WO2020119350A1 - 视频分类方法、装置、计算机设备和存储介质 - Google Patents

视频分类方法、装置、计算机设备和存储介质 Download PDF

Info

Publication number
WO2020119350A1
WO2020119350A1 PCT/CN2019/116660 CN2019116660W WO2020119350A1 WO 2020119350 A1 WO2020119350 A1 WO 2020119350A1 CN 2019116660 W CN2019116660 W CN 2019116660W WO 2020119350 A1 WO2020119350 A1 WO 2020119350A1
Authority
WO
WIPO (PCT)
Prior art keywords
classification
image
video
classification result
audio
Prior art date
Application number
PCT/CN2019/116660
Other languages
English (en)
French (fr)
Inventor
屈冰欣
郑茂
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2020119350A1 publication Critical patent/WO2020119350A1/zh
Priority to US17/192,580 priority Critical patent/US20210192220A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/75Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7834Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using audio features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7844Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using original textual content or text extracted from visual content or transcript of audio data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7847Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using low-level visual features of the video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/2431Multiple classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/254Fusion techniques of classification results, e.g. of results related to same input data
    • G06F18/256Fusion techniques of classification results, e.g. of results related to same input data of results relating to different input data, e.g. multimodal recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/809Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of classification results, e.g. where the classifiers operate on the same input data
    • G06V10/811Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of classification results, e.g. where the classifiers operate on the same input data the classifiers operating on different input data, e.g. multi-modal recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/62Text, e.g. of license plates, overlay texts or captions on TV images
    • G06V20/635Overlay text, e.g. embedded captions in a TV program
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/57Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the embodiments of the present application relate to the field of video classification, and in particular, to a video classification method, device, computer device, and storage medium.
  • the recommendation function is a common function in video applications, which is used to recommend videos of interest to users, and the quality of the recommendation function is closely related to the accuracy of video classification.
  • videos are classified based on image recognition.
  • a video classification method based on image recognition by extracting the image frames in the video and extracting the image features of the image frames, the image features are input into the long-short-term memory (LSTM) according to the timing of the image frames ) Network, and then determine the video classification according to the output of the LSTM network.
  • LSTM long-short-term memory
  • a video classification method, device, computer device, and storage medium are provided.
  • a video classification method executed by a computer device, the method includes:
  • the target classification result of the target video is determined according to the image classification result, the audio classification result, and the text classification result.
  • a video classification device includes:
  • Video acquisition module for acquiring target video
  • the first classification module is used to classify the image frames in the target video through a first classification model to obtain an image classification result, and the first classification model is used to classify based on image features of the image frames;
  • a second classification module used to classify the audio in the target video through a second classification model to obtain an audio classification result, and the second classification model is used to classify based on the audio characteristics of the audio;
  • a third classification module used to classify the text description information corresponding to the target video through a third classification model to obtain a text classification result, and the third classification model is used to classify based on the text features of the text description information;
  • the target classification module is used to determine the target classification result of the target video according to the image classification result, the audio classification result and the text classification result.
  • a computer device includes a processor and a memory.
  • the memory stores computer-readable instructions.
  • the processor causes the processor to perform the steps of the video classification method.
  • a non-volatile computer-readable storage medium storing computer-readable instructions, which when executed by one or more processors, causes the one or more processors to perform the video classification Method steps.
  • FIG. 1 shows a schematic diagram of an implementation environment provided by an embodiment of the present application
  • FIG. 2 shows a schematic flowchart of video recommendation by a computer device in an embodiment
  • FIG. 3 shows a flowchart of a video classification method provided by an embodiment of the present application
  • FIG. 4 shows a schematic diagram of the principle of the video classification process in an embodiment
  • FIG. 5 shows a flowchart of a video classification process based on image frames in an embodiment
  • FIG. 6 shows a schematic structural diagram of an initial residual network and its Stem layer in an embodiment
  • FIG. 7 is a schematic structural diagram of a target detection network provided by an exemplary embodiment
  • FIG. 8 shows a flowchart of a video classification process based on audio in an embodiment
  • FIG. 10 shows an implementation schematic diagram of a video classification process through Bi-LSTM and attention mechanism in an embodiment
  • FIG. 11 shows a block diagram of a video classification device provided by an embodiment of the present application.
  • FIG. 12 shows a block diagram when the computer device provided by an embodiment of the present application is specifically implemented as a server.
  • Convolutional layer It consists of weights and offset terms of the convolution kernel.
  • the output of the previous layer also called a feature map
  • the output feature map is obtained through an activation function.
  • the feature map can be expressed as:
  • Pooling layer Used for downsampling operations. Common pooling methods include maximum pooling, summation pooling, and average pooling.
  • MFCC Mel-Frequency Cepstral Coefficients
  • LSTM Long-Short-Term Memory
  • Bi-LSTM Bi-Long-Short-Term Memory
  • FIG. 1 shows a schematic diagram of an implementation environment provided by an embodiment of the present application.
  • the implementation environment includes a terminal 120 and a server 140.
  • the terminal 120 is an electronic device with a video playback function, and the electronic device may be a smart phone, a tablet computer, a personal computer, or the like.
  • the terminal 120 is a smartphone as an example.
  • the video playback function of the terminal 120 may be implemented by a third-party application program.
  • the third-party application program may be a video playback application, a page browsing application, a news reading application, a short video application, etc. This embodiment of the present application does not limit this.
  • the terminal 120 also has a video upload function. With the help of the video upload function, the terminal 120 can upload the recorded video or the locally stored video to the server 140. In addition, the server 140 may push the received video share to other terminals for other terminals to play.
  • the terminal 120 and the server 140 are connected through a wired or wireless network.
  • the server 140 is a server cluster or cloud computing center composed of one server and several servers.
  • the server 140 may be a background server of a third-party application program in the terminal 120, and is used to recommend to the terminal 120 videos of interest to its users.
  • the server 140 in the embodiment of the present application has a video classification function. Through the video classification function, the server 140 sorts the video according to a predetermined classification category (which may be a video captured by the server from the network, or a video uploaded by the terminal 120) It is divided into at least one of the categories, and subsequent video recommendation is performed based on the category corresponding to each video.
  • a predetermined classification category which may be a video captured by the server from the network, or a video uploaded by the terminal 120
  • the server 140 also has the function of generating user portraits.
  • the user portrait is generated based on the user's historical video viewing record, and is used to describe the user's video viewing preference.
  • the server 140 performs video recommendation based on the category corresponding to the video and the user portrait.
  • the wireless or wired network described above uses standard communication technologies and/or protocols.
  • the network is usually the Internet, but it can also be any network, including but not limited to local area network (Local Area Network, LAN), metropolitan area network (Metropolitan Area Network, MAN), wide area network (Wide Area Network, WAN), mobile, wired or wireless Network, private network, or any combination of virtual private networks).
  • technologies and/or formats including HyperText Markup (Language, HTML), Extensible Markup Language (XML), etc. are used to represent data exchanged over the network.
  • SSL Secure Socket Layer
  • TLS Transport Layer Security
  • VPN Virtual Private Network
  • IPsec Internet Protocol Security
  • the video classification methods provided by the embodiments of the present application are executed by the server 140 in FIG. 1.
  • the video classification method provided in this embodiment of the present application may be used in scenes such as video recommendation scenes or user portrait construction scenes that need to be applied to video categories.
  • the following describes video classification methods in different application scenarios.
  • the server first uses the image classification model 211 for the original video 20 from the image dimension, audio dimension, and text dimension. Perform image feature extraction and classification on image frames to obtain image classification results 212; use audio classification model 221 to perform audio feature extraction and classification on the audio of original video 20 to obtain audio classification results 222; use text classification model 231 to perform text on original video 20
  • the description information is used for text feature extraction and classification, and a text classification result 232 is obtained.
  • the server merges the image classification result 212, the audio classification result 222, and the text classification result 232 to obtain the target classification result 24 of the original video 20, and then determines the original video according to the probability corresponding to each category indicated by the target classification result 24 Target category 25 of 20, and store the original video 20 in association with the target category 25.
  • the server's recommendation system 26 obtains the user portrait 27 of the current user (which can be generated based on the user's historical viewing record), and recommends to the user a video that matches the video category of interest indicated by the user portrait 27 to the user.
  • the user portrait is used to describe the user's video viewing preferences, and its accuracy is closely related to the accuracy of the video classification.
  • the server first performs multi-dimensional classification of the original video from the image dimension, audio dimension and text dimension, and then comprehensively determines the classification result of the original video in different dimensions The target category of the original video.
  • the server When constructing a user portrait, the server obtains the user's operation behavior (such as watching, ignoring, etc.) of the recommended video, so as to determine the user's preference for the video category corresponding to the recommended video according to the operation behavior, and then on the basis of the video category's corresponding preference, Construct user portraits for use in subsequent video recommendations.
  • the user's operation behavior such as watching, ignoring, etc.
  • the video classification method provided in this embodiment of the present application can also be used in video sorting scenarios (integrating similar videos based on video categories), and video search scenarios (based on search keywords for videos of corresponding video categories Feedback) and other scenes that are applied to video categories.
  • video sorting scenarios integrating similar videos based on video categories
  • video search scenarios based on search keywords for videos of corresponding video categories Feedback
  • other scenes that are applied to video categories.
  • the embodiments of the present application do not limit specific application scenes.
  • the server when the server performs video classification based only on the image characteristics of the video, the classification effect of videos with similar pictures but large audio differences is not good.
  • Selfie videos and self-timer videos with funny voiceovers are classified based on video image characteristics. Since the image characteristics of the two are similar, both will be classified as “Selfie”. But in reality, selfie videos with funny voiceovers should be classified as "funny.”
  • the server adds audio features and text features for video classification on the basis of image features, which can make up for the limitation of video classification based solely on image features, thereby improving the accuracy of video classification. Similar videos are particularly noticeable when they are classified with large audio or text differences.
  • FIG. 3 shows a flowchart of a video classification method provided by an embodiment of the present application.
  • This embodiment is exemplified by the method applied to the server 140 in FIG. 1.
  • the method may include the following steps:
  • Step 301 Obtain the target video.
  • the target video is a video that the server pulls from the network, or a video uploaded by the terminal.
  • the embodiment of the present application does not limit the source of the target video.
  • the server is the background server of the short video application
  • the target video is the video recorded by the user using the short video application.
  • the server For the acquired target video, the server performs image feature extraction classification, audio feature extraction classification and text feature extraction classification through the following steps 302 to 304. There is no strict sequence between steps 302 to 304, and the embodiment of the present application uses the steps 302 to 304 to be executed simultaneously as an example for description.
  • Step 302 Classify the image frames in the target video through the first classification model to obtain an image classification result.
  • the first classification model is used to classify based on the image features of the image frames.
  • the first classification model includes a deep learning network for extracting image features and a classifier for classifying based on the image features.
  • the server inputs the image frames into the first classification model, and the deep learning network in the first classification model extracts the image features of the image frames, and further classifies the image features through the classifier To get the image classification result.
  • the image classification result includes various preset classification categories and their corresponding probabilities
  • the preset classification categories are the classification categories of the videos that are pre-divided.
  • the preset classification category includes at least one of the following: selfie, funny, animation, game, dubbing, basketball, football, variety show, movie.
  • the first classification model is trained based on sample image frames labeled with sample categories.
  • Step 303 Classify the audio in the target video through the second classification model to obtain an audio classification result.
  • the second classification model is used to classify based on the audio features of the audio.
  • the second classification model includes a neural network (such as LSTM) for extracting audio features and a classifier for classifying based on the audio features.
  • a neural network such as LSTM
  • the server inputs the audio into the second classification model, and the neural network in the second classification model extracts the audio features of the audio, and further classifies the audio features through the classifier to obtain the audio classification result.
  • the audio classification result includes various preset classification categories and their corresponding probabilities, and the preset classification categories are the classification categories of the pre-divided video.
  • the second classification model is obtained based on sample audio labeled with sample categories.
  • Step 304 Classify the text description information corresponding to the target video through the third classification model to obtain a text classification result.
  • the third classification model is used to classify based on the text features of the text description information.
  • the third classification model includes a neural network (such as LSTM) for extracting text features and a classifier for classifying based on text features.
  • LSTM neural network
  • the server enters the text description information into the third classification model, and the neural network in the third classification model extracts the text features of the text description information, and further performs the text features through the classifier Classify to get the text classification result.
  • the text classification result includes various preset classification categories and their corresponding probabilities, and the preset classification categories are the classification categories of the videos that are pre-divided.
  • the text description information includes at least one of a video title of the target video, video content description information, video background music information, and video publisher information
  • the second classification model is trained based on the sample text marked with the sample category.
  • Step 305 Determine the target classification result of the target video according to the image classification result, audio classification result, and text classification result.
  • the server After the server obtains the classification results based on different dimensions, it further merges the image classification results, audio classification results, and text classification results, and finally determines the target classification results of the target video.
  • the server inputs the fused classification result into the pre-trained classifier, thereby obtaining the target classification result output by the classifier.
  • each classification model and the classifier used when classifying the fused classification result may be softmax classification model.
  • the assumed function of the softmax classification model is as follows:
  • exp() is an index based on the natural constant e
  • is the model training parameter
  • T is transposed.
  • the cost function used is as follows:
  • x (i) is the input parameter
  • y (i) is the output parameter
  • m is the number of training samples in the training set.
  • the target classification result includes the probability corresponding to at least two preset classification categories.
  • the server determines the n preset classification categories with the highest probability as the target category of the target video, where n is an integer greater than or equal to 1.
  • the self-timer video A corresponds to a higher probability of "self-timer” in the text classification result
  • the self-timer video B corresponds to a higher probability of "funny” in the text classification result.
  • the server merges the classification result of selfie video A and determines the target category of selfie video A as “selfie”, and merges the classification result of selfie video B to determine the target category of selfie video B as “funny”.
  • the integration of audio features and text features to classify videos can use the complementarity between different video modalities (image modalities, audio modalities, and text modalities) to improve video classification. Accuracy.
  • the image frames are classified by the first classification model to obtain the image classification result
  • the audio is classified by the second classification model to obtain the audio classification result
  • the text description information is classified by the third classification model to obtain the text classification result, so as to determine the target classification result of the target video according to the image classification result, the audio classification result and the text classification result; compared to the video-based image in the related art Features are used for video classification.
  • the image features, audio features, and text features of the video are classified to fully consider the features of different dimensions of the video, thereby improving the accuracy of video classification.
  • the server stitches the probabilities corresponding to each classification category in the image classification result, audio classification result, and text classification result according to a predetermined order to obtain a classification feature vector, where the predetermined order is the image classification result, audio The order of classification results and text classification results.
  • the preset classification categories of video are selfie, game, sports, beauty, and funny
  • the image classification results are [selfie (0.95), game (0.01), sports (0.01), beauty ( 0.02), funny (0.01)]
  • the audio classification result is [selfie (0.05), game (0.01), sports (0.01), beauty makeup (0.03), funny (0.90)]
  • the text classification result is [selfie (0.1)
  • the classification feature vector generated by the server is (0.95, 0.01, 0.01, 0.02, 0.01, 0.05, 0.01, 0.01, 0.03, 0.90, 0.1, 0.01, 0.01, 0.03, 0.85).
  • the target classifier is constructed based on the softmax classification model.
  • the server constructs the original softmax classification model in advance, and trains the original softmax classification model according to the sample classification feature vector labeled with the video category to obtain the target classifier.
  • the server will input the classification feature vector generated by the server into the target classifier to obtain the target classification result output by the target classifier.
  • the server before performing image feature extraction and classification on the image frames, the server first extracts RGB image frames 411 and RGB difference image frames 412 from the target video, and separates the RGB image frames 411 and the RGB difference image frame 412 are input to a Residual Network (ResNet) 413 for feature extraction, so that the image features extracted from the RGB image frame 411 are input to the RGB classifier 414A to obtain the first image classification result 414B, and the RGB difference image The image features extracted in the frame 412 are input to the RGB difference classifier 415A to obtain the second image classification result 415B.
  • the above step 302 may include the following steps.
  • Step 302A Determine the original image frame extracted from the target video as an RGB image frame.
  • the server according to a predetermined sampling interval, from the target image Extract the original image frame in, and determine the extracted original video frame as the RGB image frame.
  • the predetermined sampling interval is 1s, that is, the server extracts a frame of original image every 1s.
  • the server may also dynamically determine the sampling interval according to the video length of the target video, where the sampling interval has a positive correlation with the video length, that is, the longer the video, the longer the sampling interval. This application does not limit the specific method of extracting the original image frame.
  • Step 302B the RGB image frames are classified by the residual network and the RGB classifier in the first classification model to obtain a first image classification result, and the RGB classifier is used to classify based on static image features.
  • the first classification model includes a pre-trained residual network and an RGB classifier. After the server extracts the RGB image frame, the (static) image features of the RGB image are extracted through the residual network And further classify the (static) image features through the RGB classifier to obtain a first image classification result indicating the category to which the static image belongs, the classification category in the first image classification result being the same as the preset classification category.
  • the residual network may use an initial residual network (Inception-ResNet and other deep convolutional neural networks, and the RGB classifier may use a softmax classification model, which is not limited in the embodiments of the present application.
  • the residual network includes an input layer 60, a Stem layer 61, a first residual layer 62, a first dimension reduction layer 63, a second residual layer 64, a second dimension reduction layer 65, a third residual layer 66, and pooling Layer 67, dropout layer 68, and classification layer 69.
  • the input layer 60 is used for input image frames.
  • the server combines the pixel values of the three channels R, G, and B in the RGB image frame into a one-dimensional array and inputs the input layer 60.
  • the data received by the input layer 60 is 299 (the width of the RGB image frame) ⁇ 299 (the width of the RGB image frame) ⁇ 3 (the number of channels).
  • Stem layer 61 is used to preprocess the data, which includes multiple convolutions and two pooling.
  • the optimized convolution form of 7 ⁇ 1+1 ⁇ 7 is used for convolution, and the “convolution+ Pooled” parallel structure to prevent bottlenecks.
  • the first residual layer 62 contains 5 residual blocks (for convolution processing), the second residual layer 64 contains 10 residual blocks, and the third residual layer 66 contains 5 residual blocks.
  • the first dimension reduction layer 63 is used for dimension reduction of the output of the first residual layer 62
  • the second dimension reduction layer 65 is used for dimension reduction of the output of the second residual layer 64 to reduce the amount of calculation.
  • the pooling layer 67 is used for down-sampling the output of the third residual layer 66, and the pooling layer 67 here uses average pooling.
  • the dropout layer 68 is used to set part of the input data to 0 according to the keep parameter, so as to achieve the effect of preventing overfitting. For example, when the keep parameter is 0.8, 20% of the input data is set to 0 at the discard layer 68.
  • Step 302C Generate RGB difference image frames according to two adjacent original image frames in the target video.
  • the server further classifies based on the dynamic image features of the video screen.
  • the RGB difference image frame is generated by the difference operation (subtraction of RGB pixel values) of two adjacent original image frames, and is used to represent the difference between the two original image frames, which can be expressed as:
  • rgbdiff t rgb t+1 -rgb t
  • rgbdiff t is an RGB difference image frame
  • rgb t+1 is an original image frame at time t+1
  • rgb t is an original image frame at time t
  • time t and t+1 are sampling times.
  • the pixel value rgb t of the pixel at time t is (100,100,100), and the pixel value rgb t+1 of the pixel at t+1 is (150,160,170), then the calculated rgbdiff t is (50,60,70).
  • the image feature extraction of the RGB difference image can obtain the dynamic image feature of the target video.
  • Step 302D the RGB difference image frames are classified by the residual network and the RGB difference classifier in the first classification model to obtain a second image classification result, and the RGB difference classifier is used for classification based on dynamic image features.
  • the first classification model includes a pre-trained residual network and an RGB difference classifier.
  • the (dynamic) RGB difference image is extracted through the residual network Image features, and further classify (dynamic) image features by an RGB difference classifier to obtain a second image classification result indicating the category to which the dynamic image belongs, and the classification category in the second image classification result is the same as the preset classification category.
  • the same residual network or different residual networks may be used for the image feature extraction of the RGB image frame and the RGB difference image frame, which is not limited in this application.
  • the server performs classification based on RGB image frames and RGB difference image frames, comprehensively considering static image features and dynamic image features, thereby improving the comprehensiveness of subsequent image dimension classification.
  • the server when feature extraction is performed on an RGB image frame or an RGB difference image frame, all the features of the entire image (that is, the entire image of interest) are obtained. Accordingly, subsequent use of the classifier can only classify based on the overall image features.
  • the server when the server extracts image features, it not only focuses on the entire image, but also focuses on specific targets in the image, and classifies based on the detection results of the feature targets in the image.
  • the server further inputs the RGB image frame 411 into the target detection network 416 for fine-grained feature extraction, and inputs the extracted fine-grained image feature into the fine-grained classifier 417A
  • the third image classification result 417B is obtained in.
  • the following steps may also be included.
  • Step 302E Classify the RGB image through the target detection network and the fine-grained classifier in the first classification model to obtain a third image classification result.
  • the target detection network is used to extract the fine-grained image features of the target object in the RGB image.
  • the classifier is used to classify based on fine-grained image features.
  • the first classification model further includes a target detection network and a fine-grained classifier, where the target detection network may be a regional convolutional neural network (Regions with CNN, RCNN) (including RCNN, Fast RCNN, and Faster RCNN ), YOLO (You Only Only) network, single lens multi-core detection (Single Shot MultiBox Detector, SSD) network, this embodiment does not limit the specific type of target detection network.
  • the target detection network may be a regional convolutional neural network (Regions with CNN, RCNN) (including RCNN, Fast RCNN, and Faster RCNN ), YOLO (You Only Only) network, single lens multi-core detection (Single Shot MultiBox Detector, SSD) network, this embodiment does not limit the specific type of target detection network.
  • RCNN regional convolutional neural network
  • YOLO You Only Only
  • SSD single lens multi-core detection
  • the target detection network can detect target information such as the type of the target object, the position of the target frame, and the confidence level in the RGB image frame, so as to determine the RGB based on the target information
  • Fine-grained features of image frames include:
  • Whether the target object appears in the RGB image frame 0 indicates that the target object does not appear in the RGB image frame, and 1 indicates that the target object appears in the RGB image frame.
  • the proportion of the area of the target object the maximum proportion of the RGB image frame corresponding to the target frame of the target object
  • Relative displacement of the target object the displacement of the center point of the target object corresponding to the target frame in the two adjacent RGB images.
  • the RGB image frame 71 is first subjected to a convolution process at the convolution layer 72 to output a feature map 73 for representing the image features.
  • the area prediction network 74 (for predicting the area of the target object in the image)
  • the area prediction network 74 outputs the prediction map 75.
  • the fused prediction map 75 and the feature map 73 are subjected to region of interest pooling (Region of Interest Pooling, RoI pooling) processing, and the fine-grained features of the RGN image frame 71 are determined, and then the fine-grained classifier 76
  • the granularity features are classified to obtain the third image classification result.
  • the server while lifting the overall image features, extracts fine-grained image features from the RGB image frame through the target detection network and merges the extracted fine-grained image features for classification, further improving the accuracy of the image classification results And comprehensiveness.
  • the server when performing audio feature extraction and classification on the audio of the target video, the server first extracts the MFCC feature 421 of the audio, and then performs feature extraction on the MFCC feature 421 through the VGGish network 422, Get VGGish feature 423. Further, the server uses the general classifier 425A and the specific classifier 426B to classify the VGGish feature 423, and finally obtains the first audio classification result 425A and the second audio classification result 425B. As shown in FIG. 8, the above step 303 may include the following steps.
  • Step 303A extract the MFCC of the audio.
  • the server separates the audio from the target video, and then samples the audio according to a predetermined sampling frequency, thereby pre-emphasizing, framing, windowing, FFT, mel filter bank, and DCT processing to get MFCC of audio.
  • Step 303B Perform feature extraction on the MFCC through the VGGish network in the second classification model to obtain VGGish features.
  • the second classification model in the embodiment of the present application includes a VGGish network for feature extraction and a two-layer classifier for classification.
  • the server inputs the MFCC into the VGGish network, thereby obtaining the VGGish features output by the VGGish network.
  • the VGGish network may use an existing network structure.
  • the embodiment of the present application does not limit the specific structure of the VGGish network.
  • 128 ⁇ 60-dimensional features can be extracted through the VGGish network.
  • Step 303C Classify the VGGish feature by a general classifier in the second classification model to obtain the first audio classification result.
  • Step 303D Classify the VGGish feature through at least one specific classifier in the second classification model to obtain at least one second audio classification result.
  • the second classification model includes a general classifier and at least one specific classifier, wherein the number of classification categories of the general classifier is the same as the number of preset classification categories of the video, and the specific classifier is used for classification based on the specific category ,
  • the specific category belongs to the preset classification category of the video, and different specific categories corresponding to different specific classifiers.
  • the classification category of the general classifier is also the above 5 categories, while the specific classifier is Classify based on a specific category out of 5 categories.
  • a specific classifier is used to classify based on the category of "funny", that is, to classify videos as funny and non-funny.
  • the specific category is a category that differs significantly in audio modality.
  • the server pre-trains a specific classifier for classifying funny and non-funny, so that while using a general classifier to classify, the specific classifier is used to classify whether the video is funny.
  • the server may set a plurality of specific classifiers in the second classification model, thereby further improving the accuracy of the audio classification results.
  • setting a specific classifier as an example is used for schematic description. This constitution is limited.
  • the dimension during audio classification is enriched, thereby improving the accuracy and comprehensiveness of the audio classification result.
  • the server when performing text feature extraction and classification on the text description information of the target video, the server first obtains the text description information 431 of the target video, and then uses the preprocessing module 432 to describe the text The information 431 is preprocessed. Further, through Bi-LSTM 433 combined with attention mechanism 434 (self-attention), text feature extraction is performed on the pre-processed text description information 431, and then the text features are classified by the text classifier, and the text classification result 435 is finally obtained . As shown in FIG. 9, the above step 304 may include the following steps.
  • Step 304A Acquire text description information corresponding to the target video, where the text description information includes at least one of video title, video content description information, video background music information, and video publisher information.
  • the video data of the target video is stored in association with the text description information, and at the same time when the server acquires the target video, the text description information associated with the target video is obtained from the database, and the text description information includes the video title , At least one of video content description information, video background music information, and video publisher information.
  • the text description information obtained by the server includes the video title “Challenge to eat 100 buns", the video content description information "The big stomach king anchor surpasses himself today and challenges to eat 100 buns”, and the video background music "Song A” "And the video publisher information "Awei”.
  • Step 304B Pre-process the text description information.
  • the pre-processing methods include at least one of noise removal, word segmentation, entity word retrieval, and stop word removal.
  • the server needs to pre-process the text description information, where the pre-processing of the text description information may include the following methods:
  • Denoise remove noise information that interferes with the classification of text description information. For example, for the video background music information in the text description information, if the video background music information is not included in the background music library, the "user upload" field is usually added to the video background music information, and such fields will be classified later Cause interference, so these fields need to be removed.
  • Word segmentation The long sentences in the text description information are divided into fine-grained words, and the frequency of occurrence of the words is statistically modeled after the segmentation.
  • Stop words remove meaningless words (such as “ah”, “wow”, “ah”), pronouns (such as “you", “me”, “he") and auxiliary words (such as “ ⁇ ” “, “ ⁇ ”), punctuation marks (such as ",”, “.”) and numbers.
  • Step 304C classify the pre-processed text description information through the Bi-LSTM and text classifier in the third classification model to obtain a text classification result.
  • the server converts the preprocessed text description information (consisting of words) into a word vector, and then inputs the word vector into the third classification model .
  • a mature word vector model such as word2vec may be used, which is not limited in this embodiment.
  • the server inputs the pre-processed video title 1001, video background music information 1002, and video content description information 1003 into Bi-LSTM 1004.
  • the text classification result 1006 is output through the text classifier 1005.
  • this step may include the following steps.
  • the server After obtaining the output result of Bi-LSTM, the server does not directly classify it through the text classifier, but corrects the weight in the output result through the attention mechanism, and after the weight correction is completed, the corrected output result is input Text classifier.
  • the way to modify the weights includes: increasing the weight of the results of concern and reducing the weight of the results of non-concerns.
  • the attention mechanism 1007 performs weight correction on the output result.
  • users can use the short video application to shoot and upload short videos, and the server classifies the short videos, and then recommends the video to the user according to the short video category.
  • the server classifies the short videos, and then recommends the video to the user according to the short video category.
  • the server cannot distinguish between short videos with similar pictures but large audio differences, resulting in poor video classification.
  • the server extracts image frames in the short video and performs short video based on the image characteristics of the image frames
  • the server extracts the audio of the short video and classifies the short video based on the audio characteristics of the audio; in addition, the server obtains the text description information added by the user when uploading the short video (such as the short video title, background music name and short video Content description), and classify short videos based on text features of text description information.
  • the server merges the video classification results obtained in the image dimension, audio dimension, and text dimension, and finally determines the target category of the short video.
  • the server recommends the short video matching the video category and the user portrait to the user according to the user portrait of the current user, thereby improving the fit between the recommended video and the user's preference.
  • FIG. 11 shows a block diagram of a video classification apparatus provided by an embodiment of the present application.
  • the device may be the server 140 in the implementation environment shown in FIG. 1 or may be installed on the server 140.
  • the device may include:
  • Video acquisition module 1110 used to acquire the target video
  • the first classification module 1120 is used to classify the image frames in the target video through a first classification model to obtain an image classification result, and the first classification model is used to classify based on image features of the image frames;
  • the second classification module 1130 is configured to classify the audio in the target video through a second classification model to obtain an audio classification result, and the second classification model is used to classify based on the audio characteristics of the audio;
  • the third classification module 1140 is used to classify the text description information corresponding to the target video through a third classification model to obtain a text classification result, and the third classification model is used to classify based on the text features of the text description information ;
  • the target classification module 1150 is configured to determine the target classification result of the target video according to the image classification result, the audio classification result, and the text classification result.
  • the first classification module 1120 includes:
  • a determining unit configured to determine the original image frame extracted from the target video as a red, green, and blue RGB image frame
  • a first image classification unit used to classify the RGB image frame through the residual network and the RGB classifier in the first classification model, to obtain a first image classification result, and the RGB classifier is used based on static image features sort;
  • a generating unit configured to generate the RGB difference image frame according to two adjacent original image frames in the target video
  • a second image classification unit configured to classify the RGB difference image frames through the residual network and the RGB difference classifier in the first classification model to obtain a second image classification result, and the RGB difference classifier is used to Classification of dynamic image features.
  • the first classification module 1120 further includes:
  • a third image classification unit is used to classify the RGB image through the target detection network and the fine-grained classifier in the first classification model to obtain a third image classification result
  • the target detection network is used to extract the The fine-grained image features of the target object in the RGB image
  • the fine-grained classifier is used for classification based on the fine-grained image features.
  • the second classification module 1130 includes:
  • a first extraction unit for extracting the Mel frequency cepstrum coefficient MFCC of the audio
  • a second extraction unit configured to perform feature extraction on the MFCC through the VGGish network in the second classification model to obtain VGGish features
  • a first audio classification unit configured to classify the VGGish feature by a general classifier in the second classification model to obtain a first audio classification result
  • a second audio classification unit configured to classify the VGGish feature by at least one specific classifier in the second classification model to obtain at least one second audio classification result
  • the number of classification categories of the general classifier is the same as the number of preset classification categories of the video
  • the specific classifier is used for classification based on a specific category
  • the specific category belongs to the preset classification category of the video
  • different specific classifiers Corresponding to different specific categories.
  • the third classification module 1140 includes:
  • An information obtaining unit configured to obtain the text description information corresponding to the target video, where the text description information includes at least one of a video title, video content description information, and video background music information;
  • a pre-processing unit which is used to pre-process the text description information, and the pre-processing methods include at least one of noise removal, word segmentation, entity word retrieval, and stop word removal;
  • the text classification unit is used to classify the pre-processed text description information through the bidirectional long-short-term memory network Bi-LSTM and the text classifier in the third classification model to obtain the text classification result.
  • the text classification unit is used to:
  • the target classification module 1150 includes:
  • a stitching unit used to stitch the probabilities corresponding to each classification category in the image classification result, the audio classification result and the text classification result to generate a classification feature vector
  • a target classification unit is used to input the classification feature vector into a target classifier to obtain the target classification result, and the target classifier is constructed based on a softmax classification model.
  • the image frames are classified by the first classification model to obtain the image classification result
  • the audio is classified by the second classification model to obtain the audio classification result
  • the text description information is classified by the third classification model to obtain the text classification result, so as to determine the target classification result of the target video according to the image classification result, the audio classification result and the text classification result; compared to the video-based image in the related art Features are used for video classification.
  • the image features, audio features, and text features of the video are classified to fully consider the features of different dimensions of the video, thereby improving the accuracy of video classification.
  • FIG. 12 shows a schematic structural diagram of a computer device provided by an embodiment of the present application.
  • the server is used to implement the video classification method provided by the above embodiment. Specifically:
  • the computer device 1200 includes a central processing unit (CPU) 1201, a system memory 1204 including a random access memory (RAM) 1202 and a read only memory (ROM) 1203, and a system bus 1205 connecting the system memory 1204 and the central processing unit 1201 .
  • the computer equipment 1200 also includes a basic input/output system (I/O system) 1206 that helps transfer information between various devices in the computer, and a large capacity for storing an operating system 1213, application programs 1214, and other program modules 1215 Storage device 1207.
  • I/O system basic input/output system
  • the basic input/output system 1206 includes a display 1208 for displaying information and an input device 1209 for a user to input information, such as a mouse and a keyboard.
  • the display 1208 and the input device 1209 are both connected to the central processing unit 1201 through an input and output controller 1210 connected to the system bus 1205.
  • the basic input/output system 1206 may further include an input-output controller 1210 for receiving and processing input from a number of other devices such as a keyboard, mouse, or electronic stylus.
  • the input output controller 1210 also provides output to a display screen, printer, or other type of output device.
  • the mass storage device 1207 is connected to the central processing unit 1201 through a mass storage controller (not shown) connected to the system bus 1205.
  • the mass storage device 1207 and its associated computer-readable medium provide non-volatile storage for the computer device 1200. That is, the mass storage device 1207 may include a computer-readable medium (not shown) such as a hard disk or CD-ROM drive.
  • the computer-readable media may include computer storage media and communication media.
  • Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data.
  • Computer storage media include RAM, ROM, EPROM, EEPROM, flash memory, or other solid-state storage technologies, CD-ROM, DVD, or other optical storage, tape cassettes, magnetic tape, magnetic disk storage, or other magnetic storage devices.
  • RAM random access memory
  • ROM read-only memory
  • EPROM Erasable programmable read-only memory
  • EEPROM electrically erasable programmable read-only memory
  • flash memory or other solid-state storage technologies
  • CD-ROM, DVD or other optical storage
  • tape cassettes magnetic tape
  • magnetic disk storage or other magnetic storage devices.
  • the above-mentioned system memory 1204 and mass storage device 1207 may be collectively referred to as a memory.
  • the computer device 1200 may also be operated by a remote computer connected to the network through a network such as the Internet. That is, the computer device 1200 may be connected to the network 1212 through the network interface unit 1211 connected to the system bus 1205, or, alternatively, the network interface unit 1211 may be used to connect to other types of networks or remote computer systems.
  • Embodiments of the present application also provide a computer-readable storage medium, in which computer-readable instructions are stored, and the computer-readable instructions are executed by the processor to implement the video classification methods provided by the foregoing embodiments.
  • the present application also provides a computer program product containing instructions, which when run on a computer, causes the computer to execute the video classification methods described in the above embodiments.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Library & Information Science (AREA)
  • Medical Informatics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Signal Processing (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

一种视频分类方法,包括:获取目标视频;通过第一分类模型对目标视频中的图像帧进行分类,得到图像分类结果,第一分类模型用于基于图像帧的图像特征进行分类;通过第二分类模型对目标视频中的音频进行分类,得到音频分类结果,第二分类模型用于基于音频的音频特征进行分类;通过第三分类模型对目标视频对应的文本描述信息进行分类,得到文本分类结果,第三分类模型用于基于文本描述信息的文本特征进行分类;根据图像分类结果、音频分类结果和文本分类结果,确定目标视频的目标分类结果。

Description

视频分类方法、装置、计算机设备和存储介质
本申请要求于2018年12月14日提交中国专利局,申请号为2018115358370、发明名称为“视频分类方法、装置及服务器”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请实施例涉及视频分类领域,特别涉及一种视频分类方法、装置、计算机设备和存储介质。
背景技术
推荐功能是视频类应用程序中常见的功能,用于向用户推荐其感兴趣的视频,而推荐功能的优劣与视频分类的准确性密切相关。
相关技术中,采用基于图像识别的方式对视频进行分类。在一种基于图像识别的视频分类方法中,通过抽取视频中的图像帧,并提取图像帧的图像特征,从而按照图像帧的时序,将图像特征输入长短期记忆(Long Short-Term Memory,LSTM)网络,进而根据LSTM网络的输出确定视频分类。
然而,基于图像特征进行视频分类时,由于特征维度单一,导致视频分类效果不佳。
发明内容
根据本申请提供的各种实施例,提供一种视频分类方法、装置、计算机设备和存储介质。
一种视频分类方法,由计算机设备执行,所述方法包括:
获取目标视频;
通过第一分类模型对所述目标视频中的图像帧进行分类,得到图像分类结果,所述第一分类模型用于基于所述图像帧的图像特征进行分类;
通过第二分类模型对所述目标视频中的音频进行分类,得到音频分类结果,所述第二分类模型用于基于所述音频的音频特征进行分类;
通过第三分类模型对所述目标视频对应的文本描述信息进行分类,得到文本分类结果,所述第三分类模型用于基于所述文本描述信息的文本特征进行分类;及
根据所述图像分类结果、所述音频分类结果和所述文本分类结果,确定所述目标视频的目标分类结果。
一种视频分类装置,所述装置包括:
视频获取模块,用于获取目标视频;
第一分类模块,用于通过第一分类模型对所述目标视频中的图像帧进行分类,得到图像分类结果,所述第一分类模型用于基于所述图像帧的图像特征进行分类;
第二分类模块,用于通过第二分类模型对所述目标视频中的音频进行分类,得到音频分类结果,所述第二分类模型用于基于所述音频的音频特征进行分类;
第三分类模块,用于通过第三分类模型对所述目标视频对应的文本描述信息进行分类,得到文本分类结果,所述第三分类模型用于基于所述文本描述信息的文本特征进行分类;及
目标分类模块,用于根据所述图像分类结果、所述音频分类结果和所述文本分类结果,确定所述目标视频的目标分类结果。
一种计算机设备,包括处理器和存储器,所述存储器中存储有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述处理器执行所述视频分类方法的步骤。
一种非易失性的计算机可读存储介质,存储有计算机可读指令,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行所述视频分类方法的步骤。
本申请的一个或多个实施例的细节在下面的附图和描述中提出。本申请的其它特征、目的和优点将从说明书、附图以及权利要求书变得明显。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1示出了本申请一个实施例提供的实施环境的示意图;
图2示出了一个实施例中计算机设备进行视频推荐的流程示意图;
图3示出了本申请一个实施例提供的视频分类方法的流程图;
图4示出了一个实施例中视频分类过程的原理示意图;
图5示出了一个实施例中基于图像帧进行视频分类过程的流程图;
图6示出了一个实施例中初始残差网络及其Stem层的结构示意图;
图7是一个示意性实施例提供的目标检测网络的结构示意图;
图8示出了一个实施例中基于音频进行视频分类过程的流程图;
图9示出了一个实施例中基于文本描述信息进行视频分类过程的流程图;
图10示出了一个实施例中通过Bi-LSTM和注意力机制进行视频分类过程的实施示意图;
图11示出了本申请一个实施例提供的视频分类装置的框图;
图12示出了本申请一个实施例提供的计算机设备具体实现为服务器时的框图。
具体实施方式
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。
为了方便理解,下面对本申请实施例中涉及的名词进行说明。
卷积层:由卷积核的权值和偏置项构成。在一个卷积层中,上一层的输出(又称特征映射图)被一个卷积核进行卷积,并通过一个激活函数得到输 出的特征映射图。其中,特征图可以表示为:
Figure PCTCN2019116660-appb-000001
其中,
Figure PCTCN2019116660-appb-000002
表示连接第l层的i单元和l+1层的j单元的权值参数,
Figure PCTCN2019116660-appb-000003
是连接第l层偏置单元和第l+1层的j单元对应的参数,FM l是第l层的特征映射图集合,
Figure PCTCN2019116660-appb-000004
表示第l层的第i个特征映射图。
池化(pooling)层:用于进行降采样操作,常用的池化方式包括最大池化、求和池化和平均池化。
梅尔频率倒谱系数(Mel-Frequency Cepstral Coefficients,MFCC):用于表示语音信号的能量在不同频率范围的分布情况。计算MFCC时通常需要对音频进行预加重、分帧、加窗、快速傅里叶变换(Fast Fourier Transformation,FFT)、梅尔滤波器组以及离散余弦变换(Discrete Cosine Transform,DCT)处理。
长短期记忆网络(Long-Short Term Memory,LSTM):一种时间递归神经网络,适合于处理和预测时间序列中间隔和延迟非常长的重要事件。双向长短期记忆网络(Bi Long-Short Term Memory,Bi-LSTM)则是在LSTM的基础上实现双向记忆的网络(LSTM仅正向记忆,而Bi-LSTM可以实现正向和反向记忆)。
请参考图1,其示出了本申请一个实施例提供的实施环境的示意图。该实施环境中包括终端120和服务器140。
终端120是具有视频播放功能的电子设备,该电子设备可以是智能手机、平板电脑或个人计算机等等。图1中以终端120是智能手机为例进行说明。
本申请实施例中,终端120的视频播放功能可以由第三方应用程序实现,该第三方应用程序可以是视频播放应用程序、页面浏览应用程序、新闻阅读类应用程序、短视频应用程序等等,本申请实施例对此不做限定。
除了具备视频播放功能外,在一个实施例中,终端120还具有视频上传功能,借助视频上传功能,终端120可以将录制的视频,或者,将本地存储的视频上传至服务器140。并且,服务器140可以将接收到的视频分享推送给其他终端,供其他终端进行播放。
终端120与服务器140之间通过有线或无线网络相连。
服务器140是一台服务器、若干台服务器构成的服务器集群或云计算中心。本申请实施例中,服务器140可以是终端120中第三方应用程序的后台服务器,用于向终端120推荐其使用者感兴趣的视频。
本申请实施例中的服务器140具有视频分类功能,通过视频分类功能,服务器140按照预定的分类类别,将视频(可以是服务器从网路中抓取的视频,也可以是终端120上传的视频)划分至其中至少一个类别中,后续即基于各个视频对应的类别进行视频推荐。
在一个实施例中,服务器140还具有生成用户画像的功能。其中,该用户画像根据用户的历史视频观看记录生成,用于描述用户的视频观看喜好。后续进行视频推荐时,服务器140根据视频对应的类别以及用户画像进行视频推荐。
在一个实施例中,上述的无线网络或有线网络使用标准通信技术和/或协议。网络通常为因特网、但也可以是任何网络,包括但不限于局域网(Local Area Network,LAN)、城域网(Metropolitan Area Network,MAN)、广域网(Wide Area Network,WAN)、移动、有线或者无线网络、专用网络或者虚拟专用网络的任何组合)。在一些实施例中,使用包括超文本标记语言(Hyper Text Mark-up Language,HTML)、可扩展标记语言(Extensible Markup Language,XML)等的技术和/或格式来代表通过网络交换的数据。此外还可以使用诸如安全套接字层(Secure Socket Layer,SSL)、传输层安全(Transport Layer Security,TLS)、虚拟专用网络(Virtual Private Network,VPN)、网际协议安全(Internet Protocol Security,IPsec)等常规加密技术来加密所有或者一些链路。在另一些实施例中,还可以使用定制和/或专用数据通信技术取代或者补充上述数据通信技术。
本申请各个实施例提供的视频分类方法即由图1中的服务器140执行。
本申请实施例提供的视频分类方法可用于视频推荐场景或用户画像构建场景等需要应用到视频类别的场景,下面对不同应用场景下的视频分类方法进行说明。
视频推荐场景
视频推荐场景下,如图2所示,对于待分类的原始视频20(服务器本地 存储或由终端上传),服务器首先从图像维度、音频维度和文本维度,采用图像分类模型211对原始视频20的图像帧进行图像特征提取和分类,得到图像分类结果212;采用音频分类模型221对原始视频20的音频进行音频特征提取和分类,得到音频分类结果222;采用文本分类模型231对原始视频20的文本描述信息进行文本特征提取和分类,得到文本分类结果232。进一步的,服务器对图像分类结果212、音频分类结果222以及文本分类结果232进行融合,得到原始视频20的目标分类结果24,进而根据该目标分类结果24所指示各个类别对应的概率,确定原始视频20的目标类别25,并将原始视频20与目标类别25进行关联存储。
在进行视频推荐时,服务器的推荐系统26获取当前用户的用户画像27(可以根据用户历史观看记录生成),从而根据将与用户画像27所指示用户感兴趣视频类别相匹配的视频推荐给用户。
用户画像构建场景
在视频推荐领域,用户画像用于描述用户的视频观看喜好,其准确度与视频分类的准确度密切相关。为了提高用户画像的准确性,以提高后续视频推荐的准确性,服务器首先从图像维度、音频维度和文本维度,对原始视频进行多维度分类,然后根据不同维度下原始视频的分类结果,综合确定出原始视频的目标类别。
构建用户画像时,服务器获取用户对推荐视频的操作行为(比如观看、忽略等等),从而根据操作行为确定用户对推荐视频对应视频类别的喜好程度,进而在视频类别对应喜好程度的基础上,构建出用户画像,供后续进行视频推荐时使用。
当然,除了上述应用场景外,本申请实施例提供的视频分类方法还可以用于视频整理场景(基于视频类别对同类视频进行整合)、视频搜索场景(基于搜索关键字对相应视频类别的视频进行反馈)等其他应用到视频类别的场景,本申请实施例并不对具体应用场景进行限定。
相关技术中,服务器仅基于视频的图像特征进行视频分类时,对画面相似但音频差异较大的视频的分类效果不佳。比如,在短视频应用程序中,对于自拍视频和配有搞笑配音的自拍视频,基于视频图像特征进行分类,由于 两者的图像特征相似,因此两者都会被分类为“自拍”。但是实际情况下,配有搞笑配音的自拍视频应该被分类为“搞笑”。
而本申请实施例中,服务器在图像特征的基础上,加入音频特征和文本特征进行视频分类,能够弥补单纯基于图像特征进行视频分类的局限性,从而提高了视频分类的准确性,在对图像相似但音频或文本差异较大的视频进行分类时尤为明显。下面采用示意性的实施例进行说明
请参考图3,其示出了本申请一个实施例提供的视频分类方法的流程图。本实施例以该方法应用于图1中的服务器140来举例说明,该方法可以包括以下几个步骤:
步骤301,获取目标视频。
其中,该目标视频为服务器从网络中拉取的视频,或者,由终端上传的视频,本申请实施例并不对目标视频的来源进行限定。
在一个示意性的应用场景下,服务器为短视频应用程序的后台服务器,该目标视频即为用户使用短视频应用程序录制的视频。
对于获取到的目标视频,服务器通过下述步骤302至304对其进行图像特征提取分类、音频特征提取分类以及文本特征提取分类。其中,步骤302至304之间并不存在严格的先后顺序,本申请实施例以步骤302至304同时执行为例进行说明。
步骤302,通过第一分类模型对目标视频中的图像帧进行分类,得到图像分类结果,第一分类模型用于基于图像帧的图像特征进行分类。
在一种可能的实施方式中,第一分类模型中包括用于提取图像特征的深度学习网络以及基于图像特征进行分类的分类器。相应的,服务器从目标视频中提取图像帧后,将图像帧输入第一分类模型中,由第一分类模型中的深度学习网络提取图像帧的图像特征,并进一步通过分类器对图像特征进行分类,从而得到图像分类结果。
其中,图像分类结果中包含各种预设分类类别及其对应的概率,该预设分类类别为预先划分出的视频的分类类别。比如,预设分类类别包括如下至少一种:自拍、搞笑、动画、游戏、配音、篮球、足球、综艺、电影。
在一个实施例中,第一分类模型基于标注有样本类别的样本图像帧训练 得到。
步骤303,通过第二分类模型对目标视频中的音频进行分类,得到音频分类结果,第二分类模型用于基于音频的音频特征进行分类。
在一种可能的实施方式中,第二分类模型中包括用于提取音频特征的神经网络(比如LSTM)以及基于音频特征进行分类的分类器。相应的,服务器提取出目标视频的音频后,将音频输入第二分类模型,由第二分类模型中的神经网络提取音频的音频特征,并进一步通过分类器对音频特征进行分类,从而得到音频分类结果。
其中,音频分类结果中包含各种预设分类类别及其对应的概率,且预设分类类别为预先划分出的视频的分类类别。
在一个实施例中,第二分类模型基于标注有样本类别的样本音频训练得到。
步骤304,通过第三分类模型对目标视频对应的文本描述信息进行分类,得到文本分类结果,第三分类模型用于基于文本描述信息的文本特征进行分类。
在一种可能的实施方式中,第三分类模型中包括用于提取文本特征的神经网络(比如LSTM)以及基于文本特征进行分类的分类器。相应的,服务器提取出目标视频的文本描述信息后,将文本描述信息输入第三分类模型,由第三分类模型中的神经网络提取文本描述信息的文本特征,并进一步通过分类器对文本特征进行分类,从而得到文本分类结果。
其中,文本分类结果中包含各种预设分类类别及其对应的概率,且预设分类类别为预先划分出的视频的分类类别。
在一个实施例中,该文本描述信息包括目标视频的视频标题、视频内容描述信息、视频背景音乐信息和视频发布者信息中的至少一种
在一个实施例中,第二分类模型基于标注有样本类别的样本文本训练得到。
步骤305,根据图像分类结果、音频分类结果和文本分类结果,确定目标视频的目标分类结果。
服务器得到基于不同维度的分类结果后,进一步对图像分类结果、音频分类结果和文本分类结果进行融合,最终确定出目标视频的目标分类结果。
在一种可能的实施方式中,服务器将融合后的分类结果输入预先训练的分类器,从而得到分类器输出的目标分类结果。
在一个实施例中,由于视频的预设分类类别通常包含多种,且不同的分类类别之间互斥,因此,各个分类模型以及对融合后分类结果进行分类时采用的分类器可以为softmax分类模型。在一个实施例中,softmax分类模型的假设函数如下:
Figure PCTCN2019116660-appb-000005
其中,exp()是以自然常数e为底的指数,θ为模型训练参数,T表示转置。
相应的,对softmax分类模型进行优化时,采用的化代价函数如下:
Figure PCTCN2019116660-appb-000006
其中,x (i)为输入参数,y (i)为输出参数,m为训练集中训练样本的数量。
对softmax分类模型进行优化的过程,即为最小化代价函数的过程,本申请实施例在此不再赘述。
在一个实施例中,目标分类结果中包含至少两个预设分类类别对应的概率,服务器将概率最高的n个预设分类类别确定为目标视频的目标类别,n为大于等于1的整数。
在一个示意性的例子中,对于自拍视频A和配有搞笑配音的自拍视频B,基于视频图像特征进行分类时,由于两者的图像特征相似,因此图像分类结果中“自拍”对应概率较高;而基于视频音频特征进行分类时,由于自拍视频A与自拍视频B的音频差异较大,且自拍视频B的音频具备搞笑音频的特征,因此,因此自拍视频A对应音频分类结果中“自拍”对应概率较高,而自拍视频B对应音频分类结果中“搞笑”对应概率较高;基于视频文本特征进行分类时,由于自拍视频A与自拍视频B的文本描述信息差异较大,且自拍视频B的文本描述信息具备搞笑文本描述的特征,因此,因此自拍视频A对应文本分类结果中“自拍”对应概率较高,而自拍视频B对应文本分类结果中“搞笑”对应概率较高。最终,服务器融合自拍视频A的分类结果,确定自拍视频A的目标类别为“自拍”,融合自拍视频B的分类结果,确定自 拍视频B的目标类别为“搞笑”。
由此可见,在图像特征的基础上,融合音频特征和文本特征对视频进行分类,能够利用视频不同模态(图像模态、音频模态和文本模态)之间的互补性提高视频分类的准确率。
综上所述,本申请实施例中,获取到待分类的目标视频后,分别通过第一分类模型对图像帧进行分类得到图像分类结果,通过第二分类模型对音频进行分类得到音频分类结果,通过第三分类模型对文本描述信息进行分类得到文本分类结果,从而根据图像分类结果、音频分类结果和文本分类结果,确定出目标视频的目标分类结果;相较于相关技术中仅基于视频的图像特征进行视频分类,本申请实施例中综合视频的图像特征、音频特征以及文本特征进行分类,充分考虑视频不同维度的特征,进而提高了视频分类的准确性。
在一种可能的实施方式中,根据不同维度的分类结果确定目标视频的目标分类结果时可以包括如下步骤。
一、对图像分类结果、音频分类结果和文本分类结果中各个分类类别对应的概率进行拼接,生成分类特征向量。
在一个实施例中,服务器根据预定顺序,对图像分类结果、音频分类结果和文本分类结果中各个分类类别对应的概率进行拼接,从而得到分类特征向量,其中,该预定顺序为图像分类结果、音频分类结果和文本分类结果的先后顺序。
在一个示意性的例子中,视频的预设分类类别为自拍、游戏、体育、美妆、搞笑,且图像分类结果为[自拍(0.95),游戏(0.01),体育(0.01),美妆(0.02),搞笑(0.01)],音频分类结果为[自拍(0.05),游戏(0.01),体育(0.01),美妆(0.03),搞笑(0.90)],文本分类结果为[自拍(0.1),游戏(0.01),体育(0.01),美妆(0.03),搞笑(0.85)],服务器生成的分类特征向量即为(0.95,0.01,0.01,0.02,0.01,0.05,0.01,0.01,0.03,0.90,0.1,0.01,0.01,0.03,0.85)。
二、将分类特征向量输入目标分类器,得到目标分类结果,目标分类器基于softmax分类模型构建。
在一个实施例中,服务器预先构建原始softmax分类模型,并根据标注 有视频类别的样本分类特征向量对原始softmax分类模型进行训练,得到目标分类器。在视频分类时,服务器即将生成的分类特征向量输入目标分类器,从而获取目标分类器输出的目标分类结果。
在一种可能的实施方式中,如图4所示,对图像帧进行图像特征提取及分类前,服务器首先从目标视频中提取RGB图像帧411以及RGB差异图像帧412,并分别将RGB图像帧411和RGB差异图像帧412输入残差网络(ResNet)413进行特征提取,从而将从RGB图像帧411中提取到的图像特征输入RGB分类器414A中得到第一图像分类结果414B,将RGB差异图像帧412中提取到的图像特征输入RGB差异分类器415A中得到第二图像分类结果415B。如图5所示,上述步骤302可以包括如下步骤。
步骤302A,将目标视频中提取到的原始图像帧确定为RGB图像帧。
由于后续进行图像特征提取需要耗费大量计算资源,且视频中相邻图像帧之间的差异较小,因此为了降低计算量,在一种可能的实施方式中,服务器按照预定采样间隔,从目标图像中提取原始图像帧,并将提取到的原始视频帧确定为RGB图像帧。比如,该预定采样间隔为1s,即服务器每隔1s提取一帧原始图像帧。
在其他可能的实施方式中,服务器也可以根据目标视频的视频长度,动态确定采样间隔,其中,该采样间隔与视频长度呈正相关关系,即视频越长,采样间隔越长。本申请并不对提取原始图像帧的具体方式进行限定。
步骤302B,通过第一分类模型中的残差网络和RGB分类器对RGB图像帧进行分类,得到第一图像分类结果,RGB分类器用于基于静态图像特征进行分类。
在一个实施例中,本申请实施例中,第一分类模型包含预先训练的残差网络以及RGB分类器,服务器提取到RGB图像帧后,即通过残差网络提取RGB图像的(静态)图像特征,并进一步通过RGB分类器对(静态)图像特征进行分类,得到指示静态图像所属类别的第一图像分类结果,该第一图像分类结果中的分类类别与预设分类类别相同。
其中,残差网络可以采用初始残差网络(Inception-ResNet等其他深度卷积神经网络,且RGB分类器可以采用softmax分类模型,本申请实施例对此 不做限定。
在一个示意性的例子中,残差网络(Inception-ResNet-v2)的结构如图6所示。该残差网络包括输入层60、Stem层61、第一残差层62、第一降维层63、第二残差层64、第二降维层65、第三残差层66、池化层67、丢弃(dropout)层68和分类层69。
输入层60用于输入的图像帧,对于RGB图像帧而言,该服务器将RGB图像帧中R、G、B三个通道的像素值组成一维数组后输入输入层60。如图6中,输入层60接收到的数据为299(RGB图像帧的宽度)×299(RGB图像帧的宽度)×3(通道数)。
Stem层61用于对数据进行预处理,其中包含多次卷积和两次池化,卷积时采用了7×1+1×7的优化卷积形式,且池化时采用“卷积+池化”的并行结构,以此防止瓶颈问题。
第一残差层62中包含5个残差块(用于进行卷积处理),第二残差层64中包含10个残差块,第三残差层66中包含5个残差块。
第一降维层63用于对第一残差层62的输出进行降维,第二降维层65用于对第二残差层64的输出进行降维,以减少计算量。
池化层67用于对第三残差层66的输出进行下采样处理,此处的池化层67采用平均池化。
丢弃(dropout)层68用于根据keep参数将部分输入数据设置为0,从而达到防治过拟合的效果。比如,当keep参数为0.8时,输入数据中20%的数据在丢弃层68被设置为0。
步骤302C,根据目标视频中相邻两帧原始图像帧生成RGB差异图像帧。
上述步骤中,由于RGB图像帧仅能够反映出视频画面的静态图像特征,因此,为了提高图像分类的效果,服务器进一步基于视频画面的动态图像特征进行分类。
其中,RGB差异图像帧由相邻两帧原始图像帧进行差运算(RGB像素值相减)生成,用于表示两帧原始图像帧之间的差异性,其可以表示为:
rgbdiff t=rgb t+1-rgb t
其中,rgbdiff t为RGB差异图像帧,rgb t+1为t+1时刻的原始图像帧,rgb t为t时刻的原始图像帧,且t时刻和t+1时刻为采样时刻。
以图像帧中的一个像素点为例,t时刻该像素点的像素值rgb t为(100,100,100),而t+1时刻该像素点的像素值rgb t+1为(150,160,170),则计算得到的rgbdiff t为(50,60,70)。
由于RGB差异图像帧能够反映出两帧原始图像帧之间的差异性,因此,对RGB差异图像进行图像特征提取,能够得到目标视频的动态图像特征。
步骤302D,通过第一分类模型中的残差网络和RGB差异分类器对RGB差异图像帧进行分类,得到第二图像分类结果,RGB差异分类器用于基于动态图像特征进行分类。
在一个实施例中,本申请实施例中,第一分类模型包含预先训练的残差网络以及RGB差异分类器,服务器生成RGB差异图像帧后,即通过残差网络提取RGB差异图像的(动态)图像特征,并进一步通过RGB差异分类器对(动态)图像特征进行分类,得到指示动态图像所属类别的第二图像分类结果,该第二图像分类结果中的分类类别与预设分类类别相同。
其中,对RGB图像帧和RGB差异图像帧进行图像特征提取时可以采用同一残差网络,也可以采用不同残差网络,本申请对此不做限定。
本实施例中,服务器基于RGB图像帧和RGB差异图像帧进行分类,综合考虑到静态图像特征和动态图像特征,进而提高了后续图像维度分类的全面性。
上述实施例中,对RGB图像帧或RGB差异图像帧进行特征提取时,得到的都是图像整体的特征(即关注图像整体),相应的,后续使用分类器仅能够基于整体图像特征进行分类。为了进一步提高图像分类结果的准确性,本申请实施例中,服务器进行图像特征提取时,不仅关注图像整体,还关注图像中的特定目标,并基于图像中特征目标的检测结果进行分类。
如图4所示,服务器从目标视频中提取RGB图像帧411后,进一步将RGB图像帧411输入目标检测网络416进行细粒度特征提取,并将提取到的细粒度图像特征输入细粒度分类器417A中得到第三图像分类结果417B。如图5所示,上述步骤302A之后还可以包括如下步骤。
步骤302E,通过第一分类模型中的目标检测网络和细粒度分类器对RGB图像进行分类,得到第三图像分类结果,目标检测网络用于提取RGB图像中 目标物体的细粒度图像特征,细粒度分类器用于基于细粒度图像特征进行分类。
本申请实施例中,第一分类模型中还包括目标检测网络和细粒度分类器,其中,目标检测网络可以是区域卷积神经网络(Regions with CNN,RCNN)(包括RCNN、Fast RCNN以及Faster RCNN)、YOLO(You Only Look Once)网络、单镜多核检测(Single Shot multiBox Detector,SSD)网络,本实施例并不对目标检测网络的具体类型进行限定。
在一种可能的试试方式中,利用目标检测网络进行目标检测时,目标检测网络能够检测出RGB图像帧中目标物体的类别、目标框位置、置信度等目标信息,从而根据目标信息确定RGB图像帧的细粒度特征。在一个实施例中,细粒度特征包括:
1、目标物体是否出现在RGB图像帧中:0表示目标物体未出现在RGB图像帧中,1表示目标物体出现在RGB图像帧中。
2、目标物体面积占比:目标物体对应目标框占RGB图像帧的最大比例。
3、目标物体相对位移:目标物体对应目标框的中心点在相邻两帧RGB图像中的位移。
在一个示意性的例子中,如图7所示,当目标检测网络采用Faster-RCNN时,RGB图像帧71首先在卷积层72经过卷积处理,输出用于表示图像特征的特征图73。特征图73输入区域预测网络74(用于预测目标物体在图像中的区域)后,由区域预测网络74输出预测图75。进一步的,对融合后的预测图75和特征图73进行兴趣区域池化(Region of Interest pooling,RoI pooling)处理,并确定RGN图像帧71的细粒度特征,进而通过细粒度分类器76对细粒度特征进行分类,得到第三图像分类结果。
本实施例中,服务器在提起图像整体特征的同时,通过目标检测网络对RGB图像帧进行细粒度图像特征提取,并融合提取到的细粒度图像特征进行分类,进一步提高了图像分类结果的准确性和全面性。
在一种可能的实施方式中,如图4所示,对目标视频的音频进行音频特征提取及分类时,服务器首先提取音频的MFCC特征421,然后通过VGGish网络422对MFCC特征421进行特征提取,得到VGGish特征423。进一步 的,服务器分别使用通用分类器425A和特定分类器426B对VGGish特征423进行分类,最终得到第一音频分类结果425A和第二音频分类结果425B。如图8所示,上述步骤303可以包括如下步骤。
步骤303A,提取音频的MFCC。
在一种可能的实施方式中,服务器从目标视频中分离出音频,然后按照预定采样频率对音频进行采样,从而对采样结果进行预加重、分帧、加窗、FFT、梅尔滤波器组以及DCT处理,得到音频的MFCC。
步骤303B,通过第二分类模型中的VGGish网络对MFCC进行特征提取,得到VGGish特征。
在一个实施例中,本申请实施例中的第二分类模型中包括用于进行特征提取的VGGish网络以及用于进行分类的双层分类器。对于提取到的MFCC,服务器将MFCC输入VGGish网络,从而得到VGGish网络输出的VGGish特征。
其中,VGGish网络可以采用已有的网络结构,本申请实施例并不对VGGish网络的具体结构进行限定。
示意性的,当目标视频的音频为60s,且采样频率为128Hz时,经过VGGish网络可以提取到128×60维的特征。
步骤303C,通过第二分类模型中的通用分类器对VGGish特征进行分类,得到第一音频分类结果。
步骤303D,通过第二分类模型中的至少一个特定分类器对VGGish特征进行分类,得到至少一条第二音频分类结果。
本申请实施例中,第二分类模型中包括通用分类器和至少一个特定分类器,其中,通用分类器的分类类别数量与视频的预设分类类别数量相同,特定分类器用于基于特定类别进行分类,特定类别属于视频的预设分类类别,且不同特定分类器对应的不同特定类别。
示意性的,当视频的预设分类类别包括5个类别(分别为自拍、游戏、体育、美妆、搞笑)时,该通用分类器的分类类别也为上述5个类别,而特定分类器则基于5个类别中的某一特定类别进行分类。比如,特定分类器用于基于“搞笑”这一类别进行分类,即将视频分类为搞笑和非搞笑。
在一个实施例中,特定类别为音频模态上存在明显区别的类别。
在一种可能的实施方式中,由于相较于非搞笑视频,搞笑视频的音频中通常包含笑声(即搞笑视频与非搞笑视频在音频模态上的区别在于是否包含笑声),因此,服务器预先训练用于分类搞笑和非搞笑的特定分类器,从而在利用通用分类器进行分类的同时,利用该特定分类器对视频是否搞笑进行分类。
需要说明的是,服务器可以在第二分类模型中设置多个特定分类器,从而进一步提高音频分类结果的准确性,本实施例仅以设置一个特定分类器为例进行示意性说明,当并不对此构成限定。
本实施例中,在通用分类器的基础上,通过增加用于对特定类别进行区分的特定分类器,丰富了音频分类时的维度,进而提高了音频分类结果的准确性和全面性。
在一种可能的实施方式中,如图4所示,对目标视频的文本描述信息进行文本特征提取及分类时,服务器首先获取目标视频的文本描述信息431,然后通过预处理模块432对文本描述信息431进行预处理。进一步的,通过Bi-LSTM 433并结合注意力机制434(self-attention)对于预处理后的文本描述信息431进行文本特征提取,进而通过文本分类器对文本特征进行分类,最终得到文本分类结果435。如图9所示,上述步骤304可以包括如下步骤。
步骤304A,获取目标视频对应的文本描述信息,文本描述信息包括视频标题、视频内容描述信息、视频背景音乐信息和视频发布者信息中的至少一种。
在一种可能的实施方式中,目标视频的视频数据与文本描述信息关联存储,服务器获取目标视频的同时,即从数据库中获取目标视频关联存储的文本描述信息,该文本描述信息中包括视频标题、视频内容描述信息、视频背景音乐信息和视频发布者信息中的至少一种。
示意性的,服务器获取到的文本描述信息中包含视频标题“挑战吃100个包子”、视频内容描述信息“大胃王主播今天超越自我,挑战速吃100个包子”、视频背景音乐“歌曲A”以及视频发布者信息“大胃王阿伟”。
步骤304B,对文本描述信息进行预处理,预处理方式包括去噪声、分词、实体词回捞和去停用词中的至少一种。
为了提高后续分类的准确性,服务器需要先对文本描述信息进行预处理,其中,对文本描述信息进行预处理可以包括如下方式:
1、去噪声:去除文本描述信息中干扰分类的噪声信息。比如,对于文本描述信息中的视频背景音乐信息,若该视频背景音乐信息不包含在背景音乐库中时,视频背景音乐信息中通常会加入“用户上传”字段,而这类字段会对后续分类造成干扰,因此需要对此类字段进行去除。
2、分词:将文本描述信息中的长句切分为细粒度的词,并在切分后对词的出现频率进行统计建模。
3、实体词回捞:基于预设的实体词列表,提取文本描述信息中的实体词,从而分词阶段将实体词误分为多个词。
4、去停用词:去除文本描述信息中无意义的语气词(比如“啊”,“哇”“呀”)、代词(比如“你”“我”“他”)、助词(比如“的”、“了”)、标点符号(比如“,”、“。”)以及数字。
步骤304C,通过第三分类模型中的Bi-LSTM和文本分类器对经过预处理的文本描述信息进行分类,得到文本分类结果。
在一种可能的实施方式中,将预处理的文本描述信息输入第三分类模型之前,服务器将预处理的文本描述信息(由词构成)转化为词向量,进而将词向量输入第三分类模型。其中,将词转化为词向量时可以采用为word2vec等成熟的词向量模型,本实施例对此不做限定。
示意性的,如图10所示,服务器将预处理后的视频标题1001、视频背景音乐信息1002以及视频内容描述信息1003输入Bi-LSTM 1004中。由Bi-LSTM 1004进行文本特征提取后,通过文本分类器1005输出文本分类结果1006。
为了进一步提高文本分类结果的准确性,本步骤可以包括如下步骤。
一、将经过预处理的文本描述信息输入Bi-LSTM。
二、通过注意力机制对Bi-LSTM的输出结果进行权重修正。
得到Bi-LSTM的输出结果后,服务器并非直接通过文本分类器对其进行分类,而是通过注意力机制,对输出结果中的权重进行修正,并在完成权重修正后,将修正后输出结果输入文本分类器。
其中,对权重修正的方式包括:提高关注结果的权重和减低非关注结果 的权重。
示意性的,如图10所示,Bi-LSTM 1004的输出结果在输入文本分类器1005前,注意力机制1007对输出结果进行权重修正。
三、通过文本分类器对修正后的Bi-LSTM的输出结果进行分类,得到文本分类结果。
通过引入注意力机制对Bi-LSTM的输出结果进行修正,进一步提高了输出结果的准确性,进而提高最终得到的文本分类结果的准确性。
在短视频领域,用户可以使用短视频应用程序拍摄并上传短视频,并由服务器对短视频进行分类,进而根据短视频的类别对用户进行视频推荐。然而,在实际分类过程中发现,用户上传的短视频中,存在部分视频画面相似,但是音频差异极大的短视频(比如不用用户拍摄的配音短视频),而单纯基于图像特征对此类短视频进行分类时,服务器无法区分画面相似但音频差异较大的短视频,导致视频分类效果不佳。
而将上述实施例提供的视频分类方法应用于短视频分类时,用户使用短视频应用程序拍摄并上传短视频后,服务器提取短视频中的图像帧,并基于图像帧的图像特征对短视频进行分类;同时,服务器提取短视频的音频,并基于音频的音频特征对短视频进行分类;此外,服务器获取用户上传短视频时添加的文本描述信息(比如短视频的标题、背景音乐名称和短视频内容描述),并基于文本描述信息的文本特征对短视频进行分类。
进一步的,服务器对图像维度、音频维度和文本维度下得到的视频分类结果进行融合,最终确定短视频的目标类别。后续向用户推荐短视频时,服务器即根据当前用户的用户画像,将视频类别与用户画像相匹配的短视频推荐给用户,提高推荐视频与用户喜好之间的契合度。
下述为本申请装置实施例,可以用于执行本申请方法实施例。对于本申请装置实施例中未披露的细节,请参照本申请方法实施例。
请参考图11,其示出了本申请一个实施例提供的视频分类装置的框图。该装置可以是图1所示实施环境中的服务器140,也可以设置在服务器140上。该装置可以包括:
视频获取模块1110,用于获取目标视频;
第一分类模块1120,用于通过第一分类模型对所述目标视频中的图像帧进行分类,得到图像分类结果,所述第一分类模型用于基于所述图像帧的图像特征进行分类;
第二分类模块1130,用于通过第二分类模型对所述目标视频中的音频进行分类,得到音频分类结果,所述第二分类模型用于基于所述音频的音频特征进行分类;
第三分类模块1140,用于通过第三分类模型对所述目标视频对应的文本描述信息进行分类,得到文本分类结果,所述第三分类模型用于基于所述文本描述信息的文本特征进行分类;
目标分类模块1150,用于根据图像分类结果、所述音频分类结果和所述文本分类结果,确定所述目标视频的目标分类结果。
在一个实施例中,第一分类模块1120,包括:
确定单元,用于将所述目标视频中提取到的原始图像帧确定为红绿蓝RGB图像帧;
第一图像分类单元,用于通过所述第一分类模型中的残差网络和RGB分类器对所述RGB图像帧进行分类,得到第一图像分类结果,所述RGB分类器用于基于静态图像特征进行分类;
生成单元,用于根据所述目标视频中相邻两帧原始图像帧生成所述RGB差异图像帧;
第二图像分类单元,用于通过所述第一分类模型中的残差网络和RGB差异分类器对所述RGB差异图像帧进行分类,得到第二图像分类结果,所述RGB差异分类器用于基于动态图像特征进行分类。
在一个实施例中,所述第一分类模块1120,还包括:
第三图像分类单元,用于通过所述第一分类模型中的目标检测网络和细粒度分类器对所述RGB图像进行分类,得到第三图像分类结果,所述目标检测网络用于提取所述RGB图像中目标物体的细粒度图像特征,所述细粒度分类器用于基于所述细粒度图像特征进行分类。
在一个实施例中,所述第二分类模块1130,包括:
第一提取单元,用于提取所述音频的梅尔频率倒谱系数MFCC;
第二提取单元,用于通过所述第二分类模型中的VGGish网络对所述MFCC进行特征提取,得到VGGish特征;
第一音频分类单元,用于通过所述第二分类模型中的通用分类器对所述VGGish特征进行分类,得到第一音频分类结果;
第二音频分类单元,用于通过所述第二分类模型中的至少一个特定分类器对所述VGGish特征进行分类,得到至少一条第二音频分类结果;
其中,所述通用分类器的分类类别数量与视频的预设分类类别数量相同,所述特定分类器用于基于特定类别进行分类,所述特定类别属于视频的预设分类类别,且不同特定分类器对应的不同特定类别。
在一个实施例中,所述第三分类模块1140,包括:
信息获取单元,用于获取所述目标视频对应的所述文本描述信息,所述文本描述信息包括视频标题、视频内容描述信息和视频背景音乐信息中的至少一种;
预处理单元,用于对所述文本描述信息进行预处理,预处理方式包括去噪声、分词、实体词回捞和去停用词中的至少一种;
文本分类单元,用于通过所述第三分类模型中的双向长短期记忆网络Bi-LSTM和文本分类器对经过预处理的所述文本描述信息进行分类,得到所述文本分类结果。
在一个实施例中,所述文本分类单元,用于:
将经过预处理的所述文本描述信息输入所述Bi-LSTM;
通过注意力机制对所述Bi-LSTM的输出结果进行权重修正;
通过所述文本分类器对修正后的所述Bi-LSTM的输出结果进行分类,得到所述文本分类结果。
在一个实施例中,所述目标分类模块1150,包括:
拼接单元,用于对所述图像分类结果、所述音频分类结果和所述文本分类结果中各个分类类别对应的概率进行拼接,生成分类特征向量;
目标分类单元,用于将所述分类特征向量输入目标分类器,得到所述目标分类结果,所述目标分类器基于softmax分类模型构建。
综上所述,本申请实施例中,获取到待分类的目标视频后,分别通过第一分类模型对图像帧进行分类得到图像分类结果,通过第二分类模型对音频 进行分类得到音频分类结果,通过第三分类模型对文本描述信息进行分类得到文本分类结果,从而根据图像分类结果、音频分类结果和文本分类结果,确定出目标视频的目标分类结果;相较于相关技术中仅基于视频的图像特征进行视频分类,本申请实施例中综合视频的图像特征、音频特征以及文本特征进行分类,充分考虑视频不同维度的特征,进而提高了视频分类的准确性。
请参考图12,其示出了本申请一个实施例提供的计算机设备的结构示意图。该服务器用于实施上述实施例提供的视频分类方法。具体来讲:
所述计算机设备1200包括中央处理单元(CPU)1201、包括随机存取存储器(RAM)1202和只读存储器(ROM)1203的系统存储器1204,以及连接系统存储器1204和中央处理单元1201的系统总线1205。所述计算机设备1200还包括帮助计算机内的各个器件之间传输信息的基本输入/输出系统(I/O系统)1206,和用于存储操作系统1213、应用程序1214和其他程序模块1215的大容量存储设备1207。
所述基本输入/输出系统1206包括有用于显示信息的显示器1208和用于用户输入信息的诸如鼠标、键盘之类的输入设备1209。其中所述显示器1208和输入设备1209都通过连接到系统总线1205的输入输出控制器1210连接到中央处理单元1201。所述基本输入/输出系统1206还可以包括输入输出控制器1210以用于接收和处理来自键盘、鼠标、或电子触控笔等多个其他设备的输入。类似地,输入输出控制器1210还提供输出到显示屏、打印机或其他类型的输出设备。
所述大容量存储设备1207通过连接到系统总线1205的大容量存储控制器(未示出)连接到中央处理单元1201。所述大容量存储设备1207及其相关联的计算机可读介质为计算机设备1200提供非易失性存储。也就是说,所述大容量存储设备1207可以包括诸如硬盘或者CD-ROM驱动器之类的计算机可读介质(未示出)。
不失一般性,所述计算机可读介质可以包括计算机存储介质和通信介质。计算机存储介质包括以用于存储诸如计算机可读指令、数据结构、程序模块或其他数据等信息的任何方法或技术实现的易失性和非易失性、可移动和不可移动介质。计算机存储介质包括RAM、ROM、EPROM、EEPROM、闪存 或其他固态存储其技术,CD-ROM、DVD或其他光学存储、磁带盒、磁带、磁盘存储或其他磁性存储设备。当然,本领域技术人员可知所述计算机存储介质不局限于上述几种。上述的系统存储器1204和大容量存储设备1207可以统称为存储器。
根据本发明的各种实施例,所述计算机设备1200还可以通过诸如因特网等网络连接到网络上的远程计算机运行。也即计算机设备1200可以通过连接在所述系统总线1205上的网络接口单元1211连接到网络1212,或者说,也可以使用网络接口单元1211来连接到其他类型的网络或远程计算机系统。
本申请实施例还提供一种计算机可读存储介质,所述存储介质中存储有计算机可读指令,计算机可读指令由所述处理器执行以实现上述各个实施例提供的视频分类方法。
本申请还提供了一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行上述各个实施例所述的视频分类方法。
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。
本领域普通技术人员可以理解实现上述实施例的无线局域网的参数配置方法中全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。以上所述仅为本申请的较佳实施例,并不用以限制本申请,凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。

Claims (20)

  1. 一种视频分类方法,由计算机设备执行,所述方法包括:
    获取目标视频;
    通过第一分类模型对所述目标视频中的图像帧进行分类,得到图像分类结果,所述第一分类模型用于基于所述图像帧的图像特征进行分类;
    通过第二分类模型对所述目标视频中的音频进行分类,得到音频分类结果,所述第二分类模型用于基于所述音频的音频特征进行分类;
    通过第三分类模型对所述目标视频对应的文本描述信息进行分类,得到文本分类结果,所述第三分类模型用于基于所述文本描述信息的文本特征进行分类;及
    根据所述图像分类结果、所述音频分类结果和所述文本分类结果,确定所述目标视频的目标分类结果。
  2. 根据权利要求1所述的方法,其特征在于,所述图像分类结果包括第一图像分类结果;所述通过第一分类模型对所述目标视频中的图像帧进行分类,得到图像分类结果,包括:
    将从所述目标视频中提取到的原始图像帧确定为RGB图像帧;及
    通过第一分类模型中的残差网络和RGB分类器对所述RGB图像帧进行分类,得到所述第一图像分类结果,所述RGB分类器用于基于所述RGB图像帧的静态图像特征进行分类。
  3. 根据权利要求1所述的方法,其特征在于,所述图像分类结果包括第二图像分类结果;所述通过第一分类模型对所述目标视频中的图像帧进行分类,得到图像分类结果,包括:
    根据所述目标视频中相邻两帧原始图像帧生成RGB差异图像帧;及
    通过所述第一分类模型中的残差网络和RGB差异分类器对所述RGB差异图像帧进行分类,得到所述第二图像分类结果,所述RGB差异分类器用于基于所述RGB差异图像帧的动态图像特征进行分类。
  4. 根据权利要求1所述的方法,其特征在于,所述图像分类结果包括第三图像分类结果;所述通过第一分类模型对所述目标视频中的图像帧进行分类,得到图像分类结果,包括:
    将从所述目标视频中提取到的原始图像帧确定为RGB图像帧;及
    通过第一分类模型中的目标检测网络和细粒度分类器对所述RGB图像进行分类,得到第三图像分类结果,所述目标检测网络用于提取所述RGB图像中目标物体的细粒度图像特征,所述细粒度分类器用于基于所述细粒度图像特征进行分类。
  5. 根据权利要求1所述的方法,其特征在于,所述音频分类结果包括第一音频分类结果和第二音频分类结果;所述通过第二分类模型对所述目标视频中的音频进行分类,得到音频分类结果,包括:
    提取所述音频的梅尔频率倒谱系数MFCC;
    通过第二分类模型中的VGGish网络对所述MFCC进行特征提取,得到VGGish特征;及
    通过所述第二分类模型中的通用分类器对所述VGGish特征进行分类,得到所述第一音频分类结果。
  6. 根据权利要求5所述的方法,其特征在于,所述音频分类结果还包括第二音频分类结果;所述方法还包括:
    通过所述第二分类模型中的至少一个特定分类器对所述VGGish特征进行分类,得到每个所述特定分类器输出的第二音频分类结果;
    其中,所述通用分类器的分类类别数量与视频的预设分类类别数量相同,所述特定分类器用于基于所述音频的特定类别进行分类,所述特定类别属于视频的预设分类类别,且不同特定分类器对应的不同特定类别。
  7. 根据权利要求1所述的方法,其特征在于,所述通过第三分类模型对所述目标视频对应的文本描述信息进行分类,得到文本分类结果,包括:
    获取所述目标视频对应的所述文本描述信息,所述文本描述信息包括视频标题、视频内容描述信息、视频背景音乐信息和视频发布者信息中的至少一种;
    对所述文本描述信息进行预处理,预处理方式包括去噪声、分词、实体词回捞和去停用词中的至少一种;及
    通过所述第三分类模型中的双向长短期记忆网络Bi-LSTM和文本分类器对经过预处理的所述文本描述信息进行分类,得到所述文本分类结果。
  8. 根据权利要求7所述的方法,其特征在于,所述通过所述第三分类模型中的Bi-LSTM和文本分类器对经过预处理的所述文本描述信息进行分类, 得到所述文本分类结果,包括:
    将经过预处理的所述文本描述信息输入所述Bi-LSTM;
    通过注意力机制对所述Bi-LSTM的输出结果进行权重修正;及
    通过所述文本分类器对修正后的所述Bi-LSTM的输出结果进行分类,得到所述文本分类结果。
  9. 根据权利要求1至8任一所述的方法,其特征在于,所述根据所述图像分类结果、所述音频分类结果和所述文本分类结果,确定所述目标视频的目标分类结果,包括:
    对所述图像分类结果、所述音频分类结果和所述文本分类结果中各个分类类别对应的概率进行拼接,生成分类特征向量;及
    将所述分类特征向量输入目标分类器,得到所述目标分类结果,所述目标分类器基于softmax分类模型构建。
  10. 一种视频分类装置,其特征在于,所述装置包括:
    视频获取模块,用于获取目标视频;
    第一分类模块,用于通过第一分类模型对所述目标视频中的图像帧进行分类,得到图像分类结果,所述第一分类模型用于基于所述图像帧的图像特征进行分类;
    第二分类模块,用于通过第二分类模型对所述目标视频中的音频进行分类,得到音频分类结果,所述第二分类模型用于基于所述音频的音频特征进行分类;
    第三分类模块,用于通过第三分类模型对所述目标视频对应的文本描述信息进行分类,得到文本分类结果,所述第三分类模型用于基于所述文本描述信息的文本特征进行分类;及
    目标分类模块,用于根据所述图像分类结果、所述音频分类结果和所述文本分类结果,确定所述目标视频的目标分类结果。
  11. 根据权利要求10所述的装置,其特征在于,所述图像分类结果包括第一图像分类结果;所述第一分类模块,包括:
    确定单元,用于将从所述目标视频中提取到的原始图像帧确定为RGB图像帧;
    第一图像分类单元,用于通过第一分类模型中的残差网络和RGB分类器对所述RGB图像帧进行分类,得到所述第一图像分类结果,所述RGB分类器用于基于所述RGB图像帧的静态图像特征进行分类。
  12. 根据权利要求10所述的装置,其特征在于,所述图像分类结果包括第二图像分类结果;所述第一分类模块,包括:
    生成单元,用于根据所述目标视频中相邻两帧原始图像帧生成所述RGB差异图像帧;及
    第二图像分类单元,用于通过所述第一分类模型中的残差网络和RGB差异分类器对所述RGB差异图像帧进行分类,得到所述第二图像分类结果,所述RGB差异分类器用于基于所述RGB差异图像帧的动态图像特征进行分类。
  13. 根据权利要求10所述的装置,其特征在于,所述图像分类结果包括第三图像分类结果;所述第一分类模块,包括:
    确定单元,用于将从所述目标视频中提取到的原始图像帧确定为RGB图像帧;
    第三图像分类单元,用于通过所述第一分类模型中的目标检测网络和细粒度分类器对所述RGB图像进行分类,得到第三图像分类结果,所述目标检测网络用于提取所述RGB图像中目标物体的细粒度图像特征,所述细粒度分类器用于基于所述细粒度图像特征进行分类。
  14. 根据权利要求10所述的装置,其特征在于,所述音频分类结果包括第一音频分类结果;所述第二分类模块,包括:
    第一提取单元,用于提取所述音频的梅尔频率倒谱系数MFCC;
    第二提取单元,用于通过所述第二分类模型中的VGGish网络对所述MFCC进行特征提取,得到VGGish特征;及
    第一音频分类单元,用于通过所述第二分类模型中的通用分类器对所述VGGish特征进行分类,得到所述第一音频分类结果。
  15. 根据权利要求14所述的装置,其特征在于,所述音频分类结果还包括第二音频分类结果;所述第二分类模块还包括:
    第二音频分类单元,用于通过所述第二分类模型中的至少一个特定分类器对所述VGGish特征进行分类,得到每个所述特定分类器输出的第二音频分类结果;
    其中,所述通用分类器的分类类别数量与视频的预设分类类别数量相同,所述特定分类器用于基于所述音频的特定类别进行分类,所述特定类别属于视频的预设分类类别,且不同特定分类器对应的不同特定类别。
  16. 根据权利要求10所述的装置,其特征在于,所述第三分类模块,包括:
    信息获取单元,用于获取所述目标视频对应的所述文本描述信息,所述文本描述信息包括视频标题、视频内容描述信息和视频背景音乐信息中的至少一种;
    预处理单元,用于对所述文本描述信息进行预处理,预处理方式包括去噪声、分词、实体词回捞和去停用词中的至少一种;及
    文本分类单元,用于通过所述第三分类模型中的双向长短期记忆网络Bi-LSTM和文本分类器对经过预处理的所述文本描述信息进行分类,得到所述文本分类结果。
  17. 根据权利要求16所述的装置,其特征在于,所述文本分类单元还用于:
    将经过预处理的所述文本描述信息输入所述Bi-LSTM;
    通过注意力机制对所述Bi-LSTM的输出结果进行权重修正;及
    通过所述文本分类器对修正后的所述Bi-LSTM的输出结果进行分类,得到所述文本分类结果。
  18. 根据权利要求10至17任一所述的装置,其特征在于,所述目标分类模块,包括:
    拼接单元,用于对所述图像分类结果、所述音频分类结果和所述文本分类结果中各个分类类别对应的概率进行拼接,生成分类特征向量;
    目标分类单元,用于将所述分类特征向量输入目标分类器,得到所述目标分类结果,所述目标分类器基于softmax分类模型构建。
  19. 一种计算机设备,其特征在于,所述计算机设备包括处理器和存储器,所述存储器中存储有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述处理器执行如权利要求1至9中任一项所述的方法的步骤。
  20. 一种非易失性的计算机可读存储介质,存储有计算机可读指令,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如权利要求1至9中任一项所述的方法的步骤。
PCT/CN2019/116660 2018-12-14 2019-11-08 视频分类方法、装置、计算机设备和存储介质 WO2020119350A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/192,580 US20210192220A1 (en) 2018-12-14 2021-03-04 Video classification method and apparatus, computer device, and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811535837.0 2018-12-14
CN201811535837.0A CN109359636B (zh) 2018-12-14 2018-12-14 视频分类方法、装置及服务器

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/192,580 Continuation US20210192220A1 (en) 2018-12-14 2021-03-04 Video classification method and apparatus, computer device, and storage medium

Publications (1)

Publication Number Publication Date
WO2020119350A1 true WO2020119350A1 (zh) 2020-06-18

Family

ID=65328892

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/116660 WO2020119350A1 (zh) 2018-12-14 2019-11-08 视频分类方法、装置、计算机设备和存储介质

Country Status (3)

Country Link
US (1) US20210192220A1 (zh)
CN (2) CN111428088B (zh)
WO (1) WO2020119350A1 (zh)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111860353A (zh) * 2020-07-23 2020-10-30 北京以萨技术股份有限公司 基于双流神经网络的视频行为预测方法、装置及介质
CN112070093A (zh) * 2020-09-22 2020-12-11 网易(杭州)网络有限公司 生成图像分类模型的方法、图像分类方法、装置和设备
CN112784111A (zh) * 2021-03-12 2021-05-11 有半岛(北京)信息科技有限公司 视频分类方法、装置、设备及介质
CN112989117A (zh) * 2021-04-14 2021-06-18 北京世纪好未来教育科技有限公司 视频分类的方法、装置、电子设备和计算机存储介质
CN115797943A (zh) * 2023-02-08 2023-03-14 广州数说故事信息科技有限公司 一种基于多模态的视频文本内容提取方法、系统及存储介质
US20240046644A1 (en) * 2019-09-12 2024-02-08 Xiamen Wangsu Co., Ltd. Video classification method, device and system

Families Citing this family (74)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111428088B (zh) * 2018-12-14 2022-12-13 腾讯科技(深圳)有限公司 视频分类方法、装置及服务器
CN110147711B (zh) * 2019-02-27 2023-11-14 腾讯科技(深圳)有限公司 视频场景识别方法、装置、存储介质和电子装置
CN109758756B (zh) * 2019-02-28 2021-03-23 国家体育总局体育科学研究所 基于3d相机的体操视频分析方法及系统
CN110059225B (zh) * 2019-03-11 2022-02-15 北京奇艺世纪科技有限公司 视频分类方法、装置、终端设备及存储介质
CN110019950A (zh) * 2019-03-22 2019-07-16 广州新视展投资咨询有限公司 视频推荐方法及装置
CN110020658B (zh) * 2019-03-28 2022-09-30 大连理工大学 一种基于多任务深度学习的显著目标检测方法
CN110084128B (zh) * 2019-03-29 2021-12-14 安徽艾睿思智能科技有限公司 基于语义空间约束和注意力机制的场景图生成方法
CN110162669B (zh) * 2019-04-04 2021-07-02 腾讯科技(深圳)有限公司 视频分类处理方法、装置、计算机设备及存储介质
CN110110143B (zh) * 2019-04-15 2021-08-03 厦门网宿有限公司 一种视频分类方法及装置
CN110046279B (zh) * 2019-04-18 2022-02-25 网易传媒科技(北京)有限公司 视频文件特征的预测方法、介质、装置和计算设备
CN110084180A (zh) * 2019-04-24 2019-08-02 北京达佳互联信息技术有限公司 关键点检测方法、装置、电子设备及可读存储介质
CN110163115B (zh) * 2019-04-26 2023-10-13 腾讯科技(深圳)有限公司 一种视频处理方法、装置和计算机可读存储介质
US11961300B2 (en) 2019-04-29 2024-04-16 Ecole Polytechnique Federale De Lausanne (Epfl) Dynamic media content categorization method
CN110099302B (zh) 2019-04-29 2020-11-24 北京达佳互联信息技术有限公司 视频分级方法、装置、设备及存储介质
CN111914120A (zh) * 2019-05-08 2020-11-10 阿里巴巴集团控股有限公司 视频分类方法、装置、电子设备以及计算机可读存储介质
CN110287788A (zh) * 2019-05-23 2019-09-27 厦门网宿有限公司 一种视频分类方法及装置
CN112000819B (zh) * 2019-05-27 2023-07-11 北京达佳互联信息技术有限公司 多媒体资源推荐方法、装置、电子设备及存储介质
CN110222234B (zh) * 2019-06-14 2021-07-23 北京奇艺世纪科技有限公司 一种视频分类方法和装置
CN110287371A (zh) * 2019-06-26 2019-09-27 北京字节跳动网络技术有限公司 端到端的视频推送方法、装置及电子设备
CN110516086B (zh) * 2019-07-12 2022-05-03 浙江工业大学 一种基于深度神经网络影视标签自动获取方法
CN110334689B (zh) 2019-07-16 2022-02-15 北京百度网讯科技有限公司 视频分类方法和装置
CN110489592B (zh) * 2019-07-18 2024-05-03 平安科技(深圳)有限公司 视频分类方法、装置、计算机设备和存储介质
CN110647804A (zh) * 2019-08-09 2020-01-03 中国传媒大学 一种暴力视频识别方法、计算机系统和存储介质
CN110489593B (zh) * 2019-08-20 2023-04-28 腾讯科技(深圳)有限公司 视频的话题处理方法、装置、电子设备及存储介质
CN110598620B (zh) * 2019-09-06 2022-05-06 腾讯科技(深圳)有限公司 基于深度神经网络模型的推荐方法和装置
CN110751171A (zh) * 2019-09-06 2020-02-04 平安医疗健康管理股份有限公司 图像数据分类方法、装置、计算机设备和存储介质
CN110674348B (zh) * 2019-09-27 2023-02-03 北京字节跳动网络技术有限公司 视频分类方法、装置及电子设备
CN110769267B (zh) * 2019-10-30 2022-02-08 北京达佳互联信息技术有限公司 一种视频的展示方法、装置、电子设备及存储介质
CN110796204B (zh) * 2019-11-01 2023-05-02 腾讯科技(深圳)有限公司 视频标签确定方法、装置和服务器
CN110839173A (zh) * 2019-11-18 2020-02-25 上海极链网络科技有限公司 一种音乐匹配方法、装置、终端及存储介质
CN111046943A (zh) * 2019-12-09 2020-04-21 国网智能科技股份有限公司 变电站隔离刀闸状态自动识别方法及系统
CN113127667A (zh) * 2019-12-30 2021-07-16 阿里巴巴集团控股有限公司 图像处理方法及装置、图像分类方法及装置
CN111163366B (zh) * 2019-12-30 2022-01-18 厦门市美亚柏科信息股份有限公司 一种视频处理方法及终端
CN111222011B (zh) * 2020-01-06 2023-11-14 腾讯科技(深圳)有限公司 一种视频向量确定方法和装置
CN111209970B (zh) * 2020-01-08 2023-04-25 Oppo(重庆)智能科技有限公司 视频分类方法、装置、存储介质及服务器
US11645505B2 (en) * 2020-01-17 2023-05-09 Servicenow Canada Inc. Method and system for generating a vector representation of an image
US11417330B2 (en) * 2020-02-21 2022-08-16 BetterUp, Inc. Determining conversation analysis indicators for a multiparty conversation
CN111246124B (zh) * 2020-03-09 2021-05-25 三亚至途科技有限公司 一种多媒体数字融合方法和装置
CN111461220B (zh) * 2020-04-01 2022-11-01 腾讯科技(深圳)有限公司 图像分析方法、图像分析装置及图像分析系统
CN111586473B (zh) * 2020-05-20 2023-01-17 北京字节跳动网络技术有限公司 视频的裁剪方法、装置、设备及存储介质
CN111651626B (zh) * 2020-05-25 2023-08-22 腾讯科技(深圳)有限公司 图像分类方法、装置及可读存储介质
CN111626049B (zh) * 2020-05-27 2022-12-16 深圳市雅阅科技有限公司 多媒体信息的标题修正方法、装置、电子设备及存储介质
CN111988663B (zh) * 2020-08-28 2022-09-06 北京百度网讯科技有限公司 视频播放节点的定位方法、装置、设备以及存储介质
CN114157906B (zh) * 2020-09-07 2024-04-02 北京达佳互联信息技术有限公司 视频检测方法、装置、电子设备及存储介质
CN112163122B (zh) * 2020-10-30 2024-02-06 腾讯科技(深圳)有限公司 确定目标视频的标签的方法、装置、计算设备及存储介质
CN112418215A (zh) * 2020-11-17 2021-02-26 峰米(北京)科技有限公司 一种视频分类识别方法、装置、存储介质和设备
CN112364810B (zh) * 2020-11-25 2024-08-02 深圳市欢太科技有限公司 视频分类方法及装置、计算机可读存储介质与电子设备
CN112738556B (zh) * 2020-12-22 2023-03-31 上海幻电信息科技有限公司 视频处理方法及装置
CN112738555B (zh) * 2020-12-22 2024-03-29 上海幻电信息科技有限公司 视频处理方法及装置
CN112714362B (zh) * 2020-12-25 2023-06-27 北京百度网讯科技有限公司 确定属性的方法、装置、电子设备和介质
CN115082930B (zh) * 2021-03-11 2024-05-28 腾讯科技(深圳)有限公司 图像分类方法、装置、电子设备和存储介质
CN113095194A (zh) * 2021-04-02 2021-07-09 北京车和家信息技术有限公司 图像分类方法、装置、存储介质及电子设备
CN113033707B (zh) * 2021-04-25 2023-08-04 北京有竹居网络技术有限公司 视频分类方法、装置、可读介质及电子设备
CN113240004B (zh) * 2021-05-11 2024-04-30 北京达佳互联信息技术有限公司 视频信息确定方法、装置、电子设备以及存储介质
CN113393643B (zh) * 2021-06-10 2023-07-21 上海安亭地平线智能交通技术有限公司 异常行为预警方法、装置、车载终端以及介质
CN113343921B (zh) * 2021-06-30 2024-04-09 北京达佳互联信息技术有限公司 视频识别方法、装置、电子设备及存储介质
CN113821675B (zh) * 2021-06-30 2024-06-07 腾讯科技(北京)有限公司 视频识别方法、装置、电子设备及计算机可读存储介质
CN113343936B (zh) * 2021-07-15 2024-07-12 北京达佳互联信息技术有限公司 视频表征模型的训练方法及训练装置
CN113473628B (zh) * 2021-08-05 2022-08-09 深圳市虎瑞科技有限公司 智能平台的通信方法以及系统
CN113628249B (zh) * 2021-08-16 2023-04-07 电子科技大学 基于跨模态注意力机制与孪生结构的rgbt目标跟踪方法
CN113469920B (zh) * 2021-09-02 2021-11-19 中国建筑第五工程局有限公司 用于智能化设备管理的图像处理方法及系统
CN113850162B (zh) * 2021-09-10 2023-03-24 北京百度网讯科技有限公司 一种视频审核方法、装置及电子设备
CN113837457A (zh) * 2021-09-14 2021-12-24 上海任意门科技有限公司 用于预测帖子互动行为状态的方法、计算设备和存储介质
CN113837576A (zh) * 2021-09-14 2021-12-24 上海任意门科技有限公司 用于内容推荐的方法、计算设备和计算机可读存储介质
CN113813053A (zh) * 2021-09-18 2021-12-21 长春理工大学 一种基于腹腔镜内窥影像的手术进程分析方法
CN113887446A (zh) * 2021-10-08 2022-01-04 黑龙江雨谷科技有限公司 一种基于神经网络的音视频联合行人意外跌倒监控方法
CN116150428B (zh) * 2021-11-16 2024-06-07 腾讯科技(深圳)有限公司 视频标签获取方法、装置、电子设备及存储介质
CN114238690A (zh) * 2021-12-08 2022-03-25 腾讯科技(深圳)有限公司 视频分类的方法、装置及存储介质
CN114241241A (zh) * 2021-12-16 2022-03-25 北京奇艺世纪科技有限公司 一种图像分类方法、装置、电子设备及存储介质
CN114979767B (zh) * 2022-05-07 2023-11-21 咪咕视讯科技有限公司 视频推荐方法、装置、设备及计算机可读存储介质
CN114882299B (zh) * 2022-07-11 2022-11-15 深圳市信润富联数字科技有限公司 水果分类方法、装置、采摘设备及存储介质
CN115776592A (zh) * 2022-11-03 2023-03-10 深圳创维-Rgb电子有限公司 显示方法、装置、电子设备及存储介质
CN115878804B (zh) * 2022-12-28 2023-06-20 郑州轻工业大学 基于ab-cnn模型的电商评论多分类情感分析方法
CN116567306B (zh) * 2023-05-09 2023-10-20 北京新东方迅程网络科技有限公司 一种视频的推荐方法、装置、电子设备及介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101937445A (zh) * 2010-05-24 2011-01-05 中国科学技术信息研究所 一种文件自动分类系统
CN103200463A (zh) * 2013-03-27 2013-07-10 天脉聚源(北京)传媒科技有限公司 一种视频摘要生成方法和装置
US20170308753A1 (en) * 2016-04-26 2017-10-26 Disney Enterprises, Inc. Systems and Methods for Identifying Activities and/or Events in Media Contents Based on Object Data and Scene Data
CN109359636A (zh) * 2018-12-14 2019-02-19 腾讯科技(深圳)有限公司 视频分类方法、装置及服务器

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9191626B2 (en) * 2005-10-26 2015-11-17 Cortica, Ltd. System and methods thereof for visual analysis of an image on a web-page and matching an advertisement thereto
CN201796362U (zh) * 2010-05-24 2011-04-13 中国科学技术信息研究所 一种文件自动分类系统
CN106779073B (zh) * 2016-12-27 2019-05-31 西安石油大学 基于深度神经网络的媒体信息分类方法及装置
CN108833973B (zh) * 2018-06-28 2021-01-19 腾讯科技(深圳)有限公司 视频特征的提取方法、装置和计算机设备
CN111507097B (zh) * 2020-04-16 2023-08-04 腾讯科技(深圳)有限公司 一种标题文本处理方法、装置、电子设备及存储介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101937445A (zh) * 2010-05-24 2011-01-05 中国科学技术信息研究所 一种文件自动分类系统
CN103200463A (zh) * 2013-03-27 2013-07-10 天脉聚源(北京)传媒科技有限公司 一种视频摘要生成方法和装置
US20170308753A1 (en) * 2016-04-26 2017-10-26 Disney Enterprises, Inc. Systems and Methods for Identifying Activities and/or Events in Media Contents Based on Object Data and Scene Data
CN109359636A (zh) * 2018-12-14 2019-02-19 腾讯科技(深圳)有限公司 视频分类方法、装置及服务器

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20240046644A1 (en) * 2019-09-12 2024-02-08 Xiamen Wangsu Co., Ltd. Video classification method, device and system
CN111860353A (zh) * 2020-07-23 2020-10-30 北京以萨技术股份有限公司 基于双流神经网络的视频行为预测方法、装置及介质
CN112070093A (zh) * 2020-09-22 2020-12-11 网易(杭州)网络有限公司 生成图像分类模型的方法、图像分类方法、装置和设备
CN112784111A (zh) * 2021-03-12 2021-05-11 有半岛(北京)信息科技有限公司 视频分类方法、装置、设备及介质
CN112989117A (zh) * 2021-04-14 2021-06-18 北京世纪好未来教育科技有限公司 视频分类的方法、装置、电子设备和计算机存储介质
CN115797943A (zh) * 2023-02-08 2023-03-14 广州数说故事信息科技有限公司 一种基于多模态的视频文本内容提取方法、系统及存储介质
CN115797943B (zh) * 2023-02-08 2023-05-05 广州数说故事信息科技有限公司 一种基于多模态的视频文本内容提取方法、系统及存储介质

Also Published As

Publication number Publication date
CN109359636B (zh) 2023-04-28
CN111428088B (zh) 2022-12-13
CN111428088A (zh) 2020-07-17
US20210192220A1 (en) 2021-06-24
CN109359636A (zh) 2019-02-19

Similar Documents

Publication Publication Date Title
WO2020119350A1 (zh) 视频分类方法、装置、计算机设备和存储介质
CN111062871B (zh) 一种图像处理方法、装置、计算机设备及可读存储介质
WO2020221278A1 (zh) 视频分类方法及其模型的训练方法、装置和电子设备
US11126853B2 (en) Video to data
WO2022116888A1 (zh) 一种视频数据处理方法、装置、设备以及介质
CN111522996B (zh) 视频片段的检索方法和装置
CN113010703B (zh) 一种信息推荐方法、装置、电子设备和存储介质
WO2020177673A1 (zh) 一种视频序列选择的方法、计算机设备及存储介质
US8634603B2 (en) Automatic media sharing via shutter click
EP3968179A1 (en) Place recognition method and apparatus, model training method and apparatus for place recognition, and electronic device
WO2021237570A1 (zh) 影像审核方法及装置、设备、存储介质
CN109871490B (zh) 媒体资源匹配方法、装置、存储介质和计算机设备
US20140198986A1 (en) System and method for image selection using multivariate time series analysis
CN110309795A (zh) 视频检测方法、装置、电子设备及存储介质
CN114666663A (zh) 用于生成视频的方法和装置
CN112163122A (zh) 确定目标视频的标签的方法、装置、计算设备及存储介质
US20150189384A1 (en) Presenting information based on a video
WO2020077999A1 (zh) 视频摘要生成方法和装置、电子设备、计算机存储介质
CN114339360B (zh) 一种视频处理的方法、相关装置及设备
KR102313338B1 (ko) 영상 검색 장치 및 방법
US11537636B2 (en) System and method for using multimedia content as search queries
US20130191368A1 (en) System and method for using multimedia content as search queries
Mery Face analysis: state of the art and ethical challenges
US20180189602A1 (en) Method of and system for determining and selecting media representing event diversity
Sheng et al. Embedded learning for computerized production of movie trailers

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19897519

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19897519

Country of ref document: EP

Kind code of ref document: A1