WO2020088216A1 - 一种音视频处理方法、装置、设备及介质 - Google Patents

一种音视频处理方法、装置、设备及介质 Download PDF

Info

Publication number
WO2020088216A1
WO2020088216A1 PCT/CN2019/110735 CN2019110735W WO2020088216A1 WO 2020088216 A1 WO2020088216 A1 WO 2020088216A1 CN 2019110735 W CN2019110735 W CN 2019110735W WO 2020088216 A1 WO2020088216 A1 WO 2020088216A1
Authority
WO
WIPO (PCT)
Prior art keywords
information
video
audio
feature information
feature
Prior art date
Application number
PCT/CN2019/110735
Other languages
English (en)
French (fr)
Inventor
刘文奇
刘运
梁柱锦
Original Assignee
广州市百果园信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 广州市百果园信息技术有限公司 filed Critical 广州市百果园信息技术有限公司
Publication of WO2020088216A1 publication Critical patent/WO2020088216A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
    • H04N21/23418Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/239Interfacing the upstream path of the transmission network, e.g. prioritizing client content requests
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • H04N21/4394Processing of audio elementary streams involving operations for analysing the audio stream, e.g. detecting features or characteristics in audio streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/44008Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/45Management operations performed by the client for facilitating the reception of or the interaction with the content or administrating data related to the end-user or to the client device itself, e.g. learning user preferences for recommending movies, resolving scheduling conflicts
    • H04N21/466Learning process for intelligent management, e.g. learning user preferences for recommending movies
    • H04N21/4662Learning process for intelligent management, e.g. learning user preferences for recommending movies characterized by learning algorithms
    • H04N21/4665Learning process for intelligent management, e.g. learning user preferences for recommending movies characterized by learning algorithms involving classification methods, e.g. Decision trees

Definitions

  • This application relates to the field of computer technology, such as an audio-video processing method, device, equipment, and medium.
  • Video classification is mainly to classify videos into different tags, that is, classify videos into different video categories, so as to set video tags based on the video category to which the video belongs.
  • short video applications are created and uploaded by users a lot of short videos every day.
  • the content of these short videos is diverse, and different viewers like different types of short videos.
  • By categorizing short videos into videos with different tags on the one hand, it is convenient for users to search for video categories that they are interested in, on the other hand, they can recommend videos of interest to different users, which can increase the length of time that viewers stay in short video applications .
  • the practice of classifying short videos uploaded by users into different tags is usually selected by an algorithm and then manually reviewed.
  • the accuracy of short video tag classification is limited by the performance of the algorithm. If the algorithm performance is poor, the accuracy of classifying short videos into different tags is relatively low, which will consume a lot of manpower to conduct the review work and increase the labor cost.
  • Embodiments of the present application provide an audio-video processing method, device, equipment, and medium, which combine audio feature information in a video and image feature information of a video frame to perform video classification to improve the accuracy and recall rate of video classification and reduce video classification Audited labor costs.
  • an embodiment of the present application provides an audio and video processing method, including: acquiring a video file; separating image frame information and audio information from the video file; extracting from the image frame information and the audio information, respectively Image feature information and audio feature information; merge the image feature information and audio feature information into video content feature information; determine the classification result corresponding to the video file according to the video content feature information.
  • an embodiment of the present application further provides an audio and video processing device, including: a video file acquisition module configured to acquire a video file; and a video separation module configured to separate image frame information and audio information from the video file ; Feature extraction module, set to extract image feature information and audio feature information from the image frame information and the audio information; Feature fusion module, set to fuse the image feature information and audio feature information into video content feature information
  • the video classification module is set to determine the classification result corresponding to the video file according to the video content characteristic information.
  • an embodiment of the present application further provides a device, including: a processor and a memory; at least one instruction is stored in the memory, and the instruction is executed by the processor, so that the device executes as the first Aspect audio and video processing method.
  • an embodiment of the present application further provides a computer-readable storage medium, when instructions in the storage medium are executed by a processor of a device, enabling the device to perform audio and video processing as described in the first aspect method.
  • FIG. 1 is a schematic flowchart of steps of an embodiment of an audio-video processing method of the present application
  • FIG. 2 is a schematic diagram of an audio and video processing method in another example of this application.
  • FIG. 3 is a schematic structural block diagram of an audio-video processing device in an embodiment of the present application.
  • FIG. 4 is a schematic structural block diagram of a device in an example of this application.
  • Fusion to generate new video content feature information to classify the video content tags based on the newly generated video content feature information which can be combined with the action information in the video to classify the video tag and improve the recognition effect of the action in the video, but the video
  • optical flow information is very time-consuming and affects the efficiency of video classification.
  • an embodiment of the present application proposes a new audio and video processing method, which combines the audio feature information in the video and the image feature information of the video frame to classify the video, improving the classification of videos to different
  • the accuracy of the label is to improve the accuracy and recall rate of video classification, thereby reducing the labor cost of video classification review.
  • FIG. 1 a schematic flowchart of steps of an embodiment of an audio-video processing method of the present application is shown, including steps 110 to 150.
  • step 110 a video file is obtained.
  • a video file that needs to be classified can be obtained to classify video tags according to the audio and video information contained in the video file.
  • the audio and video information contained in the video file may include information related to video playback, such as image frame information and audio information of the video, which is not limited in the embodiments of the present application.
  • one video file can be used to characterize one video
  • one video can include at least one video frame.
  • the image frame information may refer to image information of a video frame, and may be used to display a video screen so that the user can view objects, people, scenes, etc. in the video.
  • Audio information can be used to play various sounds in the video, such as playing voice in the video.
  • step 120 image frame information and audio information are separated from the video file.
  • the embodiment of the present application can decompose the video file to separate the image frame information and audio information contained in the video file, so as to extract features based on the image frame information and audio information.
  • step 130 image feature information and audio feature information are extracted from the image frame information and the audio information, respectively.
  • image feature extraction and audio feature extraction can be performed according to the separated image frame information and audio information, respectively, to obtain image feature information and audio corresponding to the video Feature information.
  • the audio feature information can be used to characterize the audio feature of the video
  • the image feature information can be used to characterize the image feature of the video frame.
  • step 140 the image feature information and audio feature information are merged into video content feature information.
  • the image feature information and audio feature information corresponding to the video file may be fused into video content feature information, so that subsequent video content Feature information for video classification.
  • the video content feature information can be used to characterize the video content feature corresponding to the video file. It can be seen that the embodiments of the present application can generate video content feature information corresponding to the video file by fusing image feature information and audio feature information corresponding to the video file.
  • step 150 the classification result corresponding to the video file is determined according to the video content feature information.
  • the embodiment of the present application may perform video classification based on the video content feature information to obtain a classification result corresponding to the video file.
  • the classification result can be used for at least one of the following: the video category to which the video file belongs can be determined; and the video tag corresponding to the video file can be set.
  • the video category and the video tag may have a one-to-one correspondence, for example, the video category may be equivalent to the video tag, that is, the video tag may be set through the video category.
  • the embodiments of the present application can extract image feature information and audio feature information from the image frame information and audio information, respectively, and combine the image feature information and The audio feature information is merged into video content feature information, and then video classification can be performed based on the video content feature information to improve the accuracy and recall rate of the video classification, avoiding the classification caused by only using the image feature information for video classification in the related art
  • the result is low accuracy, which reduces the labor cost of video classification review, and can recommend videos that users are interested in based on the classification results corresponding to the video files, improving the user's experience of watching videos.
  • the embodiment of the present application may undergo video decoding to obtain image frame information and audio information corresponding to the video file, and then may respectively pass a preset image feature extractor and audio feature extractor Feature extraction is performed to generate video content feature information based on the extracted image feature information and audio feature information.
  • extracting image feature information and audio feature information from the image frame information and the audio information respectively includes: extracting image feature information corresponding to the image frame information through a pre-trained image feature extractor ; Extract the audio feature information corresponding to the audio information through a pre-trained audio feature extractor.
  • the image frame information may be input into a pre-trained image feature extractor for image feature extraction to extract the image of the video file Feature information
  • the audio information can be input into a pre-trained audio feature extractor for audio feature extraction to obtain audio feature information of the video file.
  • vectors may be used to represent image feature information and audio feature information, and the vector dimension of the image feature information is equal to the vector dimension of the audio feature information.
  • the above fusion of the image feature information and audio feature information into video content feature information may include: image feature information and audio feature information represented by vectors, and the vector dimension of the image feature information and the audio The vector dimensions of the feature information are equal; based on the image vector element in the image feature information and the audio vector element in the audio feature information, a video content feature matrix that is the video content feature information is generated.
  • the embodiment of the present application may use the image vector element in the image feature information as the first-dimensional matrix vector element and the audio vector element in the audio feature information as the first Two-dimensional matrix vector elements, and then generate a video content feature matrix according to the first-dimensional matrix vector elements and the second-dimensional matrix vector elements, and use the video content feature matrix as video content feature information.
  • a 2 ⁇ 1024 dimension can be obtained Video content feature matrix, and can use the video content feature matrix as the video content feature information obtained by fusion, and use the matrix vector elements included in the video content feature to represent each video content feature included in the video content feature information.
  • the audio feature extractor can be a neural network for audio feature extraction trained using deep learning techniques, such as a convolutional neural network, for example, a network for extracting audio features
  • the SoundNet model may also be other neural network models, such as InceptionNet, a network structure model for image classification, and ResNet, a network structure for image classification, which are not limited in the embodiments of the present application.
  • An embodiment of the present application may pre-train the audio feature extractor so that the trained audio feature extractor can be used for audio feature extraction in the future; and the image frame information of the video can be used as supervision information during the training process to optimize the training process Network parameters of the audio feature extractor. Therefore, the audio and video processing method provided in the embodiment of the present application may further include: acquiring video data from a preset input data set; extracting image frame information and audio information to be trained from the video data; and extracting the Image feature information of the training image frame information and audio feature information of the audio information to be trained; use the image feature information of the image frame information to be trained as training supervision information, and adopt the audio feature information of the audio information to be trained Perform training to obtain the audio feature extractor.
  • the preset input data set may refer to the video data set to be trained, which may include a large amount of unlabeled video data;
  • the image frame information to be trained may refer to the image frame information to be trained;
  • the audio information to be trained may refer to Audio information to be trained.
  • the image feature information of the image information to be trained and the audio feature information of the audio information to be trained can be separately extracted through different network models, such as Input the image frame information to be trained into a preset first network model for image feature extraction to obtain image feature information of the image frame information to be trained, and the audio information to be trained can be input to the preset first Perform audio feature extraction in the second network model to obtain the audio feature information of the audio information to be trained; wherein, the first network model is set to extract image features, such as the network model trained on the dataset ImageNet and the dataset Places2 ( Visual Geometry Group (VGG), etc .; the second network model is set to extract audio features, such as the network model SoundNet, etc., the embodiment of the present application does not limit this.
  • VVGG Visual Geometry Group
  • video data can be extracted from a preset input data set, and then the image frame information of the extracted video data can be input to the data set Imagenet and the data set
  • the audio information of the extracted video data is input to the network model SoundNet for audio feature extraction, and the image feature information of the video data can be used as supervisory information to follow the audio feature information output by the network model SoundNet and the image features of the video data
  • the information determines the loss data of KL Divergence, that is, the loss data corresponding to the audio feature information and the loss data corresponding to the image feature information, and then the loss data corresponding to the audio feature information and the loss data corresponding to the image feature letter Determine the network model SoundNet Loss function value, such as the average of the loss data
  • the audio feature extractor in this example can be pre-trained on a large number of unlabeled video data sets, and use the image frame information of the video as supervision information to optimize the network parameters of the audio feature extractor and improve the audio feature extraction Training efficiency.
  • the image frame information can be used to optimize the network parameters of the audio feature extractor during the audio and video joint training process.
  • the image frame information includes the object information and scene information in the video frame; the object information can be used to characterize the object in the video frame
  • the scene information can be used to characterize the scene of the video frame. For example, in a video frame, there is a child playing in the bedroom. The child can belong to the object information of "person", while the bedroom can belong to the scene information.
  • the image feature extractor may be composed of various networks that have achieved good results in the public video classification data set.
  • the image feature extractor includes but is not limited to: three-dimensional (3D) convolution-based Neural network C3D for video classification, neural network I3D for video classification based on 3D convolution, video classification neural network TSN based on optical flow and spatial domain, and various motion recognition based on recurrent neural network (RNN)
  • RNN recurrent neural network
  • the network and the like are not limited in the embodiments of the present application.
  • These networks can be pre-trained on large public datasets, such as the video classification dataset Kinetics or the video classification dataset Youtube-8M.
  • the above audio and video processing method may further include: acquiring image frame information to be trained; performing training based on the image frame information to be trained to obtain a video classification network; based on the non- The output layer generates the image feature extractor.
  • video data may be extracted from a preset video classification data set, and then image frame information to be trained may be obtained from the extracted video data, and according to the image frame information to be trained, the Set the network structure for training, such as network training according to the preset network structure for image classification InceptionNet-V1 to obtain the video classification network; then the image feature extractor can be generated based on the non-output layer in the video classification network, such as removing the video
  • the output layer used for classification in the classification network to use the remaining network layer in the video classification network as a video frame feature extraction network and determine the video frame feature extraction network as an image feature extractor so that the image feature can be passed later
  • the extractor performs image feature extraction.
  • image frame information and audio information can be generated.
  • the image frame information can be input to the pre-trained image feature extractor, and the audio information can be input to the pre-trained audio feature extractor to extract the image feature information and audio feature through the image feature extractor and audio feature extractor, respectively information.
  • video content feature information can be generated, and then the classification result corresponding to the video file can be determined according to the video content feature information.
  • the above determination of the classification result corresponding to the video file based on the video content feature information may include: generating feature map information based on the video content feature information; based on the feature map information and the video Content characteristic information to generate target characteristic information; perform classification processing according to the target characteristic information to obtain the classification result.
  • An embodiment of the present application may perform attention feature extraction on video content feature information based on a preset attention mechanism module, such as inputting video content feature information into the preset attention mechanism module, so that the attention mechanism module Attention feature extraction is performed on the video content feature information to obtain attention feature information, so that feature map information can be generated according to the attention feature information and video content feature information.
  • the attention feature information may represent the attention feature corresponding to the video content feature information generated based on the attention mechanism module; the feature map information may be used to represent at least one feature map corresponding to the video file.
  • generating feature map information according to the video content feature information may include: inputting the video content feature information into a preset attention mechanism module for attention feature extraction; based on the attention mechanism
  • the attention feature information and the video content feature information output by the module generate the feature map information.
  • the obtained video content feature information can be used as input to the attention mechanism module to be input into the attention mechanism module to extract the attention feature information through the attention mechanism module, and then based on the extracted attention
  • the feature information and the video content feature information generate corresponding feature map information, for example, the product of the attention feature information and the video content feature information can be used as the feature map information.
  • both the feature map information and the video content feature information can be recorded using a matrix, and the matrix dimension of the feature map information is equal to the matrix dimension of the video content feature information.
  • generating target feature information according to the feature map information and the video content feature information includes: recording the feature map information and the video content feature information through a matrix, and the matrix of the feature map information The dimension is equal to the matrix dimension of the video content feature information; based on the matrix element in the feature map information and the matrix element in the video content feature information, a target feature matrix as the target feature information is generated.
  • the feature map information in an embodiment of the present application may be generated according to the attention feature information output by the preset attention mechanism module.
  • the video content feature information may include 2 ⁇ 1024 matrix elements.
  • the video content feature information ie, 2 ⁇ 1024 matrix elements
  • the attention mechanism module as shown in FIG.
  • the matrix elements are given corresponding weights to obtain a 2 ⁇ 1024 dimension attention feature matrix as attention feature information, and then the matrix elements of the attention feature matrix and the video content feature matrix elements can be used to generate a 2 ⁇ 1024 dimension Feature map matrix, and use the feature map matrix as feature map information to perform element-wise multiplication operation based on the feature map information and the video content feature information to generate a 2 ⁇ 1024 dimension target feature matrix, which can then be used
  • the feature matrix is used as the final target feature information to perform classification processing based on the final target feature information.
  • the attention mechanism module can be composed of a convolution module; the convolution module includes at least one convolutional layer, the convolution can be a 1 ⁇ 1 convolutional layer, and can take video content feature information as input to generate Attention feature map information of 2 ⁇ 1024 dimensions.
  • the main purpose of the attention mechanism module is to learn the weights of the input features, and then by element-wise multiplication, each feature value is given a different weight. For example, after the video content feature information is input to the attention mechanism module, the attention mechanism module can learn the weight corresponding to each matrix element in the video content feature information, that is, generate a weight matrix corresponding to the video content feature information, the weight The matrix may contain a weight corresponding to each matrix element in the video content feature information.
  • the attention mechanism module can perform element-by-element multiplication operation based on the weight matrix and the input video content feature information to extract attention feature information from the video content feature information based on the weight matrix to obtain attention feature information, which can then be
  • the attention feature information and the video content feature information are combined into feature map information and output as feature map information corresponding to the video content feature.
  • the attention mechanism module is composed of the network layer of the neural network.
  • it can be composed of the convolutional layer module, including the convolutional layer, the nonlinear layer, and the batch normalization layer. It can also be composed of the fully connected layer and the global pool.
  • the composition of the chemical conversion layer is not limited in the embodiments of the present application.
  • the method of the embodiment of the present application may further include the step of pre-training the attention mechanism module.
  • the above audio-video processing method further includes: acquiring feature information of the video content to be trained; training based on the feature information of the video content to be trained and preset weight information to obtain a network layer; generating the attention according to the network layer Mechanism module.
  • the feature information of the video content to be trained may refer to the feature information of the video content to be trained, and may include various video content feature information obtained during the training process; the weight information may be preset according to the training requirements of the network layer, and may be used In the training process, each matrix element in the video content feature information is given a corresponding weight to generate corresponding attention feature information.
  • the acquired video content information can be used as the feature information of the video content to be trained, and the neural network technology can be used to adopt the feature information of the video content to be trained based on the attention mechanism.
  • the network layer can be used as the attention mechanism module, so that the attention feature information can be extracted by the network layer during subsequent audio and video processing, and then the attention feature information and the video content feature information can be used Fusion to get the corresponding feature map information.
  • the trained network layer may include at least one of the following: a fully connected layer, a global pooling layer, a convolutional layer, a non-linear layer, a batch normalization layer, etc. This embodiment of the present application does not make this limit.
  • the above determination performs classification processing according to the target characteristic information to obtain the classification result, which may include: inputting the video content characteristic information into a preset classification network for classification processing; The classification score output by the classification network of is used as the classification result, where the classification score is used to determine the video category to which the video file belongs.
  • the preset classification network may be a classification layer in a neural network, and may be used to classify videos according to target feature information corresponding to video files, and output video classification results. For example, in combination with the above example, after obtaining the video content feature information as shown in FIG.
  • the video content feature information can be used as the final target feature information to input the video content feature information to a preset classification layer for classification processing,
  • the category score output by the classification network is obtained, and the category score can be used as the classification result corresponding to the video file, so that the video category to which the video file belongs can be determined based on the category score.
  • the preset classification network may include a convolution module, and the convolution module may be composed of a batch normalization layer, a non-linear layer, and a convolution layer.
  • the classification network consists of three 1 * 1 convolution modules.
  • the number of convolution kernels in these three convolution modules is 1024 and 512, respectively, and the number of classification categories.
  • each convolution module can be batch normalized layer, non-linear layer and 1 * 1 convolution layer.
  • the video category can be equivalent to the video tag.
  • the embodiment of the present application can further classify the video into an algorithm model of different video tags based on the video category, for example, the video can be combined with the image frame information and audio information in the short video file to determine the video Category, an algorithm model to classify short videos into different tags based on the video category, improve the accuracy of classifying short videos into different tags, and thus reduce the cost of manual review.
  • the embodiments of the present application use the attention mechanism, image feature information, and audio feature information to classify video tags.
  • video tag classification can be greatly improved.
  • Accuracy and recall rate thereby reducing the labor cost of video tag review; and can facilitate users to search for videos of interest based on video tags and / or categories, or can recommend videos that users are interested in based on video tags for different users.
  • it can effectively improve the experience of viewers watching videos, such as the experience of watching videos for users of short video applications.
  • the audio and video processing device may include a video file acquisition module 310, a video separation module 320, a feature extraction module 330, and features Fusion module 340 and video classification module 350.
  • the video file obtaining module 310 is set to obtain a video file.
  • the video separation module 320 is configured to separate image frame information and audio information from the video file.
  • the feature extraction module 330 is configured to extract image feature information and audio feature information from the image frame information and the audio information, respectively.
  • the feature fusion module 340 is configured to fuse the image feature information and audio feature information into video content feature information.
  • the video classification module 350 is configured to determine the classification result corresponding to the video file according to the video content feature information.
  • the feature extraction module 330 may include an image feature extraction sub-module and an audio feature extraction sub-module.
  • the image feature extraction submodule is configured to extract image feature information corresponding to the image frame information through a pre-trained image feature extractor.
  • the audio feature extraction submodule is configured to extract audio feature information corresponding to the audio information through a pre-trained audio feature extractor.
  • the vector dimension of the image feature information is equal to the vector dimension of the audio feature information.
  • the above feature fusion module is set to represent image feature information and audio feature information by vectors, and the vector dimension of the image feature information is equal to the vector dimension of the audio feature information; based on the image feature information The image vector element in and the audio vector element in the audio feature information generate a video content feature matrix as the video content feature information.
  • the audio-video processing device may further include a video data acquisition module, an information extraction module, a training feature extraction module, and an audio feature extractor training module.
  • the video data acquisition module is set to acquire video data from a preset input data set.
  • the information extraction module is configured to extract image frame information and audio information to be trained from the video data.
  • the training feature extraction module is configured to separately extract image feature information of the image frame information to be trained and audio feature information of the audio information to be trained.
  • the audio feature extractor training module is configured to use the image feature information of the to-be-trained image frame information as training supervision information, and train with the audio feature information of the audio information to be trained to obtain the audio feature extractor .
  • the audio and video processing device in an embodiment of the present application may further include an image frame information acquisition module, a classification network training module, and an image feature extractor generation module.
  • the image frame information acquisition module is set to acquire image frame information to be trained.
  • the classification network training module is set to perform training based on the image frame information to be trained to obtain a video classification network.
  • the image feature extractor generating module is configured to generate the image feature extractor based on the non-output layer in the video classification network.
  • the video classification module 350 may include a feature map generation sub-module, a target feature generation sub-module, and a classification processing sub-module.
  • the feature map generation submodule is configured to generate feature map information according to the video content feature information.
  • the target feature generation submodule is configured to generate target feature information according to the feature map information and the video content feature information.
  • the classification processing submodule is configured to perform classification processing according to the target feature information to obtain the classification result.
  • the feature map generation submodule may be set to input the video content feature information into a preset attention mechanism module for attention feature extraction; and based on the attention mechanism module The output attention feature information and the video content feature information generate the feature map information.
  • the matrix dimension of the feature map information and the matrix dimension of the video content feature information are equal.
  • the target feature generation submodule may be configured to record the feature map information and the video content feature information through a matrix, and the matrix dimension of the feature map information and the matrix of the video content feature information The dimensions are equal; based on the matrix element in the feature map information and the matrix element in the video content feature information, a target feature matrix as the target feature information is generated.
  • the classification processing submodule is configured to input the video content feature information into a preset classification network for classification processing, and use the category score output by the preset classification network as the classification result ; Where the category score is used to determine the video category to which the video file belongs.
  • the audio and video processing apparatus in the embodiment of the present application may further include a video content feature acquisition module, a network layer training module, and an attention mechanism module generation module.
  • the video content feature acquisition module is set to acquire feature information of the video content to be trained.
  • the network layer training module is set to train based on the feature information of the video content to be trained and the preset weight information to obtain the network layer.
  • the attention mechanism module generating module is configured to generate the attention mechanism module according to the network layer.
  • the network layer includes at least one of the following: a fully connected layer, a global pooling layer, a convolutional layer, a nonlinear layer, and a batch normalization layer.
  • the image frame information includes: object information and scene information in the video frame.
  • the preset classification network includes a convolution module, and the convolution module is composed of the batch normalization layer, a nonlinear layer, and a convolution layer.
  • the audio and video processing device provided by the embodiment of the present application may execute the audio and video processing method provided by any embodiment of the present application.
  • the above audio and video processing device may be integrated in the device.
  • the device can be composed of at least two physical entities or a physical entity, such as the device can be a personal computer (Personal Computer (PC), computer, mobile phone, tablet device, personal digital assistant, server, messaging device, game Console etc.
  • PC Personal Computer
  • mobile phone mobile phone
  • tablet device personal digital assistant
  • server messaging device
  • game Console etc.
  • An embodiment of the present application further provides a device, including: a processor and a memory. At least one instruction is stored in the memory, and the instruction is executed by the processor, so that the device executes the audio-video processing method described in the foregoing method embodiment.
  • the device may include a processor 40, a memory 41, a display screen 42 with a touch function, an input device 43, an output device 44, and a communication device 45.
  • the number of processors 40 in the device may be at least one, and one processor 40 is taken as an example in FIG. 4.
  • the number of the memory 41 in the device may be at least one, and one memory 41 is taken as an example in FIG. 4.
  • the processor 40, the memory 41, the display screen 42, the input device 43, the output device 44 and the communication device 45 of the device may be connected by a bus or other means. In FIG. 4, the connection by a bus is taken as an example.
  • the memory 41 is a computer-readable storage medium, and is configured to store software programs, computer executable programs, and modules, such as program instructions / modules corresponding to the audio-video processing method described in any embodiment of the present application (for example, the above-mentioned audio-video processing
  • the memory 41 may mainly include a storage program area and a storage data area, where the storage program area may store operation devices and application programs required for at least one function; the storage data area may store data created according to the use of the device and the like.
  • the memory 41 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, or other non-volatile solid-state storage devices.
  • the memory 41 includes memories remotely provided with respect to the processor 40, and these remote memories may be connected to the device through a network. Examples of the above network include but are not limited to the Internet, intranet, local area network, mobile communication network, and combinations thereof.
  • the display screen 42 is a display screen 42 with a touch function, which may be a capacitive screen, an electromagnetic screen or an infrared screen.
  • the display screen 42 is configured to display data according to the instructions of the processor 40, and is also configured to receive touch operations acting on the display screen 42 and send corresponding signals to the processor 40 or other devices.
  • the display screen 42 when the display screen 42 is an infrared screen, the display screen 42 further includes an infrared touch frame, and the infrared touch frame is disposed around the display screen 42.
  • the infrared touch frame is configured to receive infrared signals, and The infrared signal is sent to the processor 40 or other devices.
  • the communication device 45 is configured to establish a communication connection with other devices, which may be at least one of a wired communication device and a wireless communication device.
  • the input device 43 is configured to receive input digital or character information, and generate key signal input related to user settings and function control of the device, and may be a camera for acquiring images and a sound pickup device for acquiring audio data.
  • the output device 44 may include an audio device such as a speaker. It should be noted that the composition of the input device 43 and the output device 44 can be set according to actual conditions.
  • the processor 40 executes various functional applications and data processing of the device by running software programs, instructions, and modules stored in the memory 41, that is, implementing the above audio-video processing method.
  • the processor 40 executes at least one program stored in the memory 41, the following operations are achieved: obtaining a video file; separating image frame information and audio information from the video file; The audio information extracts image feature information and audio feature information; fuse the image feature information and audio feature information into video content feature information; and determine the classification result corresponding to the video file according to the video content feature information.
  • Embodiments of the present application also provide a computer-readable storage medium.
  • the device When instructions in the storage medium are executed by a processor of a device, the device enables the device to execute the audio-video processing method described in the foregoing method embodiments.
  • the audio and video processing method includes: acquiring a video file; separating image frame information and audio information from the video file; extracting image feature information and audio feature information from the image frame information and the audio information, respectively; Fuse the image feature information and audio feature information into video content feature information; determine the classification result corresponding to the video file according to the video content feature information.
  • the technical solution of the present application can essentially be embodied in the form of a software product that contributes to the related technology, and the computer software product can be stored in a computer-readable storage medium, such as a computer floppy disk, Read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), flash memory (FLASH), hard disk or optical disc, etc., including several instructions to make a computer device (which can be a robot, personal A computer, server, or network device, etc.) executes the audio and video processing method described in any embodiment of the present application.
  • a computer device which can be a robot, personal A computer, server, or network device, etc.
  • each part of the present application may be implemented by hardware, software, firmware, or a combination thereof.
  • multiple steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution device.
  • a logic gate circuit for implementing a logic function on a data signal
  • dedicated integrated circuits with appropriate combinational logic gate circuits
  • programmable gate arrays PROM
  • field programmable gate arrays Field Programmable Gate Array, FPGA

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Databases & Information Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

本申请公开了一种音视频处理方法、装置、设备及介质,涉及计算机技术领域,该方法包括:获取视频文件;从所述视频文件中分离出图像帧信息和音频信息;分别从所述图像帧信息和所述音频信息中提取图像特征信息和音频特征信息;将所述图像特征信息和音频特征信息融合为视频内容特征信息;依据所述视频内容特征信息确定所述视频文件对应的分类结果。

Description

一种音视频处理方法、装置、设备及介质
本申请要求在2018年11月01日提交中国专利局、申请号为201811293776.1的中国专利申请的优先权,该申请的全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机技术领域,例如一种音视频处理方法、装置、设备及介质。
背景技术
随着计算机技术的快速发展,深度学习技术在图像理解的多个领域取得巨大进展,如深度学习技术应用到图像中物体分类,物体检测,物体分割等任务中。到目前为止,深度学习技术在图像理解领域中的技术已经非常成熟了,并慢慢应用到视频内容理解任务中。但是,与图像内容理解相比,视频内容理解还有一段很长的路要走。在视频内容理解任务中,视频分类是一个最基本的任务,视频分类领域已成为很多研究者致力研究的热点。
视频分类主要是将视频分类到不同的标签,即将视频分类到不同的视频类别中,以基于视频所属的视频类别设置视频标签。例如,短视频类的应用每天会由用户制作并上传大量短视频,这些短视频内容多种多样,不同的观众喜欢的短视频类别不一样。通过将短视频分类成不同标签的视频,一方面方便用户搜索自己感兴趣的视频类别,另一方面可以针对不同的用户推荐其感兴趣的视频,从而可以提高观众在短视频类应用的停留时长。目前,将用户上传的短视频分类到不同的标签的做法通常是先通过算法挑选,然后再由人工审核。但是,短视频标签分类的准确率受算法性能限制,如果算法性能比较差,将短视频分类到不同标签的准确度比较低,将会耗费大量的人力来进行审核工作,增加人力成本。
发明内容
本申请实施例提供一种音视频处理方法、装置、设备及介质,结合视频中的音频特征信息和视频帧的图像特征信息进行视频分类,以提升视频分类的准确率和召回率,减少视频分类审核的人工成本。
第一方面,本申请实施例提供了一种音视频处理方法,包括:获取视频文件;从所述视频文件分离出图像帧信息和音频信息;分别从所述图像帧信息和所述音频信息提取图像特征信息和音频特征信息;将所述图像特征信息和音频特征信息融合为视频内容特征信息;依据所述视频内容特征信息确定所述视频文件对应的分类结果。
第二方面,本申请实施例还提供了一种音视频处理装置,包括:视频文件获取模块,设置为获取视频文件;视频分离模块,设置为从所述视频文件分离出图像帧信息和音频信息;特征提取模块,设置为分别从所述图像帧信息和所述音频信息提取图像特征信息和音频特征信息;特征融合模块,设置为将所述图像特征信息和音频特征信息融合为视频内容特征信息;视频分类模块,设置为依据所述视频内容特征信息确定所述视频文件对应的分类结果。
第三方面,本申请实施例还提供了一种设备,包括:处理器和存储器;所述存储器中存储有至少一条指令,所述指令由所述处理器执行,使得所述设备执行如第一方面所述的音视频处理方法。
第四方面,本申请实施例还提供了一种计算机可读存储介质,所述存储介质中的指令由设备的处理器执行时,使得所述设备能够执行如第一方面所述的音视频处理方法。
附图说明
图1是本申请的一种音视频处理方法实施例的步骤流程示意图;
图2是本申请另一个示例中的一种音视频处理方法的示意图;
图3是本申请一实施例中的一种音视频处理装置的结构方框示意图;
图4是本申请一个示例中的一种设备的结构方框示意图。
具体实施方式
相关技术使用三维卷积做视频内容的标签分类,需要将处理单张图片的二维卷积神经网络改造为能处理多张图片的三维卷积神经网络,以直接用于图像分类的卷积神经网络,但是三维卷积导致网络的参数非常大,使得网络的训练变得困难,即存在网络训练参数大导致网络训练困难的问题。为避免网络训练参数大而导致的网络训练困难问题,通常结合视频的光流信息及视频帧的图像信息做视频内容的标签分类,即分别提取视频的光流信息和视频帧的图像信息, 经过融合,生成新的视频内容特征信息,以基于新生成的视频内容特征信息进行视频内容的标签分类,从而能够结合视频中的动作信息进行视频标签分类,提升视频中动作的识别效果,但是视频的光流信息的提取非常耗时,影响视频分类效率。
为避免将视频分类到不同标签的情况,本申请实施例提出了一种新的音视频处理方法,结合视频中的音频特征信息和视频帧的图像特征信息进行视频分类,提高将视频分类到不同标签的准确度,即提升视频分类的准确率和召回率,进而减少视频分类审核的人工成本。
参照图1,示出了本申请的一种音视频处理方法实施例的步骤流程示意图,包括步骤110至步骤150。
在步骤110中,获取视频文件。
本申请实施例在进行视频分类时,可以获取需要分类的视频文件,以依据该视频文件中所包含的音视频信息进行视频标签分类。其中,视频文件中包含的音视频信息可以包括与视频播放相关的信息,如可以包括:视频的图像帧信息、音频信息等,本申请实施例对此不作限制。
需要说明的是,一个视频文件可以用于表征一个视频,一个视频可包括至少一个视频帧。图像帧信息可以是指视频帧的图像信息,可以用于显示视频画面,使得用户可以观看到视频中的物体、人物、场景等。音频信息可以用于播放视频中的各种声音,如播放视频中的语音等。
在步骤120中,从所述视频文件分离出图像帧信息和音频信息。
本申请实施例在获取到视频文件后,可以通过对该视频文件进行解复合,分离出该视频文件中所包含的图像帧信息和音频信息,以依据该图像帧信息和音频信息进行特征提取。
在步骤130中,分别从所述图像帧信息和所述音频信息提取图像特征信息和音频特征信息。
本申请实施例中,在分离出视频的图像帧信息和音频信息后,可以依据分离出的图像帧信息和音频信息分别进行图像特征提取和音频特征提取,得到该视频对应的图像特征信息和音频特征信息。其中,音频特征信息可以用于表征视频的音频特征;图像特征信息可以用于表征视频帧的图像特征。
在步骤140中,将所述图像特征信息和音频特征信息融合为视频内容特征信息。
本申请实施例在提取到同一个视频文件对应的图像特征信息和音频特征信息后,可以将该视频文件对应的图像特征信息和音频特征信息融合为视频内容特征信息,以便后续可以依据该视频内容特征信息进行视频分类。该视频内容特征信息可以用于表征视频文件对应的视频内容特征。可见,本申请实施例可以通过融合视频文件对应的图像特征信息和音频特征信息,生成该视频文件对应的视频内容特征信息。
在步骤150中,依据所述视频内容特征信息确定所述视频文件对应的分类结果。
本申请实施例在得到视频内容特征信息后,可以基于该视频内容特征信息进行视频分类,得到视频文件对应的分类结果。该分类结果可以用于以下至少之一:可以确定视频文件所属的视频类别;用于设置该视频文件对应的视频标签等。其中,视频类别与视频标签可以具有一一对应的关系,如视频类别可以等价于视频标签,亦即可以通过视频类别设置视频标签。
综上,本申请实施例在从获取到的视频文件分离出图像帧信息和音频信息后,可分别从图像帧信息和音频信息提取出图像特征信息和音频特征信息,并将该图像特征信息和音频特征信息融合为视频内容特征信息,随后可以依据该视频内容特征信息进行视频分类,以提升视频分类的准确率和召回率,避免了相关技术中只使用图像特征信息进行视频分类所导致的分类结果准确度低的情况,从而减少视频分类审核的人工成本,以及,能够基于视频文件对应的分类结果推荐用户感兴趣的视频,提升用户观看视频的体验。
在实际处理中,本申请实施例在获取到视频文件后,可以经过视频解码,得到该视频文件对应的图像帧信息和音频信息,随后可以分别通过预设的图像特征提取器和音频特征提取器进行特征提取,以根据提取到的图像特征信息和音频特征信息生成视频内容特征信息。在一实施例中,上述分别从所述图像帧信息和所述音频信息提取图像特征信息和音频特征信息,包括:通过预先训练的图像特征提取器,提取所述图像帧信息对应的图像特征信息;通过预先训练的音频特征提取器,提取所述音频信息对应的音频特征信息。
本申请实施例在分离出视频文件中所包含的图像帧信息和音频信息后,可以将图像帧信息输入到预先训练好的图像特征提取器中进行图像特征提取,以提取出该视频文件的图像特征信息,并可将音频信息输入到预先训练好的音频特征提取器中进行音频特征提取,得到该视频文件的音频特征信息。
在一种实施方式中,可以采用向量来表示图像特征信息和音频特征信息,且图像特征信息的向量维度与所述音频特征信息的向量维度相等。示例性的,上述将所述图像特征信息和音频特征信息融合为视频内容特征信息,可以包括:通过向量来表示图像特征信息和音频特征信息,且所述图像特征信息的向量维度与所述音频特征信息的向量维度相等;基于所述图像特征信息中的图像向量元素和所述音频特征信息中的音频向量元素,生成作为所述视频内容特征信息的视频内容特征矩阵。在一实施例中,本申请实施例在得到图像特征信息和音频特征信息后,可以将图像特征信息中的图像向量元素作为第一维矩阵向量元素,将音频特征信息中的音频向量元素作为第二维矩阵向量元素,随后根据第一维矩阵向量元素和第二维矩阵向量元素生成一个视频内容特征矩阵,并将该视频内容特征矩阵作为视频内容特征信息。例如,在图像特征信息和音频特征信息均为一个1024维度的向量的情况下,融合图像特征信息(1024维度的向量)和音频特征信息(1024维度的向量)后,可以得到一个2×1024维度的视频内容特征矩阵,并可以将该视频内容特征矩阵作为融合得到的视频内容特征信息,采用该视频内容特征所包含的矩阵向量元素表征视频内容特征信息中所包含的各个视频内容特征。
在实际处理中,音频特征提取器可以是利用深度学习技术训练出的一种用于音频特征提取的神经网络,如可以是一种卷积神经网络,例如可以是一个用来提取音频特征的网络模型SoundNet,也可以是其他的神经网络模型,如用于图像分类的网络结构模型InceptionNet、用于图像分类的网络结构ResNet等,本申请实施例对此不作限制。
本申请一实施例可以预先训练音频特征提取器,以便后续可以利用训练好的音频特征提取器进行音频特征提取;并在训练过程中可使用视频的图像帧信息作为监督信息,以优化训练过程中的音频特征提取器的网络参数。因此,本申请实施例提供的音视频处理方法还可以包括:从预设的输入数据集中获取视频数据;从所述视频数据提取出待训练图像帧信息和待训练音频信息;分别提取所述待训练图像帧信息的图像特征信息和所述待训练音频信息的音频特征信息;将所述待训练图像帧信息的图像特征信息作为训练的监督信息,并采用所述待训练音频信息的音频特征信息进行训练,得到所述音频特征提取器。其中,预设的输入数据集可以是指待训练的视频数据集,可以包括大量的未标记的视频数据;待训练图像帧信息可以是指待训练的图像帧信息;待训练音频信息可 以是指待训练的音频信息。
在从所述视频数据提取出待训练图像帧信息和待训练音频信息后,可以通过不同的网络模型分别提取出待训练图像帧信息的图像特征信息和待训练音频信息的音频特征信息,如可以将所述待训练图像帧信息输入到预设的第一网络模型中进行图像特征提取,得到该待训练图像帧信息的图像特征信息,并且可以将所述待训练音频信息输入到预设的第二网络模型中进行音频特征提取,得到该待训练音频信息的音频特征信息;其中,第一网络模型设置为提取图像特征,如可以是在数据集ImageNet和数据集Places2上训练得到的网络模型(Visual Geometry Group,VGG)等;第二网络模型设置为提取音频特征,如可以是网络模型SoundNet等,本申请实施例对此也不做限制。
作为本申请的一个示例,在音频特征提取器的训练过程中,可以从预设的输入数据集中提取视频数据,随后可将提取到的视频数据的图像帧信息输入到在数据集Imagenet和数据集Places2上预训练的网络模型VGG,以通过该网络模型VGG进行图像特征提取,随后可以将该网络模型VGG的输出结果信息确定为视频数据的图像特征信息,并保存该图像特征信息;然后可将提取到的视频数据的音频信息输入到网络模型SoundNet中进行音频特征提取,并可使用视频数据的图像特征信息作为监督信息,以根据该网络模型SoundNet输出的音频特征信息和该视频数据的图像特征信息确定出松散度(KL Divergence)的损失数据,即确定出音频特征信息对应的损失数据和图像特征信息对应的损失数据,随后可依据音频特征信息对应的损失数据和图像特征信对应的损失数据确定出网络模型SoundNet的损失(Loss)函数值,如将音频特征信息和图像特征信这两者对应的损失数据的平均值作为网络模型SoundNet最终的Loss,即使用视频的图像帧信息作为监督信息进行训练,以优化训练的网络参数,进而可以在网络模型SoundNet的Loss函数值符合预设条件时,将训练好的网络模型SoundNet作为音频特征提取器。
可见,本示例中的音频特征提取器可以在大量的未标记的视频数据集上进行预训练,并使用视频的图像帧信息作为监督信息,以优化音频特征提取器的网络参数,提高音频特征提取器的训练效率。其中,图像帧信息在音视频联合训练过程中可以用于优化音频特征提取器的网络参数,图像帧信息包括视频帧中的物体信息和场景信息等;物体信息可以用于表征视频帧中的物体,场景信息可以用于表征视频帧的场景,如在一个视频帧中,有一个小孩在卧室中玩耍, 小孩可以是属于“人”这个物体信息,而卧室可以是属于场景信息。
在一实施例中,图像特征提取器可以由公开的视频分类数据集中取得较好效果的各类网络来组成,图像特征提取器包括但不仅限于:基于三维(3Dimensions,3D)卷积的用于视频分类的神经网络C3D、基于3D卷积的用于视频分类的神经网络I3D、基于光流和空间域的视频分类神经网络TSN以及基于循环神经网络(Recurrent Neural Network,RNN)的各种动作识别网络等,本申请实施例对此不作限制。这些网络可以在大型公开数据集上进行预训练,如可以在视频分类数据集Kinetics或视频分类数据集Youtube-8M上进行预训练等。
在本申请的一个实施例中,上述音视频处理方法还可以包括:获取待训练图像帧信息;依据所述待训练图像帧信息进行训练,得到视频分类网络;基于所述视频分类网络中的非输出层,生成所述图像特征提取器。示例性的,本申请实施例可以在预设的视频分类数据集中提取视频数据,随后可以从该提取到的视频数据中获取待训练图像帧信息,并可依据该待训练图像帧信息,按照预设网络结构进行训练,如按照预设的用于图像分类的网络结构InceptionNet-V1进行网络训练,得到视频分类网络;随后可基于视频分类网络中的非输出层生成图像特征提取器,如除去视频分类网络中用于分类的输出层,以将该视频分类网络中剩余的网络层作为视频帧特征提取网络,并将该视频帧特征提取网络确定为图像特征提取器,以便后续可以通过该图像特征提取器进行图像特征提取。
作为本申请的一个示例,如图2所示,视频文件经视频解码后,可以生成对应的图像帧信息和音频信息。随后,图像帧信息可以输入到预先训练的图像特征提取器,且音频信息可以输入到预先训练的音频特征提取器,以分别通过图像特征提取器和音频特征提取器提取出图像特征信息和音频特征信息。通过融合提取出的图像特征信息和音频特征信息,可生成视频内容特征信息,进而可以依据该视频内容特征信息确定出视频文件对应的分类结果。
本申请一实施例中,上述依据所述视频内容特征信息确定所述视频文件对应的分类结果,可以包括:依据所述视频内容特征信息生成特征图信息;依据所述特征图信息和所述视频内容特征信息,生成目标特征信息;依据所述目标特征信息进行分类处理,得到所述分类结果。
本申请一实施例可以基于预设的注意力机制模块对视频内容特征信息进行注意力特征提取,如将视频内容特征信息输入到预设的注意力机制模块中,以 通过该注意力机制模块对该视频内容特征信息中进行注意力特征提取,得到注意力特征信息,从而可以根据该注意力特征信息和视频内容特征信息生成特征图信息。其中,注意力特征信息可以表征基于注意力机制模块生成视频内容特征信息对应的注意力特征;特征图信息可以用于表征视频文件对应的至少一个特征图。
在一个实施例中,依据所述视频内容特征信息生成特征图信息,可以包括:将所述视频内容特征信息输入到预设的注意力机制模块中进行注意力特征提取;基于所述注意力机制模块输出的注意力特征信息和所述视频内容特征信息,生成所述特征图信息。,本申请实施例可以将得到的视频内容特征信息作为注意力机制模块的输入,输入到注意力机制模块中,以通过注意力机制模块提取出注意力特征信息,随后可基于提取到的注意力特征信息与视频内容特征信息生成对应的特征图信息,如可以将注意力特征信息与视频内容特征信息的乘积作为特征图信息等。
在实际处理中,特征图信息和视频内容特征信息均可以采用矩阵来记录,且特征图信息的矩阵维度和所述视频内容特征信息的矩阵维度相等。在一实施例中,依据所述特征图信息和所述视频内容特征信息生成目标特征信息,包括:通过矩阵记录所述特征图信息和所述视频内容特征信息,且所述特征图信息的矩阵维度和所述视频内容特征信息的矩阵维度相等;基于所述特征图信息中的矩阵元素与所述视频内容特征信息中的矩阵元素,生成作为所述目标特征信息的目标特征矩阵。
本申请一实施例中的特征图信息可以为依据预设的注意力机制模块输出的注意力特征信息生成的。例如,在采用一个2×1024维度的矩阵记录视频内容特征信息时,该视频内容特征信息可以包括2×1024个矩阵元素。在实际处理中,可以将视频内容特征信息(即2×1024个矩阵元素)作为注意力机制模块的输入,如图2所示,以通过该注意力机制模块为该视频内容特征信息中的各个矩阵元素赋予对应的权重,得到一个2×1024维度的注意力特征矩阵,作为注意力特征信息,随后可以采用该注意力特征矩阵的矩阵元素和视频内容特征的矩阵元素,生成一个2×1024维度的特征图矩阵,并将该特征图矩阵作为特征图信息,以依据该特征图信息与视频内容特征信息进行逐元素相乘操作,生成一个2×1024维度的目标特征矩阵,随后可将该目标特征矩阵作为最终的目标特征信息,以依据最终的目标特征信息进行分类处理。其中,注意力机制模 块可以由一个卷积模块组成;该卷积模块包括至少一个卷积层,该卷积可以为1×1的卷积层,并可将视频内容特征信息作为输入,以生成2×1024维度的注意力特征图信息。
需要说明的是,注意力机制模块主要的目的在于学习输入特征的权重,再通过逐元素相乘,给每个特征值赋予不同的权重。例如,在视频内容特征信息输入到注意力机制模块后,注意力机制模块可以学习到该视频内容特征信息中的每一个矩阵元素对应的权重,即生成视频内容特征信息对应的权重矩阵,该权重矩阵可以包含与视频内容特征信息中的每一个矩阵元素对应的权重。随后,注意力机制模块可以依据该权重矩阵和输入的视频内容特征信息进行逐元素相乘操作,以基于该权重矩阵对视频内容特征信息进行注意力特征提取,得到注意力特征信息,随后可将该注意力特征信息与视频内容特征信息合成特征图信息并输出,以作为视频内容特征对应的特征图信息。
在实际处理中,注意力机制模块由神经网络的网络层组成,如可以由卷积层模块组成,包括卷积层,非线性层及批归一化层,也可以由全连接层和全局池化层组成等,本申请实施例对此不作限制。在一实施例中,本申请实施例的方法还可以包括预先训练注意力机制模块步骤。示例性的,上述音视频处理方法还包括:获取待训练视频内容特征信息;基于待训练视频内容特征信息和预设的权重信息进行训练,得到网络层;依据所述网络层生成所述注意力机制模块。其中,待训练视频内容特征信息可以是指待训练的视频内容特征信息,可以包括训练过程中所获取到的各种视频内容特征信息;权重信息可以根据网络层的训练需求进行预先设置,可以用于在训练过程中为视频内容特征信息中的每个矩阵元素赋予对应的权重,以生成对应的注意力特征信息。
在一实施例中,本申请实施例在训练过程中,可以将获取到的视频内容信息作为待训练视频内容特征信息,并可基于注意力机制,利用神经网络技术,采用待训练视频内容特征信息和预设的权重信息进行训练,以训练出用于提取注意力特征的网络层。在训练出网络层后,可以将该网络层作为注意力机制模块,以便后续音视频处理过程中可以利用该网络层提取出注意力特征信息,随后可将该注意力特征信息与视频内容特征信息融合,得到对应的特征图信息。在一实施例中,训练出的网络层可以包括以下至少一项:全连接层、全局池化层、卷积层、非线性层和批归一化层等等,本申请实施例对此不作限制。
本申请一实施例中,上述确定依据所述目标特征信息进行分类处理,得到 所述分类结果,可以包括:将所述视频内容特征信息输入预设的分类网络进行分类处理;将所述预设的分类网络输出的类别分数作为所述分类结果,其中,所述类别分数用于确定所述视频文件所属的视频类别。需要说明的是,预设的分类网络可以是神经网络中的分类层,可以用于依据视频文件对应的目标特征信息进行视频分类,输出视频的分类结果。例如,结合上述例子,如图2所在融合得到视频内容特征信息后,可以将该视频内容特征信息作为最终的目标特征信息,以将该视频内容特征信息输入到预设的分类层进行分类处理,得到分类网络输出的类别分数,并可将该类别分数作为视频文件对应的分类结果,从而可以基于该类别分数确定出该视频文件所属的视频类别。
在一实施例中,预设的分类网络可以包括卷积模块,该卷积模块可以由批归一化层,非线性层以及卷积层组成。例如,分类网络由三个1*1的卷积模块组成,这三个卷积模块中卷积核的个数分别为1024,512,分类的类别数组成。其中,每个卷积模块可以批归一化层,非线性层和1*1卷积层组成。
在实际处理中,视频类别可以与视频标签等价。本申请实施例在确定出视频文件所属的视频类别后,进而可以基于该视频类别将视频分类到不同视频标签的算法模型中,如可以结合短视频文件中的图像帧信息和音频信息确定出视频类别,以基于视频类别将短视频分类到不同标签的算法模型,提高将短视频分类到不同标签的准确度,从而减少人工审核的成本。
综上,本申请实施例通过融合注意力机制、图像特征信息及音频特征信息来进行视频标签分类,相较相关技术只使用图像特征信息进行视频标签分类,可以较大幅度的提升视频标签分类的准确率和召回率,从而减少视频标签审核的人工成本;并且可以方便用户依据视频标签和/或类别搜索其感兴趣的视频,或者,可以依据视频标签,针对不同用户推荐用户感觉兴趣的视频,进而可以有效地提升观众观看视频的体验,如提升短视频类的应用的用户观看视频的体验。
需要说明的是,对于方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请实施例并不受所描述的动作顺序的限制,因为依据本申请实施例,某些步骤可以采用其他顺序或者同时进行。
参照图3,示出了本申请一实施例中的一种音视频处理装置的结构方框示意图,该音视频处理装置可以包括视频文件获取模块310、视频分离模块320、特 征提取模块330、特征融合模块340以及视频分类模块350。
视频文件获取模块310,设置为获取视频文件。
视频分离模块320,设置为从所述视频文件分离出图像帧信息和音频信息。
特征提取模块330,设置为分别从所述图像帧信息和所述音频信息提取图像特征信息和音频特征信息。
特征融合模块340,设置为将所述图像特征信息和音频特征信息融合为视频内容特征信息。
视频分类模块350,设置为依据所述视频内容特征信息确定所述视频文件对应的分类结果。
在一实施例中,所述特征提取模块330可以包括图像特征提取子模块以及音频特征提取子模块。
图像特征提取子模块,设置为通过预先训练的图像特征提取器,提取所述图像帧信息对应的图像特征信息。
音频特征提取子模块,设置为通过预先训练的音频特征提取器,提取所述音频信息对应的音频特征信息。
在实际处理中,所述图像特征信息的向量维度与所述音频特征信息的向量维度相等。在一实施例中,上述特征融合模块设置为通过向量来表示图像特征信息和音频特征信息,且所述图像特征信息的向量维度与所述音频特征信息的向量维度相等;基于所述图像特征信息中的图像向量元素和所述音频特征信息中的音频向量元素,生成作为所述视频内容特征信息的视频内容特征矩阵。
在本申请的一个实施例中,上述音视频处理装置还可以包括视频数据获取模块、信息提取模块、训练特征提取模块以及音频特征提取器训练模块。
视频数据获取模块,设置为从预设的输入数据集中获取视频数据。
信息提取模块,设置为从所述视频数据提取出待训练图像帧信息和待训练音频信息。
训练特征提取模块,设置为分别提取所述待训练图像帧信息的图像特征信息和所述待训练音频信息的音频特征信息。
音频特征提取器训练模块,设置为将所述待训练图像帧信息的图像特征信息作为训练的监督信息,并采用和所述待训练音频信息的音频特征信息进行训练,得到所述音频特征提取器。
本申请一实施例中的音视频处理装置还可以包括图像帧信息获取模块、分 类网络训练模块以及图像特征提取器生成模块。
图像帧信息获取模块,设置为获取待训练图像帧信息。
分类网络训练模块,设置为依据所述待训练图像帧信息进行训练,得到视频分类网络。
图像特征提取器生成模块,设置为基于所述视频分类网络中的非输出层,生成所述图像特征提取器。
在本申请的一个实施例中,所述视频分类模块350可以包括特征图生成子模块、目标特征生成子模块以及分类处理子模块。
特征图生成子模块,设置为依据所述视频内容特征信息生成特征图信息。
目标特征生成子模块,设置为依据所述特征图信息和所述视频内容特征信息,生成目标特征信息。
分类处理子模块,设置为依据所述目标特征信息进行分类处理,得到所述分类结果。
在本申请的一个实施例中,所述特征图生成子模块可以设置为将所述视频内容特征信息输入到预设的注意力机制模块中进行注意力特征提取;并基于所述注意力机制模块输出的注意力特征信息和所述视频内容特征信息,生成所述特征图信息。
在实际处理中,所述特征图信息的矩阵维度和所述视频内容特征信息的矩阵维度相等。在一实施例中,所述目标特征生成子模块可以设置为通过矩阵记录所述特征图信息和所述视频内容特征信息,且所述特征图信息的矩阵维度和所述视频内容特征信息的矩阵维度相等;基于所述特征图信息中的矩阵元素与所述视频内容特征信息中的矩阵元素,生成作为所述目标特征信息的目标特征矩阵。
在本申请的一个实施例中,分类处理子模块设置为将所述视频内容特征信息输入预设的分类网络进行分类处理,并将所述预设的分类网络输出的类别分数作为所述分类结果;其中,所述类别分数用于确定所述视频文件所属的视频类别。
在一实施例中,本申请实施例中的音视频处理装置还可以包括视频内容特征获取模块、网络层训练模块以及注意力机制模块生成模块。
视频内容特征获取模块,设置为获取待训练视频内容特征信息。
网络层训练模块,设置为基于待训练视频内容特征信息和预设的权重信息 进行训练,得到网络层。
注意力机制模块生成模块,设置为依据所述网络层生成所述注意力机制模块。
本申请一实施例中,所述网络层包括以下至少一项:全连接层、全局池化层、卷积层、非线性层和批归一化层。所述图像帧信息包括:视频帧中的物体信息和场景信息。
在一实施例中,所述预设的分类网络包括卷积模块,所述卷积模块由所述批归一化层,非线性层以及卷积层组成。
需要说明的是,本申请实施例提供的音视频处理装置可执行本申请任意实施例所提供的音视频处理方法。
在一实施例中,上述音视频处理装置可以集成在设备中。该设备可以是至少两个物理实体构成,也可以是一个物理实体构成,如设备可以是个人计算机(Personal Computer,PC)、电脑、手机、平板设备、个人数字助理、服务器、消息收发设备、游戏控制台等。
本申请一实施例还提供一种设备,包括:处理器和存储器。存储器中存储有至少一条指令,且指令由所述处理器执行,使得所述设备执行如上述方法实施例中所述的音视频处理方法。
参照图4,示出了本申请一个示例中的一种设备的结构方框示意图。如图4所示,该设备可以包括:处理器40、存储器41、具有触摸功能的显示屏42、输入装置43、输出装置44以及通信装置45。该设备中处理器40的数量可以是至少一个,图4中以一个处理器40为例。该设备中存储器41的数量可以是至少一个,图4中以一个存储器41为例。该设备的处理器40、存储器41、显示屏42、输入装置43、输出装置44以及通信装置45可以通过总线或者其他方式连接,图4中以通过总线连接为例。
存储器41作为一种计算机可读存储介质,设置为存储软件程序、计算机可执行程序以及模块,如本申请任意实施例所述的音视频处理方法对应的程序指令/模块(例如,上述音视频处理装置中的视频文件获取模块310、视频分离模块320、特征提取模块330、特征融合模块340以及视频分类模块350等)。存储器41可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作装置、至少一个功能所需的应用程序;存储数据区可存储根据设备的使用所创建的数据等。此外,存储器41可以包括高速随机存取存储器,还可以包括非易失 性存储器,例如至少一个磁盘存储器件、闪存器件、或其他非易失性固态存储器件。在一些实例中,存储器41包括相对于处理器40远程设置的存储器,这些远程存储器可以通过网络连接至设备。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。
显示屏42为具有触摸功能的显示屏42,其可以是电容屏、电磁屏或者红外屏。一般而言,显示屏42设置为根据处理器40的指示显示数据,还设置为接收作用于显示屏42的触摸操作,并将相应的信号发送至处理器40或其他装置。在一实施例中,在显示屏42为红外屏的情况下,显示屏42还包括红外触摸框,该红外触摸框设置在显示屏42的四周,该红外触摸框设置为接收红外信号,并将该红外信号发送至处理器40或者其他设备。
通信装置45,设置为与其他设备建立通信连接,其可以是有线通信装置和无线通信装置中的至少一种。
输入装置43设置为接收输入的数字或者字符信息,以及产生与设备的用户设置以及功能控制有关的键信号输入,可以是用于获取图像的摄像头以及获取音频数据的拾音设备。输出装置44可以包括扬声器等音频设备。需要说明的是,输入装置43和输出装置44的组成可以根据实际情况设定。
处理器40通过运行存储在存储器41中的软件程序、指令以及模块,从而执行设备的各种功能应用以及数据处理,即实现上述音视频处理方法。
在一实施例中,处理器40执行存储器41中存储的至少一个程序时,实现如下操作:获取视频文件;从所述视频文件分离出图像帧信息和音频信息;分别从所述图像帧信息和所述音频信息提取图像特征信息和音频特征信息;将所述图像特征信息和音频特征信息融合为视频内容特征信息;依据所述视频内容特征信息确定所述视频文件对应的分类结果。
本申请实施例还提供一种计算机可读存储介质,所述存储介质中的指令由设备的处理器执行时,使得设备能够执行如上述方法实施例所述的音视频处理方法。示例性的,该音视频处理方法包括:获取视频文件;从所述视频文件分离出图像帧信息和音频信息;分别从所述图像帧信息和所述音频信息提取图像特征信息和音频特征信息;将所述图像特征信息和音频特征信息融合为视频内容特征信息;依据所述视频内容特征信息确定所述视频文件对应的分类结果。
需要说明的是,对于装置、设备、存储介质实施例而言,由于其与方法实施例基本相似,所以描述的比较简单,相关之处参见方法实施例的部分说明即 可。
通过以上关于实施方式的描述,所属领域的技术人员可以清楚地了解到,本申请可借助软件及必需的通用硬件来实现,当然也可以通过硬件实现,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对相关技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在计算机可读存储介质中,如计算机的软盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、闪存(FLASH)、硬盘或光盘等,包括若干指令用以使得一台计算机设备(可以是机器人,个人计算机,服务器,或者网络设备等)执行本申请任意实施例所述的音视频处理方法。
值得注意的是,上述音视频处理装置中,所包括的各个单元和模块只是按照功能逻辑进行划分的,但并不局限于上述的划分,只要能够实现相应的功能即可;另外,各功能单元的名称也只是为了便于相互区分,并不用于限制本申请的保护范围。
应当理解,本申请的各部分可以用硬件、软件、固件或它们的组合来实现。在上述实施方式中,多个步骤或方法可以用存储在存储器中且由合适的指令执行装置执行的软件或固件来实现。例如,如果用硬件来实现,和在另一实施方式中一样,可用本领域公知的下列技术中的任一项或他们的组合来实现:具有用于对数据信号实现逻辑功能的逻辑门电路的离散逻辑电路,具有合适的组合逻辑门电路的专用集成电路,可编程门阵列(Programmable Gate Array,PGA),现场可编程门阵列(Field Programmable Gate Array,FPGA)等。
在本说明书的描述中,参考术语“一个实施例”、“一些实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本申请的至少一个实施例或示例中。在本说明书中,对上述术语的示意性表述不一定指的是相同的实施例或示例。而且,描述的具体特征、结构、材料或者特点可以在任何的至少一个实施例或示例中以合适的方式结合。

Claims (15)

  1. 一种音视频处理方法,包括:
    获取视频文件;
    从所述视频文件中分离出图像帧信息和音频信息;
    分别从所述图像帧信息和所述音频信息中提取图像特征信息和音频特征信息;
    将所述图像特征信息和音频特征信息融合为视频内容特征信息;
    依据所述视频内容特征信息确定所述视频文件对应的分类结果。
  2. 根据权利要求1所述的方法,其中,所述分别从所述图像帧信息和所述音频信息中提取图像特征信息和音频特征信息,包括:
    通过预先训练的图像特征提取器,提取所述图像帧信息对应的图像特征信息;
    通过预先训练的音频特征提取器,提取所述音频信息对应的音频特征信息。
  3. 根据权利要求1所述的方法,其中,所述将所述图像特征信息和音频特征信息融合为视频内容特征信息,包括:
    通过向量来表示图像特征信息和音频特征信息,且所述图像特征信息的向量维度与所述音频特征信息的向量维度相等;
    基于所述图像特征信息中的图像向量元素和所述音频特征信息中的音频向量元素,生成作为所述视频内容特征信息的视频内容特征矩阵。
  4. 根据权利要求2所述的方法,还包括:
    从预设的输入数据集中获取视频数据;
    从所述视频数据中提取出待训练图像帧信息和待训练音频信息;
    分别提取所述待训练图像帧信息的图像特征信息和所述待训练音频信息的音频特征信息;
    将所述待训练图像帧信息的图像特征信息作为训练的监督信息,并采用所述待训练音频信息的音频特征信息进行训练,得到所述音频特征提取器。
  5. 根据权利要求2所述的方法,还包括:
    获取待训练图像帧信息;
    依据所述待训练图像帧信息进行训练,得到视频分类网络;
    基于所述视频分类网络中的非输出层,生成所述图像特征提取器。
  6. 根据权利要求1至5任一项所述的方法,其中,所述依据所述视频内容特征信息确定所述视频文件对应的分类结果,包括:
    依据所述视频内容特征信息生成特征图信息;
    依据所述特征图信息和所述视频内容特征信息,生成目标特征信息;
    依据所述目标特征信息进行分类处理,得到所述分类结果。
  7. 根据权利要求6所述的方法,所述依据所述视频内容特征信息生成特征图信息,包括:
    将所述视频内容特征信息输入到预设的注意力机制模块中进行注意力特征提取;
    基于所述注意力机制模块输出的注意力特征信息和所述视频内容特征信息,生成所述特征图信息。
  8. 根据权利要求6所述的方法,其中,所述依据所述特征图信息和所述视频内容特征信息,生成目标特征信息,包括:
    通过矩阵记录所述特征图信息和所述视频内容特征信息,且所述特征图信息的矩阵维度和所述视频内容特征信息的矩阵维度相等;
    基于所述特征图信息中的矩阵元素与所述视频内容特征信息中的矩阵元素,生成作为所述目标特征信息的目标特征矩阵。
  9. 根据权利要求6所述的方法,其中,所述确定依据所述目标特征信息进行分类处理,得到所述分类结果,包括:
    将所述视频内容特征信息输入预设的分类网络进行分类处理;
    将所述预设的分类网络输出的类别分数作为所述分类结果,其中,所述类别分数用于确定所述视频文件所属的视频类别。
  10. 根据权利要求7所述的方法,还包括:
    获取待训练视频内容特征信息;
    基于所述待训练视频内容特征信息和预设的权重信息进行训练,得到网络层;
    依据所述网络层生成所述注意力机制模块。
  11. 根据权利要求10所述的方法,其中,所述网络层包括以下至少一项:全连接层、全局池化层、卷积层、非线性层,和批归一化层;
    所述图像帧信息包括:视频帧中的物体信息和场景信息。
  12. 根据权利要求9所述的方法,其中,所述预设的分类网络包括卷积模块,所述卷积模块由批归一化层,非线性层以及卷积层组成。
  13. 一种音视频处理装置,包括:
    视频文件获取模块,设置为获取视频文件;
    视频分离模块,设置为从所述视频文件中分离出图像帧信息和音频信息;
    特征提取模块,设置为分别从所述图像帧信息和所述音频信息中提取图像特征信息和音频特征信息;
    特征融合模块,设置为将所述图像特征信息和音频特征信息融合为视频内容特征信息;
    视频分类模块,设置为依据所述视频内容特征信息确定所述视频文件对应的分类结果。
  14. 一种设备,包括:处理器和存储器;
    所述存储器中存储有至少一条指令,所述指令由所述处理器执行,使得所述设备执行如权利要求1至12任一项所述的音视频处理方法。
  15. 一种计算机可读存储介质,所述存储介质中的指令由设备的处理器执行时,使得所述设备能够执行如权利要求1至12任一项所述的音视频处理方法。
PCT/CN2019/110735 2018-11-01 2019-10-12 一种音视频处理方法、装置、设备及介质 WO2020088216A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811293776.1A CN109257622A (zh) 2018-11-01 2018-11-01 一种音视频处理方法、装置、设备及介质
CN201811293776.1 2018-11-01

Publications (1)

Publication Number Publication Date
WO2020088216A1 true WO2020088216A1 (zh) 2020-05-07

Family

ID=65044588

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/110735 WO2020088216A1 (zh) 2018-11-01 2019-10-12 一种音视频处理方法、装置、设备及介质

Country Status (2)

Country Link
CN (1) CN109257622A (zh)
WO (1) WO2020088216A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112380392A (zh) * 2020-11-17 2021-02-19 北京百度网讯科技有限公司 用于分类视频的方法、装置、电子设备及可读存储介质

Families Citing this family (30)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109257622A (zh) * 2018-11-01 2019-01-22 广州市百果园信息技术有限公司 一种音视频处理方法、装置、设备及介质
CN109922356B (zh) * 2019-03-01 2021-07-09 广州酷狗计算机科技有限公司 视频推荐方法、装置和计算机可读存储介质
CN109947989B (zh) * 2019-03-18 2023-08-29 北京字节跳动网络技术有限公司 用于处理视频的方法和装置
CN109947990A (zh) * 2019-03-29 2019-06-28 北京奇艺世纪科技有限公司 一种精彩片段检测方法及系统
CN110287789A (zh) * 2019-05-23 2019-09-27 北京百度网讯科技有限公司 基于互联网数据的游戏视频分类方法以及系统
CN110263220A (zh) * 2019-06-28 2019-09-20 北京奇艺世纪科技有限公司 一种视频精彩片段识别方法及装置
CN112231275B (zh) * 2019-07-14 2024-02-27 阿里巴巴集团控股有限公司 多媒体文件分类、信息处理与模型训练方法、系统及设备
CN110399934A (zh) * 2019-07-31 2019-11-01 北京达佳互联信息技术有限公司 一种视频分类方法、装置及电子设备
CN112328830A (zh) * 2019-08-05 2021-02-05 Tcl集团股份有限公司 一种基于深度学习的信息定位方法及相关设备
CN110620905A (zh) * 2019-09-06 2019-12-27 平安医疗健康管理股份有限公司 视频监控方法、装置、计算机设备和存储介质
CN110740343B (zh) * 2019-09-11 2022-08-26 深圳壹账通智能科技有限公司 基于视频类型的播放控制实现方法、装置及计算机设备
CN110751030A (zh) * 2019-09-12 2020-02-04 厦门网宿有限公司 一种视频分类方法、设备及系统
CN110674350B (zh) * 2019-09-23 2022-02-08 网易(杭州)网络有限公司 视频人物检索方法、介质、装置和计算设备
CN110674348B (zh) * 2019-09-27 2023-02-03 北京字节跳动网络技术有限公司 视频分类方法、装置及电子设备
CN110719436B (zh) * 2019-10-17 2021-05-07 浙江同花顺智能科技有限公司 一种会议文档信息获取方法、装置及其相关设备
CN111209970B (zh) * 2020-01-08 2023-04-25 Oppo(重庆)智能科技有限公司 视频分类方法、装置、存储介质及服务器
CN111488489B (zh) * 2020-03-26 2023-10-24 腾讯科技(深圳)有限公司 视频文件的分类方法、装置、介质及电子设备
US11157813B1 (en) * 2020-04-24 2021-10-26 StradVision, Inc. Method and device for on-vehicle active learning to be used for training perception network of autonomous vehicle
CN113573097A (zh) * 2020-04-29 2021-10-29 北京达佳互联信息技术有限公司 视频推荐方法、装置、服务器及存储介质
CN111723239B (zh) * 2020-05-11 2023-06-16 华中科技大学 一种基于多模态的视频标注方法
CN111625661B (zh) * 2020-05-14 2023-09-08 国家计算机网络与信息安全管理中心 一种音视频片段分类方法及装置
CN111753133A (zh) * 2020-06-11 2020-10-09 北京小米松果电子有限公司 视频分类方法、装置及存储介质
CN111901668B (zh) * 2020-09-07 2022-06-24 三星电子(中国)研发中心 视频播放方法和装置
CN112183275A (zh) * 2020-09-21 2021-01-05 北京达佳互联信息技术有限公司 视频描述信息的生成方法、装置及服务器
CN112735472B (zh) * 2020-12-25 2024-04-09 航天科工深圳(集团)有限公司 音视频旋律动作自生成方法及装置
CN112863482B (zh) * 2020-12-31 2022-09-27 思必驰科技股份有限公司 带有韵律的语音合成方法及系统
CN113011383A (zh) * 2021-04-12 2021-06-22 北京明略软件系统有限公司 视频标签定义模型构建方法、系统、电子设备及存储介质
CN113435357B (zh) * 2021-06-30 2022-09-02 平安科技(深圳)有限公司 语音播报方法、装置、设备及存储介质
CN113673364A (zh) * 2021-07-28 2021-11-19 上海影谱科技有限公司 一种基于深度神经网络的视频暴力检测方法及装置
CN115065872A (zh) * 2022-06-17 2022-09-16 联通沃音乐文化有限公司 一种影音视频的智能推荐方法及系统

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101316327A (zh) * 2007-05-29 2008-12-03 中国科学院计算技术研究所 一种多模态融合的采访镜头检测方法
CN101754056A (zh) * 2008-12-17 2010-06-23 中国科学院自动化研究所 支持海量数据自动处理的数字内容编目管理系统及方法
CN101799876A (zh) * 2010-04-20 2010-08-11 王巍 一种视音频智能分析管控系统
CN101834982A (zh) * 2010-05-28 2010-09-15 上海交通大学 基于多模态的暴力视频分层筛选方法
CN103294811A (zh) * 2013-06-05 2013-09-11 中国科学院自动化研究所 考虑特征可靠性的视频分类器构造方法
CN103336832A (zh) * 2013-07-10 2013-10-02 中国科学院自动化研究所 基于质量元数据的视频分类器构造方法
EP3096243A1 (en) * 2015-05-22 2016-11-23 Thomson Licensing Methods, systems and apparatus for automatic video query expansion
CN109257622A (zh) * 2018-11-01 2019-01-22 广州市百果园信息技术有限公司 一种音视频处理方法、装置、设备及介质

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105959723B (zh) * 2016-05-16 2018-09-18 浙江大学 一种基于机器视觉和语音信号处理相结合的假唱检测方法
US20180018970A1 (en) * 2016-07-15 2018-01-18 Google Inc. Neural network for recognition of signals in multiple sensory domains
CN106599907B (zh) * 2016-11-29 2019-11-29 北京航空航天大学 多特征融合的动态场景分类方法与装置
CN108229490B (zh) * 2017-02-23 2021-01-05 北京市商汤科技开发有限公司 关键点检测方法、神经网络训练方法、装置和电子设备
CN106934397B (zh) * 2017-03-13 2020-09-01 北京市商汤科技开发有限公司 图像处理方法、装置及电子设备
CN107330362B (zh) * 2017-05-25 2020-10-09 北京大学 一种基于时空注意力的视频分类方法
CN108229298A (zh) * 2017-09-30 2018-06-29 北京市商汤科技开发有限公司 神经网络的训练和人脸识别方法及装置、设备、存储介质
CN107748898A (zh) * 2017-11-03 2018-03-02 北京奇虎科技有限公司 文件分类方法、装置、计算设备及计算机存储介质
CN108307229B (zh) * 2018-02-02 2023-12-22 新华智云科技有限公司 一种影音数据的处理方法及设备
CN108388900B (zh) * 2018-02-05 2021-06-08 华南理工大学 基于多特征融合和时空注意力机制相结合的视频描述方法
CN108388348B (zh) * 2018-03-19 2020-11-24 浙江大学 一种基于深度学习和注意力机制的肌电信号手势识别方法
CN108664632B (zh) * 2018-05-15 2021-09-21 华南理工大学 一种基于卷积神经网络和注意力机制的文本情感分类算法

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101316327A (zh) * 2007-05-29 2008-12-03 中国科学院计算技术研究所 一种多模态融合的采访镜头检测方法
CN101754056A (zh) * 2008-12-17 2010-06-23 中国科学院自动化研究所 支持海量数据自动处理的数字内容编目管理系统及方法
CN101799876A (zh) * 2010-04-20 2010-08-11 王巍 一种视音频智能分析管控系统
CN101834982A (zh) * 2010-05-28 2010-09-15 上海交通大学 基于多模态的暴力视频分层筛选方法
CN103294811A (zh) * 2013-06-05 2013-09-11 中国科学院自动化研究所 考虑特征可靠性的视频分类器构造方法
CN103336832A (zh) * 2013-07-10 2013-10-02 中国科学院自动化研究所 基于质量元数据的视频分类器构造方法
EP3096243A1 (en) * 2015-05-22 2016-11-23 Thomson Licensing Methods, systems and apparatus for automatic video query expansion
CN109257622A (zh) * 2018-11-01 2019-01-22 广州市百果园信息技术有限公司 一种音视频处理方法、装置、设备及介质

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112380392A (zh) * 2020-11-17 2021-02-19 北京百度网讯科技有限公司 用于分类视频的方法、装置、电子设备及可读存储介质
EP3852007A3 (en) * 2020-11-17 2021-12-22 Beijing Baidu Netcom Science And Technology Co. Ltd. Method, apparatus, electronic device, readable storage medium and program for classifying video
US11768873B2 (en) 2020-11-17 2023-09-26 Beijing Baidu Netcom Science And Technology Co., Ltd. Method, apparatus, electronic device and readable storage medium for classifying video

Also Published As

Publication number Publication date
CN109257622A (zh) 2019-01-22

Similar Documents

Publication Publication Date Title
WO2020088216A1 (zh) 一种音视频处理方法、装置、设备及介质
WO2020119350A1 (zh) 视频分类方法、装置、计算机设备和存储介质
Ramaswamy et al. See the sound, hear the pixels
Wang et al. I3d-lstm: A new model for human action recognition
CN105808610B (zh) 一种互联网图片过滤方法及装置
Sharma et al. A review of deep learning-based human activity recognition on benchmark video datasets
CN111026914B (zh) 视频摘要模型的训练方法、视频摘要生成方法及装置
CN109918539B (zh) 一种基于用户点击行为的音、视频互相检索方法
US10685236B2 (en) Multi-model techniques to generate video metadata
CN112070044B (zh) 一种视频物体分类方法及装置
Thomas et al. Perceptual video summarization—A new framework for video summarization
CN113590850A (zh) 多媒体数据的搜索方法、装置、设备及存储介质
CN111491187B (zh) 视频的推荐方法、装置、设备及存储介质
CN111814620A (zh) 人脸图像质量评价模型建立方法、优选方法、介质及装置
CN107247919A (zh) 一种视频情感内容的获取方法及系统
CN114283351A (zh) 视频场景分割方法、装置、设备及计算机可读存储介质
CN112364168A (zh) 一种基于多属性信息融合的舆情分类方法
CN113190709A (zh) 一种基于短视频关键帧的背景音乐推荐方法和装置
CN114339450A (zh) 视频评论生成方法、系统、设备及存储介质
CN113992973B (zh) 视频摘要生成方法、装置、电子设备和存储介质
CN111083469A (zh) 一种视频质量确定方法、装置、电子设备及可读存储介质
CN114741556A (zh) 一种基于场景片段和多模态特征增强的短视频分类方法
CN113298015A (zh) 基于图卷积网络的视频人物社交关系图生成方法
Yang et al. Learning discriminative motion feature for enhancing multi-modal action recognition
Deotale et al. Optimized hybrid RNN model for human activity recognition in untrimmed video

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19880013

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 06/09/2021)

122 Ep: pct application non-entry in european phase

Ref document number: 19880013

Country of ref document: EP

Kind code of ref document: A1