CN113032627A

CN113032627A - Video classification method and device, storage medium and terminal equipment

Info

Publication number: CN113032627A
Application number: CN202110321242.0A
Authority: CN
Inventors: 王栋
Original assignee: Beijing Xiaomi Mobile Software Co Ltd; Beijing Xiaomi Pinecone Electronic Co Ltd
Current assignee: Beijing Xiaomi Mobile Software Co Ltd; Beijing Xiaomi Pinecone Electronic Co Ltd
Priority date: 2021-03-25
Filing date: 2021-03-25
Publication date: 2021-06-25

Abstract

The disclosure relates to a video classification method, a video classification device, a storage medium and a terminal device, wherein the method comprises the following steps: acquiring a target video through a terminal; acquiring an image characteristic vector, an audio characteristic vector and a text characteristic vector corresponding to a target video; splicing the image feature vector, the audio feature vector and the text feature vector according to a first preset splicing sequence to obtain a first feature vector corresponding to the target video; fusing the image feature vector, the audio feature vector and the text feature vector through a pre-trained feature fusion model to obtain a second feature vector corresponding to the target video; splicing the first feature vector and the second feature vector according to a second preset splicing sequence to obtain a third feature vector corresponding to the target video; and determining the corresponding category of the target video according to the third feature vector and a pre-trained video classification model. Therefore, the information in the target video can be prevented from being lost, and the accuracy of video classification is improved.

Description

Video classification method and device, storage medium and terminal equipment

Technical Field

The present disclosure relates to the field of image processing technologies, and in particular, to a video classification method and apparatus, a storage medium, and a terminal device.

Background

With the rapid development of network multimedia technology, various multimedia information is emerging. More and more users are used to watch videos on the network, and videos are generally classified in order to enable the users to select contents which the users want to watch from a large number of videos, so that the video classification plays an important role in realizing video management and interest recommendation, and the video classification results are widely applied to the fields of monitoring, retrieval, human-computer interaction and the like.

In the related art, the image feature and the audio feature of a video may be obtained, and the image feature and the audio feature may be input into a Recurrent Neural Network (RNN), and a result output by the RNN is input into a Logistic Regression (LR) to obtain a type of the video. However, this video classification method is only to classify videos according to individual image features and audio features, and cannot extract deep features with more expressive ability, which results in a low accuracy of video classification.

Disclosure of Invention

In order to overcome the problems in the related art, the present disclosure provides a video classification method, apparatus, storage medium, and terminal device.

According to a first aspect of the embodiments of the present disclosure, there is provided a video classification method, including: acquiring a target video through a terminal; acquiring an image characteristic vector, an audio characteristic vector and a text characteristic vector corresponding to the target video; splicing the image feature vector, the audio feature vector and the text feature vector according to a first preset splicing sequence to obtain a first feature vector corresponding to the target video; fusing the image feature vector, the audio feature vector and the text feature vector through a pre-trained feature fusion model to obtain a second feature vector corresponding to the target video; splicing the first feature vector and the second feature vector according to a second preset splicing sequence to obtain a third feature vector corresponding to the target video; and determining the corresponding category of the target video according to the third feature vector and a pre-trained video classification model.

Optionally, the obtaining of the image feature vector, the audio feature vector, and the text feature vector corresponding to the target video includes: determining a preset frame extraction interval corresponding to the target video according to the playing duration corresponding to the target video; extracting a plurality of target images and a plurality of target audios corresponding to the target video from the target video according to the preset frame extraction interval; acquiring image feature vectors corresponding to the target video according to the plurality of target images; acquiring audio characteristic vectors corresponding to the target videos according to the target audios; and generating a text feature vector corresponding to the target video according to the text description information corresponding to the target video.

Optionally, the obtaining, according to the plurality of target images, an image feature vector corresponding to the target video includes: inputting a plurality of target images into a pre-trained image feature acquisition model to obtain a plurality of local image feature vectors corresponding to the target video; inputting a plurality of local image feature vectors into a pre-trained feature aggregation model to obtain the image feature vectors corresponding to the target video; the obtaining, according to the plurality of target audios, an audio feature vector corresponding to the target video includes: inputting a plurality of target audios into a pre-trained audio feature acquisition model to obtain a plurality of local audio feature vectors corresponding to the target videos; and inputting a plurality of local audio feature vectors into the feature aggregation model to obtain the audio feature vectors corresponding to the target video.

Optionally, the determining, according to the third feature vector and a pre-trained video classification model, a category corresponding to the target video includes: and taking the third feature vector as the input of the video classification model to obtain the category corresponding to the target video.

Optionally, the determining, according to the third feature vector and a pre-trained video classification model, a category corresponding to the target video includes: taking the third feature vector as the input of the video classification model to obtain the probability of each preset category corresponding to the target video; and taking the preset category with the highest probability as the category corresponding to the target video and outputting the category.

Optionally, the video classification model is trained by: acquiring a plurality of sample videos; for each sample video in the plurality of sample videos, obtaining a sample image feature vector, a sample audio feature vector and a sample text feature vector corresponding to the sample video; splicing the sample image feature vector, the sample audio feature vector and the sample text feature vector according to the first preset splicing sequence to obtain a first sample feature vector corresponding to the sample video; fusing the sample image feature vector, the sample audio feature vector and the sample text feature vector through a pre-trained feature fusion model to obtain a second sample feature vector corresponding to the sample video; splicing the first sample feature vector and the second sample feature vector according to the second preset splicing sequence to obtain a third sample feature vector corresponding to the sample video; and training a target neural network model according to third sample feature vectors corresponding to the plurality of sample videos to obtain the video classification model.

According to a second aspect of the embodiments of the present disclosure, there is provided a video classification apparatus including: the video acquisition module is configured to acquire a target video through a terminal; the feature vector acquisition module is configured to acquire an image feature vector, an audio feature vector and a text feature vector corresponding to the target video; the first feature vector splicing module is configured to splice the image feature vector, the audio feature vector and the text feature vector according to a first preset splicing sequence to obtain a first feature vector corresponding to the target video; the feature vector fusion module is configured to fuse the image feature vector, the audio feature vector and the text feature vector through a pre-trained feature fusion model to obtain a second feature vector corresponding to the target video; the second feature vector splicing module is configured to splice the first feature vector and the second feature vector according to a second preset splicing sequence to obtain a third feature vector corresponding to the target video; and the category determining module is configured to determine a category corresponding to the target video according to the third feature vector and a pre-trained video classification model.

Optionally, the feature vector obtaining module includes: the interval acquisition submodule is configured to determine a preset frame extraction interval corresponding to the target video according to the playing duration corresponding to the target video; the extraction sub-module is configured to extract a plurality of target images and a plurality of target audios corresponding to the target video from the target video according to the preset frame extraction interval; the image feature vector acquisition sub-module is configured to acquire an image feature vector corresponding to the target video according to the plurality of target images; the audio characteristic vector acquisition sub-module is configured to acquire an audio characteristic vector corresponding to the target video according to the plurality of target audios; and the text characteristic vector acquisition submodule is configured to generate a text characteristic vector corresponding to the target video according to the text description information corresponding to the target video.

Optionally, the image feature vector obtaining sub-module is further configured to: inputting a plurality of target images into a pre-trained image feature acquisition model to obtain a plurality of local image feature vectors corresponding to the target video; inputting a plurality of local image feature vectors into a pre-trained feature aggregation model to obtain the image feature vectors corresponding to the target video; the audio feature vector acquisition sub-module is further configured to: inputting a plurality of target audios into a pre-trained audio feature acquisition model to obtain a plurality of local audio feature vectors corresponding to the target videos; and inputting a plurality of local audio feature vectors into the feature aggregation model to obtain the audio feature vectors corresponding to the target video.

Optionally, the category determining module further includes: and the first class determination submodule is configured to take the third feature vector as the input of the video classification model to obtain a class corresponding to the target video.

Optionally, the category determination module includes: a probability obtaining submodule configured to take the third feature vector as an input of the video classification model to obtain a probability of each preset category corresponding to the target video; and the second category determination submodule is configured to take the preset category with the highest probability as the category corresponding to the target video and output the category.

According to a third aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the steps of the video classification method provided by the first aspect of the present disclosure.

According to a fourth aspect of the embodiments of the present disclosure, there is provided a terminal device, including: a memory having a computer program stored thereon; a processor for executing the computer program in the memory to implement the steps of the video classification method mentioned in the first aspect of the present disclosure.

The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects: acquiring a target video through a terminal; acquiring an image characteristic vector, an audio characteristic vector and a text characteristic vector corresponding to the target video; splicing the image feature vector, the audio feature vector and the text feature vector according to a first preset splicing sequence to obtain a first feature vector corresponding to the target video; fusing the image feature vector, the audio feature vector and the text feature vector through a pre-trained feature fusion model to obtain a second feature vector corresponding to the target video; splicing the first feature vector and the second feature vector according to a second preset splicing sequence to obtain a third feature vector corresponding to the target video; and determining the corresponding category of the target video according to the third feature vector and a pre-trained video classification model. That is to say, the present disclosure may splice an image feature vector, an audio feature vector, and a text feature vector corresponding to the target video according to a first preset splicing order to obtain a first feature vector corresponding to the target video, fuse the image feature vector, the audio feature vector, and the text feature vector to obtain a second feature vector corresponding to the target video, splice the first feature vector and the second feature vector according to a second preset splicing order to obtain a third feature vector corresponding to the target video, where compared with the independent first feature vector or second feature vector, the third feature vector includes both the original features of the target video and a deeper fusion feature vector obtained by fully interacting the image feature vector, the audio feature vector, and the text feature vector of the target video, therefore, the information in the target video can be prevented from being lost, and the accuracy of video classification is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

FIG. 1 is a flow diagram illustrating a method of video classification in accordance with an exemplary embodiment;

FIG. 2 is a flow diagram illustrating another method of video classification according to an exemplary embodiment;

FIG. 3 is a schematic diagram illustrating the structure of a NeXtVLAD model in accordance with an exemplary embodiment;

FIG. 4 is a schematic diagram illustrating a neural network model in accordance with an exemplary embodiment;

FIG. 5 is a schematic diagram illustrating the structure of a video classification apparatus according to an exemplary embodiment;

fig. 6 is a block diagram illustrating a terminal device according to an example embodiment.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

First, an application scenario of the present disclosure will be explained. Under the conditions that more and more users start to shoot videos by themselves, the number of the videos is more and more, and the content of the videos is more and more abundant, the requirements on the efficiency and the accuracy of video classification are higher and higher. In the related art, the image characteristics and the audio characteristics of a video can be obtained, the image characteristics and the audio characteristics are spliced, and the video is classified according to the spliced characteristics, although the video classification efficiency is improved by the method, the video classification method is only to simply splice the image characteristics and the audio characteristics of the video, the spliced characteristics cannot reflect the interaction between the image characteristics and the audio characteristics, and cannot express deeper characteristics, so that the accuracy of video classification is low.

In order to solve the above problem, the present disclosure provides a video classification method, an apparatus, a storage medium, and a terminal device, which may splice an image feature vector, an audio feature vector, and a text feature vector corresponding to a target video according to a first preset splicing order to obtain a first feature vector corresponding to the target video, fuse the image feature vector, the audio feature vector, and the text feature vector to obtain a second feature vector corresponding to the target video, splice the first feature vector and the second feature vector according to a second preset splicing order to obtain a third feature vector corresponding to the target video, where the third feature vector is a deeper fusion feature vector obtained by fully interacting the image feature vector, the audio feature vector, and the text feature vector compared with the independent first feature vector or second feature vector, therefore, the information in the target video can be prevented from being lost, and the accuracy of video classification is improved.

The present disclosure is described below with reference to specific examples.

Fig. 1 is a flow chart illustrating a method of video classification, as shown in fig. 1, according to an exemplary embodiment, the method comprising:

and S101, acquiring a target video through a terminal.

Among other things, the terminal may include a mobile device, a wearable device, a home appliance, etc., such as: the mobile phone, the tablet computer, the notebook computer, the smart watch, the smart television and other devices are not limited in this disclosure.

S102, obtaining an image feature vector, an audio feature vector and a text feature vector corresponding to the target video.

It should be noted that, after the target video is obtained, an image, an audio and text description information corresponding to the target video may be obtained first, where the text description information may be a title of the target video, or information obtained from a subtitle of the target video through an OCR (Optical Character Recognition) technology, and the disclosure is not limited thereto.

In this step, after obtaining the image, the audio and the text description information corresponding to the target video, the image corresponding to the target video may be input into an EfficientNet-B7 model to obtain an image feature vector corresponding to the target video, the audio corresponding to the target video is input into a VGGish model to obtain an audio feature vector corresponding to the target video, and the text description information corresponding to the target video is input into a Bert model to obtain a text feature vector corresponding to the target video. The above-mentioned model for obtaining the image feature vector, the audio feature vector and the text feature vector corresponding to the target video is only an example, and the image feature vector, the audio feature vector and the text feature vector of the target video may also be obtained by other models of related technologies, which is not limited in this disclosure.

S103, splicing the image feature vector, the audio feature vector and the text feature vector according to a first preset splicing sequence to obtain a first feature vector corresponding to the target video.

The first preset stitching order may be an order of the image feature vector, the audio feature vector, and the text feature vector, or an order of the audio feature vector, the image feature vector, and the text feature vector, which is not limited in this disclosure.

In this step, after the image feature vector, the audio feature vector, and the text feature vector corresponding to the target video are obtained, the image feature vector, the audio feature vector, and the text feature vector may be spliced according to the first preset splicing order through a Concat algorithm to obtain a first feature vector corresponding to the target video.

And S104, fusing the image feature vector, the audio feature vector and the text feature vector through a pre-trained feature fusion model to obtain a second feature vector corresponding to the target video.

In this step, after obtaining the image feature vector, the audio feature vector, and the text feature vector corresponding to the target video, the image feature vector, the audio feature vector, and the text feature vector may be used as the input of the feature fusion model to obtain a second feature vector corresponding to the target video.

And S105, splicing the first feature vector and the second feature vector according to a second preset splicing sequence to obtain a third feature vector corresponding to the target video.

The second preset stitching order may be an order of the first feature vector and the second feature vector, or an order of the second feature vector and the first feature vector, which is not limited in this disclosure.

In this step, after the first feature vector and the second feature vector are obtained, the first feature vector and the second feature vector may be spliced according to the second preset splicing order through a Concat algorithm, so as to obtain a third feature vector corresponding to the target video.

And S106, determining the corresponding category of the target video according to the third feature vector and a pre-trained video classification model.

In this step, after obtaining the third feature vector corresponding to the target video, the third feature vector may be used as an input of the video classification model to obtain a category corresponding to the target video.

By adopting the method, the image characteristic vector, the audio characteristic vector and the text characteristic vector corresponding to the target video can be spliced according to a first preset splicing sequence to obtain a first characteristic vector corresponding to the target video, the image characteristic vector, the audio characteristic vector and the text characteristic vector are fused to obtain a second characteristic vector corresponding to the target video, and the first feature vector and the second feature vector are spliced according to a second preset splicing sequence to obtain a third feature vector corresponding to the target video, the third feature vector is a deeper feature vector after fully interacting the image feature vector, the audio feature vector and the text feature vector, therefore, the information in the target video can be prevented from being lost, and the accuracy of video classification is improved.

Fig. 2 is a flow diagram illustrating another video classification method according to an example embodiment, as shown in fig. 2, the method comprising:

s201, acquiring a target video through a terminal.

S202, determining a preset frame extraction interval corresponding to the target video according to the playing time length corresponding to the target video.

In this step, after the target video is obtained, the playing duration corresponding to the target video may be obtained, and the preset frame extraction interval corresponding to the playing duration is determined according to a preset frame extraction interval association relationship, where the frame extraction interval association relationship may include correspondence relationships between different playing durations and preset frame extraction intervals. For example, a larger preset framing interval may be set for a target video with a longer playing time, for example, the preset framing interval may be set to 2s, and for a target video with a shorter playing time, a smaller preset framing interval may be set, for example, the preset framing interval may be set to 500ms, and the setting manner of the preset framing interval is not limited by the present disclosure.

It should be noted that, in consideration of the difference between the image variation situation and the audio variation situation of different types of target videos, the preset framing interval may include a preset image framing interval and a preset audio framing interval, a plurality of target images are extracted from the target video according to the preset image framing interval, and a plurality of target audios are extracted from the target video according to the preset audio framing interval. For example, for a landscape type target video, the image change is small, and the audio change is large, a large preset image decimation interval may be set, and a small preset audio decimation interval may be set.

S203, extracting a plurality of target images and a plurality of target audios corresponding to the target video from the target video according to the preset frame extraction interval.

In this step, after the preset frame extraction interval is obtained, a plurality of target images and a plurality of target audios corresponding to the target video may be extracted from the target video according to the preset frame extraction interval, for example, in a case that the preset frame extraction interval is 1s, one target image and one target audio may be collected from the target video every 1 s. In a case where the preset frame extraction interval includes a preset image frame extraction interval and a preset audio frame extraction interval, a plurality of target images may be extracted from the target video according to the preset image frame extraction interval, and a plurality of target audios may be extracted from the target video according to the preset audio frame extraction interval, for example, in a case where the preset image frame extraction interval is 2s and the preset audio frame extraction interval is 1s, one target image may be captured from the target video every 2s, and one target audio may be captured from the target video every 1 s.

204. And acquiring image feature vectors corresponding to the target video according to the plurality of target images.

In this step, after a plurality of target images corresponding to the target video are acquired, the plurality of target images may be input into a pre-trained image feature acquisition model to obtain a plurality of local image feature vectors corresponding to the target video, and the plurality of local image feature vectors represent original image features of the target video. The image feature acquisition model may be a model trained based on the EfficientNet-B7 model, or a model trained based on other related technologies, which is not limited in this disclosure.

Further, after obtaining a plurality of local image feature vectors corresponding to the target video, the plurality of local image feature vectors may be input into a pre-trained feature aggregation model to obtain the image feature vector corresponding to the target video. The feature aggregation model may be a model trained based on a NeXtVLAD model, or a model trained based on models of other related technologies, which is not limited in this disclosure. Fig. 3 is a schematic structural diagram of a NeXtVLAD model according to an exemplary embodiment, where as shown in fig. 3, x is a local feature vector input into the NeXtVLAD model, and golobal feature is a global feature vector output by the NeXtVLAD model, and for example, after a local image feature vector corresponding to the target video is input into the NeXtVLAD model, an image feature vector corresponding to the target video may be obtained.

It should be noted that the training mode of the image feature obtaining model and the feature aggregation model may refer to a model training mode of the related art, and is not described herein again.

S205, according to the plurality of target audios, obtaining audio feature vectors corresponding to the target videos.

In this step, after a plurality of target audios corresponding to the target video are collected, the plurality of target audios may be input into a pre-trained audio feature acquisition model to obtain a plurality of local audio feature vectors corresponding to the target video, the plurality of local audio feature vectors represent original audio features of the target video, and the plurality of local audio feature vectors are input into the feature aggregation model to obtain the audio feature vector corresponding to the target video. For example, the audio feature obtaining model may be trained based on a VGGish model, or may be trained based on models of other related technologies, which is not limited in this disclosure, and the training mode of the audio feature obtaining model may refer to a model training method of the related technology, which is not described herein again.

And S206, generating a text feature vector corresponding to the target video according to the text description information corresponding to the target video.

The text description information may be a title of the target video, or may be information obtained from a subtitle of the target video through an OCR technology, which is not limited in this disclosure.

In this step, the text description information corresponding to the target video may be input into a pre-trained text feature acquisition model, so as to obtain a text feature vector corresponding to the target video. The text feature obtaining model may be trained based on a Bert model, or may be trained based on models of other related technologies, which is not limited in this disclosure, and the training mode of the text feature obtaining model may refer to a model training method of the related technology, which is not described herein again.

And S207, splicing the image feature vector, the audio feature vector and the text feature vector according to a first preset splicing sequence to obtain a first feature vector corresponding to the target video.

In this step, after the image feature vector, the audio feature vector, and the text feature vector corresponding to the target video are obtained, the image feature vector, the audio feature vector, and the text feature vector may be stitched according to the first preset stitching order by using a Concat algorithm.

It should be noted that, the image feature vector, the audio feature vector, and the text feature vector may also be spliced according to the first preset splicing order by using another splicing algorithm in the related art, so as to obtain the first feature vector corresponding to the target video, which is not limited in this disclosure.

And S208, fusing the image feature vector, the audio feature vector and the text feature vector through a pre-trained feature fusion model to obtain a second feature vector corresponding to the target video.

In this step, after the image feature vector, the audio feature vector, and the text feature vector corresponding to the target video are obtained, the image feature vector, the audio feature vector, and the text feature vector may be used as the input of the feature fusion model to obtain a second feature vector corresponding to the target video. The feature fusion model may be obtained based on the MUTAN model training, or may be obtained based on model training of other related technologies, which is not limited in this disclosure, and the training mode of the feature fusion model may refer to the model training method of the related technologies, which is not described herein again.

And S209, splicing the first feature vector and the second feature vector according to a second preset splicing sequence to obtain a third feature vector corresponding to the target video.

The second predetermined stitching order may be an order of the first feature vector and the second feature vector, or an order of the second feature vector and the first feature vector, which is not limited in this disclosure

It should be noted that, through another splicing algorithm in the related art, the first feature vector and the second feature vector may also be spliced according to the second preset splicing order to obtain a third feature vector corresponding to the target video, which is not limited in this disclosure.

And S210, taking the third feature vector as the input of the video classification model to obtain the probability of each preset category corresponding to the target video.

The preset categories may include education, entertainment, news, and dining, and may also include other categories, which are not limited in this disclosure.

In this step, since the preset categories may include a plurality of categories, after the third feature vector corresponding to the target video is obtained, the third feature vector may be input into the video classification model, so as to obtain the probability of each preset category corresponding to the target video. For example, if the preset categories include education, entertainment and news, the probability a that the target video is an education category, the probability B that the target video is an entertainment category and the probability C that the target video is a news category can be obtained by inputting the third feature vector into the video classification model.

The video classification model can be obtained by training in the following way:

s1, obtaining a plurality of sample videos;

s2, acquiring a sample image feature vector, a sample audio feature vector and a sample text feature vector corresponding to each sample video in the plurality of sample videos;

s3, splicing the sample image feature vector, the sample audio feature vector and the sample text feature vector according to a first preset splicing sequence to obtain a first sample feature vector corresponding to the sample video;

s4, fusing the sample image feature vector, the sample audio feature vector and the sample text feature vector through a pre-trained feature fusion model to obtain a second sample feature vector corresponding to the sample video;

s5, splicing the first sample feature vector and the second sample feature vector according to a second preset splicing sequence to obtain a third sample feature vector corresponding to the sample video;

it should be noted that, in the method for obtaining the third sample feature vector corresponding to the sample video in the steps S2 to S5, reference may be made to the method for obtaining the third feature vector corresponding to the target video in the steps S202 to S209, and details are not repeated here.

And S6, training the target neural network model according to the third sample feature vectors corresponding to the plurality of sample videos to obtain the video classification model.

After the third sample feature vectors corresponding to the multiple sample videos are obtained, the multiple sample videos may be divided into a training set and a test set according to a preset ratio, where the preset ratio may be 8:2, or other ratios, and the disclosure does not limit this. For example, in the case that the number of the sample videos is 100, 80 sample videos in the sample videos may be used as a training set, and the other 20 sample videos in the sample videos may be used as a verification set. Then, the third sample feature vector corresponding to the sample video of the training set and the label corresponding to the sample video of the training set may be input into the target neural network model, where the label may be a category of the sample video.

Further, parameters of the target neural network model may be adjusted according to a loss function of the target neural network model, and the target neural network model is verified through the sample videos of the verification set, for example, a third sample feature vector corresponding to the sample videos of the verification set may be input into the target neural network model, an F1 value of the target neural network model is obtained, when the F1 value reaches a maximum value, the target neural network model is an optimal model, and the optimal target neural network model may be used as the video classification model.

It should be noted that the video classification model may include a full link layer (FC), an SE Context filtering and a logical Classifier (Logistic Classifier), and in the process of training the video classification model, the FC, the SE Context filtering and the logical Classifier may be trained simultaneously.

And S211, taking the preset category with the highest probability as the category corresponding to the target video and outputting the category.

In this step, after the probability of each preset category corresponding to the target video is obtained, the probability of each preset category may be compared, and the preset category with the highest probability is used as the category corresponding to the target video. Continuing with the example in step S210, if the probability a is 20%, the probability B is 2%, and the probability C is 78%, it may be determined that the news category corresponding to the probability C is the category corresponding to the target video.

In the above steps S210 to S211, the probability of each preset category corresponding to the target video is obtained first, and then the category of the target video is determined according to the probability.

It should be noted that multiple neural network models are used in the embodiment shown in fig. 2, fig. 4 is a schematic diagram of a neural network model shown according to an exemplary embodiment, and as shown in fig. 4, the image feature acquisition model is an EfficientNet-B7 model, the audio feature acquisition model is a VGGish model, the text feature acquisition model is a Bert model, the feature aggregation model is a NeXtVLAD model, the splicing manner is Concat, the feature fusion model is a MUTAN model, and the video classification model includes FC, SE Context scaling, and a logic classifier.

Step S210 above discloses a method for training a video classification model, and here, the method for training the video classification model may also be referred to a method for training the video classification model to train other models in the embodiment shown in fig. 4, for example, after obtaining a target sample image, a target sample audio and sample description information corresponding to a sample video, according to the flow of the embodiment shown in fig. 4, the target sample image may be input into the EfficientNet-B7 model, the target sample audio may be input into the VGGish model, the sample description information may be input into the Bert model, and after finally outputting a category corresponding to the sample video through a logic classifier, all the models may be optimized through a cross entropy loss function, or the partial models may be optimized, which is not limited by the present disclosure.

It should be noted that the first preset stitching sequence and the second preset stitching sequence when the video classification model is used are the same as the first preset stitching sequence and the second preset stitching sequence when the video classification model is trained, for example, if the first preset stitching sequence when the video classification model is trained is an image feature vector, an audio feature vector, and a text feature vector, and the second preset stitching sequence is a first feature vector and a second feature vector, the first preset stitching sequence when the video classification model is used is also an image feature vector, an audio feature vector, and a text feature vector, and the second preset stitching sequence is also a first feature vector and a second feature vector. Therefore, the type of the third feature vector input into the video classification model is the same as that of the third sample feature vector in the process of training the video classification model, and the accuracy of video classification can be further improved.

By adopting the method, the first feature vector and the second feature vector corresponding to the target video can be obtained firstly, the first feature vector is the original feature vector of the target video, the second feature vector is the fusion feature vector of the target video, and then the first feature vector and the second feature vector are spliced according to a second preset splicing sequence to obtain a third feature vector corresponding to the target video, wherein the third feature vector comprises both the original feature of the target video and the fusion feature of the target video, so that when the category of the target video is obtained according to the third feature vector, the information in the target video can be prevented from being lost, and the accuracy of video classification can be improved.

Fig. 5 is a schematic structural diagram illustrating a video classification apparatus according to an exemplary embodiment, as shown in fig. 5, the apparatus includes:

a video acquisition module 501 configured to acquire a target video through a terminal;

a feature vector obtaining module 502 configured to obtain an image feature vector, an audio feature vector, and a text feature vector corresponding to the target video;

a first feature vector stitching module 503, configured to stitch the image feature vector, the audio feature vector, and the text feature vector according to a first preset stitching order to obtain a first feature vector corresponding to the target video;

a feature vector fusion module 504 configured to fuse the image feature vector, the audio feature vector, and the text feature vector through a pre-trained feature fusion model to obtain a second feature vector corresponding to the target video;

a second eigenvector stitching module 505, configured to stitch the first eigenvector and the second eigenvector according to a second preset stitching order, so as to obtain a third eigenvector corresponding to the target video;

a category determining module 506 configured to determine a category corresponding to the target video according to the third feature vector and a pre-trained video classification model.

Optionally, the feature vector obtaining module 502 includes:

the interval acquisition submodule is configured to determine a preset frame extraction interval corresponding to the target video according to the playing time length corresponding to the target video;

the extraction submodule is configured to extract a plurality of target images and a plurality of target audios corresponding to the target video from the target video according to the preset frame extraction interval;

the image feature vector acquisition sub-module is configured to acquire image feature vectors corresponding to the target video according to a plurality of target images;

the audio characteristic vector acquisition submodule is configured to acquire an audio characteristic vector corresponding to the target video according to a plurality of target audios;

and the text characteristic vector acquisition submodule is configured to generate a text characteristic vector corresponding to the target video according to the text description information corresponding to the target video.

Optionally, the image feature vector obtaining sub-module is further configured to: inputting a plurality of target images into a pre-trained image feature acquisition model to obtain a plurality of local image feature vectors corresponding to the target video; inputting a plurality of local image feature vectors into a pre-trained feature aggregation model to obtain the image feature vectors corresponding to the target video; the audio feature vector acquisition sub-module is further configured to: inputting a plurality of target audios into a pre-trained audio feature acquisition model to obtain a plurality of local audio feature vectors corresponding to the target videos; and inputting a plurality of local audio feature vectors into the feature aggregation model to obtain the audio feature vector corresponding to the target video.

Optionally, the category determining module 506 further includes:

and the first class determination submodule is configured to take the third feature vector as the input of the video classification model to obtain the class corresponding to the target video.

Optionally, the category determining module 506 includes:

the probability obtaining submodule is configured to take the third feature vector as the input of the video classification model to obtain the probability of each preset category corresponding to the target video;

and the second category determination submodule is configured to take the preset category with the highest probability as the category corresponding to the target video and output the category. Optionally, the video classification model is trained by: acquiring a plurality of sample videos; for each sample video in the plurality of sample videos, obtaining a sample image feature vector, a sample audio feature vector and a sample text feature vector corresponding to the sample video; splicing the sample image feature vector, the sample audio feature vector and the sample text feature vector according to a first preset splicing sequence to obtain a first sample feature vector corresponding to the sample video; fusing the sample image feature vector, the sample audio feature vector and the sample text feature vector through a pre-trained feature fusion model to obtain a second sample feature vector corresponding to the sample video; splicing the first sample feature vector and the second sample feature vector according to a second preset splicing sequence to obtain a third sample feature vector corresponding to the sample video; and training a target neural network model according to third sample feature vectors corresponding to the plurality of sample videos to obtain the video classification model.

By the device, the image characteristic vector, the audio characteristic vector and the text characteristic vector corresponding to the target video can be spliced according to a first preset splicing sequence to obtain a first characteristic vector corresponding to the target video, the image characteristic vector, the audio characteristic vector and the text characteristic vector are fused to obtain a second characteristic vector corresponding to the target video, and the first feature vector and the second feature vector are spliced according to a second preset splicing sequence to obtain a third feature vector corresponding to the target video, the third feature vector is a deeper feature vector after fully interacting the image feature vector, the audio feature vector and the text feature vector, therefore, the information in the target video can be prevented from being lost, and the accuracy of video classification is improved.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

The present disclosure also provides a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the steps of the method of video classification provided by the present disclosure.

Fig. 6 is a block diagram illustrating a terminal device 600 according to an example embodiment. For example, the terminal device 600 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, a personal digital assistant, and the like.

Referring to fig. 6, the terminal device 600 may include one or more of the following components: a processing component 602, a memory 604, a power component 606, a multimedia component 608, an audio component 610, an interface to input/output (I/O) 612, a sensor component 614, and a communication component 616.

The processing component 602 generally controls overall operations of the terminal device 600, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 602 may include one or more processors 620 to execute instructions to perform all or a portion of the steps of the method of video classification described above. Further, the processing component 602 can include one or more modules that facilitate interaction between the processing component 602 and other components. For example, the processing component 602 can include a multimedia module to facilitate interaction between the multimedia component 608 and the processing component 602.

The memory 604 is configured to store various types of data to support operation at the terminal device 600. Examples of such data include instructions for any application or method operating on the terminal device 600, contact data, phonebook data, messages, pictures, videos, and the like. The memory 604 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

Power component 606 provides power to the various components of terminal device 600. Power components 606 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for terminal device 600.

The multimedia component 608 comprises a screen providing an output interface between the terminal device 600 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 608 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the terminal device 600 is in an operation mode, such as a photographing mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 610 is configured to output and/or input audio signals. For example, the audio component 610 includes a Microphone (MIC) configured to receive an external audio signal when the terminal device 600 is in an operation mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signal may further be stored in the memory 604 or transmitted via the communication component 616. In some embodiments, audio component 610 further includes a speaker for outputting audio signals.

The I/O interface 612 provides an interface between the processing component 602 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor component 614 includes one or more sensors for providing various aspects of status assessment for the terminal device 600. For example, the sensor component 614 may detect an open/closed state of the terminal device 600, relative positioning of components such as a display and keypad of the terminal device 600, a change in position of the terminal device 600 or a component of the terminal device 600, presence or absence of user contact with the terminal device 600, orientation or acceleration/deceleration of the terminal device 600, and a change in temperature of the terminal device 600. The sensor assembly 614 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 614 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 614 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 616 is configured to facilitate communications between the terminal device 600 and other devices in a wired or wireless manner. The terminal device 600 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 616 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 616 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the terminal device 600 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described video classification method.

In an exemplary embodiment, a non-transitory computer readable storage medium comprising instructions, such as the memory 604 comprising instructions, executable by the processor 620 of the terminal device 600 to perform the above-described method of video classification is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

In another exemplary embodiment, a computer program product is also provided, which comprises a computer program executable by a programmable apparatus, the computer program having code portions for performing the above-mentioned video classification method when executed by the programmable apparatus.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method of video classification, comprising:

acquiring a target video through a terminal;

acquiring an image characteristic vector, an audio characteristic vector and a text characteristic vector corresponding to the target video;

splicing the image feature vector, the audio feature vector and the text feature vector according to a first preset splicing sequence to obtain a first feature vector corresponding to the target video;

fusing the image feature vector, the audio feature vector and the text feature vector through a pre-trained feature fusion model to obtain a second feature vector corresponding to the target video;

splicing the first feature vector and the second feature vector according to a second preset splicing sequence to obtain a third feature vector corresponding to the target video;

and determining the corresponding category of the target video according to the third feature vector and a pre-trained video classification model.

2. The method of claim 1, wherein the obtaining image feature vectors, audio feature vectors and text feature vectors corresponding to the target video comprises:

determining a preset frame extraction interval corresponding to the target video according to the playing duration corresponding to the target video;

extracting a plurality of target images and a plurality of target audios corresponding to the target video from the target video according to the preset frame extraction interval;

acquiring image feature vectors corresponding to the target video according to the plurality of target images;

acquiring audio characteristic vectors corresponding to the target videos according to the target audios;

and generating a text feature vector corresponding to the target video according to the text description information corresponding to the target video.

3. The method according to claim 2, wherein the obtaining, according to the plurality of target images, image feature vectors corresponding to the target video comprises:

inputting a plurality of target images into a pre-trained image feature acquisition model to obtain a plurality of local image feature vectors corresponding to the target video;

inputting a plurality of local image feature vectors into a pre-trained feature aggregation model to obtain the image feature vectors corresponding to the target video;

the obtaining, according to the plurality of target audios, an audio feature vector corresponding to the target video includes:

inputting a plurality of target audios into a pre-trained audio feature acquisition model to obtain a plurality of local audio feature vectors corresponding to the target videos;

and inputting a plurality of local audio feature vectors into the feature aggregation model to obtain the audio feature vectors corresponding to the target video.

4. The method of claim 1, wherein the determining the corresponding category of the target video according to the third feature vector and a pre-trained video classification model comprises:

and taking the third feature vector as the input of the video classification model, and outputting to obtain the category corresponding to the target video.

5. The method according to claim 1 or 4, wherein the determining the corresponding category of the target video according to the third feature vector and a pre-trained video classification model comprises:

taking the third feature vector as the input of the video classification model to obtain the probability of each preset category corresponding to the target video;

and taking the preset category with the highest probability as the category corresponding to the target video and outputting the category.

6. The method of claim 1, wherein the video classification model is trained by:

acquiring a plurality of sample videos;

for each sample video in the plurality of sample videos, obtaining a sample image feature vector, a sample audio feature vector and a sample text feature vector corresponding to the sample video;

splicing the sample image feature vector, the sample audio feature vector and the sample text feature vector according to the first preset splicing sequence to obtain a first sample feature vector corresponding to the sample video;

fusing the sample image feature vector, the sample audio feature vector and the sample text feature vector through a pre-trained feature fusion model to obtain a second sample feature vector corresponding to the sample video;

splicing the first sample feature vector and the second sample feature vector according to the second preset splicing sequence to obtain a third sample feature vector corresponding to the sample video;

and training a target neural network model according to third sample feature vectors corresponding to the plurality of sample videos to obtain the video classification model.

7. A video classification apparatus, comprising:

the video acquisition module is configured to acquire a target video through a terminal;

the feature vector acquisition module is configured to acquire an image feature vector, an audio feature vector and a text feature vector corresponding to the target video;

the first feature vector splicing module is configured to splice the image feature vector, the audio feature vector and the text feature vector according to a first preset splicing sequence to obtain a first feature vector corresponding to the target video;

the feature vector fusion module is configured to fuse the image feature vector, the audio feature vector and the text feature vector through a pre-trained feature fusion model to obtain a second feature vector corresponding to the target video;

the second feature vector splicing module is configured to splice the first feature vector and the second feature vector according to a second preset splicing sequence to obtain a third feature vector corresponding to the target video;

and the category determining module is configured to determine a category corresponding to the target video according to the third feature vector and a pre-trained video classification model.

8. The apparatus of claim 7, wherein the category determination module comprises:

a probability obtaining submodule configured to take the third feature vector as an input of the video classification model to obtain a probability of each preset category corresponding to the target video;

and the first class determination submodule is configured to take the preset class with the highest probability as the class corresponding to the target video.

9. A computer-readable storage medium, on which computer program instructions are stored, which program instructions, when executed by a processor, carry out the steps of the method according to any one of claims 1 to 6.

10. A terminal device, comprising:

a memory having a computer program stored thereon;

a processor for executing the computer program in the memory to carry out the steps of the method of any one of claims 1 to 6.