CN113220940B

CN113220940B - Video classification method, device, electronic equipment and storage medium

Info

Publication number: CN113220940B
Application number: CN202110524503.9A
Authority: CN
Inventors: 王铭喜
Original assignee: Beijing Xiaomi Mobile Software Co Ltd
Current assignee: Beijing Xiaomi Mobile Software Co Ltd
Priority date: 2021-05-13
Filing date: 2021-05-13
Publication date: 2024-02-09
Anticipated expiration: 2041-05-13
Also published as: CN113220940A

Abstract

The disclosure relates to a video classification method, a device, an electronic device and a storage medium, so as to solve the problem of low video classification accuracy in the related art, comprising: acquiring videos to be classified; splitting the video to be classified to obtain text information, a video frame sequence and an audio frame sequence of the video to be classified; obtaining a target feature vector according to the video frame sequence and the audio frame sequence, and obtaining a text feature vector according to the text information; splicing the target feature vector and the text feature vector to obtain a full connection vector; and inputting the full connection vector into a classifier to obtain a classification result of the video to be classified, which is output by the classifier. Therefore, the video classification accuracy can be improved, the accuracy of recommending videos to users is further improved, and the video popularization efficiency is improved.

Description

Video classification method, device, electronic equipment and storage medium

Technical Field

The disclosure relates to the technical field of video processing, and in particular relates to a video classification method, a video classification device, electronic equipment and a storage medium.

Background

Short video platforms or video applications often recommend videos to users, e.g., corresponding videos according to the interests of the user. Therefore, videos are required to be classified, and interested contents of the videos can be accurately recommended to users according to the classification, so that the purposes of popularizing video works such as movies and television shows are achieved.

In video classification, a classification label is generally added to video, for example, the difference loss between a private domain feature matrix and a public domain feature matrix of an audio mode and the difference loss between the private domain feature matrix and the public domain feature matrix of a visual mode are calculated, the two difference losses are combined to obtain a first objective function, and a second objective function is obtained according to the difference between a prediction label and a real label of a video data set; taking the similarity loss of the public domain features of the audio mode and the public domain features of the video mode of the video data set as a third objective function; and weighting the first to third objective functions to obtain a total objective function, iterating network parameters of the depth network until the objective function values are converged, and obtaining video classification.

Disclosure of Invention

In order to overcome the problems in the related art, the present disclosure provides a video classification method, apparatus, electronic device, and storage medium.

According to a first aspect of an embodiment of the present disclosure, there is provided a video classification method, including:

acquiring videos to be classified;

splitting the video to be classified to obtain text information, a video frame sequence and an audio frame sequence of the video to be classified;

obtaining a target feature vector according to the video frame sequence and the audio frame sequence, and obtaining a text feature vector according to the text information;

splicing the target feature vector and the text feature vector to obtain a full connection vector;

and inputting the full connection vector into a classifier to obtain a classification result of the video to be classified, which is output by the classifier.

Optionally, the obtaining the target feature vector according to the video frame sequence and the audio frame sequence includes:

extracting a Mel frequency cepstrum coefficient of the audio frame sequence, inputting the Mel frequency cepstrum coefficient into a VGGish model, and extracting local features of the Mel frequency cepstrum coefficient to obtain VGGish features;

inputting the video frame sequence into a TSN model to perform action recognition on the video frame sequence to obtain action characteristics;

splicing the VGGish characteristic and the action characteristic to obtain a characteristic to be input;

and inputting the feature to be input into a NeXtVLA model to obtain the target feature vector.

Optionally, the TSN model generates the action feature by:

according to the video frame sequence, single-frame image information and optical flow image information of the video frame sequence are determined;

extracting according to the single-frame image information and the optical flow image information at intervals to obtain a sparse sampling result;

and obtaining the action characteristic according to the sparse sampling result.

Optionally, the obtaining the text feature vector according to the text information includes:

and inputting the text information into a BERT model to extract text features of the text information so as to obtain the text feature vector.

Optionally, the classification model is trained by:

taking the seed video with the classification labels as a training sample;

freezing models except the classifier in the classification model, and training parameters of the classifier according to the training sample, wherein the models except the classifier in the classification model comprise: VGGish model, TSN model, neXtVLA model.

According to a second aspect of embodiments of the present disclosure, there is provided a video classification apparatus, comprising:

the acquisition module is configured to acquire videos to be classified;

the splitting module is configured to split the video to be classified to obtain text information, a video frame sequence and an audio frame sequence of the video to be classified;

the execution module is configured to obtain a target feature vector according to the video frame sequence and the audio frame sequence and obtain a text feature vector according to the text information;

the splicing module is configured to splice the target feature vector and the text feature vector to obtain a full connection vector;

the determining module is configured to input the full connection vector into a classifier to obtain a classification result of the video to be classified, which is output by the classifier.

Optionally, the execution module includes:

the extraction submodule is configured to extract a Mel frequency cepstrum coefficient of the audio frame sequence, input the Mel frequency cepstrum coefficient into a VGGish model, and extract local features of the Mel frequency cepstrum coefficient to obtain VGGish features;

the identification sub-module is configured to input the video frame sequence into a TSN model so as to conduct action identification on the video frame sequence to obtain action characteristics;

the determining submodule is configured to splice the VGGish characteristic and the action characteristic to obtain a characteristic to be input;

and the input sub-module is configured to input the feature to be input into a NeXtVLA model to obtain the target feature vector.

Optionally, the TSN model generates the action feature by:

Optionally, the execution module is configured to: and inputting the text information into a BERT model to extract text features of the text information so as to obtain the text feature vector.

Optionally, the classification model is trained by:

taking the seed video with the classification labels as a training sample;

According to a third aspect of embodiments of the present disclosure, there is provided an electronic device, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to:

acquiring videos to be classified;

According to a fourth aspect of embodiments of the present disclosure, there is provided a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the steps of the video classification method provided by the first aspect of the present disclosure.

The technical scheme provided by the embodiment of the disclosure can comprise the following beneficial effects:

obtaining videos to be classified; splitting the video to be classified to obtain text information, a video frame sequence and an audio frame sequence of the video to be classified; obtaining a target feature vector according to the video frame sequence and the audio frame sequence, and obtaining a text feature vector according to the text information; splicing the target feature vector and the text feature vector to obtain a full connection vector; and inputting the full connection vector into a classifier to obtain a classification result of the video to be classified, which is output by the classifier. Therefore, the video classification accuracy can be improved, the accuracy of recommending videos to users is further improved, and the video popularization efficiency is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure.

Fig. 1 is a flow chart illustrating a video classification method according to an exemplary embodiment.

Fig. 2 is a flow chart illustrating one implementation of step S13 in fig. 1 according to an exemplary embodiment.

Fig. 3 is a flow chart illustrating one implementation of step S132 of fig. 2, according to an exemplary embodiment.

Fig. 4 is a flow chart illustrating another video classification method according to an exemplary embodiment.

Fig. 5 is a block diagram illustrating a video classification device according to an exemplary embodiment.

Fig. 6 is a block diagram illustrating an apparatus for video classification according to an exemplary embodiment.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.

It should be noted that, in the present disclosure, the terms "S131", "S132", and the like in the specification and claims and drawings are used for distinguishing steps, and are not necessarily to be construed as performing the method steps in a particular order or sequence.

Before introducing the video classification method, the video classification device, the electronic equipment and the storage medium provided by the disclosure, an application scene of the disclosure is first introduced. The video classification method provided by the disclosure can be applied to electronic equipment, and the electronic equipment can be a server, for example. The server can be a server cluster formed by one or more servers, and the server can be in communication connection with the terminal equipment in a wired or wireless mode and is used for obtaining videos to be classified on the terminal equipment. The terminal device may be, for example, a smart phone, a PC (Personal Computer personal computer), or the like. The terminal device is used for shooting videos through application programs or intercepting videos from TV dramas and movies by a user to obtain the videos to be classified, and uploading the videos to be classified to the server in a wired or wireless mode.

The inventor finds that in the related art, the video classification is determined through the audio mode and the video mode, and caption information such as caption information and speech of the video is not considered, so that the accuracy of the video classification is lower, the accuracy of video recommendation is lower, and the video popularization efficiency is reduced. And the audio, the text and the video are classified independently, and then the probabilities of the classification results in all the categories are spliced to obtain the feature vector input into the classifier, so that more effective information is lost, and the video classification accuracy is lower.

In order to solve the above technical problems, the present disclosure provides a video classification method. Fig. 1 is a flow chart illustrating a video classification method according to an exemplary embodiment, including the following steps, as shown in fig. 1.

In step S11, a video to be classified is acquired.

In step S12, the video to be classified is split, so as to obtain text information, a video frame sequence and an audio frame sequence of the video to be classified.

In step S13, a target feature vector is obtained according to the video frame sequence and the audio frame sequence, and a text feature vector is obtained according to the text information.

In step S14, the target feature vector and the text feature vector are spliced to obtain a full connection vector.

In step S15, the full connection vector is input into a classifier, and a classification result of the video to be classified output by the classifier is obtained.

In the implementation, the video to be classified in step S11 may be actively uploaded by the terminal device, or may be actively sent to the terminal device to require uploading by the server in response to the terminal device completing the video creation, and the video to be classified is acquired when the terminal device responds to the instruction and is allowed to upload.

For example, when the terminal device shares a video to be classified in a friend circle, the server responds to the successful action of sharing the video to be classified in the friend circle, sends an instruction for uploading to the terminal device, and obtains the video to be classified which is shared in the friend circle when the instruction for uploading is responded by the terminal device and uploading is allowed.

Further, after the server obtains the video to be classified, text extraction, audio extraction and video extraction are carried out on the video to be classified, so that splitting is completed to obtain text information, video information and audio information, a video frame sequence is obtained based on the video information, an audio frame sequence corresponding to the video frame sequence is obtained based on the audio information, and the frames of the video frame sequence and the audio frame sequence are identical. The text information may be bullet screen information and subtitle information, and the subtitle information may be title information, a staff table, a gramophone, a dialogue, a description word, a character introduction, a place name, a year, etc., for example, the subtitle information may be a dialogue for displaying below a playing interface, a title for displaying on both sides of the playing interface, a bystander for displaying above the playing interface, or a text description when sharing the video to be classified. The speech information may include speech and commentary, which may be speech content when sharing the video to be classified.

Taking 3 frames of videos to be classified as an example, splitting the acquired 3 frames of videos to be classified, extracting text information, 3 frames of video frame sequences and corresponding 3 frames of audio frame sequences, further sequentially inputting the 3 frames of video frame sequences into TSN according to time sequence to obtain action characteristics of the 3 frames of video frame sequences, wherein each action characteristic is a column vector of 1 x 1024, sequentially inputting the 3 frames of audio frame sequences into VGGish according to time sequence to obtain VGGish characteristics of the 3 frames of audio frames, wherein each VGGish characteristic is a column vector of 1 x 128, inputting the extracted text information into BERT model to obtain text characteristic vectors corresponding to the text information, and the text characteristic vectors are column vectors of 1 x 1024.

Further, based on vector stitching, the VGGish feature of each motion feature stitching 3 frames is stitched, that is, on the basis of 1×1024 column vectors of each motion feature, one VGGish feature of 1×128 is stitched, so as to obtain a stitching vector corresponding to each motion feature, that is, a column vector of 1×1024+128) =1×1152 is obtained. And recording the spliced vector corresponding to each action feature into the PKL file to obtain 3 column vectors of 1 x 1152, and obtaining a two-dimensional matrix of 3 rows and 1152 columns according to the 3 column vectors of 1 x 1152.

Further, a two-dimensional matrix with 3 rows and 1152 columns is input to a NextVLA model to obtain a target feature vector of 1 x 2048, and the splicing of the video frame sequence and the audio frame sequence is completed. Based on vector stitching, on the basis of 1×2048 target feature vectors, 1×1024 text feature vectors are stitched to obtain 1×3072 column vectors, namely 1×3072 full connection vectors corresponding to 3 frames of videos to be classified are obtained. And inputting the full connection vector of 1 x 3072 into a classifier to obtain a classification result of 3 frames of videos to be classified output by the classifier.

Compared with the prior art, the method and the device have the advantages that the audio, the text and the video are classified independently, the probabilities of classification results in various categories are spliced, the feature vectors of the input classifier are obtained, and the loss of effective information is reduced.

It should be noted that the classification result of the video to be classified may be one type of video, or may be multiple types of videos. For example, the video to be classified may be of the entertainment type only, or may be of both the entertainment type and the star type.

Further, after the classification result of the video to be classified is obtained, the classified video to be classified can be recommended on other terminal equipment according to the preference of other users according to the classification result. For example, the classified video to be classified is displayed on a terminal device which plays the video of the same type.

The technical scheme is that videos to be classified are obtained; splitting the video to be classified to obtain text information, a video frame sequence and an audio frame sequence of the video to be classified; obtaining a target feature vector according to the video frame sequence and the audio frame sequence, and obtaining a text feature vector according to the text information; splicing the target feature vector and the text feature vector to obtain a full connection vector; and inputting the full connection vector into a classifier to obtain a classification result of the video to be classified, which is output by the classifier. Therefore, the video classification accuracy can be improved, the accuracy of recommending videos to users is further improved, and the video popularization efficiency is improved.

Alternatively, fig. 2 is a flow chart illustrating one implementation of step S13 in fig. 1 according to an exemplary embodiment. In step S13, the target feature vector is obtained according to the video frame sequence and the audio frame sequence, including the following steps.

In step S131, mel-frequency cepstrum coefficients of the audio frame sequence are extracted, and the mel-frequency cepstrum coefficients are input into the VGGish model to perform local feature extraction on the mel-frequency cepstrum coefficients to obtain VGGish features.

In step S132, the video frame sequence is input into the TSN model to perform motion recognition on the video frame sequence to obtain motion features.

In step S133, the VGGish feature and the action feature are spliced to obtain the feature to be input.

In step S134, the feature to be input is input into the NextVLA model to obtain a target feature vector

The NeXtVLA model performs audio feature clustering on VGGish features in the features to be input to obtain audio feature vectors, performs motion feature clustering on motion features in the features to be input to obtain image feature vectors, and obtains target feature vectors according to the audio feature vectors and the image feature vectors.

In specific implementation, an audio frame sequence is input to an encoder, an audio frame sequence output by the encoder is input to a decoder to obtain mel frequency cepstrum coefficients corresponding to the audio frame sequence, and further, mel frequency cepstrum coefficients corresponding to the audio frame sequence are input to a VGGish (Visual Geometry Group super-resolution test sequence) model, wherein the VGGish model is obtained by training audio features of manually marked seed videos, a classification label is added to audio of each seed video during manual marking, and a plurality of classification labels can be added to audio of each seed video. Similarly, the TSN (Temporal Segment Networks time period network) model is obtained by training the video features of the artificially annotated seed video.

It is worth noting that the class labels for the audio of each seed video may not be identical to the class labels of the video. For example, the video-added class label is le bronce james and the audio-added class label may be an american professional basketball game.

By adopting the technical scheme, VGGish characteristics can be obtained based on the VGGish model, action characteristics can be obtained based on the TSN model, and audio characteristic vectors and video characteristic vectors can be obtained and spliced based on the NeXtVLA model, so that loss of effective information is reduced. And further, the accuracy of video classification can be improved.

Alternatively, fig. 3 is a flow chart illustrating one implementation of step S132 in fig. 2 according to an exemplary embodiment. In step S132, the TSN model generates the action feature by:

in step S1321, single-frame image information and optical flow image information of the video frame sequence are determined from the video frame sequence.

In step S1322, a sparse sampling result is obtained by extracting the single-frame image information and the optical flow image information at time intervals.

In step S1323, an action feature is obtained from the sparse sampling result.

In specific implementation, the video frame sequence can be subjected to RGB image extraction, so that single-frame image information and optical flow image information in the RGB processed image are extracted. Wherein, the single frame image information and the optical flow image information can be extracted by adopting a random sampling mode.

For example, an RGB image and an RGB difference in a video frame sequence are extracted, where the RGB image may represent a certain frame of image in an action feature, and the RGB difference is a difference between video frame images of two adjacent frames, so as to obtain single frame image information based on the RGB image and the RGB difference.

As yet another example, optical flow image information in a sequence of video frames may be extracted by a region-based method, for example, performing similar region localization on RGB processed images, and calculating optical flow by displacement of the similar regions for the localized similar regions, resulting in optical flow image information.

Specifically, the network part of the TSN model is composed of two paths of CNNs (Convolutional Neural Networks convolutional neural networks), one path of which takes single-frame image information as input, and the other path of which takes optical flow image information as input.

Further, the TSN model samples and extracts input single-frame image information and optical flow image information according to time intervals, performs sharing or association feature extraction on the sparse-sampled single-frame image information, performs corresponding category judgment to obtain single-frame feature vectors, performs sharing or association feature extraction on the sparse-sampled optical flow image information, performs corresponding category judgment to obtain optical flow image feature vectors, and combines the single-frame feature vectors and the optical flow image feature vectors, wherein a weighting and averaging method can be adopted for combining to obtain action features of videos to be classified.

For example, the TSN model samples and extracts input single-frame image information and optical flow image information according to a time interval of 5 seconds, calculates a score of each single-frame feature vector and each optical flow image feature vector after sparseness belongs to each category, averages the scores belonging to the same category, and further calculates a frame image value of the single-frame image vector and an optical flow image value of the optical flow image feature vector.

Further, based on weighted summation, calculating a scoring value of a frame image value and an optical flow image value, finally, based on a softmax function, calculating the probability of the category to which the video frame sequence belongs according to the score, taking the category with the highest probability value as the target category to which the video frame sequence belongs, and obtaining the action characteristics of the video to be classified according to the target category.

By adopting the technical scheme, the sparse frame extraction processing is carried out on the video frame sequence, redundant information can be removed, the calculated amount is reduced, but the TSN model can effectively reduce the loss of effective information, and further the accuracy of video classification can be improved.

Optionally, in step S13, the obtaining a text feature vector according to the text information includes:

Specifically, text information is input to an encoder of a BERT model, the encoder of the BERT model is used for converting the input text information into feature vectors, the feature vectors output by the encoder and the predicted results are input to a decoder of the BERT model, and the decoder of the BERT model is used for outputting conditional probabilities of the final results and converting the conditional probabilities into text feature vectors.

By adopting the technical scheme, the multi-side contexts in all layers of the training sample are fused based on the BERT model, so that the text feature extraction can be improved, and the accuracy of video classification is further improved.

Optionally, the method further comprises:

and inputting the text information, the video frame sequence and the audio frame sequence into a classification model, so as to obtain a target feature vector from the video frame sequence and the audio frame sequence, and obtain a text feature vector from the text information, and inputting the full connection vector into a classifier to obtain a classification result of the video to be classified, which is output by the classifier, wherein the classification model comprises the classifier.

Wherein the classifier may be a softmax regression classification model.

Optionally, the training of the classification model includes:

taking the seed video with the classification labels as a training sample;

The classification labels of the seed videos are marked manually, and the same seed video can be provided with a plurality of classification labels according to text information, video frame sequences and audio frame sequences. The loss function of the classifier uses a cross entropy loss function.

Fig. 4 is a flow chart illustrating another video classification method according to an exemplary embodiment. As shown in fig. 4, the video classification method includes the steps of:

and acquiring the video to be classified, splitting the video to be classified, and obtaining corresponding text information, a video frame sequence and an audio frame sequence.

Further, inputting the text information into a pre-trained BERT model to obtain text characteristic information, wherein the pre-trained BERT model is in a frozen state in the application process. Meanwhile, the video frame sequence is input into a pre-trained TSN model to obtain action characteristics, and the audio frame sequence is input into a pre-trained VGGish model to obtain VGGish characteristics, wherein the pre-trained TSN model and the VGGish model are also in a frozen state in the application process.

Further, the obtained motion feature and VGGish feature are input into a pre-trained NeXtVLA model, the NeXtVLA model performs audio feature clustering on the VGGish feature to obtain an audio feature vector, the motion feature is clustered to obtain an image feature vector, and the audio feature vector and the image feature vector are spliced to obtain a target feature vector. The pre-trained NeXtVLA model is also in a frozen state during the application process.

Further, the text feature information and the target feature information are spliced to obtain full connection vectors, and the full connection vectors are input into a classifier, for example, a softmax regression classification model, so that a classification result of the video to be classified is obtained. And meanwhile, updating and training the softmax regression classification model by using a classification result obtained by the video to be classified. Wherein, the loss function of the classification model adopts a cross entropy loss function.

Compared with the prior art, after characters, videos and audios are respectively classified, the video classification is determined according to the probabilities of the characters, the videos and the audios in the categories, so that information loss in the video processing process to be classified is reduced, the video classification accuracy can be improved, the video recommendation accuracy to users is further improved, and the video popularization efficiency is improved.

Based on the same inventive concept, a video classification apparatus 500 is further provided according to an embodiment of the present disclosure, for performing the steps of the video classification method provided in the above method embodiment, where the apparatus 500 may implement the video classification method in a manner of software, hardware, or a combination of both. Fig. 5 is a block diagram illustrating a video classification apparatus 500 according to an exemplary embodiment. As shown in fig. 5, the apparatus 500 includes: the system comprises an acquisition module 510, a splitting module 520, an execution module 530, a stitching module 540 and a determination module 550.

Wherein the obtaining module 510 is configured to obtain a video to be classified;

the splitting module 520 is configured to split the video to be classified to obtain text information, a video frame sequence and an audio frame sequence of the video to be classified;

the execution module 530 is configured to obtain a target feature vector from the video frame sequence and the audio frame sequence, and obtain a text feature vector from the text information;

the stitching module 540 is configured to stitch the target feature vector and the text feature vector to obtain a full connection vector;

the determining module 550 is configured to input the full connection vector into a classifier, and obtain a classification result of the video to be classified output by the classifier.

The device can improve the accuracy of video classification, further improve the accuracy of video recommendation to users and improve the video popularization efficiency.

Optionally, the executing module 530 includes: the device comprises an extraction sub-module, an identification sub-module, a determination sub-module and an input sub-module.

Optionally, the TSN model generates the action feature by:

Optionally, the classification model is trained by:

taking the seed video with the classification labels as a training sample;

The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

It should be noted that, for convenience and brevity, the embodiments described in the specification are all preferred embodiments, and the parts related to the embodiments are not necessarily essential to the present invention, for example, the obtaining module 510 and the splitting module 520 may be separate devices or the same device when implemented, which is not limited by the present disclosure.

There is also provided, in accordance with an embodiment of the present disclosure, an electronic device including:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to:

acquiring videos to be classified;

The disclosed embodiments also provide a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the steps of the video classification method provided by the present disclosure.

Fig. 6 is a block diagram illustrating an apparatus 1900 for video classification according to an example embodiment. For example, the apparatus 1900 may be provided as a server. Referring to fig. 6, the apparatus 1900 includes a processing component 1922 that further includes one or more processors and memory resources represented by memory 1932 for storing instructions, such as application programs, that can be executed by the processing component 1922. The application programs stored in memory 1932 may include one or more modules each corresponding to a set of instructions. Further, processing component 1922 is configured to execute instructions to perform the video classification method described above.

The apparatus 1900 may further include a power component 1926 configured to perform power management of the apparatus 1900, a wired or wireless network interface 1950 configured to connect the apparatus 1900 to a network, and an input/output (I/O) interface 1958. The apparatus 1900 may operate based on an operating system stored in the memory 1932, such as Windows Server ^TM ，Mac OS X ^TM ，Unix ^TM ，Linux ^TM ，FreeBSD ^TM Or the like.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method of video classification, comprising:

acquiring videos to be classified;

2. The method of claim 1, wherein the deriving the target feature vector from the sequence of video frames and the sequence of audio frames comprises:

3. The method of claim 2, wherein the TSN model generates the action feature by:

4. The method of claim 1, wherein said deriving text feature vectors from said text information comprises:

5. The method according to any one of claims 2-4, wherein the classification model is trained by:

taking the seed video with the classification labels as a training sample;

6. A video classification apparatus, comprising:

the acquisition module is configured to acquire videos to be classified;

7. The apparatus of claim 6, wherein the execution module comprises:

8. The apparatus of claim 7, wherein the TSN model generates the action feature by:

9. An electronic device, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to:

acquiring videos to be classified;

10. A computer readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the steps of the video classification method of any of claims 1-5.