CN113220940A - Video classification method and device, electronic equipment and storage medium - Google Patents

Video classification method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN113220940A
CN113220940A CN202110524503.9A CN202110524503A CN113220940A CN 113220940 A CN113220940 A CN 113220940A CN 202110524503 A CN202110524503 A CN 202110524503A CN 113220940 A CN113220940 A CN 113220940A
Authority
CN
China
Prior art keywords
video
frame sequence
classified
model
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110524503.9A
Other languages
Chinese (zh)
Other versions
CN113220940B (en
Inventor
王铭喜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Xiaomi Mobile Software Co Ltd
Original Assignee
Beijing Xiaomi Mobile Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Xiaomi Mobile Software Co Ltd filed Critical Beijing Xiaomi Mobile Software Co Ltd
Priority to CN202110524503.9A priority Critical patent/CN113220940B/en
Publication of CN113220940A publication Critical patent/CN113220940A/en
Application granted granted Critical
Publication of CN113220940B publication Critical patent/CN113220940B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/75Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/73Querying
    • G06F16/735Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7834Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using audio features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7844Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using original textual content or text extracted from visual content or transcript of audio data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7847Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using low-level visual features of the video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Multimedia (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Library & Information Science (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present disclosure relates to a video classification method, apparatus, electronic device and storage medium, to solve the problem of low accuracy of video classification in the related art, including: acquiring a video to be classified; splitting the video to be classified to obtain text information, a video frame sequence and an audio frame sequence of the video to be classified; obtaining a target characteristic vector according to the video frame sequence and the audio frame sequence, and obtaining a text characteristic vector according to the text information; splicing the target characteristic vector and the text characteristic vector to obtain a full-connection vector; and inputting the full-connection vector into a classifier to obtain a classification result of the video to be classified output by the classifier. Therefore, the video classification accuracy can be improved, the accuracy of recommending videos to users is improved, and the video popularization efficiency is improved.

Description

Video classification method and device, electronic equipment and storage medium
Technical Field
The present disclosure relates to the field of video processing technologies, and in particular, to a video classification method and apparatus, an electronic device, and a storage medium.
Background
Short video platforms or video applications often recommend videos to users, for example, based on the interests of the users. Therefore, videos need to be classified, and interesting contents can be accurately recommended to users according to the classification, so that the purpose of popularizing video works such as movies and television shows is achieved.
In video classification, generally, a classification label is added to a video, for example, a difference loss between a private domain feature matrix and a public domain feature matrix of an audio modality and a difference loss between a private domain feature matrix and a public domain feature matrix of a visual modality are calculated, the two difference losses are combined to obtain a first objective function, and a second objective function is obtained according to the difference between a prediction label and a real label of a video data set; taking the similarity loss of the common domain characteristics of the audio modality and the video modality of the video data set as a third objective function; and weighting the first to third objective functions to obtain a total objective function, iterating network parameters of the depth network until the objective function values are converged, and obtaining video classification.
Disclosure of Invention
To overcome the problems in the related art, the present disclosure provides a video classification method, apparatus, electronic device, and storage medium.
According to a first aspect of the embodiments of the present disclosure, there is provided a video classification method, including:
acquiring a video to be classified;
splitting the video to be classified to obtain text information, a video frame sequence and an audio frame sequence of the video to be classified;
obtaining a target characteristic vector according to the video frame sequence and the audio frame sequence, and obtaining a text characteristic vector according to the text information;
splicing the target characteristic vector and the text characteristic vector to obtain a full-connection vector;
and inputting the full-connection vector into a classifier to obtain a classification result of the video to be classified output by the classifier.
Optionally, the obtaining a target feature vector according to the sequence of video frames and the sequence of audio frames includes:
extracting a Mel frequency cepstrum coefficient of the audio frame sequence, inputting the Mel frequency cepstrum coefficient into a VGGish model, and extracting local features of the Mel frequency cepstrum coefficient to obtain VGGish features;
inputting the video frame sequence into a TSN (TSN) model so as to perform motion recognition on the video frame sequence to obtain motion characteristics;
splicing the VGGish feature and the action feature to obtain a feature to be input;
and inputting the features to be input into a NeXtVLA model to obtain the target feature vector.
Optionally, the TSN model generates the action feature by:
determining single-frame image information and optical flow image information of the video frame sequence according to the video frame sequence;
extracting according to the single-frame image information and the optical flow image information according to a time interval to obtain a sparse sampling result;
and obtaining the action characteristics according to the sparse sampling result.
Optionally, the obtaining a text feature vector according to the text information includes:
and inputting the text information into a BERT model so as to extract text features of the text information to obtain the text feature vector.
Optionally, the classification model is trained by:
taking the seed video with the classification label as a training sample;
freezing the models except the classifier in the classification model, and training the parameters of the classifier according to the training samples, wherein the models except the classifier in the classification model comprise: VGGish model, TSN model, NeXtVLA model.
According to a second aspect of the embodiments of the present disclosure, there is provided a video classification apparatus including:
the acquisition module is configured to acquire a video to be classified;
the splitting module is configured to split the video to be classified to obtain text information, a video frame sequence and an audio frame sequence of the video to be classified;
the execution module is configured to obtain a target feature vector according to the video frame sequence and the audio frame sequence and obtain a text feature vector according to the text information;
the splicing module is configured to splice the target characteristic vector and the text characteristic vector to obtain a full-connection vector;
and the determining module is configured to input the full-connection vector into a classifier to obtain a classification result of the video to be classified output by the classifier.
Optionally, the execution module includes:
the extraction sub-module is configured to extract a Mel frequency cepstrum coefficient of the audio frame sequence, and input the Mel frequency cepstrum coefficient into a VGGish model so as to perform local feature extraction on the Mel frequency cepstrum coefficient to obtain a VGGish feature;
the identification submodule is configured to input the video frame sequence into a TSN (TSN) model so as to perform action identification on the video frame sequence to obtain action characteristics;
the determining submodule is configured to splice the VGGish feature and the action feature to obtain a feature to be input;
and the input submodule is configured to input the features to be input into a NeXtVLA model to obtain the target feature vector.
Optionally, the TSN model generates the action feature by:
determining single-frame image information and optical flow image information of the video frame sequence according to the video frame sequence;
extracting according to the single-frame image information and the optical flow image information according to a time interval to obtain a sparse sampling result;
and obtaining the action characteristics according to the sparse sampling result.
Optionally, the execution module is configured to: and inputting the text information into a BERT model so as to extract text features of the text information to obtain the text feature vector.
Optionally, the classification model is trained by:
taking the seed video with the classification label as a training sample;
freezing the models except the classifier in the classification model, and training the parameters of the classifier according to the training samples, wherein the models except the classifier in the classification model comprise: VGGish model, TSN model, NeXtVLA model.
According to a third aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to:
acquiring a video to be classified;
splitting the video to be classified to obtain text information, a video frame sequence and an audio frame sequence of the video to be classified;
obtaining a target characteristic vector according to the video frame sequence and the audio frame sequence, and obtaining a text characteristic vector according to the text information;
splicing the target characteristic vector and the text characteristic vector to obtain a full-connection vector;
and inputting the full-connection vector into a classifier to obtain a classification result of the video to be classified output by the classifier.
According to a fourth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the steps of the video classification method provided by the first aspect of the present disclosure.
The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects:
obtaining a video to be classified; splitting a video to be classified to obtain text information, a video frame sequence and an audio frame sequence of the video to be classified; obtaining a target characteristic vector according to the video frame sequence and the audio frame sequence, and obtaining a text characteristic vector according to text information; splicing the target characteristic vector and the text characteristic vector to obtain a full-connection vector; and inputting the full-connection vector into a classifier to obtain a classification result of the video to be classified output by the classifier. Therefore, the video classification accuracy can be improved, the accuracy of recommending videos to users is improved, and the video popularization efficiency is improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.
Fig. 1 is a flow diagram illustrating a method of video classification according to an example embodiment.
Fig. 2 is a flowchart illustrating one implementation of step S13 in fig. 1, according to an example embodiment.
Fig. 3 is a flowchart illustrating an implementation of step S132 in fig. 2 according to an exemplary embodiment.
Fig. 4 is a flow diagram illustrating another method of video classification according to an example embodiment.
Fig. 5 is a block diagram illustrating a video classification device according to an exemplary embodiment.
Fig. 6 is a block diagram illustrating an apparatus for video classification according to an example embodiment.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.
It should be noted that in the present disclosure, the terms "S131", "S132", and the like in the description and claims and the drawings are used for distinguishing the steps, and are not necessarily construed as performing method steps in a specific order or sequence.
Before introducing the video classification method, apparatus, electronic device and storage medium provided by the present disclosure, an application scenario of the present disclosure is first introduced. The video classification method provided by the present disclosure can be applied to an electronic device, which can be, for example, a server. The server may be a server cluster formed by one or more servers, and the server may be in communication connection with the terminal device in a wired or wireless manner, so as to obtain the video to be classified on the terminal device. The terminal device may be, for example, a smartphone, a PC (Personal Computer), or the like. The terminal equipment is used for shooting videos by a user through an application program or intercepting the videos from TV plays and movies, obtaining the videos to be classified, and uploading the videos to be classified to a server in a wired or wireless mode.
The inventor finds that in the related art, the video classification is determined through the audio modality and the video modality, and the caption information such as the title information and the lines of the video are not considered, so that the accuracy of the video classification is low, the accuracy of video recommendation is low, and the efficiency of video popularization is reduced. And the audio, the text and the video are classified separately, and then the probabilities of the classification results in each category are spliced to obtain the feature vector input into the classifier, so that the accuracy of video classification is low due to more lost effective information.
To solve the above technical problem, the present disclosure provides a video classification method. Fig. 1 is a flow chart illustrating a method of video classification, as shown in fig. 1, according to an exemplary embodiment, the method including the following steps.
In step S11, a video to be classified is acquired.
In step S12, the video to be classified is split to obtain text information, a video frame sequence, and an audio frame sequence of the video to be classified.
In step S13, a target feature vector is obtained according to the video frame sequence and the audio frame sequence, and a text feature vector is obtained according to the text information.
In step S14, the target feature vector and the text feature vector are spliced to obtain a full-connected vector.
In step S15, the full join vector is input into the classifier, and the classification result of the video to be classified output by the classifier is obtained.
In specific implementation, the video to be classified in step S11 may be actively uploaded by the terminal device, or the server actively sends an instruction to the terminal device to request uploading in response to the action of completing the video production by the terminal device, and acquires the video to be classified under the condition that the instruction is obtained in response to the terminal device and the uploading is allowed.
For example, when the terminal device shares a video to be classified of a friend circle, the server responds to the action of successfully sharing the video to be classified of the friend circle, sends an instruction requiring uploading to the terminal device, and obtains the video to be classified shared of the friend circle under the condition that the terminal device responds to the instruction and allows uploading.
Further, after the server acquires the video to be classified, the server performs text extraction, audio extraction and video extraction on the video to be classified, so as to complete splitting to obtain text information, video information and audio information, obtain a video frame sequence based on the video information, obtain an audio frame sequence corresponding to the video frame sequence based on the audio information, and the frame numbers of the video frame sequence and the audio frame sequence are the same. The text information may be bullet screen information and subtitle information, and the subtitle information may be title information, a credits, a lyric, a dialog, a caption, a character introduction, a place name, a year, and the like, for example, the subtitle information may be a dialog for displaying below the playing interface, a character introduction or a film name of a movie or television work for displaying on both sides of the playing interface, a bystander for displaying above the playing interface, or a caption for sharing the video to be classified. The voice information can comprise a speech line and a commentary, and the commentary can be voice content when the video to be classified is shared.
Taking 3 frames of videos to be classified as an example, splitting the acquired 3 frames of videos to be classified, extracting text information, a 3 frame video frame sequence and a corresponding 3 frame audio frame sequence, further, sequentially inputting the 3 frame video frame sequence to the TSN according to a time sequence to obtain action characteristics of the 3 frame video frame sequence, wherein each action characteristic is a 1 x 1024 column vector, and sequentially inputting the 3 frame audio frame sequence to VGGish according to the time sequence to obtain VGGish characteristics of the 3 frame audio frame, wherein each VGGish characteristic is a 1 x 128 column vector, and inputting the extracted text information to the BERT model to obtain a text characteristic vector corresponding to the text information, wherein the text characteristic vector is a 1 x 1024 column vector.
Further, based on vector splicing, each action feature is spliced into the VGGish feature of 3 frames, that is, on the basis of 1 × 1024 column vectors of each action feature, one VGGish feature of 1 × 128 is spliced to obtain a spliced vector corresponding to each action feature, and thus a column vector of 1 × 1 (1024+128) ═ 1 × 1152 is obtained. And recording the splicing vector corresponding to each action feature into a PKL file, further obtaining 3 column vectors of 1 × 1152, and obtaining a two-dimensional matrix of 3 rows and 1152 columns according to the 3 column vectors of 1 × 1152.
Further, a two-dimensional matrix of 3 rows and 1152 columns is input into the NeXtVLA model, and a target feature vector of 1 × 2048 is obtained, so that the splicing of the video frame sequence and the audio frame sequence is completed. And based on vector splicing, splicing 1 × 1024 text feature vectors on the basis of 1 × 2048 target feature vectors to obtain 1 × 3072 column vectors, and obtaining 1 × 3072 full-connection vectors corresponding to 3 frames of videos to be classified. And inputting the full-connected vector of 1 × 3072 into a classifier to obtain a classification result of 3 frames of videos to be classified output by the classifier.
Compared with the prior art that the audio, the text and the video are separately classified, and then the probabilities of the classification results in all the classes are spliced to obtain the feature vector of the input classifier, the embodiment of the disclosure reduces the loss of effective information.
It should be noted that the classification result of the video to be classified may belong to one type of video or may belong to multiple types of videos. For example, the video to be classified may be of entertainment type only, or may be of entertainment type and star type at the same time.
Further, after the classification result of the video to be classified is obtained, the classified video to be classified can be recommended on other terminal devices according to the classification result and the preference of other users. For example, the classified videos to be classified are displayed on the terminal devices playing the same type of videos.
According to the technical scheme, videos to be classified are obtained; splitting a video to be classified to obtain text information, a video frame sequence and an audio frame sequence of the video to be classified; obtaining a target characteristic vector according to the video frame sequence and the audio frame sequence, and obtaining a text characteristic vector according to text information; splicing the target characteristic vector and the text characteristic vector to obtain a full-connection vector; and inputting the full-connection vector into a classifier to obtain a classification result of the video to be classified output by the classifier. Therefore, the video classification accuracy can be improved, the accuracy of recommending videos to users is improved, and the video popularization efficiency is improved.
Alternatively, fig. 2 is a flowchart illustrating one implementation of step S13 in fig. 1 according to an example embodiment. In step S13, the obtaining the target feature vector according to the video frame sequence and the audio frame sequence includes the following steps.
In step S131, mel-frequency cepstrum coefficients of the sequence of audio frames are extracted, and the mel-frequency cepstrum coefficients are input into the VGGish model, so as to perform local feature extraction on the mel-frequency cepstrum coefficients to obtain the VGGish feature.
In step S132, the video frame sequence is input into the TSN model, so as to perform motion recognition on the video frame sequence to obtain motion characteristics.
In step S133, the VGGish feature and the action feature are spliced to obtain a feature to be input.
In step S134, the feature to be input is input into the NeXtVLA model to obtain a target feature vector
The NeXtVLA model carries out audio feature clustering on VGGish features to be input in features to obtain audio feature vectors, carries out action feature clustering on action features to be input in the features to obtain image feature vectors, and obtains target feature vectors according to the audio feature vectors and the image feature vectors.
In specific implementation, the audio frame sequence is input to an encoder, and the audio frame sequence output by the encoder is input to a decoder to obtain mel frequency cepstrum coefficients corresponding to the audio frame sequence, and further, the mel frequency cepstrum coefficients corresponding to the audio frame sequence are input to a VGGish (Visual Geometry Group super resolution test sequence) model, wherein the VGGish model is obtained by training audio features of manually labeled seed videos, classification tags are added to the audio of each seed video during manual labeling, and a plurality of classification tags can be added to the audio of each seed video. Similarly, the TSN (Temporal Segment Networks) model is obtained by training the video features of the manually labeled seed video.
It is worth noting that the classification label for the audio of each seed video may not coincide with the classification label for the video. For example, the class label added to video is lebun james and the class label added to audio may be the american professional basketball game.
By adopting the technical scheme, the VGGish characteristic can be obtained based on the VGGish model, the action characteristic can be obtained based on the TSN model, the audio characteristic vector and the video characteristic vector are obtained and spliced based on the NeXtVLA model, and the loss of effective information is reduced. And further the accuracy of video classification can be improved.
Alternatively, fig. 3 is a flowchart illustrating an implementation of step S132 in fig. 2 according to an exemplary embodiment. In step S132, the TSN model generates the action feature by the following steps:
in step S1321, single-frame image information and optical flow image information of the video frame sequence are determined from the video frame sequence.
In step S1322, a sparse sampling result is obtained by extracting at time intervals from the single-frame image information and the optical flow image information.
In step S1323, an action feature is obtained from the sparse sampling result.
In specific implementation, RGB image extraction may be performed on the video frame sequence, so as to extract single-frame image information and optical flow image information in the RGB-processed image. The method for extracting the single-frame image information and the optical flow image information can adopt a random sampling mode.
For example, RGB image and RGB difference in the video frame sequence are extracted, where RGB image may represent a certain frame image in the action feature, and RGB difference is a difference between two adjacent frames of video frame images, and then single frame image information is obtained based on RGB image and RGB difference.
For another example, optical flow image information in a video frame sequence may be extracted by a region-based method, for example, similar region positioning is performed on RGB-processed images, and optical flow is calculated by displacement of the similar region for the similar region after positioning, so as to obtain the optical flow image information.
Specifically, the network part of the TSN model is composed of a two-way CNN (Convolutional Neural Networks), in which one way of network takes single frame image information as input, and the other way of network takes optical flow image information as input.
Furthermore, the TSN model samples and extracts input single frame image information and optical flow image information at time intervals, performs common or associated feature extraction on the sparsely sampled single frame image information, performs corresponding category determination to obtain a single frame feature vector, performs common or associated feature extraction on the sparsely sampled optical flow image information, performs corresponding category determination to obtain an optical flow image feature vector, and merges the single frame feature vector and the optical flow image feature vector, where the merging mode may adopt a method of weighting and averaging to obtain the motion features of the video to be classified.
Illustratively, the TSN model samples and extracts input single-frame image information and optical flow image information at a time interval of 5 seconds, calculates scores of each single-frame feature vector and each optical flow image feature vector belonging to each category after thinning, averages the scores belonging to the same category, and calculates an optical flow image value of the single-frame image vector and an optical flow image value of the optical flow image feature vector according to the frame image value of the single-frame image vector and the optical flow image value of the optical flow image feature vector.
And further, calculating score values of the frame image values and the optical flow image values based on weighted summation, finally calculating the probability of the category to which the video frame sequence belongs according to the scores based on a softmax function, taking the category with the maximum probability value as the target category to which the video frame sequence belongs, and obtaining the action characteristics of the video to be classified according to the target category.
By adopting the technical scheme, the video frame sequence is subjected to sparse frame extraction processing, redundant information can be removed, the calculated amount is reduced, but the loss of effective information can be effectively reduced by the TSN model, and the accuracy of video classification can be further improved.
Optionally, in step S13, the obtaining a text feature vector according to the text information includes:
and inputting the text information into a BERT model so as to extract text features of the text information to obtain the text feature vector.
Specifically, text information is input into an encoder of the BERT model, the encoder of the BERT model is used for converting the input text information into a feature vector, and the feature vector output by the encoder and a result which is predicted are input into a decoder of the BERT model, the decoder of the BERT model is used for outputting a conditional probability of a final result and converting the conditional probability into a text feature vector.
By adopting the technical scheme, the multi-side contexts in all layers of the training sample are fused based on the BERT model, so that the text feature extraction can be improved, and the accuracy of video classification is further improved.
Optionally, the method further comprises:
and inputting the text information, the video frame sequence and the audio frame sequence into a classification model to obtain a target feature vector according to the video frame sequence and the audio frame sequence, obtain a text feature vector according to the text information, and input the full-connection vector into a classifier to obtain a classification result of the video to be classified output by the classifier, wherein the classification model comprises the classifier.
Wherein, the classifier can be a softmax regression classification model.
Optionally, the training of the classification model comprises:
taking the seed video with the classification label as a training sample;
freezing the models except the classifier in the classification model, and training the parameters of the classifier according to the training samples, wherein the models except the classifier in the classification model comprise: VGGish model, TSN model, NeXtVLA model.
The classification labels of the seed videos are manually labeled, and the same seed video can have a plurality of classification labels according to the text information, the video frame sequence and the audio frame sequence. The loss function of the classifier adopts a cross entropy loss function.
Fig. 4 is a flow diagram illustrating another method of video classification according to an example embodiment. As shown in fig. 4, the video classification method includes the following steps:
and acquiring a video to be classified, and splitting the video to be classified to obtain corresponding text information, a video frame sequence and an audio frame sequence.
Further, the text information is input into a pre-trained BERT model to obtain text characteristic information, and the pre-trained BERT model is in a frozen state in the application process. Meanwhile, the video frame sequence is input into a pre-trained TSN model to obtain the action characteristic, the audio frame sequence is input into a pre-trained VGGish model to obtain the VGGish characteristic, and the pre-trained TSN model and the pre-trained VGGish model are also in a frozen state in the application process.
And further, inputting the obtained action characteristic and the VGGish characteristic into a pre-trained NeXtVLA model, carrying out audio characteristic clustering on the VGGish characteristic by the NeXtVLA model to obtain an audio characteristic vector, carrying out action characteristic clustering on the action characteristic to obtain an image characteristic vector, and splicing the audio characteristic vector and the image characteristic vector to obtain a target characteristic vector. The pre-trained NeXtVLA model is also frozen during the application process.
Further, the text feature information and the target feature information are spliced to obtain a full-connection vector, and the full-connection vector is input into a classifier, for example, a softmax regression classification model, to obtain a classification result of the video to be classified. And meanwhile, updating and training the softmax regression classification model by using a classification result obtained by the video to be classified. Wherein, the loss function of the classification model adopts a cross entropy loss function.
Compared with the prior art, the technical scheme has the advantages that after characters, videos and audios are classified respectively, video classification is determined according to the probability of each character, video and audio in the classification, information loss in the process of processing the videos to be classified is reduced, video classification accuracy can be improved, accuracy of recommending the videos to users is improved, and video popularization efficiency is improved.
Based on the same inventive concept, there is also provided a video classification apparatus 500 according to the embodiments of the present disclosure, for performing the steps of the video classification method provided by the above method embodiments, and the apparatus 500 may implement the video classification method in software, hardware or a combination of both. Fig. 5 is a block diagram illustrating a video classification apparatus 500 according to an exemplary embodiment. As shown in fig. 5, the apparatus 500 includes: an acquisition module 510, a splitting module 520, an execution module 530, a splicing module 540, and a determination module 550.
Wherein the obtaining module 510 is configured to obtain a video to be classified;
the splitting module 520 is configured to split the video to be classified to obtain text information, a video frame sequence and an audio frame sequence of the video to be classified;
the execution module 530 is configured to derive a target feature vector from the sequence of video frames and the sequence of audio frames, and derive a text feature vector from the text information;
the stitching module 540 is configured to stitch the target feature vector and the text feature vector to obtain a full-connected vector;
the determining module 550 is configured to input the full join vector into a classifier, and obtain a classification result of the video to be classified output by the classifier.
The device can improve the accuracy of video classification, further improve the accuracy of recommending videos to users, and improve the efficiency of video popularization.
Optionally, the executing module 530 includes: the device comprises an extraction submodule, an identification submodule, a determination submodule and an input submodule.
Wherein the extraction sub-module is configured to extract mel frequency cepstrum coefficients of the audio frame sequence and input the mel frequency cepstrum coefficients into a VGGish model so as to perform local feature extraction on the mel frequency cepstrum coefficients to obtain VGGish features;
the identification submodule is configured to input the video frame sequence into a TSN (TSN) model so as to identify the motion of the video frame sequence to obtain motion characteristics;
the determining submodule is configured to splice the VGGish feature and the action feature to obtain a feature to be input;
and the input submodule is configured to input the features to be input into a NeXtVLA model to obtain the target feature vector.
Optionally, the TSN model generates the action feature by:
determining single-frame image information and optical flow image information of the video frame sequence according to the video frame sequence;
extracting according to the single-frame image information and the optical flow image information according to a time interval to obtain a sparse sampling result;
and obtaining the action characteristics according to the sparse sampling result.
Optionally, the execution module is configured to: and inputting the text information into a BERT model so as to extract text features of the text information to obtain the text feature vector.
Optionally, the classification model is trained by:
taking the seed video with the classification label as a training sample;
freezing the models except the classifier in the classification model, and training the parameters of the classifier according to the training samples, wherein the models except the classifier in the classification model comprise: VGGish model, TSN model, NeXtVLA model.
With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.
It should be noted that, for convenience and brevity of description, the embodiments described in the specification all belong to the preferred embodiments, and the related parts are not necessarily essential to the present invention, for example, the obtaining module 510 and the splitting module 520 may be independent devices or may be the same device when being implemented specifically, and the disclosure is not limited thereto.
There is also provided, in accordance with an embodiment of the present disclosure, an electronic device, including:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to:
acquiring a video to be classified;
splitting the video to be classified to obtain text information, a video frame sequence and an audio frame sequence of the video to be classified;
obtaining a target characteristic vector according to the video frame sequence and the audio frame sequence, and obtaining a text characteristic vector according to the text information;
splicing the target characteristic vector and the text characteristic vector to obtain a full-connection vector;
and inputting the full-connection vector into a classifier to obtain a classification result of the video to be classified output by the classifier.
Embodiments of the present disclosure also provide a computer-readable storage medium on which computer program instructions are stored, which when executed by a processor implement the steps of the video classification method provided by the present disclosure.
Fig. 6 is a block diagram illustrating an apparatus 1900 for video classification according to an example embodiment. For example, the apparatus 1900 may be provided as a server. Referring to FIG. 6, the device 1900 includes a processing component 1922 further including one or more processors and memory resources, represented by memory 1932, for storing instructions, e.g., applications, executable by the processing component 1922. The application programs stored in memory 1932 may include one or more modules that each correspond to a set of instructions. Further, the processing component 1922 is configured to execute instructions to perform the video classification method described above.
The device 1900 may also include a power component 1926 configured to perform power management of the device 1900, a wired or wireless network interface 1950 configured to connect the device 1900 to a network, and an input/output (I/O) interface 1958. The device 1900 may operate based on an operating system, such as Windows Server, stored in memory 1932TM,Mac OS XTM,UnixTM,LinuxTM,FreeBSDTMOr the like.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (10)

1. A method of video classification, comprising:
acquiring a video to be classified;
splitting the video to be classified to obtain text information, a video frame sequence and an audio frame sequence of the video to be classified;
obtaining a target characteristic vector according to the video frame sequence and the audio frame sequence, and obtaining a text characteristic vector according to the text information;
splicing the target characteristic vector and the text characteristic vector to obtain a full-connection vector;
and inputting the full-connection vector into a classifier to obtain a classification result of the video to be classified output by the classifier.
2. The method of claim 1, wherein deriving the target feature vector from the sequence of video frames and the sequence of audio frames comprises:
extracting a Mel frequency cepstrum coefficient of the audio frame sequence, inputting the Mel frequency cepstrum coefficient into a VGGish model, and extracting local features of the Mel frequency cepstrum coefficient to obtain VGGish features;
inputting the video frame sequence into a TSN (TSN) model so as to perform motion recognition on the video frame sequence to obtain motion characteristics;
splicing the VGGish feature and the action feature to obtain a feature to be input;
and inputting the features to be input into a NeXtVLA model to obtain the target feature vector.
3. The method of claim 2, wherein the TSN model generates the action features by:
determining single-frame image information and optical flow image information of the video frame sequence according to the video frame sequence;
extracting according to the single-frame image information and the optical flow image information according to a time interval to obtain a sparse sampling result;
and obtaining the action characteristics according to the sparse sampling result.
4. The method of claim 1, wherein the deriving a text feature vector according to the text information comprises:
and inputting the text information into a BERT model so as to extract text features of the text information to obtain the text feature vector.
5. The method according to any of claims 2-4, wherein the classification model is trained by:
taking the seed video with the classification label as a training sample;
freezing the models except the classifier in the classification model, and training the parameters of the classifier according to the training samples, wherein the models except the classifier in the classification model comprise: VGGish model, TSN model, NeXtVLA model.
6. A video classification apparatus, comprising:
the acquisition module is configured to acquire a video to be classified;
the splitting module is configured to split the video to be classified to obtain text information, a video frame sequence and an audio frame sequence of the video to be classified;
the execution module is configured to obtain a target feature vector according to the video frame sequence and the audio frame sequence and obtain a text feature vector according to the text information;
the splicing module is configured to splice the target characteristic vector and the text characteristic vector to obtain a full-connection vector;
and the determining module is configured to input the full-connection vector into a classifier to obtain a classification result of the video to be classified output by the classifier.
7. The apparatus of claim 6, wherein the execution module comprises:
the extraction sub-module is configured to extract a Mel frequency cepstrum coefficient of the audio frame sequence, and input the Mel frequency cepstrum coefficient into a VGGish model so as to perform local feature extraction on the Mel frequency cepstrum coefficient to obtain a VGGish feature;
the identification submodule is configured to input the video frame sequence into a TSN (TSN) model so as to perform action identification on the video frame sequence to obtain action characteristics;
the determining submodule is configured to splice the VGGish feature and the action feature to obtain a feature to be input;
and the input submodule is configured to input the features to be input into a NeXtVLA model to obtain the target feature vector.
8. The apparatus of claim 7, wherein the TSN model generates the action features by:
determining single-frame image information and optical flow image information of the video frame sequence according to the video frame sequence;
extracting according to the single-frame image information and the optical flow image information according to a time interval to obtain a sparse sampling result;
and obtaining the action characteristics according to the sparse sampling result.
9. An electronic device, comprising:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to:
acquiring a video to be classified;
splitting the video to be classified to obtain text information, a video frame sequence and an audio frame sequence of the video to be classified;
obtaining a target characteristic vector according to the video frame sequence and the audio frame sequence, and obtaining a text characteristic vector according to the text information;
splicing the target characteristic vector and the text characteristic vector to obtain a full-connection vector;
and inputting the full-connection vector into a classifier to obtain a classification result of the video to be classified output by the classifier.
10. A computer-readable storage medium, on which computer program instructions are stored, which program instructions, when executed by a processor, carry out the steps of the video classification method according to any one of claims 1 to 5.
CN202110524503.9A 2021-05-13 2021-05-13 Video classification method, device, electronic equipment and storage medium Active CN113220940B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110524503.9A CN113220940B (en) 2021-05-13 2021-05-13 Video classification method, device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110524503.9A CN113220940B (en) 2021-05-13 2021-05-13 Video classification method, device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113220940A true CN113220940A (en) 2021-08-06
CN113220940B CN113220940B (en) 2024-02-09

Family

ID=77095560

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110524503.9A Active CN113220940B (en) 2021-05-13 2021-05-13 Video classification method, device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113220940B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180005037A1 (en) * 2016-06-29 2018-01-04 Cellular South, Inc. Dba C Spire Wireless Video to data
CN111626251A (en) * 2020-06-02 2020-09-04 Oppo广东移动通信有限公司 Video classification method, video classification device and electronic equipment
WO2020221278A1 (en) * 2019-04-29 2020-11-05 北京金山云网络技术有限公司 Video classification method and model training method and apparatus thereof, and electronic device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180005037A1 (en) * 2016-06-29 2018-01-04 Cellular South, Inc. Dba C Spire Wireless Video to data
WO2020221278A1 (en) * 2019-04-29 2020-11-05 北京金山云网络技术有限公司 Video classification method and model training method and apparatus thereof, and electronic device
CN111626251A (en) * 2020-06-02 2020-09-04 Oppo广东移动通信有限公司 Video classification method, video classification device and electronic equipment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
朱映映,朱艳艳,等: "基于类型标志镜头与词袋模型的体育视频分类", 计算机辅助设计与图形学学报, no. 09 *

Also Published As

Publication number Publication date
CN113220940B (en) 2024-02-09

Similar Documents

Publication Publication Date Title
CN109145784B (en) Method and apparatus for processing video
Zang et al. Attention-based temporal weighted convolutional neural network for action recognition
CN108830235B (en) Method and apparatus for generating information
WO2022022152A1 (en) Video clip positioning method and apparatus, and computer device and storage medium
CN110839173A (en) Music matching method, device, terminal and storage medium
CN111611436A (en) Label data processing method and device and computer readable storage medium
US11302361B2 (en) Apparatus for video searching using multi-modal criteria and method thereof
CN114342353A (en) Method and system for video segmentation
CN110287375B (en) Method and device for determining video tag and server
CN110347866B (en) Information processing method, information processing device, storage medium and electronic equipment
CN110740389A (en) Video positioning method and device, computer readable medium and electronic equipment
CN113297891A (en) Video information processing method and device and electronic equipment
CN113766299B (en) Video data playing method, device, equipment and medium
CN111836118B (en) Video processing method, device, server and storage medium
CN115269913A (en) Video retrieval method based on attention fragment prompt
CN116050496A (en) Determination method and device, medium and equipment of picture description information generation model
CN110347869B (en) Video generation method and device, electronic equipment and storage medium
CN116665083A (en) Video classification method and device, electronic equipment and storage medium
CN114302174A (en) Video editing method and device, computing equipment and storage medium
CN114363695A (en) Video processing method, video processing device, computer equipment and storage medium
CN112328834A (en) Video association method and device, electronic equipment and storage medium
WO2021162803A1 (en) Systems for authenticating digital contents
CN114845149A (en) Editing method of video clip, video recommendation method, device, equipment and medium
CN113220940B (en) Video classification method, device, electronic equipment and storage medium
CN113822045B (en) Multi-mode data-based film evaluation quality identification method and related device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant