CN113220940B - Video classification method, device, electronic equipment and storage medium - Google Patents
Video classification method, device, electronic equipment and storage medium Download PDFInfo
- Publication number
- CN113220940B CN113220940B CN202110524503.9A CN202110524503A CN113220940B CN 113220940 B CN113220940 B CN 113220940B CN 202110524503 A CN202110524503 A CN 202110524503A CN 113220940 B CN113220940 B CN 113220940B
- Authority
- CN
- China
- Prior art keywords
- video
- frame sequence
- classified
- feature vector
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 37
- 239000013598 vector Substances 0.000 claims abstract description 138
- 230000009471 action Effects 0.000 claims description 37
- 230000003287 optical effect Effects 0.000 claims description 27
- 238000013145 classification model Methods 0.000 claims description 21
- 238000012549 training Methods 0.000 claims description 20
- 238000000605 extraction Methods 0.000 claims description 13
- 238000005070 sampling Methods 0.000 claims description 13
- 239000000284 extract Substances 0.000 claims description 5
- 230000008014 freezing Effects 0.000 claims description 5
- 238000007710 freezing Methods 0.000 claims description 5
- 238000004590 computer program Methods 0.000 claims description 3
- 230000006870 function Effects 0.000 description 11
- 239000011159 matrix material Substances 0.000 description 6
- 238000012545 processing Methods 0.000 description 6
- 238000010586 diagram Methods 0.000 description 4
- 230000008569 process Effects 0.000 description 4
- 238000013527 convolutional neural network Methods 0.000 description 3
- 230000006978 adaptation Effects 0.000 description 2
- 230000000007 visual effect Effects 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000000981 bystander Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000006073 displacement reaction Methods 0.000 description 1
- 230000004807 localization Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/75—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/73—Querying
- G06F16/735—Filtering based on additional data, e.g. user or group profiles
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/783—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
- G06F16/7834—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using audio features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/783—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
- G06F16/7844—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using original textual content or text extracted from visual content or transcript of audio data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/783—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
- G06F16/7847—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using low-level visual features of the video content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Multimedia (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Library & Information Science (AREA)
- Databases & Information Systems (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computational Linguistics (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Biomedical Technology (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Health & Medical Sciences (AREA)
- Mathematical Physics (AREA)
- Biophysics (AREA)
- Probability & Statistics with Applications (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The disclosure relates to a video classification method, a device, an electronic device and a storage medium, so as to solve the problem of low video classification accuracy in the related art, comprising: acquiring videos to be classified; splitting the video to be classified to obtain text information, a video frame sequence and an audio frame sequence of the video to be classified; obtaining a target feature vector according to the video frame sequence and the audio frame sequence, and obtaining a text feature vector according to the text information; splicing the target feature vector and the text feature vector to obtain a full connection vector; and inputting the full connection vector into a classifier to obtain a classification result of the video to be classified, which is output by the classifier. Therefore, the video classification accuracy can be improved, the accuracy of recommending videos to users is further improved, and the video popularization efficiency is improved.
Description
Technical Field
The disclosure relates to the technical field of video processing, and in particular relates to a video classification method, a video classification device, electronic equipment and a storage medium.
Background
Short video platforms or video applications often recommend videos to users, e.g., corresponding videos according to the interests of the user. Therefore, videos are required to be classified, and interested contents of the videos can be accurately recommended to users according to the classification, so that the purposes of popularizing video works such as movies and television shows are achieved.
In video classification, a classification label is generally added to video, for example, the difference loss between a private domain feature matrix and a public domain feature matrix of an audio mode and the difference loss between the private domain feature matrix and the public domain feature matrix of a visual mode are calculated, the two difference losses are combined to obtain a first objective function, and a second objective function is obtained according to the difference between a prediction label and a real label of a video data set; taking the similarity loss of the public domain features of the audio mode and the public domain features of the video mode of the video data set as a third objective function; and weighting the first to third objective functions to obtain a total objective function, iterating network parameters of the depth network until the objective function values are converged, and obtaining video classification.
Disclosure of Invention
In order to overcome the problems in the related art, the present disclosure provides a video classification method, apparatus, electronic device, and storage medium.
According to a first aspect of an embodiment of the present disclosure, there is provided a video classification method, including:
acquiring videos to be classified;
splitting the video to be classified to obtain text information, a video frame sequence and an audio frame sequence of the video to be classified;
obtaining a target feature vector according to the video frame sequence and the audio frame sequence, and obtaining a text feature vector according to the text information;
splicing the target feature vector and the text feature vector to obtain a full connection vector;
and inputting the full connection vector into a classifier to obtain a classification result of the video to be classified, which is output by the classifier.
Optionally, the obtaining the target feature vector according to the video frame sequence and the audio frame sequence includes:
extracting a Mel frequency cepstrum coefficient of the audio frame sequence, inputting the Mel frequency cepstrum coefficient into a VGGish model, and extracting local features of the Mel frequency cepstrum coefficient to obtain VGGish features;
inputting the video frame sequence into a TSN model to perform action recognition on the video frame sequence to obtain action characteristics;
splicing the VGGish characteristic and the action characteristic to obtain a characteristic to be input;
and inputting the feature to be input into a NeXtVLA model to obtain the target feature vector.
Optionally, the TSN model generates the action feature by:
according to the video frame sequence, single-frame image information and optical flow image information of the video frame sequence are determined;
extracting according to the single-frame image information and the optical flow image information at intervals to obtain a sparse sampling result;
and obtaining the action characteristic according to the sparse sampling result.
Optionally, the obtaining the text feature vector according to the text information includes:
and inputting the text information into a BERT model to extract text features of the text information so as to obtain the text feature vector.
Optionally, the classification model is trained by:
taking the seed video with the classification labels as a training sample;
freezing models except the classifier in the classification model, and training parameters of the classifier according to the training sample, wherein the models except the classifier in the classification model comprise: VGGish model, TSN model, neXtVLA model.
According to a second aspect of embodiments of the present disclosure, there is provided a video classification apparatus, comprising:
the acquisition module is configured to acquire videos to be classified;
the splitting module is configured to split the video to be classified to obtain text information, a video frame sequence and an audio frame sequence of the video to be classified;
the execution module is configured to obtain a target feature vector according to the video frame sequence and the audio frame sequence and obtain a text feature vector according to the text information;
the splicing module is configured to splice the target feature vector and the text feature vector to obtain a full connection vector;
the determining module is configured to input the full connection vector into a classifier to obtain a classification result of the video to be classified, which is output by the classifier.
Optionally, the execution module includes:
the extraction submodule is configured to extract a Mel frequency cepstrum coefficient of the audio frame sequence, input the Mel frequency cepstrum coefficient into a VGGish model, and extract local features of the Mel frequency cepstrum coefficient to obtain VGGish features;
the identification sub-module is configured to input the video frame sequence into a TSN model so as to conduct action identification on the video frame sequence to obtain action characteristics;
the determining submodule is configured to splice the VGGish characteristic and the action characteristic to obtain a characteristic to be input;
and the input sub-module is configured to input the feature to be input into a NeXtVLA model to obtain the target feature vector.
Optionally, the TSN model generates the action feature by:
according to the video frame sequence, single-frame image information and optical flow image information of the video frame sequence are determined;
extracting according to the single-frame image information and the optical flow image information at intervals to obtain a sparse sampling result;
and obtaining the action characteristic according to the sparse sampling result.
Optionally, the execution module is configured to: and inputting the text information into a BERT model to extract text features of the text information so as to obtain the text feature vector.
Optionally, the classification model is trained by:
taking the seed video with the classification labels as a training sample;
freezing models except the classifier in the classification model, and training parameters of the classifier according to the training sample, wherein the models except the classifier in the classification model comprise: VGGish model, TSN model, neXtVLA model.
According to a third aspect of embodiments of the present disclosure, there is provided an electronic device, comprising:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to:
acquiring videos to be classified;
splitting the video to be classified to obtain text information, a video frame sequence and an audio frame sequence of the video to be classified;
obtaining a target feature vector according to the video frame sequence and the audio frame sequence, and obtaining a text feature vector according to the text information;
splicing the target feature vector and the text feature vector to obtain a full connection vector;
and inputting the full connection vector into a classifier to obtain a classification result of the video to be classified, which is output by the classifier.
According to a fourth aspect of embodiments of the present disclosure, there is provided a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the steps of the video classification method provided by the first aspect of the present disclosure.
The technical scheme provided by the embodiment of the disclosure can comprise the following beneficial effects:
obtaining videos to be classified; splitting the video to be classified to obtain text information, a video frame sequence and an audio frame sequence of the video to be classified; obtaining a target feature vector according to the video frame sequence and the audio frame sequence, and obtaining a text feature vector according to the text information; splicing the target feature vector and the text feature vector to obtain a full connection vector; and inputting the full connection vector into a classifier to obtain a classification result of the video to be classified, which is output by the classifier. Therefore, the video classification accuracy can be improved, the accuracy of recommending videos to users is further improved, and the video popularization efficiency is improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure.
Fig. 1 is a flow chart illustrating a video classification method according to an exemplary embodiment.
Fig. 2 is a flow chart illustrating one implementation of step S13 in fig. 1 according to an exemplary embodiment.
Fig. 3 is a flow chart illustrating one implementation of step S132 of fig. 2, according to an exemplary embodiment.
Fig. 4 is a flow chart illustrating another video classification method according to an exemplary embodiment.
Fig. 5 is a block diagram illustrating a video classification device according to an exemplary embodiment.
Fig. 6 is a block diagram illustrating an apparatus for video classification according to an exemplary embodiment.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.
It should be noted that, in the present disclosure, the terms "S131", "S132", and the like in the specification and claims and drawings are used for distinguishing steps, and are not necessarily to be construed as performing the method steps in a particular order or sequence.
Before introducing the video classification method, the video classification device, the electronic equipment and the storage medium provided by the disclosure, an application scene of the disclosure is first introduced. The video classification method provided by the disclosure can be applied to electronic equipment, and the electronic equipment can be a server, for example. The server can be a server cluster formed by one or more servers, and the server can be in communication connection with the terminal equipment in a wired or wireless mode and is used for obtaining videos to be classified on the terminal equipment. The terminal device may be, for example, a smart phone, a PC (Personal Computer personal computer), or the like. The terminal device is used for shooting videos through application programs or intercepting videos from TV dramas and movies by a user to obtain the videos to be classified, and uploading the videos to be classified to the server in a wired or wireless mode.
The inventor finds that in the related art, the video classification is determined through the audio mode and the video mode, and caption information such as caption information and speech of the video is not considered, so that the accuracy of the video classification is lower, the accuracy of video recommendation is lower, and the video popularization efficiency is reduced. And the audio, the text and the video are classified independently, and then the probabilities of the classification results in all the categories are spliced to obtain the feature vector input into the classifier, so that more effective information is lost, and the video classification accuracy is lower.
In order to solve the above technical problems, the present disclosure provides a video classification method. Fig. 1 is a flow chart illustrating a video classification method according to an exemplary embodiment, including the following steps, as shown in fig. 1.
In step S11, a video to be classified is acquired.
In step S12, the video to be classified is split, so as to obtain text information, a video frame sequence and an audio frame sequence of the video to be classified.
In step S13, a target feature vector is obtained according to the video frame sequence and the audio frame sequence, and a text feature vector is obtained according to the text information.
In step S14, the target feature vector and the text feature vector are spliced to obtain a full connection vector.
In step S15, the full connection vector is input into a classifier, and a classification result of the video to be classified output by the classifier is obtained.
In the implementation, the video to be classified in step S11 may be actively uploaded by the terminal device, or may be actively sent to the terminal device to require uploading by the server in response to the terminal device completing the video creation, and the video to be classified is acquired when the terminal device responds to the instruction and is allowed to upload.
For example, when the terminal device shares a video to be classified in a friend circle, the server responds to the successful action of sharing the video to be classified in the friend circle, sends an instruction for uploading to the terminal device, and obtains the video to be classified which is shared in the friend circle when the instruction for uploading is responded by the terminal device and uploading is allowed.
Further, after the server obtains the video to be classified, text extraction, audio extraction and video extraction are carried out on the video to be classified, so that splitting is completed to obtain text information, video information and audio information, a video frame sequence is obtained based on the video information, an audio frame sequence corresponding to the video frame sequence is obtained based on the audio information, and the frames of the video frame sequence and the audio frame sequence are identical. The text information may be bullet screen information and subtitle information, and the subtitle information may be title information, a staff table, a gramophone, a dialogue, a description word, a character introduction, a place name, a year, etc., for example, the subtitle information may be a dialogue for displaying below a playing interface, a title for displaying on both sides of the playing interface, a bystander for displaying above the playing interface, or a text description when sharing the video to be classified. The speech information may include speech and commentary, which may be speech content when sharing the video to be classified.
Taking 3 frames of videos to be classified as an example, splitting the acquired 3 frames of videos to be classified, extracting text information, 3 frames of video frame sequences and corresponding 3 frames of audio frame sequences, further sequentially inputting the 3 frames of video frame sequences into TSN according to time sequence to obtain action characteristics of the 3 frames of video frame sequences, wherein each action characteristic is a column vector of 1 x 1024, sequentially inputting the 3 frames of audio frame sequences into VGGish according to time sequence to obtain VGGish characteristics of the 3 frames of audio frames, wherein each VGGish characteristic is a column vector of 1 x 128, inputting the extracted text information into BERT model to obtain text characteristic vectors corresponding to the text information, and the text characteristic vectors are column vectors of 1 x 1024.
Further, based on vector stitching, the VGGish feature of each motion feature stitching 3 frames is stitched, that is, on the basis of 1×1024 column vectors of each motion feature, one VGGish feature of 1×128 is stitched, so as to obtain a stitching vector corresponding to each motion feature, that is, a column vector of 1×1024+128) =1×1152 is obtained. And recording the spliced vector corresponding to each action feature into the PKL file to obtain 3 column vectors of 1 x 1152, and obtaining a two-dimensional matrix of 3 rows and 1152 columns according to the 3 column vectors of 1 x 1152.
Further, a two-dimensional matrix with 3 rows and 1152 columns is input to a NextVLA model to obtain a target feature vector of 1 x 2048, and the splicing of the video frame sequence and the audio frame sequence is completed. Based on vector stitching, on the basis of 1×2048 target feature vectors, 1×1024 text feature vectors are stitched to obtain 1×3072 column vectors, namely 1×3072 full connection vectors corresponding to 3 frames of videos to be classified are obtained. And inputting the full connection vector of 1 x 3072 into a classifier to obtain a classification result of 3 frames of videos to be classified output by the classifier.
Compared with the prior art, the method and the device have the advantages that the audio, the text and the video are classified independently, the probabilities of classification results in various categories are spliced, the feature vectors of the input classifier are obtained, and the loss of effective information is reduced.
It should be noted that the classification result of the video to be classified may be one type of video, or may be multiple types of videos. For example, the video to be classified may be of the entertainment type only, or may be of both the entertainment type and the star type.
Further, after the classification result of the video to be classified is obtained, the classified video to be classified can be recommended on other terminal equipment according to the preference of other users according to the classification result. For example, the classified video to be classified is displayed on a terminal device which plays the video of the same type.
The technical scheme is that videos to be classified are obtained; splitting the video to be classified to obtain text information, a video frame sequence and an audio frame sequence of the video to be classified; obtaining a target feature vector according to the video frame sequence and the audio frame sequence, and obtaining a text feature vector according to the text information; splicing the target feature vector and the text feature vector to obtain a full connection vector; and inputting the full connection vector into a classifier to obtain a classification result of the video to be classified, which is output by the classifier. Therefore, the video classification accuracy can be improved, the accuracy of recommending videos to users is further improved, and the video popularization efficiency is improved.
Alternatively, fig. 2 is a flow chart illustrating one implementation of step S13 in fig. 1 according to an exemplary embodiment. In step S13, the target feature vector is obtained according to the video frame sequence and the audio frame sequence, including the following steps.
In step S131, mel-frequency cepstrum coefficients of the audio frame sequence are extracted, and the mel-frequency cepstrum coefficients are input into the VGGish model to perform local feature extraction on the mel-frequency cepstrum coefficients to obtain VGGish features.
In step S132, the video frame sequence is input into the TSN model to perform motion recognition on the video frame sequence to obtain motion features.
In step S133, the VGGish feature and the action feature are spliced to obtain the feature to be input.
In step S134, the feature to be input is input into the NextVLA model to obtain a target feature vector
The NeXtVLA model performs audio feature clustering on VGGish features in the features to be input to obtain audio feature vectors, performs motion feature clustering on motion features in the features to be input to obtain image feature vectors, and obtains target feature vectors according to the audio feature vectors and the image feature vectors.
In specific implementation, an audio frame sequence is input to an encoder, an audio frame sequence output by the encoder is input to a decoder to obtain mel frequency cepstrum coefficients corresponding to the audio frame sequence, and further, mel frequency cepstrum coefficients corresponding to the audio frame sequence are input to a VGGish (Visual Geometry Group super-resolution test sequence) model, wherein the VGGish model is obtained by training audio features of manually marked seed videos, a classification label is added to audio of each seed video during manual marking, and a plurality of classification labels can be added to audio of each seed video. Similarly, the TSN (Temporal Segment Networks time period network) model is obtained by training the video features of the artificially annotated seed video.
It is worth noting that the class labels for the audio of each seed video may not be identical to the class labels of the video. For example, the video-added class label is le bronce james and the audio-added class label may be an american professional basketball game.
By adopting the technical scheme, VGGish characteristics can be obtained based on the VGGish model, action characteristics can be obtained based on the TSN model, and audio characteristic vectors and video characteristic vectors can be obtained and spliced based on the NeXtVLA model, so that loss of effective information is reduced. And further, the accuracy of video classification can be improved.
Alternatively, fig. 3 is a flow chart illustrating one implementation of step S132 in fig. 2 according to an exemplary embodiment. In step S132, the TSN model generates the action feature by:
in step S1321, single-frame image information and optical flow image information of the video frame sequence are determined from the video frame sequence.
In step S1322, a sparse sampling result is obtained by extracting the single-frame image information and the optical flow image information at time intervals.
In step S1323, an action feature is obtained from the sparse sampling result.
In specific implementation, the video frame sequence can be subjected to RGB image extraction, so that single-frame image information and optical flow image information in the RGB processed image are extracted. Wherein, the single frame image information and the optical flow image information can be extracted by adopting a random sampling mode.
For example, an RGB image and an RGB difference in a video frame sequence are extracted, where the RGB image may represent a certain frame of image in an action feature, and the RGB difference is a difference between video frame images of two adjacent frames, so as to obtain single frame image information based on the RGB image and the RGB difference.
As yet another example, optical flow image information in a sequence of video frames may be extracted by a region-based method, for example, performing similar region localization on RGB processed images, and calculating optical flow by displacement of the similar regions for the localized similar regions, resulting in optical flow image information.
Specifically, the network part of the TSN model is composed of two paths of CNNs (Convolutional Neural Networks convolutional neural networks), one path of which takes single-frame image information as input, and the other path of which takes optical flow image information as input.
Further, the TSN model samples and extracts input single-frame image information and optical flow image information according to time intervals, performs sharing or association feature extraction on the sparse-sampled single-frame image information, performs corresponding category judgment to obtain single-frame feature vectors, performs sharing or association feature extraction on the sparse-sampled optical flow image information, performs corresponding category judgment to obtain optical flow image feature vectors, and combines the single-frame feature vectors and the optical flow image feature vectors, wherein a weighting and averaging method can be adopted for combining to obtain action features of videos to be classified.
For example, the TSN model samples and extracts input single-frame image information and optical flow image information according to a time interval of 5 seconds, calculates a score of each single-frame feature vector and each optical flow image feature vector after sparseness belongs to each category, averages the scores belonging to the same category, and further calculates a frame image value of the single-frame image vector and an optical flow image value of the optical flow image feature vector.
Further, based on weighted summation, calculating a scoring value of a frame image value and an optical flow image value, finally, based on a softmax function, calculating the probability of the category to which the video frame sequence belongs according to the score, taking the category with the highest probability value as the target category to which the video frame sequence belongs, and obtaining the action characteristics of the video to be classified according to the target category.
By adopting the technical scheme, the sparse frame extraction processing is carried out on the video frame sequence, redundant information can be removed, the calculated amount is reduced, but the TSN model can effectively reduce the loss of effective information, and further the accuracy of video classification can be improved.
Optionally, in step S13, the obtaining a text feature vector according to the text information includes:
and inputting the text information into a BERT model to extract text features of the text information so as to obtain the text feature vector.
Specifically, text information is input to an encoder of a BERT model, the encoder of the BERT model is used for converting the input text information into feature vectors, the feature vectors output by the encoder and the predicted results are input to a decoder of the BERT model, and the decoder of the BERT model is used for outputting conditional probabilities of the final results and converting the conditional probabilities into text feature vectors.
By adopting the technical scheme, the multi-side contexts in all layers of the training sample are fused based on the BERT model, so that the text feature extraction can be improved, and the accuracy of video classification is further improved.
Optionally, the method further comprises:
and inputting the text information, the video frame sequence and the audio frame sequence into a classification model, so as to obtain a target feature vector from the video frame sequence and the audio frame sequence, and obtain a text feature vector from the text information, and inputting the full connection vector into a classifier to obtain a classification result of the video to be classified, which is output by the classifier, wherein the classification model comprises the classifier.
Wherein the classifier may be a softmax regression classification model.
Optionally, the training of the classification model includes:
taking the seed video with the classification labels as a training sample;
freezing models except the classifier in the classification model, and training parameters of the classifier according to the training sample, wherein the models except the classifier in the classification model comprise: VGGish model, TSN model, neXtVLA model.
The classification labels of the seed videos are marked manually, and the same seed video can be provided with a plurality of classification labels according to text information, video frame sequences and audio frame sequences. The loss function of the classifier uses a cross entropy loss function.
Fig. 4 is a flow chart illustrating another video classification method according to an exemplary embodiment. As shown in fig. 4, the video classification method includes the steps of:
and acquiring the video to be classified, splitting the video to be classified, and obtaining corresponding text information, a video frame sequence and an audio frame sequence.
Further, inputting the text information into a pre-trained BERT model to obtain text characteristic information, wherein the pre-trained BERT model is in a frozen state in the application process. Meanwhile, the video frame sequence is input into a pre-trained TSN model to obtain action characteristics, and the audio frame sequence is input into a pre-trained VGGish model to obtain VGGish characteristics, wherein the pre-trained TSN model and the VGGish model are also in a frozen state in the application process.
Further, the obtained motion feature and VGGish feature are input into a pre-trained NeXtVLA model, the NeXtVLA model performs audio feature clustering on the VGGish feature to obtain an audio feature vector, the motion feature is clustered to obtain an image feature vector, and the audio feature vector and the image feature vector are spliced to obtain a target feature vector. The pre-trained NeXtVLA model is also in a frozen state during the application process.
Further, the text feature information and the target feature information are spliced to obtain full connection vectors, and the full connection vectors are input into a classifier, for example, a softmax regression classification model, so that a classification result of the video to be classified is obtained. And meanwhile, updating and training the softmax regression classification model by using a classification result obtained by the video to be classified. Wherein, the loss function of the classification model adopts a cross entropy loss function.
Compared with the prior art, after characters, videos and audios are respectively classified, the video classification is determined according to the probabilities of the characters, the videos and the audios in the categories, so that information loss in the video processing process to be classified is reduced, the video classification accuracy can be improved, the video recommendation accuracy to users is further improved, and the video popularization efficiency is improved.
Based on the same inventive concept, a video classification apparatus 500 is further provided according to an embodiment of the present disclosure, for performing the steps of the video classification method provided in the above method embodiment, where the apparatus 500 may implement the video classification method in a manner of software, hardware, or a combination of both. Fig. 5 is a block diagram illustrating a video classification apparatus 500 according to an exemplary embodiment. As shown in fig. 5, the apparatus 500 includes: the system comprises an acquisition module 510, a splitting module 520, an execution module 530, a stitching module 540 and a determination module 550.
Wherein the obtaining module 510 is configured to obtain a video to be classified;
the splitting module 520 is configured to split the video to be classified to obtain text information, a video frame sequence and an audio frame sequence of the video to be classified;
the execution module 530 is configured to obtain a target feature vector from the video frame sequence and the audio frame sequence, and obtain a text feature vector from the text information;
the stitching module 540 is configured to stitch the target feature vector and the text feature vector to obtain a full connection vector;
the determining module 550 is configured to input the full connection vector into a classifier, and obtain a classification result of the video to be classified output by the classifier.
The device can improve the accuracy of video classification, further improve the accuracy of video recommendation to users and improve the video popularization efficiency.
Optionally, the executing module 530 includes: the device comprises an extraction sub-module, an identification sub-module, a determination sub-module and an input sub-module.
The extraction submodule is configured to extract a mel frequency cepstrum coefficient of the audio frame sequence, input the mel frequency cepstrum coefficient into a VGGish model, and extract local features of the mel frequency cepstrum coefficient to obtain VGGish features;
the identification sub-module is configured to input the video frame sequence into a TSN model so as to conduct action identification on the video frame sequence to obtain action characteristics;
the determining submodule is configured to splice the VGGish characteristic and the action characteristic to obtain a characteristic to be input;
and the input sub-module is configured to input the feature to be input into a NeXtVLA model to obtain the target feature vector.
Optionally, the TSN model generates the action feature by:
according to the video frame sequence, single-frame image information and optical flow image information of the video frame sequence are determined;
extracting according to the single-frame image information and the optical flow image information at intervals to obtain a sparse sampling result;
and obtaining the action characteristic according to the sparse sampling result.
Optionally, the execution module is configured to: and inputting the text information into a BERT model to extract text features of the text information so as to obtain the text feature vector.
Optionally, the classification model is trained by:
taking the seed video with the classification labels as a training sample;
freezing models except the classifier in the classification model, and training parameters of the classifier according to the training sample, wherein the models except the classifier in the classification model comprise: VGGish model, TSN model, neXtVLA model.
The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.
It should be noted that, for convenience and brevity, the embodiments described in the specification are all preferred embodiments, and the parts related to the embodiments are not necessarily essential to the present invention, for example, the obtaining module 510 and the splitting module 520 may be separate devices or the same device when implemented, which is not limited by the present disclosure.
There is also provided, in accordance with an embodiment of the present disclosure, an electronic device including:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to:
acquiring videos to be classified;
splitting the video to be classified to obtain text information, a video frame sequence and an audio frame sequence of the video to be classified;
obtaining a target feature vector according to the video frame sequence and the audio frame sequence, and obtaining a text feature vector according to the text information;
splicing the target feature vector and the text feature vector to obtain a full connection vector;
and inputting the full connection vector into a classifier to obtain a classification result of the video to be classified, which is output by the classifier.
The disclosed embodiments also provide a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the steps of the video classification method provided by the present disclosure.
Fig. 6 is a block diagram illustrating an apparatus 1900 for video classification according to an example embodiment. For example, the apparatus 1900 may be provided as a server. Referring to fig. 6, the apparatus 1900 includes a processing component 1922 that further includes one or more processors and memory resources represented by memory 1932 for storing instructions, such as application programs, that can be executed by the processing component 1922. The application programs stored in memory 1932 may include one or more modules each corresponding to a set of instructions. Further, processing component 1922 is configured to execute instructions to perform the video classification method described above.
The apparatus 1900 may further include a power component 1926 configured to perform power management of the apparatus 1900, a wired or wireless network interface 1950 configured to connect the apparatus 1900 to a network, and an input/output (I/O) interface 1958. The apparatus 1900 may operate based on an operating system stored in the memory 1932, such as Windows Server TM ,Mac OS X TM ,Unix TM ,Linux TM ,FreeBSD TM Or the like.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.
Claims (10)
1. A method of video classification, comprising:
acquiring videos to be classified;
splitting the video to be classified to obtain text information, a video frame sequence and an audio frame sequence of the video to be classified;
obtaining a target feature vector according to the video frame sequence and the audio frame sequence, and obtaining a text feature vector according to the text information;
splicing the target feature vector and the text feature vector to obtain a full connection vector;
and inputting the full connection vector into a classifier to obtain a classification result of the video to be classified, which is output by the classifier.
2. The method of claim 1, wherein the deriving the target feature vector from the sequence of video frames and the sequence of audio frames comprises:
extracting a Mel frequency cepstrum coefficient of the audio frame sequence, inputting the Mel frequency cepstrum coefficient into a VGGish model, and extracting local features of the Mel frequency cepstrum coefficient to obtain VGGish features;
inputting the video frame sequence into a TSN model to perform action recognition on the video frame sequence to obtain action characteristics;
splicing the VGGish characteristic and the action characteristic to obtain a characteristic to be input;
and inputting the feature to be input into a NeXtVLA model to obtain the target feature vector.
3. The method of claim 2, wherein the TSN model generates the action feature by:
according to the video frame sequence, single-frame image information and optical flow image information of the video frame sequence are determined;
extracting according to the single-frame image information and the optical flow image information at intervals to obtain a sparse sampling result;
and obtaining the action characteristic according to the sparse sampling result.
4. The method of claim 1, wherein said deriving text feature vectors from said text information comprises:
and inputting the text information into a BERT model to extract text features of the text information so as to obtain the text feature vector.
5. The method according to any one of claims 2-4, wherein the classification model is trained by:
taking the seed video with the classification labels as a training sample;
freezing models except the classifier in the classification model, and training parameters of the classifier according to the training sample, wherein the models except the classifier in the classification model comprise: VGGish model, TSN model, neXtVLA model.
6. A video classification apparatus, comprising:
the acquisition module is configured to acquire videos to be classified;
the splitting module is configured to split the video to be classified to obtain text information, a video frame sequence and an audio frame sequence of the video to be classified;
the execution module is configured to obtain a target feature vector according to the video frame sequence and the audio frame sequence and obtain a text feature vector according to the text information;
the splicing module is configured to splice the target feature vector and the text feature vector to obtain a full connection vector;
the determining module is configured to input the full connection vector into a classifier to obtain a classification result of the video to be classified, which is output by the classifier.
7. The apparatus of claim 6, wherein the execution module comprises:
the extraction submodule is configured to extract a Mel frequency cepstrum coefficient of the audio frame sequence, input the Mel frequency cepstrum coefficient into a VGGish model, and extract local features of the Mel frequency cepstrum coefficient to obtain VGGish features;
the identification sub-module is configured to input the video frame sequence into a TSN model so as to conduct action identification on the video frame sequence to obtain action characteristics;
the determining submodule is configured to splice the VGGish characteristic and the action characteristic to obtain a characteristic to be input;
and the input sub-module is configured to input the feature to be input into a NeXtVLA model to obtain the target feature vector.
8. The apparatus of claim 7, wherein the TSN model generates the action feature by:
according to the video frame sequence, single-frame image information and optical flow image information of the video frame sequence are determined;
extracting according to the single-frame image information and the optical flow image information at intervals to obtain a sparse sampling result;
and obtaining the action characteristic according to the sparse sampling result.
9. An electronic device, comprising:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to:
acquiring videos to be classified;
splitting the video to be classified to obtain text information, a video frame sequence and an audio frame sequence of the video to be classified;
obtaining a target feature vector according to the video frame sequence and the audio frame sequence, and obtaining a text feature vector according to the text information;
splicing the target feature vector and the text feature vector to obtain a full connection vector;
and inputting the full connection vector into a classifier to obtain a classification result of the video to be classified, which is output by the classifier.
10. A computer readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the steps of the video classification method of any of claims 1-5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110524503.9A CN113220940B (en) | 2021-05-13 | 2021-05-13 | Video classification method, device, electronic equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110524503.9A CN113220940B (en) | 2021-05-13 | 2021-05-13 | Video classification method, device, electronic equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113220940A CN113220940A (en) | 2021-08-06 |
CN113220940B true CN113220940B (en) | 2024-02-09 |
Family
ID=77095560
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110524503.9A Active CN113220940B (en) | 2021-05-13 | 2021-05-13 | Video classification method, device, electronic equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113220940B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117132939A (en) * | 2023-09-11 | 2023-11-28 | 深圳科腾飞宇科技有限公司 | Object analysis method and system based on video processing |
CN118172713A (en) * | 2024-05-13 | 2024-06-11 | 腾讯科技(深圳)有限公司 | Video tag identification method, device, computer equipment and storage medium |
CN118400575B (en) * | 2024-06-24 | 2024-09-10 | 湖南快乐阳光互动娱乐传媒有限公司 | Video processing method and related device |
CN118470717B (en) * | 2024-07-09 | 2024-10-01 | 苏州元脑智能科技有限公司 | Method, device, computer program product, equipment and medium for generating annotation text |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111626251A (en) * | 2020-06-02 | 2020-09-04 | Oppo广东移动通信有限公司 | Video classification method, video classification device and electronic equipment |
WO2020221278A1 (en) * | 2019-04-29 | 2020-11-05 | 北京金山云网络技术有限公司 | Video classification method and model training method and apparatus thereof, and electronic device |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10204274B2 (en) * | 2016-06-29 | 2019-02-12 | Cellular South, Inc. | Video to data |
-
2021
- 2021-05-13 CN CN202110524503.9A patent/CN113220940B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020221278A1 (en) * | 2019-04-29 | 2020-11-05 | 北京金山云网络技术有限公司 | Video classification method and model training method and apparatus thereof, and electronic device |
CN111626251A (en) * | 2020-06-02 | 2020-09-04 | Oppo广东移动通信有限公司 | Video classification method, video classification device and electronic equipment |
Non-Patent Citations (1)
Title |
---|
基于类型标志镜头与词袋模型的体育视频分类;朱映映,朱艳艳,等;计算机辅助设计与图形学学报(09);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN113220940A (en) | 2021-08-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113220940B (en) | Video classification method, device, electronic equipment and storage medium | |
CN109145784B (en) | Method and apparatus for processing video | |
CN109587554B (en) | Video data processing method and device and readable storage medium | |
CN110582025B (en) | Method and apparatus for processing video | |
CN109117777B (en) | Method and device for generating information | |
WO2019242222A1 (en) | Method and device for use in generating information | |
CN109871736B (en) | Method and device for generating natural language description information | |
US12001479B2 (en) | Video processing method, video searching method, terminal device, and computer-readable storage medium | |
CN111611436A (en) | Label data processing method and device and computer readable storage medium | |
CN110390033A (en) | Training method, device, electronic equipment and the storage medium of image classification model | |
US20140257995A1 (en) | Method, device, and system for playing video advertisement | |
US11302361B2 (en) | Apparatus for video searching using multi-modal criteria and method thereof | |
CN110309795A (en) | Video detecting method, device, electronic equipment and storage medium | |
CN109640112B (en) | Video processing method, device, equipment and storage medium | |
CN113766299A (en) | Video data playing method, device, equipment and medium | |
CN114095749A (en) | Recommendation and live interface display method, computer storage medium and program product | |
CN114302174A (en) | Video editing method and device, computing equipment and storage medium | |
CN116665083A (en) | Video classification method and device, electronic equipment and storage medium | |
CN110347869B (en) | Video generation method and device, electronic equipment and storage medium | |
CN114186074A (en) | Video search word recommendation method and device, electronic equipment and storage medium | |
CN113573097A (en) | Video recommendation method and device, server and storage medium | |
CN113395584A (en) | Video data processing method, device, equipment and medium | |
KR101674310B1 (en) | System and method for matching advertisement for providing advertisement associated with video contents | |
CN113472834A (en) | Object pushing method and device | |
CN116437114A (en) | Live caption transfer method, device, equipment and computer readable storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |