CN113449148B

CN113449148B - Video classification method, device, electronic equipment and storage medium

Info

Publication number: CN113449148B
Application number: CN202110707843.5A
Authority: CN
Inventors: 吴文灏; 夏博洋
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-06-24
Filing date: 2021-06-24
Publication date: 2023-10-20
Anticipated expiration: 2041-06-24
Also published as: CN113449148A

Abstract

The disclosure provides a video classification method, a video classification device, electronic equipment and a storage medium, and relates to the technical field of computer vision and deep learning. The specific implementation scheme is as follows: acquiring videos to be classified, and acquiring at least one video clip from the videos; acquiring an image-audio pair corresponding to each video clip; acquiring characteristic information of each image-audio pair; the characteristic information is used for representing a video clip corresponding to the image-audio; screening at least one salient video segment from the at least one video segment according to the characteristic information of each image-audio pair; classifying the video according to the at least one salient video segment. The scheme can greatly reduce the calculated amount of video classification, and realizes the balance of the video classification accuracy and the calculated amount.

Description

Video classification method, device, electronic equipment and storage medium

Technical Field

The disclosure relates to the field of artificial intelligence, in particular to the technical field of computer vision and deep learning, which can be used in video analysis scenes, and particularly relates to a video classification method, a device, electronic equipment and a storage medium.

Background

Due to the proliferation of video data in recent years, video classification technology has exploded out great application potential in video monitoring, recommending, retrieving and the like. The video classification technology mainly has 2 important evaluation indexes: the cost and the classification accuracy are calculated. Currently, video classification technology has made great progress in recognition accuracy, but application in actual scenes still faces challenges due to its high computational cost.

Disclosure of Invention

The present disclosure provides a video classification method, apparatus, electronic device, and storage medium, which can greatly reduce the computation cost while not reducing the video classification accuracy.

According to a first aspect of the present disclosure, there is provided a video classification method, comprising:

acquiring videos to be classified, and acquiring at least one video clip from the videos;

acquiring an image-audio pair corresponding to each video clip;

acquiring characteristic information of each image-audio pair; the characteristic information is used for representing a video clip corresponding to the image-audio;

screening at least one salient video segment from the at least one video segment according to the characteristic information of each image-audio pair;

Classifying the video according to the at least one salient video segment.

According to a second aspect of the present disclosure, there is provided a video classification apparatus comprising:

the first acquisition module is used for acquiring videos to be classified and acquiring at least one video fragment from the videos;

the second acquisition module is used for acquiring an image-audio pair corresponding to each video clip;

the third acquisition module is used for acquiring the characteristic information of each image-audio pair; the characteristic information is used for representing a video clip corresponding to the image-audio;

the screening module is used for screening at least one salient video segment from the at least one video segment according to the characteristic information of each image-audio pair;

and the classification module is used for classifying the video according to the at least one significant video segment.

According to a third aspect of the present disclosure, there is provided an electronic device comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein, the liquid crystal display device comprises a liquid crystal display device,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the first aspect described above.

According to a fourth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of the first aspect.

According to a fifth aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method of the first aspect described above.

According to the technical scheme, the video is divided into a plurality of image-audio pairs, and the corresponding video fragments are represented by using the characteristic information of the image-audio pairs, so that video fragments effective for video classification can be screened from videos to be classified, namely, the screened video fragments are classified. The scheme can greatly reduce the calculated amount and improve the reasoning speed of video classification, so that the balance of the video classification accuracy and the calculated amount can be realized, and the method is beneficial to the floor application of actual scenes.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a flow chart of a video classification method according to an embodiment of the present disclosure;

FIG. 2 is a flow chart of acquiring feature information for each image-audio pair according to an embodiment of the present disclosure;

FIG. 3 is a flow chart of a video clip screening process according to an embodiment of the present disclosure;

FIG. 4 is a flow chart of generating a prominence score as set forth in an embodiment of the disclosure;

FIG. 5 is a schematic diagram of a model structure of video segment screening in an embodiment of the disclosure;

FIG. 6 is a flow chart of video clip screening according to an embodiment of the present disclosure;

FIG. 7 is a schematic diagram of another model structure for video clip screening in an embodiment of the present disclosure;

FIG. 8 is a flow chart of constructing an image-audio pair feature extraction model in accordance with an embodiment of the present disclosure;

FIG. 9 is a schematic diagram of a knowledge distillation training model in an embodiment of the disclosure;

fig. 10 is a block diagram of a video classification apparatus according to an embodiment of the present disclosure;

FIG. 11 is a block diagram of another video classification apparatus according to an embodiment of the disclosure;

FIG. 12 is a block diagram of still another video classification apparatus according to an embodiment of the disclosure;

Fig. 13 is a block diagram of an electronic device according to an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the existing scheme, three main ideas for improving video classification efficiency are: the first is to use a method of skipping or selecting key frames, most of which use reinforcement learning modeling to make sequence decisions, such as the number of frames each frame should skip forward, selecting frames with information, etc., but these methods often require reference to the information of all video frames; the second is to dynamically change the resolution, the model capacity or the network route in the model according to different network inputs by means of the idea of conditional computation, thereby achieving the purpose of saving computation; the third is to save computation with a lightweight 3D convolutional neural network.

The inventors of the present disclosure have found that existing methods only resort to the visual modality of video, which has more redundancy both within and among video segments. At the video segment level, there is some redundancy between consecutive frames due to the frame-to-frame continuity, while at the video level, important, noticeable video segments are sparse. Therefore, using only the visual modality, much redundant information is inevitably introduced, and thus a great deal of calculation cost in the video classification process is wasted.

Based on the above problems, the present disclosure provides a video classification method, a video classification device, an electronic device, and a storage medium. According to the technical scheme, on one hand, the use of a visual mode is minimized, and on the other hand, the time sequence information of an audio mode is utilized to the maximum, so that the balance of video classification accuracy and calculated amount is realized.

Fig. 1 is a flowchart of a video classification method according to an embodiment of the disclosure. It should be noted that the video classification method according to the embodiments of the present disclosure may be applied to the video classification apparatus according to the embodiments of the present disclosure, where the video classification apparatus may be configured in an electronic device. As shown in fig. 1, an implementation of the video classification method may include:

Step 101, obtaining a video to be classified, and obtaining at least one video clip from the video to be classified.

In the embodiment of the disclosure, the video to be classified may be a short video, a movie video, or a video to be classified in other aspects. It will be appreciated that each video to be classified is of a different size and contains a different amount of information in each frame. At least one video segment is obtained from the video to be classified, which is equivalent to dividing the video to be classified into a plurality of video segments with consistent length, so that the extraction and classification of the subsequent features can be facilitated.

Specifically, the method for obtaining at least one video segment from the video to be classified may be implemented by dividing the video to be classified into one video segment at regular intervals of frames according to the timing information, so as to obtain a plurality of continuous video segments with fixed frames. The number of frames used for the division may be 8 frames, 16 frames, or other frames of other positive integer powers of 2, which is not limited in this disclosure.

Step 102, obtaining an image-audio pair corresponding to each video clip.

It will be appreciated that each video clip is composed of audio and corresponding images per frame, the images within each video clip may represent the content of the corresponding video clip, and the audio within each video clip may represent the temporal dynamics of the corresponding video clip, such that in the disclosed embodiment each video clip is represented using its corresponding image and audio.

It should be noted that, to reduce the use of the visual modality, the number of images in the image-audio pair corresponding to each video clip may be as small as possible, and because the first frame of each video clip may often contain most of the video content, the image corresponding to the first frame of each video clip may be represented. That is, there may be only one image per image-audio pair, which minimizes the use of visual modalities and maximizes the utilization of timing information in the audio modalities, thereby enabling subsequent feature learning to occur in a lower dimension, which in turn may reduce computational costs.

And step 103, acquiring characteristic information of each image-audio pair. Wherein the characteristic information of the image-audio pair is used to characterize the video clip corresponding to the image-audio pair.

That is, by acquiring the feature information of each image-audio pair to characterize the video clip corresponding to each image-audio pair, only the feature information of the image-audio pair is used to characterize the corresponding video clip, not only the calculation amount of feature extraction but also the calculation amount of screening of the subsequent video clips can be reduced.

As an example, the implementation of acquiring the feature information of each image-audio pair may be: and respectively inputting each image-audio pair into a preset image-audio pair feature extraction model to obtain feature information of each image-audio pair. The image-audio pair feature extraction model is trained by video clip samples and image-audio pair sample data corresponding to the video clip samples.

Step 104, at least one salient video segment is selected from at least one video segment according to the characteristic information of each image-audio pair.

It will be appreciated that the saliency information of the corresponding video segments may be determined according to the feature information of each image-audio pair, that is, whether the corresponding video segments are valid for the video classification may be determined according to the feature information of each image-audio pair, so that the video segments that are invalid for the video classification may be filtered out, and at least one video segment that is valid for the video classification, that is, the saliency video segment, may be further screened out from the at least one video segment.

For example, an implementation of determining whether the corresponding video clip is valid for video classification based on the characteristic information of each image-audio pair: the context information of the video clips can be obtained according to the characteristic information of each image-audio pair, so that the salient information of the corresponding video clip is further mined, and whether the video classification of the corresponding video clip is effective or not is further determined; alternatively, according to the feature information of each image-audio pair, performing characterization learning to determine the information amount contained in the corresponding video segment, wherein the video segment with larger information amount is more effective in classifying the video; other methods not mentioned in this disclosure that may be used to screen video clips are also possible, and this disclosure is not limited in this regard.

Step 105, classifying the video according to the at least one salient video segment.

It can be understood that the above operation is equivalent to preprocessing the video to be classified, so that the video can be classified only according to the salient video segments, and further the calculated amount of video classification can be reduced, and the classification efficiency can be improved.

As an example, an implementation of classifying video may be: inputting at least one salient video segment into a preset video classification model to obtain a classification result of each salient video segment; and determining the category of the video according to the classification result of each salient video segment. Each salient video segment corresponds to one classification result, the classification result of each salient video segment can be subjected to summarization analysis, and the category of the video can be determined according to the summarization analysis result. For example, the classification result of each significant video segment may be averaged, and the averaged result may be taken as the category of the video.

According to the video classification method provided by the embodiment of the disclosure, the video is divided into a plurality of image-audio pairs, and the corresponding video fragments are represented by using the characteristic information of the image-audio pairs, so that video fragments effective for video classification can be screened from videos to be classified, namely, only the screened video fragments are classified. The scheme can greatly reduce the calculated amount and improve the reasoning speed of video classification, so that the balance of the video classification accuracy and the calculated amount can be realized, and the method is beneficial to the floor application of actual scenes.

Based on the above-described embodiments, in order to further describe the implementation process of acquiring the feature information of each image-audio pair, the present disclosure proposes another embodiment.

Fig. 2 is a flowchart of acquiring feature information of each image-audio pair according to an embodiment of the present disclosure. The embodiments of the present disclosure will be described by taking a feature extraction model using preset image-audio as an example. As shown in fig. 2, the image-audio pair feature information may be obtained by:

in step 201, each image-audio pair is input to a preset image-audio pair feature extraction model. Wherein the image-audio pair feature extraction model comprises an image feature extraction layer and an audio feature extraction layer.

It should be noted that, in the embodiment of the present disclosure, the image-audio pair feature extraction model is trained by video clip samples, and image audio pair sample data corresponding to the video clip samples. Wherein the characteristic information of the image-audio pair includes: image feature information and audio feature information of the image-audio pair.

It can be understood that the image-audio pair feature extraction model is used for respectively extracting image feature information and audio feature information of an image-audio pair, so that the feature information can express information of a corresponding video segment as far as possible, and a result obtained by performing video classification by using the feature information of the image-audio segment can be consistent with a result corresponding to the direct classification by the video segment. The image feature extraction layer is used for extracting features of the images in each image-audio pair, and the audio feature extraction layer is used for extracting features of the images in each image-audio pair.

Step 202, extracting features of the images in each image-audio pair through an image feature extraction layer to obtain image feature information of each image-audio pair.

And 203, performing feature extraction on the audio in each image-audio pair through an audio feature extraction layer to obtain audio feature information of each image-audio pair.

According to the video classification method provided by the embodiment of the disclosure, the image-audio feature extraction model is used for extracting the features of the image-audio feature, and the corresponding video segments are represented by the extracted feature information, so that the subsequent screening process can learn on the low latitude representation, and the calculated amount of video classification can be reduced. In addition, the audio characteristic information and the image characteristic information in the image-audio pair are respectively extracted, and the extracted characteristic information can express the corresponding video clips, so that the accuracy of the screening process can be ensured, and the accuracy of video classification can be further ensured.

Based on the above embodiments, for a detailed description of the video clip screening process, the present disclosure proposes yet another embodiment.

Fig. 3 is a flowchart of a video clip screening process according to an embodiment of the present disclosure. As shown in fig. 3, the screening of the video clip may include the steps of:

Step 301, for each image-audio pair, performing fusion processing on image feature information and audio feature information in the feature information of the image-audio pair, so as to obtain fusion features of the image-audio pair.

It can be understood that the feature information of each image-audio pair includes image feature information and audio feature information, and the fusion processing is performed to splice the features of the visual mode and the audio mode, so that the fused features have both the visual mode information and the time sequence feature information of the video, and a foundation is provided for effectively screening the video clips.

Step 302, generating a saliency score for the image-audio pair based on the fusion features.

That is, the importance of the image-audio pairs, i.e., the effectiveness of the video classification, is determined from the fusion features and represented by the saliency scores. Wherein a larger saliency score indicates a greater amount of information contained in the corresponding video segment of the image-audio pair, that is, the more important the video classification of the corresponding video segment of the image-audio pair. Conversely, if the saliency score is smaller, it is said that the image-audio pair contains less information about the corresponding video segment, that is, the image-audio pair has less effect on the video classification.

As one example, a method of generating a prominence score may be: according to the fusion characteristics and in combination with the video context information, the saliency score of each image-audio pair is obtained through calculation of a classifier. Wherein the saliency score for each image-audio pair may be a number between 0 and 1.

At step 303, at least one salient video segment is selected from the at least one video segment according to the salient score.

It can be appreciated that the importance of the corresponding video segments can be determined according to the importance scores, so that video segments in at least one video segment that are valid for the video classification can be filtered, video segments that are invalid for the video classification can be filtered out, and the filtered video segments can be used as the importance video segments.

As an example, a threshold may be preset according to the saliency score of an image-video pair, and if the saliency score of the image-video pair exceeds the threshold, it is indicated that the video clip corresponding to the image-video pair is effective for classifying the video, otherwise, the video clip corresponding to the image-video pair is equivalent to a redundant frame.

According to the video classification method provided by the embodiment of the disclosure, the feature information of each image-audio pair is subjected to fusion processing, and the saliency score is regenerated, which is equivalent to determining the importance of the corresponding video segment according to the feature information of each image-audio pair, so that the salient video segment effective for video classification can be screened out from the video segments corresponding to the video to be classified according to the importance of each video segment, on one hand, the calculation of redundant frames in video classification can be reduced, and on the other hand, the accuracy of video classification can be ensured.

Yet another embodiment is presented by the present disclosure for the generation of saliency scores for image-audio segments.

Fig. 4 is a flowchart illustrating generation of a saliency score according to an embodiment of the present disclosure. As shown in fig. 4, the generation of the saliency score for each image-audio segment includes the steps of:

in step 401, for each image-audio pair, contextual characteristic information of the image-audio pair is acquired.

It will be appreciated that the video to be classified may be divided into at least one video segment and that there is continuity between video segments, so that each is information contained in the video segment may be further determined based on the context information.

As one example, a long-term memory network may be used to obtain contextual characteristic information of image-audio pairs. For example, each video segment is ordered according to the time sequence information of the video to be classified, the fusion characteristic of each image-audio pair is input into a long-short-period memory network according to the ordering, and the hidden state output by the long-short-period memory network corresponding to each image-audio pair is used as the context characteristic information of the video segments adjacent to the hidden state and positioned behind the hidden state.

Step 402, generating hidden state information of the image-audio pair according to the context feature information and the fusion feature.

It can be understood that, according to the context feature information and the fusion feature information, the obtained hidden state information of the image-audio pair is equivalent to the feature comprising the image-audio pair image and audio, and further comprises the context information corresponding to the video clip on the video layer to be classified, so that the effectiveness of the significance score can be improved, and the accuracy of video clip screening can also be improved.

For the above example, the fusion features corresponding to the image-audio pairs are input to the long-short memory model thereof in order, and the context feature information of each image-audio pair is also input as the long-short memory model, that is, the hidden state of each image-audio pair is input as the model corresponding to the image-audio pair adjacent to the hidden state and positioned behind the hidden state.

Step 403, determining the saliency score of the image-audio pair according to the hidden state information.

As one example, hidden state information may be entered into the fully connected layer, generating a saliency score between 0-1.

In order to make the implementation of the above steps more intuitive, a screening process for video clips will be described below in connection with a network model structure by way of example. Fig. 5 is a schematic diagram of a model structure of video segment screening in an embodiment of the disclosure. As shown in fig. 5, the model structure includes a fusion layer 510 and a saliency score evaluation layer 520, wherein the saliency score evaluation layer 520 includes a long and short term memory network 521 and a full connection layer 522. The image-audio pair (i-1), the image-audio pair (i) and the image-audio pair (i+1) sequenced according to the video time sequence respectively input the image characteristic information and the audio characteristic information into the fusion layer 510 for fusion processing, and input the fusion characteristic and the context characteristic information into the long-short-period memory network 521 to obtain corresponding implicit state information. The context feature information of the image-audio pair (i-1) is the hidden state (i-2), that is, the output result of the long-period memory network corresponding to the image-audio pair (i-2) is obtained through the long-period memory network 521. The context feature information of the image-audio pair (i) is the hidden state (i-1), that is, the output result of the long-short-period memory network corresponding to the image-audio pair (i-1) obtains the hidden state (i) through the long-short-period memory network 521. The context feature information of the image-audio pair (i+1) is the hidden state (i), that is, the output result of the long-short-period memory network corresponding to the image-audio pair (i) is obtained through the long-short-period memory network 521. After the implicit state information corresponding to each image-audio pair is respectively input to the full-connection layer 522, the significance score of each image-audio pair is obtained, and the video clips to be classified are screened according to the significance scores.

According to the video classification method provided by the embodiment of the disclosure, the context characteristic information of each image-audio pair is taken into consideration before the saliency score is generated, and the saliency score is generated according to the implicit state obtained by the fusion characteristic and the context characteristic information, so that the generated saliency score is more effective, the screening of video clips is more accurate, and video classification can be performed only on important video clips, and the balance of video classification accuracy and calculated amount is realized.

Based on the above embodiment, discrete processing may be performed on the obtained saliency score, and screening of video clips may be performed according to the discrete values.

Fig. 6 is a flowchart of another video clip filtering according to an embodiment of the present disclosure. As shown in fig. 6, the process of screening video clips further includes, based on the above embodiment:

and 603, performing discrete processing on the significance scores of the image-audio pairs to obtain discrete values of the image-audio pairs.

It will be appreciated that the saliency score of an image-audio pair may be discrete processed to convert the saliency score into discrete values, with different discrete values indicating whether the video clip is valid for video classification.

As an example, the discrete value obtained after the discrete processing may be {0,1}, where 0 represents the corresponding video segment bit redundancy frame, and is not valid for video classification; and 1 indicates that the corresponding video clip information amount is large, and the video classification is effective. Further, the discrete process may use a gummel Softmax function, for example, adding gummel noise to the saliency score, and passing through an activation function (sigmoid), and finally obtaining a final decision {0,1} by comparing with a threshold.

At step 604, at least one salient video segment is selected from the at least one video segment based on the discrete values.

It should be noted that 601 to 602 in fig. 6 are consistent with the implementation of 301 to 302 in fig. 3, and will not be described here again.

In order to facilitate understanding of the screening process of video clips in the embodiments of the present disclosure, a description will be given below in connection with a network model. As an example, fig. 7 is a schematic diagram of a network model structure of video segment screening according to an embodiment of the present disclosure, and further includes a discrete processing module 730, configured to further process the obtained saliency score to obtain a discrete value, such as {0,1}, when screening video segments, screen video segments with a discrete value=1 as a salient video segment, filter video segments with a discrete value=0 as redundant video segments, and use the obtained salient video segments for video classification. It should be noted that 710 and 720 in fig. 7 are the same model structures as 510 and 520 in fig. 5, and will not be described here again.

According to the video classification method provided by the embodiment of the disclosure, discrete processing is added to the obtained significance scores, so that the obtained significance scores can be further processed on the basis of the significance scores, the difference of different image-audio pairs can be further increased, the screening accuracy of video fragments can be further improved, the video classification can be more accurate, and meanwhile, the calculation cost can be further reduced.

In the above embodiment, the image-audio pair feature extraction model is trained by video clip samples, and image-audio pair sample data corresponding to the video clip samples. The present disclosure will next propose yet another embodiment for the construction of the model.

Fig. 8 is a flowchart of a method for constructing an image-audio pair feature extraction model according to an embodiment of the present disclosure. As shown in fig. 8, the construction process of the model includes:

step 801, a knowledge distillation training model is constructed. The knowledge distillation training model comprises an original network and a learning network; the original network is a preset video classification model and comprises a video feature extraction layer, a first intermediate feature extraction layer and a classification head; the learning network comprises an image-audio pair feature extraction layer, a fusion layer, a second intermediate feature extraction layer and a classification head.

It should be noted that knowledge distillation may transfer knowledge from one network to another, and both networks may be homogeneous or heterogeneous. General knowledge distillation can be used to transform a large network into a small network and preserve performance near the large network, or to transfer knowledge learned by multiple networks into one network.

In the embodiment of the disclosure, a knowledge distillation training model is constructed by adopting a knowledge distillation mode. As shown in fig. 9, the model includes an original network and a learning network, wherein the original network is a video classification model, that is, an image-audio pair feature extraction model is trained by means of a preset video classification network. In addition, since the image-audio pair feature extraction model extracts features of the image and the audio respectively, in order to enable features extracted by the image-audio pair feature extraction model to have a basis for comparison with features extracted by an original network, in the embodiment of the present disclosure, a fusion layer and an intermediate feature extraction layer are constructed in a learning network, so that dimensions of the output features and video features extracted by the original network are consistent.

Step 802, obtaining a video clip sample and an image-audio pair sample corresponding to the video clip sample.

Step 803, taking the image-audio pair sample as input of a learning network, respectively carrying out feature extraction on the image and the audio in the image-audio pair sample through an image-audio pair feature extraction layer, carrying out fusion processing on the extracted image features and audio features through a fusion layer, and carrying out feature extraction on an output result of the fusion layer through a first intermediate feature extraction layer to obtain a first intermediate feature of the image-audio pair sample.

As shown in fig. 9, the image-audio pair feature extraction layer includes an image feature extraction layer for feature extraction of an image in an image-audio pair sample and an audio feature extraction layer for feature extraction of audio in the image-audio pair sample. That is, the image in the image-audio pair sample is input to the image feature extraction layer to obtain the corresponding image feature, and the audio in the image-audio pair sample is input to the audio feature extraction layer to obtain the corresponding audio feature. Before the audio is input to the audio feature extraction layer, the audio can be converted into an audio spectrogram form, and feature extraction is performed on the audio spectrogram, so that deviation caused by noise is avoided. In addition, the fusion layer is used for carrying out fusion processing on the image characteristic information and the audio characteristic information of the obtained image-audio pair, namely splicing the image characteristic and the audio characteristic to form a combined characteristic. And then the fused characteristic information can be further processed through a first intermediate characteristic extraction layer to obtain a first intermediate characteristic consistent with the dimension of the characteristic information output by the original network.

Step 804, performing classification processing on the first intermediate feature through the classification head to obtain a first classification result of the image-audio pair sample.

The classification head is used for calculating the classification result according to the learning of the feature information. Wherein the sorting head may be composed of a full connection.

Step 805, taking the video clip sample as an input of the original network, performing feature extraction on the video clip sample through the video feature extraction layer, and performing feature extraction on an output result of the video feature extraction layer through the second intermediate feature extraction layer, so as to obtain a second intermediate feature of the video clip sample.

Step 806, performing classification processing on the second intermediate features through the classification head, so as to obtain a second classification result of the video segment sample.

In step 807, a first loss value is obtained based on the first intermediate feature and the second intermediate feature.

It can be understood that, in order to make the image-audio pair feature extraction layer have a strong expressive ability on the features of the image-audio pair, the first intermediate features obtained by the image-audio pair feature extraction layer are compared with the second intermediate features obtained by the corresponding video segments through the original network. And training the knowledge distillation training model continuously according to the first loss value so as to enable the first intermediate characteristics obtained by the image-audio pair characteristic extraction layer to be consistent with the second intermediate characteristics obtained by the corresponding video segment through the original network as much as possible.

Step 808, obtaining a second loss value according to the first classification result and the second classification result.

It can be understood that if the second classification result of the video clip sample obtained by the video classification model is consistent with the first classification result of the corresponding image-audio pair sample obtained by the learning network, the feature information obtained by the image-audio pair sample through the image-audio pair feature extraction layer can fully represent the corresponding video clip sample, so that the final classification result can be consistent. Therefore, in embodiments of the present disclosure, the knowledge distillation training model is trained using a second loss value derived from the first classification result and the second classification result. Wherein the calculation of the second loss value may use a KL-divergence loss.

It should be noted that, when calculating the second loss value, the first classification result corresponds to the input of the second classification result, for example, a certain image-audio pair is input to the learning network to obtain the first classification result, and the second classification result is obtained by inputting a video clip corresponding to the image-audio pair to the original network.

Step 809, training a knowledge distillation training model based on the first loss value and the second loss value.

In the embodiment of the disclosure, the knowledge distillation training model is jointly trained according to the first loss value and the second loss value, and parameters of the knowledge distillation training model are continuously adjusted until the first loss value and the second loss value meet expectations, and then the trained knowledge distillation training model is obtained.

And 810, distilling the image-audio pair characteristic extraction layer in the training model by the trained knowledge to serve as an image-audio pair characteristic extraction model.

It can be understood that in the trained knowledge distillation training model, the feature information obtained by the image-audio pair feature extraction layer for extracting features of the image-audio pair is almost consistent with the video features extracted by the video classification model for the video segments, which indicates that the image-audio pair feature extraction layer in the knowledge distillation training model can be used for extracting the image-audio pair feature information, and the feature information can express the corresponding video pair. Thus, the image-audio pair feature extraction layer in the trained knowledge distillation training model can be used as an image-audio pair feature extraction model.

According to the video classification method provided by the embodiment of the disclosure, the image-audio frequency pair feature extraction model is trained in a knowledge distillation mode, which is equivalent to training the image-audio frequency pair feature extraction model by using a complex video classification model, so that the model training efficiency can be improved, the calculated amount in the model training process can be reduced, and the generalization effect of the model can be improved.

In order to implement the above embodiments, the present disclosure proposes a video classification apparatus.

Fig. 10 is a block diagram of a video classification device according to an embodiment of the disclosure. As shown in fig. 10, the apparatus includes:

a first obtaining module 1010, configured to obtain a video to be classified, and obtain at least one video segment from the video;

a second obtaining module 1020, configured to obtain an image-audio pair corresponding to each video clip;

a third acquiring module 1030, configured to acquire feature information of each image-audio pair; the characteristic information is used for representing video clips corresponding to the image-audio;

a screening module 1040, configured to screen at least one salient video segment from at least one video segment according to the feature information of each image-audio pair;

the classification module 1050 is configured to classify the video according to at least one salient video segment.

In some embodiments of the present disclosure, the third obtaining module 1030 is specifically configured to:

inputting each image-audio pair into a preset image-audio pair feature extraction model respectively to obtain feature information of each image-audio pair; the image-audio pair feature extraction model is trained by video clip samples and image-audio pair sample data corresponding to the video clip samples.

Wherein the feature information includes image feature information and audio feature information of an image-audio pair; the third obtaining module 1030 is specifically configured to:

inputting each image-audio pair into a preset image-audio pair feature extraction model respectively; the image-audio pair feature extraction model comprises an image feature extraction layer and an audio feature extraction layer;

respectively carrying out feature extraction on the images in each image-audio pair through an image feature extraction layer to obtain image feature information of each image-audio pair;

and respectively carrying out feature extraction on the audio in each image-audio pair through an audio feature extraction layer to obtain the audio feature information of each image-audio pair.

In some embodiments of the present disclosure, the screening module 1040 includes:

a fusion unit 1041, configured to, for each image-audio pair, perform fusion processing on image feature information and audio feature information in the feature information of the image-audio pair, so as to obtain fusion features of the image-audio pair;

a generating unit 1042 for generating a saliency score of the image-audio pair according to the fusion feature;

a screening unit 1043, configured to screen at least one salient video segment from at least one video segment according to the saliency score.

The generating unit 1042 specifically is configured to:

for each image-audio pair, acquiring context feature information of the image-audio pair;

generating hidden state information of the image-audio pair according to the context characteristic information and the second fusion characteristic;

a saliency score for the image-audio pair is determined based on the hidden state information.

Furthermore, in some embodiments of the present disclosure, classification module 1050 is specifically configured to:

inputting at least one salient video segment into a preset video classification model to obtain a classification result of each salient video segment;

and determining the category of the video according to the classification result of each salient video segment.

According to the video classification device provided by the embodiment of the disclosure, the video is divided into the plurality of image-audio pairs, and the corresponding video fragments are represented by using the characteristic information of the image-audio pairs, so that the video fragments effective for classifying the video can be screened from the video to be classified, namely, the screened video fragments are classified. The scheme can greatly reduce the calculated amount and improve the reasoning speed of video classification, so that the balance of the video classification accuracy and the calculated amount can be realized, and the method is beneficial to the floor application of actual scenes.

In order to further improve accuracy of video clip screening, another video classification device is provided in an embodiment of the disclosure.

Fig. 11 is a block diagram illustrating another video classification apparatus according to an embodiment of the disclosure. On the basis of the above embodiment, as shown in fig. 11, the screening module 1140 in the apparatus further includes:

a discrete unit 1144, configured to perform discrete processing on the saliency score of each image-audio pair, so as to obtain a discrete value of each image-audio pair.

Wherein the screening unit 1143 is further configured to:

screening out video clips corresponding to the image-audio frequency with the discrete value as a preset value from at least one video clip according to the discrete value of the image-audio frequency pair;

and taking the video segment corresponding to the image-audio pair with the discrete value being the preset value as the salient video segment.

Wherein 1110-1130 in fig. 11 and 1010-1030 in fig. 10 have the same functions and structures, and 1141-1143 in fig. 11 and 1041-1043 in fig. 10 have the same functions and structures, which are not described here again.

According to the video classification device provided by the embodiment of the disclosure, discrete processing is added to the obtained significance scores, so that the obtained significance scores can be further processed on the basis of the significance scores, the difference of different image-audio pairs can be further increased, the screening accuracy of video fragments can be further improved, the video classification can be more accurate, and meanwhile, the calculation cost can be further reduced.

In order to describe the construction of image-audio pairs in detail, further video classification apparatus are presented in embodiments of the present disclosure.

Fig. 12 is a block diagram of still another video classification apparatus according to an embodiment of the disclosure. As shown in fig. 12, the apparatus further includes a model training module 1260 based on the above embodiment, wherein the model training module 1260 is configured to:

constructing a knowledge distillation training model; the knowledge distillation training model comprises an original network and a learning network; the original network is a preset video classification network and comprises a video feature extraction layer, a first intermediate feature extraction layer and a classification head; the learning network comprises an image-audio pair feature extraction layer, a fusion layer, a second intermediate feature extraction layer and a classification head;

acquiring a video fragment sample and an image-audio pair sample corresponding to the video fragment sample;

taking an image-audio pair sample as input of a learning network, respectively extracting the characteristics of the image and the audio in the image-audio pair sample through an image-audio pair characteristic extraction layer, carrying out fusion processing on the extracted image characteristics and audio characteristics through a fusion layer, and carrying out characteristic extraction on an output result of the fusion layer through a first intermediate characteristic extraction layer to obtain first intermediate characteristics of the image-audio pair sample;

Classifying the first intermediate features through a classifying head to obtain a first classifying result of the image-audio pair sample;

taking the video segment sample as an input of an original network, carrying out feature extraction on the video segment sample through a video feature extraction layer, and carrying out feature extraction on an output result of the video feature extraction layer through a second intermediate feature extraction layer to obtain a second intermediate feature of the video segment sample;

classifying the second intermediate features through the classifying head to obtain a second classifying result of the video segment sample;

acquiring a first loss value according to the first intermediate feature and the second intermediate feature;

obtaining a second loss value according to the first classification result and the second classification result;

training a knowledge distillation training model according to the first loss value and the second loss value;

and distilling the image-audio pair characteristic extraction layer in the training model by the trained knowledge to serve as an image-audio pair characteristic extraction model.

Note that 1210 to 1250 in fig. 12 have the same functions and structures as 1110 to 1150 in fig. 11, and are not described here again.

According to the video classifying device provided by the embodiment of the disclosure, the image-audio frequency characteristic extraction model is trained in a knowledge distillation mode, which is equivalent to training the image-audio frequency characteristic extraction model by using a complex video classifying model, so that the model training efficiency can be improved, the calculated amount in the model training process can be reduced, and the generalization effect of the model can be improved.

The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium, and a computer program product.

As shown in fig. 13, is a block diagram of an electronic device of a video classification method according to an embodiment of the disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 13, the electronic device includes: one or more processors 1301, memory 1302, and interfaces for connecting the components, including a high-speed interface and a low-speed interface. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions executing within the electronic device, including instructions stored in or on memory to display graphical information of the GUI on an external input/output device, such as a display device coupled to the interface. In other embodiments, multiple processors and/or multiple buses may be used, if desired, along with multiple memories and multiple memories. Also, multiple electronic devices may be connected, each providing a portion of the necessary operations (e.g., as a server array, a set of blade servers, or a multiprocessor system). One processor 1301 is illustrated in fig. 13.

Memory 1302 is a non-transitory computer-readable storage medium provided by the present disclosure. Wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the video classification method provided by the present disclosure. The non-transitory computer readable storage medium of the present disclosure stores computer instructions for causing a computer to perform the video classification method provided by the present disclosure.

The memory Y02, which is a non-transitory computer readable storage medium, may be used to store a non-transitory software program, a non-transitory computer executable program, and modules, such as program instructions/modules (e.g., the first acquisition module 1010, the second acquisition module 1020, the third acquisition module 1030, the screening module 1040, and the classification module 1050 shown in fig. 10) corresponding to the video classification method in the embodiments of the disclosure. The processor 1301 executes various functional applications of the server and data processing, i.e., implements the video classification method in the above-described method embodiment, by running non-transitory software programs, instructions, and modules stored in the memory 1302. A computer program product according to an embodiment of the present disclosure includes a computer program that, when executed by a processor, implements the video classification method in the above-described method embodiment.

Memory 1302 may include a storage program area that may store an operating system, at least one application program required for functionality, and a storage data area; the storage data area may store data created according to the use of the electronic device of the video classification method, and the like. In addition, memory 1302 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage device. In some embodiments, memory 1302 may optionally include memory located remotely from processor 1301, which may be connected to the electronic device of the video classification method via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device of the video classification method may further include: an input device 1303 and an output device 1304. The processor 1301, memory 1302, input device 1303, and output device 1304 may be connected by a bus or other means, for example in fig. 13.

The input device 1303 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic device of the video classification method, such as a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointer stick, one or more mouse buttons, a track ball, a joystick, and the like. The output means Y04 may include a display device, an auxiliary lighting means (e.g., LED), a haptic feedback means (e.g., vibration motor), and the like. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device may be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASIC (application specific integrated circuit), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

These computing programs (also referred to as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), the internet, and blockchain networks.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service ("Virtual Private Server" or simply "VPS") are overcome. The server may also be a server of a distributed system or a server that incorporates a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A method of video classification, comprising:

acquiring an image-audio pair corresponding to each video clip;

classifying the video according to the at least one salient video segment;

said screening at least one salient video segment from said at least one video segment based on the characteristic information of each said image-audio pair, comprising:

for each image-audio pair, carrying out fusion processing on image characteristic information and audio characteristic information in the characteristic information of the image-audio pair to obtain fusion characteristics of the image-audio pair;

generating hidden state information of the image-audio pair according to the context characteristic information and the fusion characteristic;

Determining a saliency score for the image-audio pair based on the hidden state information;

and screening at least one salient video segment from the at least one video segment according to the saliency score.

2. The method of claim 1, wherein the acquiring feature information for each of the image-audio pairs comprises:

inputting each image-audio pair into a preset image-audio pair feature extraction model respectively to obtain feature information of each image-audio pair; the image-audio pair feature extraction model is trained through video clip samples and image-audio pair sample data corresponding to the video clip samples.

3. The method of claim 2, wherein the characteristic information includes image characteristic information and audio characteristic information of the image-audio pair; the step of inputting each image-audio pair into a preset image-audio pair feature extraction model to obtain feature information of each image-audio pair, comprises the following steps:

Respectively carrying out feature extraction on images in each image-audio pair through the image feature extraction layer to obtain image feature information of each image-audio pair;

and respectively carrying out feature extraction on the audio in each image-audio pair through the audio feature extraction layer to obtain the audio feature information of each image-audio pair.

4. The method of claim 1, further comprising:

and carrying out discrete processing on the significance scores of each image-audio pair to obtain discrete values of each image-audio pair.

5. The method of claim 4, wherein said screening at least one salient video segment from said at least one video segment according to said saliency score comprises:

screening out video clips corresponding to the image-audio frequency with the discrete value as a preset value from the at least one video clip according to the discrete value of the image-audio frequency pair;

and taking the video segment corresponding to the image-audio pair with the discrete value being the preset value as a salient video segment.

6. The method of claim 1, wherein the classifying the video according to the at least one salient video segment comprises:

Inputting the at least one significant video segment into a preset video classification model to obtain a classification result of each significant video segment;

7. The method of claim 1, the image-audio pair feature extraction model being pre-constructed by:

constructing a knowledge distillation training model; wherein the knowledge distillation training model comprises an original network and a learning network; the original network is a preset video classification model and comprises a video feature extraction layer, a first intermediate feature extraction layer and a classification head; the learning network comprises an image-audio pair feature extraction layer, a fusion layer, a second intermediate feature extraction layer and the classification head;

taking the image-audio pair sample as input of the learning network, respectively extracting the image and the audio in the image-audio pair sample through the image-audio pair feature extraction layer, carrying out fusion processing on the extracted image features and audio features through the fusion layer, and carrying out feature extraction on an output result of the fusion layer through the first intermediate feature extraction layer to obtain first intermediate features of the image-audio pair sample;

Classifying the first intermediate features through the classifying head to obtain a first classifying result of the image-audio pair sample;

taking the video segment sample as the input of the original network, carrying out feature extraction on the video segment sample through the video feature extraction layer, and carrying out feature extraction on the output result of the video feature extraction layer through the second intermediate feature extraction layer to obtain the second intermediate feature of the video segment sample;

training the knowledge distillation training model according to the first loss value and the second loss value;

and distilling the trained knowledge to obtain an image-audio pair characteristic extraction layer in the training model, and taking the image-audio pair characteristic extraction layer as the image-audio pair characteristic extraction model.

8. A video classification apparatus comprising:

the classification module is used for classifying the video according to the at least one significant video segment;

wherein, the screening module includes:

the fusion unit is used for carrying out fusion processing on image characteristic information and audio characteristic information in the characteristic information of the image-audio pairs aiming at each image-audio pair to obtain fusion characteristics of the image-audio pairs;

a generating unit, configured to acquire, for each of the image-audio pairs, context feature information of the image-audio pair;

And the screening unit is used for screening at least one salient video segment from the at least one video segment according to the salient score.

9. The apparatus of claim 8, wherein the third acquisition module is specifically configured to:

10. The apparatus of claim 9, wherein the characteristic information comprises image characteristic information and audio characteristic information of the image-audio pair; the third obtaining module is specifically configured to:

11. The apparatus of claim 8, wherein the screening module further comprises:

and the discrete unit is used for performing discrete processing on the significance scores of each image-audio pair to obtain discrete values of each image-audio pair.

12. The apparatus of claim 11, wherein the screening unit is further configured to:

13. The apparatus of claim 8, wherein the classification module is specifically configured to:

14. The apparatus of claim 8, further comprising a model training module, wherein the model training module is to:

constructing a knowledge distillation training model; wherein the knowledge distillation training model comprises an original network and a learning network; the original network is a preset video classification network and comprises a video feature extraction layer, a first intermediate feature extraction layer and a classification head; the learning network comprises an image-audio pair feature extraction layer, a fusion layer, a second intermediate feature extraction layer and the classification head;

15. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 9.

16. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1 to 9.