CN112464857A

CN112464857A - Video classification model training and video classification method, device, medium and equipment

Info

Publication number: CN112464857A
Application number: CN202011431606.2A
Authority: CN
Inventors: 李昊鑫
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd; Shenzhen Huantai Technology Co Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd; Shenzhen Huantai Technology Co Ltd
Priority date: 2020-12-07
Filing date: 2020-12-07
Publication date: 2021-03-09

Abstract

The disclosure provides a video classification model training method, a video classification model training device, a video classification device, a computer readable medium and electronic equipment, and relates to the technical field of computer vision. The method comprises the following steps: inputting a sample video into a video classification model to execute a training process so as to obtain a model output result; wherein the sample video includes time dimension data and non-time dimension data; and adjusting parameters in the video classification model by using the model output result until the accuracy of the video classification model determined based on the verification video reaches a preset threshold value. The method and the device reduce the cost of model training and avoid data noise caused by classification based on discontinuous characteristics in the related technology; meanwhile, various information of the video is fully utilized, and the accuracy of the classification result is improved.

Description

Video classification model training and video classification method, device, medium and equipment

Technical Field

The present disclosure relates to the field of computer vision processing technologies, and in particular, to a video classification model training method, a video classification model training apparatus, a video classification apparatus, a computer-readable medium, and an electronic device.

Background

With the advent of the information age, various forms of information have emerged. Among them, video information is becoming a main carrier of information dissemination, and especially after the rise of applications such as short video, self-media video and the like, the amount of video information is rapidly increasing. The method and the device for content auditing are especially necessary to solve the problems that how to accurately determine the video information needing to be recommended to a user from a large amount of video information and how to efficiently audit the content and the like due to the huge amount of video information. In order to solve the above problem, it is often necessary to classify the video information for subsequent processing.

Disclosure of Invention

The present disclosure is directed to a video classification model training method, a video classification model training apparatus, a video classification apparatus, a computer readable medium, and an electronic device, so as to reduce the parameter amount of the video classification model at least to a certain extent, so that the video classification model with smaller parameter amount can be applied to a longer-time video, and meanwhile, the training overhead is reduced.

According to a first aspect of the present disclosure, there is provided a video classification model training method, including: inputting a sample video into a video classification model to execute a training process so as to obtain a model output result; wherein the sample video includes time dimension data and non-time dimension data; adjusting parameters in the video classification model by using the model output result until the accuracy of the video classification model determined based on the verification video reaches a preset threshold; the training process comprises the following steps: respectively extracting the characteristics of the time dimension data and the non-time dimension data to obtain time dimension characteristics and non-time dimension characteristics; performing time-dimension slicing sampling on the time-dimension features to obtain at least one continuous feature segment, and performing feature fusion on each continuous feature segment in the at least one continuous feature segment and the non-time-dimension features to obtain at least one fusion feature; and determining a classification result corresponding to the sample video by combining the non-time dimension characteristic based on the fusion characteristic corresponding to the sample video.

According to a second aspect of the present disclosure, there is provided a video classification method, including: inputting the video to be classified into a video classification model to obtain a classification result corresponding to the video to be classified; wherein, the video classification model is obtained by training the method of the first aspect.

According to a third aspect of the present disclosure, there is provided a video classification model training apparatus, including: the data processing module is used for inputting the sample video into the video classification model to execute a training process so as to obtain a model output result; wherein the sample video includes time dimension data and non-time dimension data; the model training module is used for adjusting parameters in the video classification model by using the model output result until the accuracy of the video classification model determined based on the verification video reaches a preset threshold; the training process comprises the following steps: respectively extracting the characteristics of the time dimension data and the non-time dimension data to obtain time dimension characteristics and non-time dimension characteristics; performing time-dimension slicing sampling on the time-dimension features to obtain at least one continuous feature segment, and performing feature fusion on each continuous feature segment in the at least one continuous feature segment and the non-time-dimension features to obtain at least one fusion feature; and determining a classification result corresponding to the sample video by combining the non-time dimension characteristic based on the fusion characteristic corresponding to the sample video.

According to a fourth aspect of the present disclosure, there is provided a video classification apparatus comprising: the video classification module is used for inputting the video to be classified into the video classification model so as to obtain a classification result corresponding to the video to be classified; wherein, the video classification model is obtained by training the method of the first aspect.

According to a fifth aspect of the present disclosure, a computer-readable medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, is adapted to carry out the above-mentioned method.

According to a sixth aspect of the present disclosure, there is provided an electronic apparatus, comprising:

a processor; and

a memory for storing one or more programs that, when executed by the one or more processors, cause the one or more processors to implement the above-described method.

The video classification model training method provided by the embodiment of the disclosure performs feature extraction on time dimension data and non-time dimension data in video data, then performs time dimension sampling on the time dimension features to obtain continuous feature segments, performs feature fusion on the continuous feature segments and the non-time dimension features to obtain fusion features, and determines a classification result according to the time dimension features, the non-time dimension features and the fusion features.

According to the technical scheme of the embodiment of the disclosure, on one hand, in the training process, the time dimension characteristic is subjected to time dimension slice sampling, so that the video classification model with smaller parameter quantity can be applied to the video with longer time, and the model training overhead is reduced; on the other hand, because the segments obtained when the time dimension features are subjected to slice sampling are continuous feature segments, data noise caused by classification based on discontinuous features in the related art can be avoided; meanwhile, the classification result corresponding to the sample video is determined based on the fusion characteristic and the non-time dimension characteristic, so that various information of the video is fully utilized, and the accuracy of the classification result is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure. It is to be understood that the drawings in the following description are merely exemplary of the disclosure, and that other drawings may be derived from those drawings by one of ordinary skill in the art without the exercise of inventive faculty. In the drawings:

FIG. 1 illustrates a schematic diagram of an exemplary system architecture to which embodiments of the present disclosure may be applied;

FIG. 2 shows a schematic diagram of an electronic device to which embodiments of the present disclosure may be applied;

FIG. 3 schematically illustrates a flow chart of a method of video classification model training in an exemplary embodiment of the disclosure;

FIG. 4 schematically illustrates a flow chart of a training process in an exemplary embodiment of the present disclosure;

FIG. 5 schematically illustrates an architecture diagram of a video classification model in an exemplary embodiment of the present disclosure;

FIG. 6 schematically illustrates a schematic diagram of a model training process in an exemplary embodiment of the disclosure;

fig. 7 schematically illustrates a composition diagram of a video classification model training apparatus in an exemplary embodiment of the present disclosure.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.

Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repetitive description will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.

Fig. 1 is a schematic diagram illustrating a system architecture of an exemplary application environment to which a video classification model training method, a video classification method, and an apparatus according to the embodiments of the present disclosure may be applied.

As shown in fig. 1, the system architecture 100 may include one or more of

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. For example, server 105 may be a server cluster comprised of multiple servers, or the like.

The video classification model training method and the video classification method provided by the embodiment of the present disclosure are generally executed by the server 105, and accordingly, the video classification model training apparatus and the video classification apparatus may be disposed in the server 105; however, it is easily understood by those skilled in the art that the video classification model training method and the video classification method provided in the embodiments of the present disclosure may also be executed by the

terminal devices

101, 102, 103, and accordingly, the video classification model training apparatus and the video classification apparatus may also be disposed in the

terminal devices

101, 102, 103. In addition, the video-divided model training method and the video classification method may also be executed by the server 105 and the

terminal devices

101, 102, and 103, respectively, and corresponding devices may also be disposed at both ends, respectively, which is not particularly limited in this exemplary embodiment. For example, in an exemplary embodiment, the server 105 may train the video classification model through a video classification model training method, and then transmit the training model to the

terminal devices

101, 102, 103, so that the

terminal devices

101, 102, 103 may execute the video classification method, and the like.

The exemplary embodiment of the present disclosure provides an electronic device for implementing a video classification model training method and a video classification method, which may be the

terminal device

101, 102, 103 or the server 105 in fig. 1. The electronic device includes at least a processor and a memory for storing executable instructions of the processor, the processor configured to perform a video classification model training method and a video classification method via execution of the executable instructions.

The following takes the mobile terminal 200 in fig. 2 as an example, and exemplifies the configuration of the electronic device. It will be appreciated by those skilled in the art that the configuration of figure 2 can also be applied to fixed type devices, in addition to components specifically intended for mobile purposes. In other embodiments, mobile terminal 200 may include more or fewer components than shown, or some components may be combined, some components may be split, or a different arrangement of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware. The interfacing relationship between the components is only schematically illustrated and does not constitute a structural limitation of the mobile terminal 200. In other embodiments, the mobile terminal 200 may also interface differently than shown in fig. 2, or a combination of multiple interfaces.

As shown in fig. 2, the mobile terminal 200 may specifically include: a processor 210, an internal memory 221, an external memory interface 222, a Universal Serial Bus (USB) interface 230, a charging management module 240, a power management module 241, a battery 242, an antenna 1, an antenna 2, a mobile communication module 250, a wireless communication module 260, an audio module 270, a speaker 271, a microphone 272, a microphone 273, an earphone interface 274, a sensor module 280, a display 290, a camera module 291, an indicator 292, a motor 293, a button 294, and a Subscriber Identity Module (SIM) card interface 295. Wherein the sensor module 280 may include a depth sensor 2801, a pressure sensor 2802, a gyroscope sensor 2803, and the like.

Processor 210 may include one or more processing units, such as: the Processor 210 may include an Application Processor (AP), a modem Processor, a Graphics Processing Unit (GPU), an Image Signal Processor (ISP), a controller, a video codec, a Digital Signal Processor (DSP), a baseband Processor, and/or a Neural-Network Processing Unit (NPU), and the like. The different processing units may be separate devices or may be integrated into one or more processors.

The NPU is a Neural-Network (NN) computing processor, which processes input information quickly by using a biological Neural Network structure, for example, by using a transfer mode between neurons of a human brain, and can also learn by itself continuously. The NPU can implement applications such as intelligent recognition of the mobile terminal 200, for example: image recognition, face recognition, speech recognition, text understanding, and the like. In some embodiments, the video classification model may be trained by the NPU to obtain a trained video classification model.

A memory is provided in the processor 210. The memory may store instructions for implementing six modular functions: detection instructions, connection instructions, information management instructions, analysis instructions, data transmission instructions, and notification instructions, and execution is controlled by processor 210.

The mobile terminal 200 may implement video processing through a video codec, a GPU, a display screen 290, an application processor, and the like. Where the video codec is used to compress or decompress digital video, the mobile terminal 200 may also support one or more video codecs. In some embodiments, the videos may be deframed, decoded, and the like by a video codec to obtain time dimension data and non-time dimension data corresponding to each video.

In the related art, methods of classifying video can be generally classified into a single modality method and a multi-modality method. In the method, since the single-mode method is often difficult to distinguish among real services, a multi-mode method is often adopted for video classification. In a classification method of a related multi-modal manner, problems of too high feature dimensionality, large parameter quantity and cost of a classification model, data noise and the like can occur frequently.

For example, in a related technology, it is proposed that image frames, audio information, and text information of a video are respectively passed through corresponding feature extraction units, each feature is reduced in dimension by using an attention mechanism or other dimension reduction methods, three features are respectively aggregated into one feature, and a classifier is used to implement multi-class label output. Although the scheme uses multi-modal information, all image features or audio features are subjected to dimension reduction respectively, and the relevance of the audio and the image features is lost in a time dimension; meanwhile, because the feature dimension of a single image is high, after all images are input, even if dimension reduction is carried out, the feature dimension after audio and text feature fusion is high, and the parameter quantity and the cost of a classification model are also high.

For another example, in a related technology, it is proposed to extract partial information in information such as an image, an audio, a caption text, OCR (Optical Character Recognition) detection text, a cover picture, a face detection information, etc. from an input video, and extract features respectively using corresponding models, and then combine, pay attention mechanism and concatenate a plurality of features to obtain a fused feature, and connect each feature to a classifier, where the output of the model is the sum of a plurality of classification results, and the corresponding class is determined by the relationship between the classification probability and a threshold. The scheme designs a plurality of modes, and the problem of overhigh characteristic dimension is necessarily faced; meanwhile, the acquisition cost of information such as OCR detection characters, cover drawings, face detection information and the like is too high, so that the scheme is not practical.

For another example, in a related art, sampling a video frame to obtain a sampled video frame, segmenting the sampled video frame, and combining the segmented sub-segments with a label to generate new training data, and increasing a training sample to improve a classification effect. This scheme employs sampling and segmentation so that the segmented sub-segments become logically neutral in content and thus are likely to cause data noise.

Based on one or more of the problems described above, the present example embodiment provides a video classification model training method and a video classification method. The video classification model training method and the video classification method may be applied to the server 105, and may also be applied to one or more of the

terminal devices

101, 102, and 103, which is not particularly limited in this exemplary embodiment. Referring to fig. 3, the video classification model method may include the following steps S31 and S32:

in step S31, a training process is performed on the sample video input video classification model to obtain a model output result.

In an exemplary embodiment, the sample video is a video that includes time dimension data and non-time dimension data. The time dimension data may include, among other things, data in the video that changes over time. For example, each frame of image in a video may change over time; for example, the audio data corresponding to each second in the video will also change with time; the non-time dimension data may include data that does not change with time in the video, i.e., global data of the video, such as a text title of the video, a total duration of the video, and the like. In addition, the number of the time dimension data and the non-time dimension data is not limited in the present disclosure, and the time dimension data and the non-time dimension data may be one time dimension data and one non-time dimension data, or any number of time dimension data and any number of non-time dimension data.

For convenience of processing the video classification model, before the video is input into the video classification model, the video needs to be preprocessed to obtain a sample video including time dimension data and non-time dimension data.

For example, when the time dimension data includes video frame data, the sample video may be pre-processed in advance by sampling, picture enhancement, and normalization. For example, a video may be first subjected to frame decoding sampling of 1 frame per second, then image enhancement is performed, and the enhanced image is scaled and color value adjusted, so that the size of the image is fixed, and the color values are all within a certain range; when the time dimension data includes audio data, the video may be decoded to obtain the audio data, the time dimension is kept fixed, and the audio data is sampled according to a specific frequency to obtain a time-frequency spectrogram of the audio data.

It should be noted that, when sampling the time dimension data, the same frequency needs to be adopted for sampling to ensure the consistency of the time dimension data. And avoiding mismatching of time dimension data during subsequent processing.

In addition, in order to make the video classification model capable of processing the time dimension data and the non-time dimension data, the time dimension data and the non-time dimension data need to be converted into a form that can be processed by the video classification model. For example, for text information, it needs to be converted into dictionary coding and position so that the video classification model can be processed.

The training process performed by the video classification model is shown in fig. 4, and may include the following steps S311 to S313:

in step S311, the features of the time dimension data and the non-time dimension data are extracted, respectively, to obtain a time dimension feature and a non-time dimension feature.

In an exemplary embodiment, the input time dimension data and non-time dimension data may be feature extracted using a pre-feature extraction network. Specifically, when feature extraction is performed on different types of data, the feature extraction may be performed without using a network. For example, when the time dimension data includes video frame data, for each frame of video image, feature extraction may be performed by using a mainstream network architecture, such as a ResNet50 architecture and an inclusion v4 architecture, or by using a 3D convolutional network structure, a two stream architecture, or the like; when the time dimension data includes audio data and the audio data is converted into a time-frequency spectrogram, an image feature extraction network, such as ResNet18, a MobileNet architecture, or the like, may also be used for extraction; when the non-time dimension data comprises a title text, text feature extraction can be performed by using a model architecture such as Bert. It should be noted that none of the above architectures includes a classification layer.

In step S312, time-dimension slicing sampling is performed on the time-dimension features to obtain at least one continuous feature segment, and feature fusion is performed on each of the at least one continuous feature segment and the non-time-dimension features to obtain at least one fusion feature.

In an exemplary embodiment, temporal dimension features included in a video typically have contextual logic. For example, in video frame data, the contents of video data in front and back frames are generally similar, and correspondingly, the features of the video frames are also relatively similar. In order to avoid data noise caused by frame extraction, the time dimension characteristics can be acquired by adopting a slice sampling mode. Slice sampling includes any sampling that can slice a continuous segment of data.

Specifically, a preset duration set including at least one preset duration may be obtained, and then, according to each preset duration in the preset duration set, a continuous feature segment of the preset duration is cut out from the time dimension features. The term "continuous" in the continuous feature segment means that the feature segment is a continuous segment in the overall feature in the time dimension.

For example, assuming that the time dimension features include features of N frames of images, m segments of a preset duration k may be divided in the time dimension among the N frames of image features. For example, the head and tail portions are spaced apart by an interval t₀The duration of each segment is a preset duration k, the interval between segments is t, and then there is 2 × t₀+ m × k + (m-1) × t ═ N. At this time, one of the m segments with the preset time length k is extracted as a continuous feature segment of the cut preset time length. In addition, other slice sampling methods may also be adopted for sampling, which is not particularly limited in this disclosure.

In the above slice sampling process, in order to characterize the features of the entire video, the video frames corresponding to the cut continuous feature segments can be dispersed in the entire sample video through controlling the slice sampling.

In addition, when the time dimension features include two features at the same time, it is also necessary to ensure that the continuous feature segments obtained in the slicing process are two features in the same time period. For example, when the time dimension feature includes both a video frame feature and an audio feature, a set of consecutive feature segments is obtained that includes the video frame feature from time a to time b, and correspondingly, the audio feature included in the set of consecutive feature segments should also be the audio feature from time a to time b.

In another exemplary embodiment, in addition to slice sampling of the time-dimensional features, the time-dimensional features may be uniformly sampled simultaneously. Specifically, a preset interval set may be set in advance, and according to each time interval included in the preset interval set, the time interval is used as a sampling interval, and sampling is performed in the time dimension feature, so as to obtain a non-continuous feature segment corresponding to each time. It should be noted that the obtained non-continuous feature segment may also be used for feature fusion with a non-time dimension feature to obtain a fusion feature. The discontinuous feature segment is formed by combining features corresponding to a plurality of discontinuous time points in a time dimension.

For example, assume that one time interval in the preset interval set is y_iMay be given as y_iThe sampling is performed for the sampling interval in the time dimension. For example, the time dimension has a feature length of N, in y_iAfter sampling for the sampling interval, the length N/y can be obtained_iThe non-contiguous feature segment of (a).

In addition, when the time-dimension features include two kinds of features at the same time, the discontinuous feature segment is the same as the continuous feature segment, and two kinds of features corresponding to each other in time are necessary. For example, when the time dimension features include both video frame features and audio features, it is assumed that the obtained set of non-continuous feature segments includes video frame features corresponding to time point 1, time point 5, and time point 11, and correspondingly, the audio features included in the set of non-continuous feature segments should also be video frame features corresponding to time point 1, time point 5, and time point 11.

In addition, in consideration of the reusability of the subsequent classification network, when the continuous feature segment and the discontinuous feature segment are sampled, the length obtained by corresponding to each preset interval in the preset interval set can be made to correspond to the preset duration in the preset duration set. For example, assume that the preset duration in the preset duration set is k_iCorrespondingly, the preset interval in the preset interval set can be y_i＝N/k_iWhere N is the overall length of the time dimension feature.

In an exemplary embodiment, when the non-time dimension features include text features, since the number of characters in the text information is often difficult to reach the maximum input of the model thereof, a full connection layer may be added after the feature layer of each character position, so that the length of the text features output by the text feature model is fixed.

In an exemplary embodiment, for each continuous feature segment or each non-continuous feature segment, it is necessary to fuse the non-time dimension features thereof respectively. Specifically, the fusion can be performed by means of direct splicing or attention mechanism, so as to obtain a fusion feature with a fixed length. In addition, when the characteristic fusion is carried out, modes such as mean value pooling and maximum value pooling can also be introduced; continuous feature segments or discontinuous feature segments and non-time-dimension features can be screened before fusion, and partial features are selected for fusion, which is not limited by the disclosure.

For example, when 5 continuous feature segments are obtained, each continuous feature segment is respectively fused with a non-time dimension feature, so that 5 fused features can be obtained; similarly, when 5 non-continuous feature segments are obtained, 5 fused features can be obtained correspondingly.

In addition, each group of continuous feature segments or discontinuous feature segments comprises a time dimension feature with a certain length, so that the dimension reduction of the continuous feature segments or the discontinuous feature segments in the time dimension can be performed before the fusion. For example, when the time dimension features include video frame features and audio features, the feature dimension reduction that can perform attention mechanism on the video frame features and the audio features in the time dimension respectively may be performed first, or the feature dimension reduction may be performed by performing principal component analysis after splicing.

In practical application, videos to be classified are different in length and often have more frames, and the characteristics of all the frames are directly used for classification, so that parameters of a classification network are greatly increased, and training overhead is overlarge. If only a portion of the video frames are used, the video frame characteristics of the entire video frame may not be fully utilized. Compared with the prior art, the original high-dimensional feature vector can be converted into a plurality of groups of fusion features with fixed lengths in the sampling process, and each fusion feature is formed based on multiple modes, so that hidden layer parameters of a subsequent classification network for classification are greatly reduced, and the multiple mode features are fully utilized.

In step S313, based on the fusion features corresponding to the sample video, the classification result corresponding to the sample video is determined by combining the non-time dimension features.

In an exemplary embodiment, after the fusion features are obtained, the fusion features may be classified according to a fusion feature classification network, and a classification result corresponding to the sample video is determined by combining a result of classifying the non-time dimension features by the non-time dimension feature classification network. The classification network may be a conventional softmax multi-class classification network, or may be a classification network in other forms, which is not limited in this disclosure.

In an exemplary embodiment, the classification may be based on a voting classification. Specifically, the fusion features and the non-time dimension features can be classified according to the classification network corresponding to each feature to obtain the classification probability of the classification result, and then voting classification is performed based on the obtained classification probability to determine the classification result corresponding to the sample video. For example, assuming that there are 3 types of classification results, voting is performed based on the classification probabilities, and the classification probability of the classification result 2 is the largest among the classification probabilities of the 3 classification results obtained, so that the sample video can be classified as the classification result 2.

In addition, among the time dimension features, image modality features such as video frame features are generally higher in dimension, and non-image modality features such as audio features are lower in dimension. Therefore, when the time dimension features include non-image modality features, the classification voting process may be performed by using, as a part of the voting, the classification probability obtained when the non-image modality features are classified by the non-image modality feature classification network. It should be noted that, since the non-image modality features in the time-dimension features may have a higher dimension, the non-image modality features may be reduced in dimension through the full connection layer before the non-image modality features are classified.

In an exemplary embodiment, the classification result may include a general classification and a detail classification, and the detail classification has an association relationship with any general classification. Specifically, the overall classification and the detail classification may be hierarchical categories from coarse to fine for video. For example, the category of the overall classification is a large class, and the detailed classification is a detailed class obtained by further subdividing the large class. It should be noted that, in some embodiments, the above detailed classification may be further classified, and the present disclosure does not specifically limit the classification hierarchy.

When the classification result includes a general classification and a detail classification, since the individual expression of the features other than the fused features has limitations, and the detail classification having an association relationship with the same general classification generally has a certain similarity, it is difficult to perform the detail classification by the features other than the fused features. For the above reasons, in an exemplary embodiment, voting classification may be performed by determining a first classification probability of the overall classification for the classification of the non-time-dimensional features, and then determining a second classification probability of the detailed classification for the classification of the fused features. When voting classification is performed, the accumulated classification probability of the total classification can be determined based on the first classification probability and the second classification probability of the detail classification which has an association relationship with the total classification. After the cumulative probability is determined, a target total classification can be determined in the total classification according to the cumulative classification probability, and then a target detail classification can be determined in detail classifications having an association relation with the target total classification according to the second classification probability.

Furthermore, where the time dimension features are non-image modality features, the non-image modality features may also be used for classification. Also due to the limitations of non-image modality feature expressions, the first classification probability of the overall classification may also be determined based on the classification of the non-image modality features. The first classification probability determined for the non-image modality features may then also be used in the voting classification process described above.

It should be noted that, when the total classification and the detailed classification are large in number, the first classification probability and the second classification probability may be screened according to the classification result, and only the first classification probability and the second classification probability that satisfy a certain condition are reserved. For example, after the first classification probability and the second classification probability are obtained, the probabilities may be sorted according to the size of the probabilities, a few larger probability values may be reserved, and voting classification may be performed according to the reserved first classification probability and the second classification probability.

In step S32, parameters in the video classification model are adjusted using the model output result until the accuracy of the video classification model determined based on the verification video reaches a preset threshold.

In an exemplary embodiment, the learning may be based on a supervised or semi-supervised manner when the video classification model is parameter adjusted using the model output result. For example, manually labeled classification results can be included in each sample video, and then the learning process is implemented by a conventional deep learning framework such as a pytorch, TensorFlow, and the like. After training, the verification videos can be input into the trained video classification model, and the accuracy of classification of all the verification videos is calculated through a process of determining classification results corresponding to the sample videos by combining non-time dimension characteristics based on fusion characteristics corresponding to the sample videos. When the accuracy reaches a preset threshold value, determining that the training of the video classification model is finished; otherwise, if the accuracy rate does not reach the preset threshold, the video classification model needs to be trained by adjusting the training strategy, and the accuracy rate is known to reach the preset threshold.

It should be noted that, in an exemplary embodiment, the process of determining the classification result corresponding to the sample video by combining the non-time dimension features without involving the fusion features corresponding to the sample video in the model training process only needs to adjust the parameters in the video classification model according to the manually labeled classification result. In addition, before model training, the weight parameters of each network included in the video classification model may adopt weights pre-trained on other tasks as initialization of the model, and the weight parameters are used for determining the network of the classification result corresponding to the sample video by combining non-time dimension features based on the fusion features corresponding to the sample video, and the parameters may be initialized randomly. During training, different optimization strategies and parameters can be tried by using training data and a training framework, and a model with higher accuracy is trained.

In an exemplary embodiment, when the time dimension data includes video frame data and audio data, and the non-time dimension data is header text data, the video classification model architecture of the embodiment of the disclosure may be as shown in fig. 5, and a training process of the video classification model may be as shown in fig. 6.

The video classification model may include the following components: the system comprises a feature extraction network, a sampling recombination network, a classification network and a voting classification network. When the time dimension data includes video frame data and audio data, and the non-time dimension data is header text data, as shown in fig. 5, the video classification network may include the following parts:

the prepositive feature extraction network can comprise a text feature extraction network, an image feature extraction network and an audio spectrogram feature extraction network, and image features, audio features and text features of each video frame are obtained;

sampling time dimension characteristics (image characteristics and audio characteristics of a video frame) through a sampling recombination network, and fusing a sampling result with non-time dimension characteristics to obtain a continuous characteristic segment and a discontinuous characteristic segment of three modes;

when the classification result comprises a total classification and a detailed classification, classifying non-time dimension characteristics and non-image mode characteristics in the time dimension characteristics based on a total classification network to obtain a first classification probability of the total classification; classifying the continuous characteristic fragments and the discontinuous characteristic fragments based on the detail classification network to obtain a second classification probability;

and processing the first classification probability and the second classification probability based on the voting classification network to obtain a classification result of the video.

In the training process, all videos can be classified into two types, one type is used as a sample set, and the other type is used as a verification set. And inputting the sample videos in the sample set into the video classification model, training, then performing performance verification on the trained video classification model according to the videos in the verification set, and determining that the training of the video classification model is finished after the accuracy corresponding to the verification set reaches a preset value. In addition, since the degree of freedom of video content is high, a large number of videos need to be collected at the time of training. For example, the total number of videos may be no less than 50 ten thousand, of which 90% are training sets and the rest are validation sets.

It should be noted that, when supervised or semi-supervised model training is performed, training may be performed directly based on manually labeled labels, and only parameters in the feature extraction network, the sampling reassembly network, and the classification network may be adjusted without adding the voting classification network. When verification is carried out, a classification result is determined through voting of the voting classification network.

Further, the present exemplary embodiment also provides a video classification method, including the following steps: and inputting the video to be classified into the video classification model to obtain a classification result corresponding to the video to be classified.

The video classification model is obtained by training based on the video classification model training method.

In an exemplary embodiment, before the video to be classified is input into the video classification model, the video to be processed needs to be preprocessed, so that the video to be processed with the same format as the sample data is obtained, and the video to be processed may include time dimension data and non-time dimension data. The specific details of the preprocessing are already described in detail in the above-mentioned embodiment of the video classification model training method, and details that are not disclosed can be referred to the content of this embodiment, which is not described herein again.

It is noted that the above-mentioned figures are merely schematic illustrations of processes involved in methods according to exemplary embodiments of the present disclosure, and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.

Further, referring to fig. 7, in the present exemplary embodiment, a video classification model training apparatus 700 is further provided, which includes a data processing module 710 and a model training module 720. Wherein:

the data processing module 710 may be configured to input the sample video into the video classification model to perform a training process, so as to obtain a model output result; wherein the sample video includes time dimension data and non-time dimension data;

the training process comprises the following steps: respectively extracting the characteristics of the time dimension data and the non-time dimension data to obtain time dimension characteristics and non-time dimension characteristics; performing time-dimension slicing sampling on the time-dimension features to obtain at least one continuous feature segment, and performing feature fusion on each continuous feature segment in the at least one continuous feature segment and the non-time-dimension features to obtain at least one fusion feature; and determining a classification result corresponding to the sample video by combining the non-time dimension characteristic based on the fusion characteristic corresponding to the sample video.

The model training module 720 may be configured to adjust parameters in the video classification model using the model output result until the accuracy of the video classification model determined based on the verification video reaches a preset threshold.

In an exemplary embodiment, the data processing module 710 may be configured to obtain a preset duration set by using a video classification model; the preset duration set comprises at least one preset duration; and cutting continuous feature segments of preset duration in the time dimension features aiming at each preset duration.

In an exemplary embodiment, the data processing module 710 may be configured to obtain a preset interval set by using a video classification model; wherein the preset interval set comprises at least one time interval; and for each time interval, sampling in the time dimension characteristics by taking the time interval as a sampling interval to obtain a non-continuous characteristic segment corresponding to the time interval, so that the non-continuous characteristic segment and the non-time dimension characteristics are subjected to characteristic fusion to obtain fusion characteristics.

In an exemplary embodiment, the data processing module 710 may be configured to perform linear transformation on the non-time-dimension features by using a video classification model to obtain the non-time-dimension features with preset lengths.

In an exemplary embodiment, the data processing module 710 may be configured to classify the fusion features and the non-time dimension features corresponding to the sample video by using a video classification model to obtain classification probabilities of the classification results, and perform voting based on the classification probabilities to determine the classification results corresponding to the sample video.

In an exemplary embodiment, the data processing module 710 may be configured to classify the non-image modality features corresponding to the sample video by using a video classification model, and obtain a classification probability of the classification result.

In an exemplary embodiment, the data processing module 710 may be configured to classify the non-time dimension features using a video classification model to obtain a first classification probability of the overall classification; classifying the fusion features to obtain a second classification probability of detailed classification; determining the cumulative classification probability of the total classification based on the first classification probability and the second classification probability of the detail classification which has an incidence relation with the total classification; and determining a target total classification in the total classification according to the accumulated classification probability, and determining a target detail classification in the detail classification which has an association relation with the target total classification according to the second classification probability.

In an exemplary embodiment, the data processing module 710 may be configured to classify the non-image modality feature using a video classification model to obtain a first classification probability corresponding to the overall classification.

Further, an embodiment of the present invention further provides a video classification apparatus, including a video classification module, where the video classification module may be configured to input a video to be classified into a video classification model to obtain a classification result corresponding to the video to be classified; the video classification model is obtained by training according to the video classification model training method.

The specific details of each module in the above apparatus have been described in detail in the method section, and details that are not disclosed may refer to the method section, and thus are not described again.

As will be appreciated by one skilled in the art, aspects of the present disclosure may be embodied as a system, method or program product. Accordingly, various aspects of the present disclosure may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.

Exemplary embodiments of the present disclosure also provide a computer-readable storage medium having stored thereon a program product capable of implementing the above-described method of the present specification. In some possible embodiments, various aspects of the disclosure may also be implemented in the form of a program product including program code for causing a terminal device to perform the steps according to various exemplary embodiments of the disclosure described in the above-mentioned "exemplary methods" section of this specification, when the program product is run on the terminal device, for example, any one or more of the steps in fig. 3 to 4 may be performed.

It should be noted that the computer readable media shown in the present disclosure may be computer readable signal media or computer readable storage media or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

Furthermore, program code for carrying out operations of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is to be limited only by the terms of the appended claims.

Claims

1. A video classification model training method is characterized by comprising the following steps:

inputting a sample video into a video classification model to execute a training process so as to obtain a model output result; wherein the sample video includes time dimension data and non-time dimension data;

adjusting parameters in the video classification model by using the model output result until the accuracy of the video classification model determined based on the verification video reaches a preset threshold value;

the training process comprises:

respectively extracting the characteristics of the time dimension data and the non-time dimension data to obtain time dimension characteristics and non-time dimension characteristics;

performing time-dimension slice sampling on the time-dimension features to obtain at least one continuous feature segment, and performing feature fusion on each continuous feature segment in the at least one continuous feature segment and the non-time-dimension features to obtain at least one fusion feature;

and determining a classification result corresponding to the sample video by combining the non-time dimension characteristic based on the fusion characteristic corresponding to the sample video.

2. The method of claim 1, wherein the time-dimension slicing sampling of the time-dimension features into at least one continuous feature segment comprises:

acquiring a preset duration set; wherein the preset duration set comprises at least one preset duration;

and cutting continuous feature segments of the preset duration from the time dimension features aiming at each preset duration.

3. The method of claim 1, further comprising:

acquiring a preset interval set; wherein the preset interval set comprises at least one time interval;

and for each time interval, sampling in the time dimension characteristic by taking the time interval as a sampling interval to obtain a non-continuous characteristic segment corresponding to the time interval, so that the non-continuous characteristic segment and the non-time dimension characteristic are subjected to characteristic fusion to obtain the fusion characteristic.

4. The method of claim 1, wherein the non-time dimensional features comprise textual features;

prior to performing feature fusion, the method further comprises:

and performing linear transformation processing on the non-time dimension characteristics to obtain the non-time dimension characteristics with preset length.

5. The method according to claim 1, wherein the determining the classification result corresponding to the sample video based on the fused feature corresponding to the sample video in combination with the non-time-dimension feature comprises:

classifying the fusion features and the non-time dimension features corresponding to the sample video respectively to obtain classification probabilities of the classification results, and voting based on the classification probabilities to determine the classification results corresponding to the sample video.

6. The method of claim 5, wherein the time-dimension features comprise non-image modality features;

before the voting based on the classification probability to determine the classification result corresponding to the sample video, the method further includes:

and classifying the non-image modal characteristics corresponding to the sample video to obtain the classification probability of the classification result.

7. The method according to claim 5, wherein the classification result comprises a general classification and a detail classification, and the detail classification is associated with any general classification;

the classifying the fusion features and the non-time dimension features corresponding to the sample video respectively to obtain classification probabilities of the classification results, and voting based on the classification probabilities to determine the classification results corresponding to the sample video, includes:

classifying the non-time dimension features to obtain a first classification probability of the overall classification;

classifying the fusion features to obtain a second classification probability of the detail classification;

determining the cumulative classification probability of the overall classification based on the first classification probability and the second classification probability of the detail classification which has an association relation with the overall classification;

and determining a target total classification in the total classification according to the accumulated classification probability, and determining a target detail classification in detail classifications having an association relation with the target total classification according to the second classification probability.

8. The method of claim 7, wherein the time-dimension features comprise non-image modality features;

before determining the cumulative classification probability of the overall classification based on the first classification probability and the second classification probability of the detailed classification having an association relationship with the overall classification, the method further comprises:

and classifying the non-image mode features to obtain a first classification probability corresponding to the total classification.

9. A method of video classification, comprising:

inputting a video to be classified into a video classification model to obtain a classification result corresponding to the video to be classified;

wherein the video classification model is trained according to the method of any one of claims 1 to 8.

10. A video classification model training device, comprising:

the data processing module is used for inputting the sample video into the video classification model to execute a training process so as to obtain a model output result; wherein the sample video includes time dimension data and non-time dimension data;

the model training module is used for adjusting parameters in the video classification model by using the model output result until the accuracy of the video classification model determined based on the verification video reaches a preset threshold value;

the training process comprises:

11. A video classification apparatus, comprising:

the video classification module is used for inputting the video to be classified into a video classification model so as to obtain a classification result corresponding to the video to be classified; wherein the video classification model is trained according to the method of any one of claims 1 to 8.

12. A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 9.

13. An electronic device, comprising:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the method of any of claims 1 to 9 via execution of the executable instructions.