CN114090826A

CN114090826A - Video classification method and related device

Info

Publication number: CN114090826A
Application number: CN202111438578.1A
Authority: CN
Inventors: 赵娅琳
Original assignee: Beijing Opper Communication Co ltd
Current assignee: Beijing Opper Communication Co ltd
Priority date: 2021-11-29
Filing date: 2021-11-29
Publication date: 2022-02-25

Abstract

The embodiment of the application discloses a video classification method and a related device, wherein the method comprises the following steps: acquiring a target video and a title of the target video; determining image characteristics of the target video; determining a text feature of the title; and obtaining a target classification result of the target video according to the image characteristics, the text characteristics and a pre-trained video classification model. The video classification method and the video classification device are beneficial to improving the efficiency of video classification and the precision of video classification.

Description

Video classification method and related device

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a video classification method and a related apparatus.

Background

The video classification method has the advantages that the video classification method generally has the requirement of classifying videos in scenes such as video recommendation, at present, a manual review labeling method can be adopted for classifying videos, workers check the videos and label the videos in the categories, classification results are usually obtained only according to video images, and the problems of low video classification efficiency and low video classification precision exist.

Disclosure of Invention

The embodiment of the application provides a video classification method and a related device, aiming at improving the video classification efficiency and the video classification precision.

In a first aspect, an embodiment of the present application provides a video classification method, including:

acquiring a target video and a title of the target video;

determining image characteristics of the target video;

determining a text feature of the title;

and obtaining a target classification result of the target video according to the image characteristics, the text characteristics and a pre-trained video classification model.

In a second aspect, an embodiment of the present application provides a video classification apparatus, including:

a first acquisition unit configured to acquire a target video and a title of the target video;

the first determining unit is used for obtaining target image characteristic data according to the target image data;

the second determining unit is used for obtaining target text characteristic data according to the target text data;

and the processing unit is used for obtaining a target classification result of the target video according to the image characteristics, the text characteristics and a pre-trained video classification model.

In a third aspect, an embodiment of the present application provides an electronic device, including a processor, a memory, and a communication interface, where the processor, the memory, and the communication interface are connected to each other, where the communication interface is used to receive or transmit data, the memory is used to store application program codes for the electronic device to perform the above method, and the processor is configured to perform any one of the above methods.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium, a computer program for electronic data exchange, where the computer program enables a computer to perform some or all of the steps described in any one of the methods of the first aspect of the embodiments of the present application.

In a fifth aspect, the present application provides a computer program product, including a computer program, which when executed by a processor implements some or all of the steps described in any of the methods of the first aspect of the embodiments of the present application. The computer program product may be a software installation package.

In the embodiment of the application, a target video and a title of the target video are obtained, then image features of the target video and text features of the title are determined, and finally a target classification result of the target video is obtained according to the image features, the text features and a pre-trained video classification model. Therefore, when the classification result of the target video is determined, the image characteristics of the target video and the text characteristics of the target video are combined, and a pre-trained video classification model is utilized, so that the efficiency of video classification and the precision of video classification are improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1A is a schematic diagram of a system architecture according to an embodiment of the present application;

fig. 1B is a schematic structural diagram of an execution device according to an embodiment of the present application;

fig. 2A is a schematic flowchart of a video classification method according to an embodiment of the present application;

fig. 2B is a schematic data transmission diagram of a video classification model according to an embodiment of the present application;

fig. 3A is a block diagram illustrating functional units of a video classification apparatus according to an embodiment of the present disclosure;

fig. 3B is a block diagram illustrating functional units of another video classification apparatus according to an embodiment of the present disclosure.

Detailed Description

In order to make the technical solutions of the present application better understood, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms "first," "second," and the like in the description and claims of the present application and in the above-described drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

"plurality" means two or more. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

The embodiment of the application provides a video classification method, which can be applied to a scene of recommending videos to users (for example, pushing video contents of interest to the users). Specifically, each candidate video which can be recommended to the user can be used as a target video, the candidate video and the title of the candidate video are obtained through the video classification method, then the image features of the candidate video and the text features of the candidate video title are determined, and the classification result of the candidate video is obtained according to the image features, the text features and a pre-trained video classification model. Or the method can be simply applied to a video classification scene to obtain the video uploaded by the user, and then the classification result of the video uploaded by the user and the type of the video uploaded by the user are determined by the method. The scheme can be applied to various scenes needing to determine the video type, including but not limited to the application scenes mentioned above.

The method provided by the application is described according to a model training side and a model application side as follows:

the video classification method provided by the embodiment of the application relates to processing of natural language and computer vision, and can be particularly applied to data processing methods such as data training, machine learning and deep learning, and intelligent information modeling, extraction, preprocessing, training and the like are performed on training data (training videos in a training data set and titles of the training videos) in a symbolized and formalized mode, so that a trained video classification model is finally obtained.

It should be noted that the training steps of the video classification model and the method for determining the image features of the target video, the text features of the title, and the video classification result in the video classification method described in the embodiments of the present application are inventions based on the same concept, and may also be understood as two parts in a system or two stages of an overall process: such as a model training phase and a model application phase.

The following description will first be made with respect to terms related to the present application.

An image feature extraction model: and obtaining image features based on space dimensionality and image features based on time dimensionality extracted based on video data by using related methods such as image processing, machine learning and computer graphics. The image feature extraction model may be a Vision Transformer (VIT) model, which is an algorithmic approach to applying Transformer (Transformer) techniques to single-image classification tasks.

A text feature extraction model: and obtaining the text characteristics corresponding to the video title by using related methods such as natural language processing, machine learning, computer graphics and the like. The text feature extraction model may be a Bidirectional encoding Representation from Transformer (BERT) model, which is an algorithmic approach to applying transform techniques to natural language understanding.

Video classification model: and obtaining a video classification result of the video to be detected by utilizing related methods such as image processing, natural language processing, machine learning, computer graphics and the like.

Multi-head attention (multi-head attention) mechanism: selection of multiple information from the input information is computed in parallel using multiple queries, each focusing on a different portion of the input information. The multi-head attention mechanism comprises a plurality of self-attention mechanisms. When data to be processed is processed based on a self-attention mechanism, matrix projection needs to be performed on the data to be processed first to obtain query Q, key K and value V, calculation weights based on Q and K are obtained according to the self-attention mechanism processing Q, K and V, namely weight α is Q × K, then V is weighted according to the calculation weights, linear transformation is performed, and a processing result is output. The calculation process of the self-attention mechanism comprises the following steps: firstly, converting input data into vectors/tensors; then obtaining three target vectors/tensors of Q, K and V according to the vectors/tensors; then a fraction (i.e. weight) α is calculated for each target vector/tensor: α ═ Q × K; then for gradient stabilization, the Transformer also uses fractional normalization; applying a softmax activation function to the scores; multiplying the Value V by the softmax point to obtain a score V of each weighted input vector/tensor; and finally, obtaining a final output result Z after addition: z ═ Σ V. The text feature extraction model, the image feature extraction model and the video classification model can all carry out output processing through the multi-head attention mechanism.

The system architecture according to the embodiments of the present application is described below.

Referring to fig. 1A, fig. 1A is a schematic diagram of a system architecture according to an embodiment of the present disclosure. As shown in fig. 1A, the system architecture 10 includes a training device 11 and an execution device 12. The training device 11 is configured to train the model according to training data to obtain a video classification model 121, where the training data may include a training video in a training data set and a title of the training video. The video classification model 121 is used for processing image features and text features to obtain a classification result of a video.

The video classification model 121 obtained by training according to the training device 11 may be applied to different systems or devices, for example, the video classification model is applied to the execution device 12 shown in fig. 1A, where the execution device 12 may be a terminal, such as a mobile phone terminal, a tablet computer, a notebook computer, a vehicle-mounted terminal, or a server or a cloud. As shown in fig. 1B, fig. 1B is a schematic structural diagram of an execution device according to an embodiment of the present application. The execution device 12 comprises a processor 120, a memory 130, a communication interface 140, and one or more programs 131, wherein the one or more programs 131 are stored in the memory 130 and configured to be executed by the processor 120, and the one or more programs 131 comprise instructions for performing any of the steps of the method embodiments described below. In a specific implementation, the application processor 120 is configured to execute any one of the steps executed by the execution device in the method embodiments described below, and when performing data transmission such as acquisition or transmission, optionally invokes the communication interface 140 to complete the corresponding operation.

Referring to fig. 2A, fig. 2A is a schematic flowchart of a video classification method according to an embodiment of the present application, and as shown in fig. 2A, the video classification method includes the following steps:

s201, acquiring a target video and a title of the target video.

S202, determining the image characteristics of the target video.

In a specific implementation, step 202 may include: sampling the target video to obtain a sampling image; and inputting the sampling image into a pre-trained image feature extraction model to obtain the image features.

The structure of the image feature extraction model may be an encoding-decoding (Encode-Decode) structure based on a multi-head attention system, and may be a VIT model, for example. The image features may then be calculated from the sampled images using a multi-head attention mechanism.

Specifically, when image features are obtained according to the sampled images, M frames of sampled images may be taken as a batch, and the M frames of sampled images are input into a pre-trained image feature extraction model to obtain features of each frame of sampled image in the M frames of sampled images, where the features of each frame of sampled image form an image feature of the target video, and the image feature may be a feature tensor.

For example, taking M frames of sampled images as 1 group of input image feature extraction model as an example, M frames of sampled images of the input image feature extraction model may be data tensors with a size of (1, M, C, W, H), that is, image data of M frames of sampled images is stored in a data tensor with a shape of (1, M, C, W, H), where a first dimension "1" of the data tensor represents the number of frames of the processed sampled images as 1 group, M represents the number of frames of the sampled images included in the 1 group of sampled images, C, H, W represents the number of color channels, the width of each frame of the sampled images and the height of each frame of the sampled images, respectively, where C may be 3 when the sampled images are RGB images, and if the sampled images are grayscale images, C may be 1, H and W are also generally fixed values, such as 256, etc.; here, taking C equal to 3 and H, W equal to 256 as an example, the image feature extraction model may convert a tensor of (1, M, C, W, H) size into a tensor of (1 × M, C, W, H), extract an image feature for the tensor of (1 × M, C, W, H) size to obtain a tensor of (1 × M,768) size, and convert the tensor of (1 × M,768) into a tensor of (1, M,768) size, which is one image feature obtained from M frames of sampled images.

That is to say, the image feature extraction model may perform feature extraction on M frames of sampled images at the same time to obtain image features of (1, M,768), where, when performing feature extraction on each frame of sampled image in the M frames of sampled images, each frame of sampled image may be divided into a plurality of image blocks of the same size, and the features of the frame of sampled image are obtained by using a multi-head attention mechanism to calculate according to each image block in the frame of sampled image, so as to obtain image features of (1, M,768) size corresponding to the M frames of sampled images. Because the relation among the local features in each frame of sampling image is utilized when the features of each frame of sampling image are determined, the accuracy of extracting the features of the single frame of sampling image is improved, and the accuracy of determining the image features of the target video according to the M frames of sampling images is improved.

Referring to the introduction of the multi-head attention mechanism in the related terms, the multi-head attention mechanism includes a plurality of self-attention mechanisms, the plurality of self-attention mechanisms are spliced to obtain a multi-head attention mechanism, and a self-attention mechanism is a process of calculating the importance of each part of the input tensor by the input tensor itself and weighting the importance with the input tensor itself. Taking an input tensor as an example, let Q ═ K ═ V ═ X;

then the single self-attention mechanism calculation results as:

wherein d is_kThe scaling factor is the weight of the input tensor after the scaling processing, and V is the importance value of the input tensor

And partially, weighting with the self V. The multi-head attention mechanism fuses calculation results of a plurality of self-attention mechanisms and plays a role similar to voting.

S203, determining the text characteristics of the title.

In a specific implementation, step 203 may specifically include: processing the text corresponding to the title according to a preset word embedding coding rule to obtain a word vector corresponding to the text; and inputting the word vector, a preset text vector and a preset position vector into a pre-trained text feature extraction model to obtain the text features of the target video, wherein the text vector is used for representing the global semantic features of the text corresponding to the title, and the position vector is used for distinguishing words at different positions in the text corresponding to the title.

The processing of the text corresponding to the title according to the preset word embedding encoding rule to obtain the word vector corresponding to the text may specifically be: and converting each word in the text into a one-dimensional vector by querying the word vector table to be used as the input of the text feature extraction model, wherein the output of the text feature extraction model is the vector representation after the full-text voice information corresponding to each sub-is input, namely the text feature of the title. The input of the text feature extraction model is characterized in that except for the sub-vectors, values of preset text vectors are automatically learned in the training process of the text feature extraction model and used for representing the global semantic information of the title text, and the position vector is a vector added to words at different positions in the title text so as to distinguish the words at different positions.

Taking the text feature extraction model to process the title data corresponding to one video at a time as an example, the word vector corresponding to the text may specifically be a word vector of (1, L), where 1 represents that the number of the titles processed at a time is 1, L may be the longest length of the video title obtained through statistics, for example, L may be 56, after the word vector of (1,56) is obtained, the word vector may be input into the text feature extraction model together with the text vector of (1,56) with all zero size and the position vector of (1,56) with all one size, finally, the feature of (1,56) size is obtained, in order to subsequently perform the stitching processing on the image feature and the text feature, and input the stitching feature into the video classification model to obtain the classification result, after the feature of (1,56) size is obtained, the feature ascending dimension of (1) size of (1,56) size is also changed into (1,1,56) size, and obtaining the text characteristic corresponding to the title.

The structure of the text feature extraction model may be an encoding-decoding (Encode-Decode) structure of a multi-head attention system, and may be a BERT model, for example. The text features are obtained by adopting a multi-head attention mechanism according to the word vectors, the preset text vectors and the preset position vectors. That is to say, the text feature extraction model may perform calculation by using a multi-head attention mechanism according to the word vector, the preset text vector, and the preset position vector, and finally output a vector representation of the homonymous and full-text semantic information corresponding to each word in the text, that is, the text feature of the title.

And S204, obtaining a target classification result of the target video according to the image characteristics, the text characteristics and a pre-trained video classification model.

In a specific implementation, step 204 may specifically include: splicing the image features and the text features to obtain splicing features; and inputting the splicing characteristics into the video classification model to obtain a target classification result of the target video. The video classification model structure may be a structure of an encoding-decoding (Encode-Decode) of a multi-head attention system, among others. The input/output data transmission direction of each model in step 204 can be as shown in fig. 2B.

In a specific implementation, when image features and text features are spliced, a specific dimension of the two features may be spliced, where the dimension may be a dimension representing data processing in batches of a model, and a dimension of the data amount of each batch is, for example, every M frames of images are taken as a batch to obtain image features of (1, M,768), and text features of (1,1,56) are obtained according to a title, and when the two features are spliced, the two features may be spliced in a second dimension direction to obtain the splicing features of (1, M +1, 768) by performing zero padding on the text features of (1,1,56) according to the image features of (1, M, 768). Specifically, taking the splicing feature with the size of (1, M +1, 768) as an example, after the splicing feature is processed by using the multi-head attention system, the output with the size of (1, n _ class), that is, the classification result of 1 video (in the classification model training process, the obtained output with the size of (1, n _ class) needs to be sequentially input to the cross entropy loss function and the normalized softmax function for processing, so as to obtain the classification result of the video), where n _ class is the vector representation of the video classification result. The video classification model fuses and classifies image features of the target video and text features of a title corresponding to the target video to obtain a final classification result of the target video.

In addition, if B batches of data (i.e., B videos) are processed each time, each video corresponds to one batch of sample images (M frames of sample images) and one title, then the picture features of (B, M,768) size and the text features of (B, 1, 768) size can be obtained; and splicing the two features in the 2 nd dimension direction to obtain the feature with the size of (B, M +1, 768) and calculating by using a multi-head attention mechanism to obtain the output with the size of (B, n _ class) to obtain the classification result of B videos, wherein n _ class is the vector representation of the classification result of each video (when a classification model is trained, the obtained output with the size of (B, n _ class) is sequentially input to a cross entropy loss function and a normalized softmax function to be processed to obtain the classification result of the B videos).

It can be seen that, in the embodiment of the present application, a target video and a title of the target video are obtained, then, image features of the target video and text features of the title are determined, and finally, a target classification result of the target video is obtained according to the image features, the text features and a pre-trained video classification model. When the classification result of the target video is determined, the image characteristics of the target video and the text characteristics of the target video are combined, and a pre-trained video classification model is utilized, so that the efficiency of video classification and the precision of video classification are improved.

In one possible example, the image features include N image features, N being an integer no less than 3; obtaining a target classification result of the target video according to the image feature, the text feature and a pre-trained video classification model, wherein the target classification result comprises: obtaining N reference classification results of the target video according to the N image features, the text features and the video classification model, wherein the N reference classification results correspond to the N image features one to one; and determining a target classification result of the target video according to the N reference classification results.

In particular, N may be 3, which is beneficial to improving the classification efficiency while ensuring the video classification accuracy.

Obtaining N reference classification results of the target video according to the N image characteristics, the text characteristics and the video classification model, namely executing the following steps according to each image characteristic in the N image characteristics: and determining a reference classification result of the target video according to the current image characteristics, the text characteristics and the pre-trained video classification model.

Specifically, a reference classification result of the target video is determined according to the current image feature, the text feature and the pre-trained video classification model, and the reference classification result may be obtained by performing stitching processing on the current image feature and the text feature to obtain a current stitching feature, and inputting the current stitching feature into the pre-trained video classification model. After the N reference classification results are obtained, a method for determining a target classification result of the target video according to the N reference classification results may be, for example, determining the target reference classification result according to a preset voting strategy; or the target reference classification result can be determined according to the number statistics of the reference classification results.

When the image features of the target video are N image features, each image feature may be obtained by inputting M frames of sampled images of the target video into the image feature extraction model, that is, N batches of images may be obtained by sampling the target video images, each batch of images includes M frames of sampled images, taking N as an example, 3M frames of sampled images are obtained by sampling the target video, one image feature is obtained according to each M frames of sampled images, 3 image features are obtained in total, then the 3 image features are respectively spliced with the text features of the title corresponding to the target video, 3 splicing features are obtained, and 3 reference classification results of the target video are obtained according to the 3 splicing features.

It can be seen that, in this example, at least 3 reference classification results corresponding to at least 3 image features one to one are obtained according to at least 3 image features, text features and a video classification model, and then a target classification result of a target video is determined according to the at least 3 reference classification results, which is beneficial to improving the precision of video classification.

In one possible example, the determining the target classification result of the target video according to the N reference classification results includes: and determining the reference classification result with the largest occurrence frequency in the N reference classification results as the target classification result.

In specific implementation, after a plurality of reference classification results are determined, the reference classification results can be counted to determine the specific number of each reference classification result, and then the reference classification result with the largest number can be determined as the final target classification result of the target video, so that the processing logic is simple and the efficiency is improved.

Therefore, in this example, the reference classification result with the largest occurrence frequency among the N reference classification results is determined as the target classification result, which is beneficial to improving the video classification precision and ensuring the classification processing efficiency.

In one possible example, each of the image features is obtained from M frames of sampled images, the M frames of sampled images are obtained by performing time-series sampling on the target video, the M frames of sampled images are images subjected to the same data enhancement processing, and M is an integer greater than 1.

Specifically, the M frames of sample images may be obtained by: and continuously sampling or alternatively sampling the target video according to a time sequence to obtain M frames of video images, and performing the same data enhancement processing on the M frames of video images to obtain M frames of sampling images. Because the M frames of video images have the correlation among time sequences, namely one image characteristic comprises the characteristics of M frames of sampling images with the correlation in time sequences in a target video, when the splicing characteristic is obtained according to the image characteristic and the text characteristic and the splicing characteristic is calculated by a video classification model by adopting a multi-head attention mechanism to obtain the classification result of the target video, the processing of the video classification model embodies the respective importance of M frames of time sequence correlated images and video image titles in the target video, the time sequence relation of the M frames of video images is utilized, and the precision of the video classification result is improved.

The image enhancement processing may specifically include geometric transformation enhancement and color transformation enhancement, where the geometric transformation enhancement may specifically include: turning, clipping, scaling, the color transformation enhancement may specifically include: the method has the advantages that color dithering and image blurring are carried out, M frames of video images obtained by sequential sampling are subjected to the same data enhancement processing, the same random noise is added, and compared with the random data enhancement processing of each frame of image of a target video, the method improves the correlation among the sequential times of the sampled images.

As can be seen, in this example, one image feature is determined according to M frames of sampled images obtained by performing time-series sampling on a target video, and the M frames of sampled images are images subjected to the same data enhancement processing, so that the correlation between the time series of the sampled images is improved, and the improvement of the precision of video classification is facilitated.

It should be noted that the training method of the video classification model described in the embodiment of the present application is similar to the video classification method provided in the embodiment of the present application, and is not repeated here, but when the video classification model is trained by using the titles of the training videos and the training videos, each training video may only determine one image feature, that is, each training video may only be subjected to one time sequence sampling to obtain M frames of sampling images, and after the M frames of sampling images are uniformly subjected to the same data enhancement processing, an image feature extraction model is input to extract training image features corresponding to the training videos.

Referring to fig. 3A, fig. 3A is a block diagram illustrating functional units of a video classification apparatus according to an embodiment of the present application, where the apparatus 30 includes:

a first acquisition unit 301 configured to acquire a target video and a title of the target video;

a first determining unit 302, configured to obtain target image feature data according to the target image data;

a second determining unit 303, configured to obtain target text feature data according to the target text data;

and the processing unit 304 is configured to obtain a target classification result of the target video according to the image feature, the text feature and a pre-trained video classification model.

In one possible example, the first determining unit 302 is specifically configured to: sampling the target video to obtain a sampling image; and inputting the sampling image into a pre-trained image feature extraction model to obtain the image features.

In a possible example, the second determining unit 303 is specifically configured to: processing the text corresponding to the title according to a preset word embedding coding rule to obtain a word vector corresponding to the text; and inputting the word vector, a preset text vector and a preset position vector into a pre-trained text feature extraction model to obtain the text features of the target video, wherein the text vector is used for representing the global semantic features of the text corresponding to the title, and the position vector is used for distinguishing words at different positions in the text corresponding to the title.

In one possible example, the processing unit 304 is specifically configured to: splicing the image features and the text features to obtain splicing features; and inputting the splicing characteristics into the video classification model to obtain a target classification result of the target video.

In one possible example, the image features include N image features, N being an integer no less than 3; the processing unit 304 is specifically configured to: obtaining N reference classification results of the target video according to the N image features, the text features and the video classification model, wherein the N reference classification results correspond to the N image features one to one; and determining a target classification result of the target video according to the N reference classification results.

In one possible example, in the aspect of determining the target classification result of the target video according to the N reference classification results, the processing unit 304 is specifically configured to: and determining the reference classification result with the largest occurrence frequency in the N reference classification results as the target classification result.

In the case of using an integrated unit, a block diagram of functional units of another video classification apparatus provided in the embodiment of the present application is shown in fig. 3B. In fig. 3B, the video classification apparatus includes: a processing module 310 and a communication module 311. The processing module 310 is used for controlling and managing actions of the video classification apparatus, such as steps performed by the first obtaining unit 301, the first determining unit 302, the second determining unit 303, the processing unit 304, and/or other processes for performing the techniques described herein. The communication module 311 is used to support interaction between the video classification apparatus and other devices. As shown in fig. 3B, the video classification apparatus may further include a storage module 312, and the storage module 312 is used for storing program codes and data of the video classification apparatus.

The Processing module 310 may be a Processor or a controller, and may be, for example, a Central Processing Unit (CPU), a general-purpose Processor, a Digital Signal Processor (DSP), an ASIC, an FPGA or other programmable logic device, a transistor logic device, a hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor may also be a combination of computing functions, e.g., comprising one or more microprocessors, DSPs, and microprocessors, among others. The communication module 311 may be a transceiver, an RF circuit or a communication interface, etc. The storage module 312 may be a memory.

All relevant contents of each scene related to the method embodiment may be referred to the functional description of the corresponding functional module, and are not described herein again. The video classification apparatus can perform the steps of the video classification method shown in fig. 2A.

Embodiments of the present application also provide a computer storage medium, wherein the computer storage medium stores a computer program for electronic data exchange, and the computer program enables a computer to execute part or all of the steps of any one of the methods as described in the above method embodiments.

Embodiments of the present application further provide a computer program product, which includes a computer program, and when the computer program is executed by a processor, the computer program implements part or all of the steps of any one of the methods described in the above method embodiments. The computer program product may be a software installation package.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and elements referred to are not necessarily required in this application.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the above-described division of the units is only one type of division of logical functions, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of some interfaces, devices or units, and may be an electric or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit may be stored in a computer readable memory if it is implemented in the form of a software functional unit and sold or used as a stand-alone product. Based on such understanding, the technical solution of the present application may be substantially implemented or a part of or all or part of the technical solution contributing to the prior art may be embodied in the form of a software product stored in a memory, and including several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the above-mentioned method of the embodiments of the present application. And the aforementioned memory comprises: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable memory, which may include: flash Memory disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.

The foregoing detailed description of the embodiments of the present application has been presented to illustrate the principles and implementations of the present application, and the above description of the embodiments is only provided to help understand the method and the core concept of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A method of video classification, comprising:

acquiring a target video and a title of the target video;

determining image characteristics of the target video;

determining a text feature of the title;

2. The method of claim 1, wherein determining the image characteristics of the target video comprises:

sampling the target video to obtain a sampling image;

and inputting the sampling image into a pre-trained image feature extraction model to obtain the image features.

3. The method of claim 1, wherein the determining the textual characteristics of the title comprises:

processing the text corresponding to the title according to a preset word embedding coding rule to obtain a word vector corresponding to the text; and inputting the word vector, a preset text vector and a preset position vector into a pre-trained text feature extraction model to obtain the text features of the target video, wherein the text vector is used for representing the global semantic features of the text corresponding to the title, and the position vector is used for distinguishing words at different positions in the text corresponding to the title.

4. The method according to claim 1, wherein obtaining the target classification result of the target video according to the image feature, the text feature and a pre-trained video classification model comprises:

splicing the image features and the text features to obtain splicing features;

and inputting the splicing characteristics into the video classification model to obtain a target classification result of the target video.

5. The method according to any one of claims 1 to 3, wherein the image features include N image features, N being an integer not less than 3;

obtaining a target classification result of the target video according to the image feature, the text feature and a pre-trained video classification model, wherein the target classification result comprises:

obtaining N reference classification results of the target video according to the N image features, the text features and the video classification model, wherein the N reference classification results correspond to the N image features one to one;

and determining a target classification result of the target video according to the N reference classification results.

6. The method of claim 5, wherein determining the target classification result of the target video according to the N reference classification results comprises:

and determining the reference classification result with the largest occurrence frequency in the N reference classification results as the target classification result.

7. The method according to any one of claims 1 to 6, wherein each of the image features is obtained from M frames of sampled images, the M frames of sampled images are obtained by time-sequentially sampling the target video, the M frames of sampled images are images subjected to the same data enhancement processing, and M is an integer greater than 1.

8. A video classification apparatus, comprising:

9. An electronic device comprising a processor, a memory, a communication interface, and one or more programs stored in the memory and configured to be executed by the processor, the programs comprising instructions for performing the steps in the method of any of claims 1-7.

10. A computer-readable storage medium, characterized in that a computer program for electronic data exchange is stored, wherein the computer program causes a computer to perform the steps in the method according to any of claims 1-7.