CN113240004B

CN113240004B - Video information determining method, device, electronic equipment and storage medium

Info

Publication number: CN113240004B
Application number: CN202110512985.6A
Authority: CN
Inventors: 曹萌; 刘旭东; 梅晓茸; 李炬盼; 陶斐
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2021-05-11
Filing date: 2021-05-11
Publication date: 2024-04-30
Anticipated expiration: 2041-05-11
Also published as: CN113240004A

Abstract

The present disclosure relates to a method, an apparatus, an electronic device, and a storage medium for determining video information, which belong to the technical field of multimedia. The first characteristic is input into a video classification model to classify the target video so as to preliminarily determine the classification result of the target video, and then the classification result of the target video and the multi-mode characteristic are spliced to obtain a second characteristic capable of integrally describing the target video.

Description

Video information determining method, device, electronic equipment and storage medium

Technical Field

The disclosure relates to the technical field of multimedia, and in particular relates to a video information determining method, a video information determining device, electronic equipment and a storage medium.

Background

With the rapid development of computer technology and mobile internet, short video is gaining favor of more and more users. The short video can be used for advertisement promotion, and the quality of the short video has great influence on the final throwing effect on the platform. In the related art, a method for determining video quality generally adopts video subjective quality assessment (Subjective Quality Assessment, SQA). Subjective quality assessment of video refers to selecting a group of subjects who are allowed to view a series of short videos for about 10 to 30 minutes in a particular environment, then scoring their quality using different methods, and finally taking an average score (Mean Opinion Score, MOS) and analyzing the resulting data.

In the above technology, the SQA method cannot objectively evaluate the short video based on subjective feeling scores of the testees, resulting in poor accuracy and poor referenceability of the determined quality of the short video.

Disclosure of Invention

The present disclosure provides a method, an apparatus, an electronic device, and a storage medium for determining video information, which can objectively and accurately determine the video information. The technical scheme of the present disclosure is as follows:

according to a first aspect of embodiments of the present disclosure, there is provided a video information determining method, the method including:

Acquiring a first feature based on image features and audio features of a target video;

Processing the first feature to obtain a prediction classification result of the target video;

splicing the first feature with the difference feature of the target video to obtain a second feature, wherein the difference feature is used for representing the difference between the prediction classification result and the real classification result of the target video;

and processing the second characteristic to obtain video information of the target video.

In some embodiments, the acquiring the first feature based on the image feature and the audio feature of the target video includes any one of:

Splicing the image features of the target video and the audio features to obtain the first features;

And splicing the image characteristics, the audio characteristics and the feedback characteristics of the target video to obtain the first characteristics, wherein the feedback characteristics are used for representing feedback acquired by the target video after being put on a platform.

In some embodiments, the stitching the image feature and the audio feature of the target video to obtain the first feature includes:

In response to detecting that the playing time period corresponding to the image feature is the same as the playing time period corresponding to the plurality of audio features, acquiring the image features with the same number as the plurality of audio features;

And splicing each image feature with a corresponding audio feature to obtain the first feature.

In some embodiments, the stitching the image feature and the audio feature of the target video to obtain the first feature further includes:

In response to detecting that the playing time period corresponding to the audio feature is the same as the playing time period corresponding to the plurality of image features, acquiring the audio features with the same number as the plurality of image features;

In some embodiments, the stitching the image feature, the audio feature, and the feedback feature of the target video to obtain the first feature includes:

splicing the image features and the audio features to obtain spliced features;

and responding to the feedback characteristics of which the quantity is equal to that of the spliced characteristics, acquiring the feedback characteristics of which the quantity is equal to that of the spliced characteristics, and splicing each spliced characteristic with a corresponding feedback characteristic to acquire the first characteristic.

In some embodiments, the stitching the first feature with the difference feature of the target video to obtain a second feature includes:

And responding to the fact that a plurality of first features correspond to one difference feature, obtaining difference features with the same number as the first features, and splicing each first feature with one corresponding difference feature to obtain the second feature.

In some embodiments, the processing the first feature to obtain a prediction classification result of the target video includes:

And processing the first feature through a video classification model to obtain a prediction classification result of the target video, wherein the video classification model is obtained based on sample video and corresponding classification label training.

In some embodiments, the processing the second feature to obtain video information of the target video includes:

And processing the second feature through a video information determining model to obtain video information of the target video, wherein the video information determining model is obtained based on sample video and corresponding sample video information training.

According to a second aspect of the embodiments of the present disclosure, there is provided a video information determining apparatus including:

An acquisition unit configured to perform image feature and audio feature based on the target video, acquiring a first feature;

The processing unit is configured to execute processing on the first feature to obtain a prediction classification result of the target video;

The splicing unit is configured to splice the first feature and the difference feature of the target video to obtain a second feature, wherein the difference feature is used for representing the difference between the predicted classification result and the real classification result of the target video;

The processing unit is configured to perform processing on the second feature to obtain video information of the target video.

In some embodiments, the stitching unit is configured to perform any one of:

And splicing the image characteristics, the audio characteristics and the feedback characteristics of the target video to obtain the first characteristics.

In some embodiments, the stitching unit is configured to perform stitching each image feature with a corresponding audio feature to obtain the first feature in response to detecting that a playing time period corresponding to the image feature is the same as a playing time period corresponding to a plurality of audio features.

In some embodiments, the stitching unit is configured to perform stitching each image feature with a corresponding audio feature to obtain the first feature in response to detecting that a playing time period corresponding to the audio feature is the same as a playing time period corresponding to a plurality of image features.

In some embodiments, the stitching unit is configured to perform stitching the image feature and the audio feature, obtain stitched features, and in response to a plurality of stitched features corresponding to one feedback feature, obtain feedback features equal to the number of stitched features, stitch each stitched feature with one corresponding feedback feature, and obtain the first feature.

In some embodiments, the stitching unit is configured to perform stitching each first feature with a corresponding difference feature to obtain the second feature in response to a plurality of the first features corresponding to a difference feature, obtaining the difference features equal to the number of the first features.

In some embodiments, the processing unit is configured to perform processing of the first feature by a video classification model, resulting in a predicted classification result of the target video, the video classification model being trained based on the sample video and a corresponding classification label.

In some embodiments, the processing unit is configured to perform processing of the second feature by a video information determination model, resulting in video information of the target video, the video information determination model being trained based on the sample video and the corresponding sample video information.

According to a third aspect of embodiments of the present disclosure, there is provided an electronic device comprising:

one or more processors;

A memory for storing the processor-executable program code;

wherein the processor is configured to execute the program code to implement the video information determination method described above.

According to a fourth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium comprising: the program code in the computer readable storage medium, when executed by a processor of an electronic device, enables the electronic device to perform the video information determination method described above.

According to a fifth aspect of embodiments of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the video information determination method described above.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure and do not constitute an undue limitation on the disclosure.

FIG. 1 is a schematic diagram of an implementation environment of a video information determination method according to an exemplary embodiment;

FIG. 2 is a block diagram of a video information determination system, according to an exemplary embodiment;

FIG. 3 is a flowchart illustrating a method of video information determination, according to an exemplary embodiment;

FIG. 4 is a flowchart illustrating a method of video information determination, according to an exemplary embodiment;

FIG. 5 is a model block diagram illustrating a method of video information determination according to an exemplary embodiment;

Fig. 6 is a block diagram of a video information determining apparatus according to an exemplary embodiment;

fig. 7 is a block diagram of a server, according to an example embodiment.

Detailed Description

In order to enable those skilled in the art to better understand the technical solutions of the present disclosure, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the foregoing figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the disclosure described herein may be capable of operation in sequences other than those illustrated or described herein. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.

The data referred to in this disclosure may be data authorized by the user or sufficiently authorized by the parties.

Fig. 1 is a schematic view of an implementation environment of a video information determining method according to an embodiment of the present disclosure, referring to fig. 1, the implementation environment includes: a terminal 101 and a server 102.

The terminal 101 may be at least one of a smart phone, a desktop computer, a portable computer, a virtual reality terminal, an augmented reality terminal, a wireless terminal, a laptop portable computer, etc., the terminal 101 has a communication function, may access the internet, and the terminal 101 may refer to one of a plurality of terminals, and this embodiment is only exemplified by the terminal 101. Those skilled in the art will recognize that the number of terminals may be greater or lesser. The terminal 101 may be operated with various different types of applications such as a video application, a music application, etc. The user generates corresponding content or feedback data by operating in an application program operated by the terminal, and transmits the generated content or feedback data to the server 102. For example, taking a video application as an example, a user uploads homemade short videos, or a user reviews short videos, or a user makes purchases based on the watched short videos, etc., to generate feedback data.

The server 102 may be an independent physical server, a server cluster or a distributed file system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, a content distribution network (Content Delivery Network, CDN), basic cloud computing services such as big data and an artificial intelligence platform. The server 102 may be associated with a base database for storing base data and base feature data of short videos, etc. The basic data comprises, but is not limited to, image data of the video, audio data and feedback data of the platform, and the basic characteristic data comprises, but is not limited to, image characteristics, voice characteristics, text characteristics of the video and feedback characteristics obtained based on the feedback data of the platform. The server 102 and the terminal 101 may be directly or indirectly connected through wired or wireless communication, which is not limited in the embodiment of the present application. Alternatively, the number of servers 102 may be greater or lesser, which is not limited by the embodiments of the present application. Of course, the server 102 may also include other functional servers to provide more comprehensive and diverse services.

Based on the above implementation environment, fig. 2 is a block diagram illustrating a video information determining system according to an exemplary embodiment, and as shown in fig. 2, the system includes the following three parts:

(1) Base data layer

The base data layer is used for storing relevant data of the target video, which is fully authorized by each party and comprises a base data stream and base characteristic data.

Wherein the base data stream includes, but is not limited to: image data, audio data of the video, and feedback data of the platform, the feedback data referring to specific behavior of the user with respect to the target video, for example, whether the user clicks on the target video.

The base characteristic data includes, but is not limited to, image characteristics, voice characteristics, text characteristics of the video, and feedback characteristics derived from platform-based feedback data, although the base characteristic data may include other characteristics as well. Wherein the features may be represented by vectors. The image features are obtained by processing corresponding image data based on a convolutional neural network, the voice features and the text features are obtained by performing text conversion based on the voice data, and the image features are further obtained by processing based on a bi-directional coding representation method (Bidirectional Encoder Representation from Transformers, BERT) of the converter. Of course, the above three features can also be obtained by other methods, which are not limited by the embodiments of the present disclosure.

In some embodiments, the feedback feature is used to represent at least one of a delivery status, a delivery effect, and a video system admission. The put state refers to state information of the video in the video system, such as which scenes the video is applied to, whether there is a cover, etc. The putting effect refers to information such as exposure rate, conversion rate, click rate and the like of the video. The video system admission degree refers to the magnitude of the winning rate of the video in the video system, the winning rate indicates the preference degree of the user to the video, for example, the first video and the second video are different versions of video sent by the same advertiser to the same advertisement, the first video obtains more times of favoring audiences, and the winning rate of the first video is larger. In some embodiments, other features include a category label of the video and interest orientation information for representing an audience of the video.

(2) Model layer

The model layer is used for providing a video information determination function and optionally is also used for offline storage of the determined video information. In some embodiments, video information refers to video quality. The model layer is provided with a video classification model and a video information determining model which can be subjected to multi-mode fusion, multi-mode feature extraction can be performed on the data provided by the basic data layer, the multi-mode features are input into the video classification model to obtain a video classification result, and the video information is determined by inputting the classification result and the multi-mode features into the video information determining model. Alternatively, the model layer can perform the above-described process of video information determination by the GPU of the server.

Optionally, in response to the more urgent video information determination task, the server feeds the resulting video information back to the policy service layer in real time. And responding to the non-urgent video information determining task, and storing the obtained video information offline by the server.

(3) Policy service layer

And the strategy service layer adjusts the video information obtained by the model layer according to the score adjustment strategy. In some embodiments, the score adjustment strategy described above includes: determining reference information, wherein the reference information is used for limiting the value range of the finally output video information, and adjusting the video information to avoid the occurrence of overhigh or overlow data in response to the obtained video information exceeding the value range limited by the reference information; the video information is adjusted by different strategies according to different scene requirements, for example, for a service scene, the data precision of the video information needs to be improved, and the embodiment of the disclosure does not limit what strategy is adopted.

The scheme provided in this embodiment may determine video information of an advertisement video or any other video, which is not limited in this embodiment.

Fig. 3 is a flowchart illustrating a video information determination method that can be used in a server, according to an exemplary embodiment, the method comprising:

in step 301, a first feature is acquired based on image features and audio features of a target video.

In step 302, the first feature is processed to obtain a prediction classification result of the target video.

In step 303, the first feature is spliced with a difference feature of the target video to obtain a second feature, where the difference feature is used to represent a difference between a predicted classification result and a true classification result of the target video.

In step 304, the second feature is processed to obtain video information of the target video.

According to the technical scheme provided by the embodiment of the disclosure, the method for fusing the multi-mode features is adopted, and the image features and the audio features of the target video are spliced to obtain the first features capable of reflecting the multi-mode information of the target video, so that the subsequent target video information determination is more accurate. The first characteristic is input into a video classification model to classify the target video so as to preliminarily determine the classification result of the target video, and then the classification result of the target video and the multi-mode characteristic are spliced to obtain a second characteristic capable of integrally describing the target video.

Fig. 4 is a flowchart illustrating a video information determination method for use in a server according to an exemplary embodiment. As shown in fig. 4, the method comprises the following steps:

In step 401, the server acquires image data and audio data of a target video.

The target video is a video which is not determined by video information, and the target video is a video which is newly released to the platform by a user or any one of a plurality of videos which are already released on the platform. In some embodiments, the server obtains image data and audio data of the target video uploaded by the user.

In some embodiments, the server obtains image data and audio data for the target video from a base data stream in the base data layer. Optionally, in response to receiving the information determination request for the target video, the server obtains, based on the video identifier carried in the information determination request, image data and audio data corresponding to the video identifier from the base data stream. Optionally, the server periodically triggers a video information determination process of the published video on the platform, that is, the above information determination request is periodically triggered by the server to perform the method on the target video. If a plurality of videos are required to be determined, the information is determined for each video in parallel or in serial.

In step 402, the server performs feature extraction on the image data to obtain image features.

In some embodiments, the server invokes a first fully-connected neural network, takes image data of the target video as input data of the first fully-connected neural network, and takes output of the first fully-connected neural network as the image feature.

Fig. 5 is a model structure diagram of a video information determining method according to an exemplary embodiment, and the model includes 4 parts as shown in fig. 5: the first fully connected neural network 501, the second fully connected neural network 502A, LSTM neural network 502B and the self-attention module 502C, the video classification model 503, and the video information determination model 504, it should be understood that the different gray-scale squares in fig. 5 are used to represent features derived based on different data.

Optionally, the first fully-connected neural network 501 includes a hidden layer, where the hidden layer includes M nodes, where M is an integer greater than 0, for example, m=128. The other parts of the model are described in detail below, respectively.

In step 403, the server performs feature extraction on the audio data to obtain audio features.

In some embodiments, the server invokes a second fully-connected neural network and a Long-Term Memory neural network (LSTM), takes the target audio data as input data for the second fully-connected neural network, takes the output of the second fully-connected neural network as input data for the LSTM neural network, and the output of the LSTM neural network is a feature derived based on the audio data. Optionally, referring to part 502 in fig. 5, the second fully-connected neural network 502A includes a hidden layer, where N nodes are included in the hidden layer, K nodes are included in the LSTM neural network 502B, where N and K are integers greater than 0, for example, n=50, and k=30.

On the basis of the above structure, optionally, referring to part 502 in fig. 5, a model structure adopted when extracting features of audio data is shown in part 502, where the model structure includes a second fully-connected neural network 502A, LSTM neural network 502B and a self-attention module 502C, and the self-attention module maps output data of the LSTM neural network based on a self-attention mechanism, so as to obtain audio features.

In the above steps 401 to 403, the feature extraction is taken as an example in the determining process, and in some embodiments, the server obtains the image feature and the audio feature of the target video from the basic feature data of the basic data layer, so that the computing pressure of the server is greatly reduced.

In step 404, the server concatenates the image feature and the audio feature to obtain a first feature.

In some embodiments, any of the image features and the audio features are processed to unify dimensions in response to the dimensions of the image features and the audio features being different. In one embodiment, in response to detecting that a playing time period corresponding to a certain image feature is the same as a playing time period corresponding to a plurality of audio features, the image features with the same number as the plurality of audio features are acquired, and each image feature is spliced with a corresponding audio feature to obtain a first feature. In some embodiments, the same number of image features as the plurality of audio features is acquired by multiple copies of the image features.

In one embodiment, in response to detecting that a playing time period corresponding to a certain audio feature is the same as a playing time period corresponding to a plurality of image features, obtaining the audio features with the same number as the plurality of image features, and splicing each image feature with a corresponding audio feature to obtain a first feature. In some embodiments, the same number of audio features as the plurality of image features is obtained by multiple copying the audio features.

Through the splicing processing, the obtained first characteristics comprise the information for representing the image characteristics and the information for representing the audio characteristics, so that the multi-mode characteristics of the target video can be embodied, and the subsequent target video information determination is more accurate.

For the steps 401 to 404, in some embodiments, the server acquires the image feature, the audio feature and the feedback feature of the target video from the base data layer, and splices the image feature, the audio feature and the feedback feature of the target video to obtain the first feature. In some embodiments, the image features are stitched with the audio features based on the stitching method described in step 404, and the stitched features are obtained. And responding to the feedback characteristics of which the quantity is equal to that of the spliced characteristics, acquiring the feedback characteristics of which the quantity is equal to that of the spliced characteristics, and splicing each spliced characteristic with a corresponding feedback characteristic to obtain a first characteristic. In some embodiments, the feedback features equal to the number of features after the stitching are obtained by performing multiple copies of the feedback features. In the embodiment, the feedback characteristic obtained after the target video is launched to the platform is added, and the information can show the launching effect such as the popularity of the video in the actual scene, so that the referenceability of the first characteristic can be improved, and the subsequent target video information determination is more accurate. Accordingly, the training process of each model employed involves training the feedback characteristics of the video when processed in conjunction with the feedback characteristics.

In step 405, the server inputs the first feature into a video classification model, which is trained based on the sample video and classification tags, and outputs a predicted classification result for the target video.

In some embodiments, the video classification model employs a convolutional neural network structure, as shown in part at 503 in FIG. 5, comprising: an input layer, a first convolution layer, a second convolution layer, a max pooling layer, a first fully-connected layer, a second fully-connected layer, and an output layer. Optionally, the first convolution layer contains S convolution kernels, the second convolution layer contains T convolution kernels, the first fully-connected layer comprises X nodes, the second fully-connected layer comprises Y nodes, S, T, X and Y are integers greater than 0, e.g., s=100, t=100, x=128, y=19.

The convolution layer of the video classification model is used for carrying out depth fusion on the multi-type features contained in the first features, the maximum pooling layer can reduce the size of input data, the network calculation speed is increased, meanwhile, the occurrence of the fitting problem is prevented, and the full connection layer is used for mapping the feature data obtained by the previous convolution layer and pooling layer to a sample marking space, so that the function of a classifier is achieved, and finally the purpose of classification is achieved.

In training the video classification model, the server obtains training data from the base data stream, the training data including the sample video and the corresponding classification labels. The training is achieved through multiple iterations, in any iteration process, feature extraction is carried out on a sample video to obtain corresponding sample image features and sample voice features, the two sample features are spliced to obtain first sample features, the first sample features are input into a model to be trained, whether a training ending condition is achieved is determined based on an output sample prediction classification result and a classification label, if the training ending condition is achieved, the model corresponding to the iteration is determined to be the video classification model, if the training ending condition is not achieved, model parameters are adjusted, and the next iteration process is executed based on the adjusted model. Optionally, the training ending condition is: and if the precision of the sample prediction classification result is greater than 0.95 or the iteration number reaches a threshold value, ending training.

In step 406, the server obtains a difference feature representing a difference between the predicted classification result and the true classification result of the target video.

In some embodiments, the difference feature is obtained by calculating a difference between a predicted classification result and a true classification result of the target video, and may be represented by a vector. The difference feature is used for representing the difference between the classification result obtained through model prediction and the real classification result, and the determination of the subsequent video information achieves a better guiding purpose by introducing the difference into the determination process of the video information and taking the uncertainty of prediction into consideration.

In step 407, the server splices the first feature and the difference feature to obtain a second feature.

In some embodiments, in response to a plurality of first features corresponding to one difference feature, difference features equal to the number of first features are obtained, and each first feature is spliced with one corresponding difference feature to obtain a second feature. In some embodiments, obtaining a number of difference features equal to the first number of features is accomplished by multiple copies of the difference features. The second characteristics obtained through the splicing processing not only comprise the multi-mode characteristic information of the target video, but also comprise the classification result information of the target video, so that the aim of better guiding the determination of the subsequent video information is achieved.

The step 407 is to directly stitch the stitched first feature and the difference feature to obtain the second feature, and in some embodiments stitch the image feature, the voice feature and the difference feature to obtain the second feature. It should be noted that, in the process of stitching based on the above three features, a number alignment process of different features is also involved, so as to ensure that each image feature can be stitched with one speech feature and one difference feature. In some embodiments, the alignment of the number of different features is achieved by copying a smaller number of features.

In step 408, the server inputs the second feature into a video information determination model, which is trained based on the sample video and the corresponding sample video information, to obtain video information of the target video.

In some embodiments, the video information determination model employs a convolutional neural network structure, as shown in part 504 in FIG. 5, comprising: an input layer, a third convolution layer, a fourth convolution layer, a max pooling layer, a third full connection layer, a fourth full connection layer, and an output layer. Optionally, the third convolution layer contains Q convolution kernels, the fourth convolution layer contains R convolution kernels, the third fully-connected layer comprises I nodes, the fourth fully-connected layer comprises J nodes, Q, R, I and J are integers greater than 0, e.g., q=100, r=100, i=128, j=1.

And the second characteristic is learned through the convolution layer in the video information determination model, and the data with smaller classification effect is invalidated, so that the purpose of optimizing the video determination model is achieved. The maximum pooling layer can reduce the size of input data, and prevent the occurrence of over-fitting while accelerating the network calculation speed. And the full connection layer obtains video information of the target video based on the characteristic data obtained by the previous convolution layer and the pooling layer.

In training the video information determination model, the server obtains training data from the base data stream, the training data including sample video and corresponding sample video information. Optionally, the training data is obtained from basic feature data. The training is achieved through multiple iterations, and in any iteration process, a first sample feature and a sample difference feature of the sample video are obtained. The sample difference features are obtained by calculating the difference value between a sample prediction classification result and a classification label, and the sample prediction classification result is obtained by inputting a sample video into a video classification model. And splicing the first sample characteristic and the sample difference characteristic to obtain a second sample characteristic. And inputting the second sample characteristics into a model to be trained, determining whether a training ending condition is met based on the output sample video prediction information and sample video information, if so, determining the model corresponding to the iteration as the video information determining model, and if not, adjusting model parameters, and executing the next iteration process based on the adjusted model until the training ending condition is met. Optionally, the training ending condition is: and stopping training when the barrel division precision of the sample video prediction information reaches more than 0.95 or the iteration times reach a threshold value.

In some embodiments, video information refers to video quality. According to the technical scheme, objective video evaluation is realized, experiments prove that based on the video information obtained by the technical scheme, the barrel dividing precision can reach more than 0.8, the accuracy is good, and the referenceability of the video information is improved.

Fig. 6 is a block diagram illustrating a video information determining apparatus according to an exemplary embodiment. Referring to fig. 6, the apparatus includes an acquisition unit 601, a processing unit 602, and a splicing unit 603.

An acquisition unit 601 configured to perform image feature and audio feature based on a target video, acquiring a first feature;

a processing unit 602, configured to perform processing on the first feature to obtain a prediction classification result of the target video;

A stitching unit 603 configured to perform stitching the first feature and a difference feature of the target video, to obtain a second feature, where the difference feature is used to represent a difference between a predicted classification result and a real classification result of the target video;

The processing unit 602 is configured to perform processing on the second feature to obtain video information of the target video.

In some embodiments, the stitching unit 603 is configured to perform any of the following:

In some embodiments, the stitching unit 603 is configured to obtain the same number of image features as the plurality of audio features in response to detecting that the playing time period corresponding to the image feature is the same as the playing time period corresponding to the plurality of audio features, and stitch each image feature with a corresponding audio feature to obtain the first feature.

In some embodiments, the stitching unit 603 is configured to obtain the same number of audio features as the plurality of image features in response to detecting that the playing time period corresponding to the audio feature is the same as the playing time period corresponding to the plurality of image features, and stitch each image feature with a corresponding audio feature to obtain the first feature.

In some embodiments, the stitching unit 603 is configured to perform stitching the image feature and the audio feature, obtain stitched features, and in response to a plurality of stitched features corresponding to one feedback feature, obtain feedback features equal to the number of stitched features, stitch each stitched feature with one corresponding feedback feature, and obtain the first feature.

In some embodiments, the stitching unit 603 is configured to perform stitching each first feature with a corresponding difference feature to obtain the second feature in response to a plurality of the first features corresponding to a difference feature, obtaining a difference feature equal to the number of the first features.

In some embodiments, the processing unit 602 is configured to perform processing of the first feature by a video classification model, resulting in a predicted classification result of the target video, the video classification model being trained based on the sample video and the corresponding classification labels.

In some embodiments, the processing unit 602 is configured to perform processing of the second feature by a video information determination model, resulting in video information of the target video, the video information determination model being trained based on the sample video and the corresponding sample video information.

It should be noted that: in the video information determining apparatus provided in the above embodiment, only the division of the above functional modules is used for illustration in determining video information, and in practical application, the above functional allocation may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to perform all or part of the functions described above. In addition, the video information determining apparatus provided in the above embodiment and the video information determining method embodiment belong to the same concept, and the specific implementation process is detailed in the method embodiment, which is not repeated here.

In some embodiments, fig. 7 is a block diagram illustrating a server according to an exemplary embodiment, where the server 700 may be relatively different due to configuration or performance, and may include one or more processors (Central Processing Units, CPU) 701 and one or more memories 702, where at least one program code is stored in the one or more memories 702, and the at least one program code is loaded and executed by the one or more processors 701 to implement a procedure performed by the server in the video information determining method provided in the above embodiments. Of course, the server 700 may also have a wired or wireless network interface, a keyboard, an input/output interface, and other components for implementing the functions of the device, which are not described herein.

In an exemplary embodiment, a computer readable storage medium is also provided, e.g. a memory 702 comprising program code, which is executable by the processor 701 of the server 700 to perform the above-described video information determination method. Alternatively, the computer readable storage medium may be a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a Compact-Disc Read-Only Memory (CD-ROM), a magnetic tape, a floppy disk, an optical data storage device, and the like.

In an exemplary embodiment, a computer program product is also provided, comprising a computer program which, when executed by a processor, implements the above-described video information determination method.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any adaptations, uses, or adaptations of the disclosure following the general principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method for determining video information, the method comprising:

Processing the first characteristic to obtain a prediction classification result of the target video;

Responding to the fact that a plurality of first features correspond to one difference feature, obtaining difference features with the same number as the first features, and splicing each first feature with one corresponding difference feature to obtain a second feature, wherein the difference features are used for representing differences between a prediction classification result and a real classification result of the target video;

and processing the second feature to obtain video information of the target video.

2. The video information determination method according to claim 1, wherein the acquiring the first feature based on the image feature and the audio feature of the target video includes any one of:

splicing the image features and the audio features of the target video to obtain the first features;

And splicing the image features, the audio features and the feedback features of the target video to obtain the first features, wherein the feedback features are used for representing feedback acquired by the target video after being put on a platform.

3. The method for determining video information according to claim 2, wherein the stitching the image feature and the audio feature of the target video to obtain the first feature includes:

4. The method for determining video information according to claim 2, wherein the stitching the image feature and the audio feature of the target video to obtain the first feature includes:

5. The method for determining video information according to claim 2, wherein the stitching the image feature, the audio feature, and the feedback feature of the target video to obtain the first feature includes:

splicing the image features and the audio features to obtain spliced features;

And responding to the fact that the plurality of spliced features correspond to one feedback feature, acquiring feedback features with the same number as the spliced features, and splicing each spliced feature with one corresponding feedback feature to obtain the first feature.

6. The method for determining video information according to claim 1, wherein the processing the first feature to obtain the prediction classification result of the target video includes:

And processing the first characteristic through a video classification model to obtain a prediction classification result of the target video, wherein the video classification model is obtained based on sample video and corresponding classification label training.

7. The method for determining video information according to claim 1, wherein said processing the second feature to obtain the video information of the target video includes:

8. A video information determining apparatus, characterized in that the apparatus comprises:

The splicing unit is configured to perform the steps of responding to the fact that a plurality of first features correspond to one difference feature, acquiring difference features which are equal to the first features in number, splicing each first feature with one corresponding difference feature to obtain a second feature, wherein the difference features are used for representing differences between a prediction classification result and a real classification result of the target video;

9. The video information determination apparatus according to claim 8, wherein the splicing unit is configured to perform any one of:

and splicing the image features, the audio features and the feedback features of the target video to obtain the first features.

10. The video information determination apparatus according to claim 9, wherein the stitching unit is configured to perform stitching each image feature with a corresponding one of the audio features to obtain the first feature in response to detecting that a playback time period corresponding to the image feature is the same as a playback time period corresponding to a plurality of audio features.

11. The video information determination apparatus according to claim 9, wherein the stitching unit is configured to perform stitching each image feature with a corresponding audio feature to obtain the first feature in response to detecting that a playback time period corresponding to the audio feature is the same as a playback time period corresponding to a plurality of image features, and to acquire the same number of audio features as the plurality of image features.

12. The apparatus according to claim 9, wherein the stitching unit is configured to perform stitching the image feature with the audio feature, obtain stitched features, obtain feedback features equal to the number of stitched features in response to a plurality of stitched features corresponding to one feedback feature, and stitch each stitched feature with one corresponding feedback feature to obtain the first feature.

13. The video information determining apparatus according to claim 8, wherein the processing unit is configured to perform processing of the first feature by a video classification model, which is trained based on a sample video and a corresponding classification label, to obtain a prediction classification result of the target video.

14. The video information determining apparatus according to claim 8, wherein the processing unit is configured to perform processing of the second feature by a video information determination model, which is trained based on a sample video and corresponding sample video information, to obtain video information of the target video.

15. An electronic device, the electronic device comprising:

one or more processors;

a memory for storing the processor-executable program code;

Wherein the processor is configured to execute the program code to implement the video information determination method of any one of claims 1 to 7.

16. A computer readable storage medium, characterized in that program code in the computer readable storage medium, when executed by a processor of an electronic device, enables the electronic device to perform the video information determination method of any one of claims 1 to 7.