CN115131698A

CN115131698A - Video attribute determination method, device, equipment and storage medium

Info

Publication number: CN115131698A
Application number: CN202210578444.8A
Authority: CN
Inventors: 胡益珲; 岑杰鹏; 杨伟东; 祁雷; 马锴; 陈宇
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-05-25
Filing date: 2022-05-25
Publication date: 2022-09-30
Anticipated expiration: 2042-05-25
Also published as: CN115131698B

Abstract

The application discloses a video attribute determination method, a video attribute determination device, video attribute determination equipment and a video attribute determination storage medium, which can be applied to various scenes such as cloud technology, artificial intelligence, intelligent traffic, Internet of vehicles and the like, wherein the method comprises the following steps: acquiring a target video frame of a target video; extracting object features of the target video frame to obtain target image attribute features of a target object in the target video; carrying out scene feature extraction on the target video frame to obtain target scene features of the target video; performing fusion processing on the target image attribute features and the target scene features to obtain target fusion features; and determining target attribute information of the target video according to the target fusion characteristics. The method and the device improve the accuracy of the determined attribute information.

Description

Video attribute determination method, device, equipment and storage medium

Technical Field

The present application relates to the field of internet technologies, and in particular, to a method, an apparatus, a device, and a storage medium for determining video attributes.

Background

In the related technology, the image model pre-trained on the public data set is directly used for extracting the characteristics, so that the visual characteristics related to the coarse-grained classification task more suitable for the business data can be extracted. However, comprehensive modeling utilization of multi-modal information is lacked, visual information is usually used for identification, and a feature extractor with identification power is lacked, so that a fine-grained service data tag is difficult to distinguish; and information extraction of scene context is lacked; therefore, it is difficult to ensure accuracy and recall.

Disclosure of Invention

The application provides a method, a device, equipment and a storage medium for determining video attributes, which can improve the accuracy of determined attribute information.

In one aspect, the present application provides a method for determining video attributes, where the method includes:

acquiring a target video frame of a target video;

performing object feature extraction on the target video frame to obtain target image attribute features of a target object in the target video;

carrying out scene feature extraction on the target video frame to obtain target scene features of the target video;

performing fusion processing on the target image attribute features and the target scene features to obtain target fusion features;

and determining target attribute information of the target video according to the target fusion characteristics.

Another aspect provides a video attribute determination apparatus, the apparatus comprising:

the target video frame acquisition module is used for acquiring a target video frame of a target video;

the target image attribute feature determining module is used for extracting object features of the target video frames to obtain target image attribute features of target objects in the target video;

the target scene characteristic determining module is used for extracting scene characteristics of the target video frame to obtain target scene characteristics of the target video;

the target fusion characteristic determining module is used for performing fusion processing on the target image attribute characteristics and the target scene characteristics to obtain target fusion characteristics;

and the target attribute information determining module is used for determining the target attribute information of the target video according to the target fusion characteristics.

Another aspect provides a video attribute determining apparatus, which includes a processor and a memory, where at least one instruction or at least one program is stored in the memory, and the at least one instruction or the at least one program is loaded by the processor and executed to implement the video attribute determining method as described above.

Another aspect provides a computer storage medium storing at least one instruction or at least one program, the at least one instruction or the at least one program being loaded and executed by a processor to implement the video attribute determination method as described above.

Another aspect provides a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes to implement the video attribute determination method as described above.

The video attribute determining method, device, equipment and storage medium provided by the application have the following technical effects:

the method comprises the steps of obtaining a target video frame of a target video; performing object feature extraction on the target video frame to obtain target image attribute features of a target object in the target video; performing scene feature extraction on the target video frame to obtain target scene features of the target video; performing fusion processing on the target image attribute features and the target scene features to obtain target fusion features; and determining target attribute information of the target video according to the target fusion characteristics. According to the method and the device, in the process of determining the video attribute information, the scene characteristics of the video are fused, the fusion characteristics are determined through the target image attribute characteristics and the target scene characteristics, and the accuracy of the determined attribute information is improved.

Drawings

In order to more clearly illustrate the technical solutions and advantages of the embodiments or the prior art of the present application, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the description below are only some embodiments of the present application, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a schematic diagram of a video attribute determining system according to an embodiment of the present application;

fig. 2 is a schematic flowchart of a video attribute determining method according to an embodiment of the present application;

fig. 3 is a schematic flowchart of a method for extracting object features of the target video frame to obtain target image attribute features of a target object in the target video according to an embodiment of the present application;

fig. 4 is a schematic flowchart of a method for extracting object features of the second target object to obtain second target image attribute features according to an embodiment of the present application;

fig. 5 is a schematic flowchart of a method for extracting scene features of the target video frame to obtain target scene features of the target video according to the embodiment of the present application;

fig. 6 is a schematic diagram of a directed acyclic graph constructed based on a target video frame according to an embodiment of the present application;

FIG. 7 is a schematic structural diagram of a transformer model provided in an embodiment of the present application;

fig. 8 is a schematic structural diagram of an MLP network according to an embodiment of the present application;

fig. 9 is a flowchart illustrating a method for determining attribute information of a gourmet video according to an embodiment of the present disclosure;

fig. 10 is a page where a terminal displays a target video and target attribute information according to an embodiment of the present application;

fig. 11 is a schematic structural diagram of a video attribute determining apparatus according to an embodiment of the present application;

fig. 12 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without making creative efforts shall fall within the protection scope of the present application.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence base technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Computer Vision technology (CV) Computer Vision is a science for researching how to make a machine "see", and further refers to that a camera and a Computer are used to replace human eyes to perform machine Vision such as identification and measurement on a target, and further image processing is performed, so that the Computer processing becomes an image more suitable for human eyes to observe or is transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. Computer vision technologies generally include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technologies, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also include common biometric technologies such as face recognition and fingerprint recognition.

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formula learning.

The scheme provided by the embodiment of the application relates to the computer vision technology of artificial intelligence, the machine learning technology and the like, and is specifically explained by the following embodiment.

It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the accompanying drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It should be understood that the data so used may be interchanged under appropriate circumstances such that embodiments of the application described herein may be implemented in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or server that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Referring to fig. 1, fig. 1 is a schematic diagram of a video attribute determining system according to an embodiment of the present disclosure, and as shown in fig. 1, the video attribute determining system may include at least a server 01 and a client 02.

Specifically, in this embodiment, the server 01 may include a server that operates independently, or a distributed server, or a server cluster including a plurality of servers, and may also be a cloud server that provides basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), and a big data and artificial intelligence platform. The server 01 may comprise a network communication unit, a processor, a memory, etc. Specifically, the server 01 may be configured to obtain a target video frame of a target video; extracting object features of the target video frame to obtain target image attribute features of a target object in the target video; performing scene feature extraction on the target video frame to obtain target scene features of the target video; fusing the target image attribute features and the target scene features to obtain target fusion features; and determining target attribute information of the target video according to the target fusion characteristics.

Specifically, in this embodiment, the client 02 may include a smart phone, a desktop computer, a tablet computer, a notebook computer, a digital assistant, a smart wearable device, a smart speaker, a vehicle-mounted terminal, a smart television, and other types of physical devices, or may include software running in the physical devices, such as a web page provided by some service providers to a user, or may be an application provided by the service providers to the user. Specifically, the client 02 may be configured to display target attribute information of a target video.

A video attribute determination method of the present application is described below, and fig. 2 is a schematic flow chart of a video attribute determination method provided in an embodiment of the present application, and the present specification provides method operation steps as described in the embodiment or the flow chart, but more or less operation steps may be included based on conventional or non-inventive labor. The order of steps recited in the embodiments is merely one manner of performing the steps in a multitude of orders and does not represent the only order of execution. In practice, the system or server product may be implemented in a sequential or parallel manner (e.g., parallel processor or multi-threaded environment) according to the embodiments or methods shown in the figures. Specifically, as shown in fig. 2, the method may include:

s201: and acquiring a target video frame of the target video.

In the embodiment of the application, the video frames in the target video can be extracted to obtain the target video frames, and the number of the target video frames can be at least two. The target video can be a video for making a target object, the target video frame comprises the target object, and for the gourmet video, the target object can comprise raw materials and finished products of gourmet; for the handmade work type video, the target object can comprise raw materials and finished products of the handmade work. The extraction method of the target video frame comprises two modes of uniform frame sampling and interval frame sampling; the uniform frame sampling refers to segmented sparse sampling, and N frames are uniformly sampled from a target video to form a video frame set; alternate frame acquisition refers to acquiring frames at fixed time intervals, such as 1 second. The target video frame can be a digital image or a frequency domain image.

S203: and extracting object features of the target video frame to obtain target image attribute features of a target object in the target video.

In the embodiment of the application, object feature extraction can be performed on a target video frame to obtain the target image attribute feature of a target object. The target image attribute feature may include RGB information and timing information of the target video frame; RGB colors are the commonly known three primary optical colors, R for Red (Red), G for Green (Green), and B for Blue (Blue). Any color that can be seen by the naked eye in nature can be formed by mixing and superimposing these three colors, and is therefore also referred to as an additive color mode.

In this embodiment of the present application, the performing object feature extraction on the target video frame to obtain a target image attribute feature of a target object in the target video includes:

and extracting object features of the target video frame to obtain target raw material features, target naming features and target category features in the target video.

In the embodiment of the present application, for a target video for creating a target object, the target image attribute feature may include a target raw material feature, a target naming feature, a target category feature, and the like.

In this embodiment of the present application, as shown in fig. 3, the performing object feature extraction on the target video frame to obtain a target image attribute feature of a target object in the target video includes:

s2031: determining the structural integrity and definition of at least two target objects in the target video;

in the embodiment of the application, a plurality of target objects can be included in a target video frame, and the completeness and definition of each target object in the video frame are different.

S2033: determining an object with structural integrity greater than a first threshold and definition greater than a second threshold as a first target object, and determining an object except the first target object in the at least two target objects as a second target object;

in the embodiment of the application, the target objects can be classified according to the integrity and the definition of each target object in the target video frame; and feature extraction is performed respectively. For a first target object, object features of the first target object can be extracted through a supervised training model; for the second target object, the object features thereof can be extracted through an unsupervised training model.

S2035: performing object feature extraction on the first target object to obtain a first target image attribute feature;

in this embodiment of the application, the performing object feature extraction on the first target object to obtain a first target image attribute feature may include:

and performing object feature extraction on the first target object based on a supervised feature extraction model to obtain a first target image attribute feature.

In some embodiments, the method of training the supervised feature extraction model comprises:

acquiring a training video frame of a training video; the training video frame is marked with a training image attribute feature label of a training object;

in the embodiment of the application, the training video and the target video can be the same type of video, and the extraction mode of the training video frame is the same as that of the target video frame.

Performing image attribute feature extraction training on a preset machine learning model based on the training video frame so as to adjust model parameters of the preset machine learning model until a training image attribute feature label output by the preset machine learning model is matched with a labeled training image attribute feature label;

and taking a preset machine learning model corresponding to the model parameters when the output training image attribute feature labels are matched with the labeled training image attribute feature labels as the supervised feature extraction model.

In this embodiment of the application, the preset machine learning model may be a Video spin Transformer, and the overall architecture of a backbone network (backbone) of the Video spin Transformer is largely the same as that of the spin Transformer, and has one more time dimension, and when block partitioning (Patch Partition) is performed, a block size (Patch size) of each time dimension is obtained. The Video Swin Transformer comprises three parts: video to token, model classes, and head.

Wherein, Video to token: in image to token (for converting an image into a token), 4 × 4 image blocks are used as a group, and in Video to token (for converting a Video into a token), 2 × 4 Video blocks are used as a group, and then linear embedding (embedding) and position embedding (position embedding) are performed.

Model stages: the Model stages consists of multiple repeating stages, each Model stages includes a Video Swin Transformer Block and a Patch merge.

1) The Video Swin transform Block can be divided into two parts, Video W-MSA and Video SW-MSA. This is equivalent to extending the Swin transducer Block calculation from two dimensions to three dimensions.

2) Patch merging merges adjacent (within 2 x 2 window) token (token) features, and then reduces dimensions by using linear layers, which is equivalent to reducing tokens by 4 times, but the dimensions are not maintained, but the feature dimensions are still increased by 2 times after each Patch merging, and the number of channels is increased similarly when feature maps in Convolutional Neural Networks (CNN) are reduced, where the number of video frames is unchanged every time the Patch merging is performed.

head: after passing through Model locations, the high-dimensional characteristics of multi-frame data are obtained, and if the characteristics are used for video classification, simple frame fusion (average) is needed, and a header (head) code can be used.

In the embodiment of the application, the target image attribute characteristics of the target object in the target video frame can be rapidly and accurately extracted through the supervised characteristic extraction model.

S2037: performing object feature extraction on the second target object to obtain second target image attribute features;

in this embodiment of the application, as shown in fig. 4, the performing object feature extraction on the second target object to obtain a second target image attribute feature includes:

s20371: extracting self-reconfigurable features of the second target object to obtain self-reconfigurable features of the target;

in this embodiment of the present application, the performing self-reconfigurable feature extraction on the second target object to obtain a target self-reconfigurable feature includes:

and performing self-reconfigurable feature extraction on the second target object based on a self-reconfigurable feature extraction model to obtain the self-reconfigurable feature of the target.

In an embodiment of the present application, the training method for the self-reconfigurable feature extraction model includes:

dividing a sample video frame into at least two grid images;

in embodiments of the present application, the sample video frames are determined based on a sample video, which may be the same or different from the training video.

Performing image processing on at least one grid image to obtain a processed video frame; the image processing comprises at least one of position replacement of the grid image and shielding processing of a part of image in the grid image;

in the embodiment of the application, one network image can be shielded, and two of the network images can be replaced at the same time, so that a processed video frame is obtained.

In this embodiment of the present application, the performing image processing on at least one grid image to obtain a processed video frame includes:

performing first image processing on at least one grid image during first training of the model; the first image processing includes at least one of replacing a first number of grid images and performing a first percentage of area occlusion on any of the grid images;

performing Nth image processing on at least one grid image during Nth training of the model; the Nth image processing comprises at least one of position replacement of the Nth grid image and Nth percentage area shielding of any grid image; the Nth number is greater than the (N-1) th number; the Nth percentage is greater than the Nth-1 percentage; wherein, N is 2, 3, … …, N is positive integer.

Based on the processed video frame, performing self-reconfigurable feature extraction training on a first preset model to obtain self-reconfigurable features;

in an embodiment of the present application, the first preset model may be an encoder (encoder) model.

Based on the self-reconstruction characteristics, performing image reconstruction training on a second preset model to obtain a reconstructed video frame;

in an embodiment of the present application, the second predetermined model may be a decoder (decoder) model.

In the training process, continuously adjusting first model parameters of a first preset model and second model parameters of a second preset model until a reconstructed video frame output by the second preset model is matched with the sample video frame;

and taking a first preset model corresponding to the current first model parameter as the self-reconfigurable feature extraction model. And the current first model parameter is the model parameter of the first preset model when the reconstructed video frame output by the second preset model is matched with the sample video frame.

In the embodiment of the application, in the model training process, the sample video frame can be used as a supervisory signal for training, and the interchange and occlusion degree is gradually increased along with the progress of the training process. For example, the occlusion area for a grid image is 10% during the first training and 20% during the second training; and so on; the interchange of the grid images can be increased from the interchange of two grid images to the interchange of four grid images, so that the accuracy of the model is improved.

In the embodiment of the application, the target self-reconstruction characteristics in the target video frame can be extracted through the self-reconstruction characteristic extraction model, so that the video attribute information is determined by combining the characteristics, and the accuracy of the video attribute information is improved.

S20373: extracting attribute description features of the second target object to obtain target description features;

in the embodiment of the application, the attribute description feature extraction may be performed on the second target object through an attribute description feature extraction model to obtain a target description feature; the attribute description feature may be used to describe an attribute of the second target object; the attribute description feature extraction model can be obtained by CLIP model training based on image-text matching degree training; CLIP (contrast Language-Image Pre-Training) is a neural network trained on various (Image, text) pairs. It can be indicated in natural language to predict the most relevant text segment for a given image without being directly optimized for the task.

S20375: and taking the target self-reconstruction characteristic and the target description characteristic as the second target image attribute characteristic.

In the embodiment of the present application, the object self-reconstruction feature and the object description feature may be the same type of feature or different types of features.

S2039: and taking the first target image attribute feature and the second target image attribute feature as the target image attribute features.

In the embodiment of the present application, a first target image attribute feature of a first target object and a second target image attribute feature of a second target object may be used as the target image attribute features.

S205: and extracting scene characteristics of the target video frame to obtain target scene characteristics of the target video.

In the embodiment of the application, the scene characteristics are characteristics corresponding to scenes in the target video; in a gourmet video, scene features may include, but are not limited to, corresponding features of a table, a pan, a knife, and so on.

In this embodiment of the present application, as shown in fig. 5, the performing scene feature extraction on the target video frame to obtain a target scene feature of the target video includes:

s2051: determining at least two target associated objects of the target objects based on the target video frame;

in the embodiment of the application, the target associated object is used for representing a scene of a target video, for example, in a food video, the target object is a raw material, a finished product, and the like of food; the target associated object comprises a dining table, a pot, a knife and the like.

S2053: constructing a directed acyclic graph by taking the associated object characteristics corresponding to the at least two target associated objects as nodes; the edges in the directed acyclic graph represent the similarity between two associated object features corresponding to the edges;

in the embodiment of the application, a directed acyclic graph can be constructed through a plurality of target video frames; nodes in the directed acyclic graph represent associated object features, and edges represent the similarity between two associated object features corresponding to the edges.

S2055: and extracting scene features based on the directed acyclic graph to obtain the target scene features.

In the embodiment of the present application, the scene features in the video may be extracted through a directed acyclic Graph, and specifically, the scene features in the directed acyclic Graph may be extracted through a Graph Convolutional neural network (GCN) to obtain the target scene features.

In a specific embodiment, as shown in fig. 6, fig. 6 is a schematic diagram of a directed acyclic graph constructed based on a target video frame; wherein the map corresponds to video frames between t-k-M; extracting scene features of the directed acyclic graph through GCN to obtain target scene features; and finally, after the target scene characteristics and the target image attribute characteristics are fused, inputting an attribute information prediction model to obtain the target attribute information of the target video. The attribute information prediction model may be obtained by training a Multilayer Perceptron (MLP) network.

In an embodiment of the present application, the method further includes:

acquiring a target audio and a target text corresponding to the target video;

in the embodiment of the application, the target video can be analyzed, and the corresponding audio signal is extracted to obtain the target audio; extracting text information corresponding to the target video through OCR (Optical Character Recognition) to obtain a target text.

Extracting object features of the target audio to obtain target audio attribute features of the target object;

in the embodiment of the application, object feature extraction can be performed on the target audio through an audio attribute feature extraction model to obtain the target audio attribute feature of the target object; the audio attribute feature extraction model can be obtained through VGGish network training, and finally the target audio attribute feature with fixed dimensionality is obtained. Wherein, training on a large amount of video data sets obtains a VGG-like model, and 128-dimensional embedding is generated in the model.

The VGG model based on tensorflow (a symbolic mathematical system based on data stream programming) is called VGGish. Vggist supports the extraction of a 128-dimensional embedding feature vector with semantics from an audio waveform. "VGG" stands for Oxford Visual Geometry Group at Oxford university, which is a Group that includes machine learning to mobile robots.

The VGG model is characterized as follows:

(1) a small convolution kernel (3 × 3 convolution);

(2) small pooled nuclei (2 x 2 pooled nuclei);

(3) the number of layers is deeper and the feature map is wider. Based on the first two points, the convolution kernel is focused on expanding the number of channels and pooling is focused on narrowing the width and height, so that the model architecture is deeper and wider, and meanwhile, the increase of the calculated amount is slowed down;

(4) full concatenation is converted into convolution. In the network testing stage, three full connections in the training stage are replaced by three convolutions, and the testing reuses parameters in the training process, so that the full convolution network obtained by testing can receive input with any width or height because the full convolution network has no limitation of full connection.

In the embodiment of the application, the target audio attribute feature of the target object can be rapidly and accurately extracted through the audio attribute feature extraction model.

And extracting object features of the target text to obtain the attribute features of the target text of the target object.

In the embodiment of the application, for text information in a video, our data mainly comes from a title and an OCR recognition result in the video, and the two texts are extracted by using a BERT model to extract corresponding text attribute features. BERT is a pre-trained model proposed by the Google AI institute in 2018, month 10. The overall designation of BERT is Bidirective Encoder reproduction from transformations. Target text attribute features of the target object can be extracted quickly through the BERT model.

S207: and performing fusion processing on the target image attribute characteristics and the target scene characteristics to obtain target fusion characteristics.

In this embodiment of the present application, the performing fusion processing on the target image attribute feature and the target scene feature to obtain a target fusion feature includes:

and fusing the target image attribute feature, the target scene feature, the target audio attribute feature and the target text attribute feature to obtain the target fusion feature.

In this embodiment of the present application, the performing fusion processing on the target image attribute feature, the target scene feature, the target audio attribute feature, and the target text attribute feature to obtain the target fusion feature includes:

splicing the target image attribute feature, the target scene feature, the target audio attribute feature and the target text attribute feature to obtain a target splicing feature;

and performing feature fusion processing on the target splicing features based on a feature fusion model to obtain the target fusion features.

In an embodiment of the present application, the training method of the feature fusion model includes:

acquiring a sample splicing characteristic marked with a sample fusion characteristic;

in the embodiment of the application, the construction method of the sample splicing feature is the same as that of the target splicing feature.

Training a preset multi-mode model based on the sample splicing characteristics to adjust model parameters of the preset multi-mode model until sample fusion characteristics output by the preset multi-mode model are matched with the marked sample fusion characteristics;

in this embodiment, the preset multi-modal model may be a transform (transformer network) model, and the transform model employs a memorybank (information storage) mechanism.

And taking the preset multi-modal model when the output sample fusion characteristics are matched with the labeled sample fusion characteristics as the characteristic fusion model.

In the embodiment of the application, the different features are different dimensional features extracted by different supervisory signals and different models. The hidden space distribution of the model is different, and the direct fusion can cause the inconsistency of feature expression, and finally influences the identification performance of the model. Therefore, the problem of the difference distribution of different modalities can be aligned through the transformer and memorybank mechanisms. Firstly, connecting a plurality of characteristics in sequence, adding a division symbol at the connection position, and inputting the division symbol into a transformer model. Fig. 7 is a schematic structural diagram of a transform model, in which each square is a feature after splicing, as shown in fig. 7. the transform model performs self-attention mechanism learning on input spliced multi-modal features through a multi-head attention mechanism (multi-head), and performs similarity calculation with features in the memorybank to adjust specific gravity of different modes. And finally, storing the fused features into a memorybank and outputting the merged features to the next layer of network. the transform model may be a Scaled Dot Product Attention mechanism (Scaled Dot-Product Attention) model, which has two sequences X XX, YYY: sequence X XX provides query information Q (query), sequence Y YY provides key, value information K (key), V (value). Q is the query vector of the word, K is the "looked up" vector, and V is the content vector. Wherein, Q is the most suitable to search the goal, K is the most suitable to receive and search, V is the content, these three are not necessarily identical, so the network has set up three vectors so, then learn out the most suitable Q, K, V, in order to strengthen the ability of the network. The model comprises a matrix multiplication layer (MatMul), a scale layer (scale), and a mask layer (mask) normalized exponential function layer (Softmax), wherein the matrix multiplication layer comprises a first matrix multiplication layer and a second matrix multiplication layer; q, K, V, acquiring the current splicing characteristics; inputting Q, K of the current splicing characteristic into a first matrix multiplication layer, and inputting V of the current splicing characteristic into a second matrix multiplication layer; the output end of the first matrix multiplication layer is connected with the input end of the scaling layer, the output end of the scaling layer is connected with the input end of the mask layer, the output end of the mask layer is connected with the input end of the normalized exponential function layer, and the output end of the normalized exponential function layer is connected with the input end of the second matrix multiplication layer; and taking the output of the second matrix multiplication layer as Q of the stored characteristics in the memorybank, simultaneously obtaining K, V of the stored characteristics in the memorybank, and respectively inputting the Q of the stored characteristics into the first matrix multiplication layer and the second matrix multiplication layer for model training.

In some embodiments, feature fusion may also be performed using a channel-based attention mechanism to calculate the similarity of each feature on a channel.

determining at least two to-be-fused features based on the target image attribute feature, the target scene feature, the target audio attribute feature and the target text attribute feature;

in the embodiment of the application, at least two features to be fused can be determined through a Gradient Boosting Decision Tree (GBDT) network; specifically, attribute information labels can be labeled for training image attribute features, training scene features, training audio attribute features and training text attribute features of the same training video; training the GBDT network according to the training image attribute features, the training scene features, the training audio attribute features and the training text attribute features, and obtaining weights corresponding to the training image attribute features, the training scene features, the training audio attribute features and the training text attribute features when the model converges; and determining the target characteristics corresponding to the training characteristics with the weight larger than the preset weight threshold value as the characteristics to be fused.

And fusing the at least two to-be-fused features to obtain the target fusion feature.

In the embodiment of the application, at least two characteristics of the target image attribute characteristic, the target scene characteristic, the target audio attribute characteristic and the target text attribute characteristic can be selected for fusion processing to obtain the target fusion characteristic, so that the accuracy of the determined attribute information is improved.

and carrying out fusion processing on the target raw material characteristics, the target naming characteristics, the target category characteristics and the target scene characteristics to obtain the target fusion characteristics.

S209: and determining target attribute information of the target video according to the target fusion characteristics.

In this embodiment of the present application, the determining target attribute information of the target video according to the target fusion feature includes:

and determining target raw material information, target naming information and target category information of the target object according to the target fusion characteristics.

In this embodiment of the application, the target image attribute features may include a target raw material feature, the target naming feature, and the target category feature, and these features are fused with the target scene feature, so as to obtain a target fusion feature.

In the embodiment of the application, target raw material information, target naming information and target category information can be used as three prediction tasks, and attribute information prediction processing can be performed on the target fusion characteristics through an attribute information prediction model to obtain target attribute information of the target video; the attribute information prediction model can be obtained by three-layer MLP training, as shown in fig. 8, fig. 8 is a schematic structural diagram of an MLP network, which includes three task prediction networks, each task prediction network corresponds to a loss value, each layer is responsible for prediction of one task, each task has an individual loss function, and the final loss function is an average value of weighted sums of the loss values of the tasks.

In the training process, a Fast Gradient Method (FGM) can be combined for confrontation training; noise is added to the gradient during training to enhance the generalization performance of the model.

For example, the original penalty function in task 1 is:

L(θ,x,y)＝-min _θ logp (y | x, θ); wherein, L represents a loss function and represents the difference between the fusion characteristic value corresponding to the sample attribute characteristic x and the true value y; theta is a parameter.

Adding perturbations to the gradient

Wherein

The final loss function is obtained as L (theta, x, y) + L _adv (θ,x,y)。

In the embodiment of the application, the attribute information prediction model can also be obtained through model training of other multitask structures, for example, multitask learning can be performed by using the structure of a teacher and student network. For samples of different dimensions, knowledge learned by the network can be transferred by distillation.

In a specific embodiment, the prediction accuracy and recall rate of the attribute information of the video are determined by using embodiments 1 and 2 respectively, and the results are shown in table 1; example 1: predicting attribute information of the video by adopting the attribute characteristics of the target image; example 2: fusing the target image attribute feature, the target scene feature, the target audio attribute feature and the target text attribute feature to obtain a target fusion feature; and predicting the attribute information of the video according to the target fusion characteristics.

TABLE 1

	Rate of accuracy	Recall rate
			Example 1	79.6％	48.1％
Example 2	81.7％	68.0％

In a specific embodiment, the prediction accuracy and recall ratio of the attribute information of the video are determined by using embodiments 3 and 4 respectively, and the results are shown in table 2; example 3: carrying out object feature extraction on the target video frame according to a supervised feature extraction model to obtain target image attribute features, and predicting attribute information of a video; example 4: and (3) introducing a CBAM (cubic boron nitride), a patch-based random Mask + Shuffle reconstruction task, and predicting the attribute information of the video by combining the target image attribute feature and the self-reconstruction feature obtained by a supervision feature extraction model.

TABLE 2

	Rate of accuracy	Recall rate
			Example 3	55.5％	68.5％
Example 4	51.3％	66.2％

In a specific embodiment, as shown in fig. 9, fig. 9 is a flowchart illustrating a method for determining attribute information of a gourmet video, where the method includes:

s901: analyzing the video into a video frame set, an audio signal and text information;

s903: extracting audio features corresponding to the audio signals based on the audio information extraction model; extracting text features corresponding to the text information based on the text information extraction model; extracting video frame characteristics corresponding to the video frame set based on the video frame information extraction model;

s905: extracting attribute description features of the video frame set based on the image-text matching model to obtain attribute description features; performing reconstruction feature extraction on the video frame set based on the feature reconstruction model to obtain self-reconstruction features;

s907: fusing the audio features, the text features, the video frame features, the attribute description features and the self-reconstruction features according to a converter network model and an information storage model to obtain target fusion features;

s909: and predicting the target fusion characteristics based on the multi-task learning model to obtain a menu system label, a menu name label and a material label of the video.

In an embodiment of the present application, the method further includes:

sending the target video and the target attribute information to a terminal; and enabling the terminal to display the target video and the target attribute information.

And constructing a video attribute mapping relation according to the corresponding relation between the target video and the target attribute information.

In some embodiments, the method further comprises:

determining associated attribute information according to the target attribute information;

determining a first label of a target video according to the target attribute information;

and determining a second label of the target video according to the associated attribute information.

In this embodiment of the application, the associated attribute information may be information whose similarity to the target attribute information is greater than a preset value, the first tag may be a high confidence tag, and the second tag may be a low confidence tag.

In some embodiments, the method further comprises:

determining the category information of the target video according to the target attribute information;

and sending the target video, the target attribute information and the category information of the target video to a terminal.

In the embodiment of the present application, as shown in fig. 10, fig. 10 is a page where a terminal displays a target video and target attribute information; the page shows a first-level classification and a second-level classification label corresponding to the video category; attribute labels corresponding to the video are also shown, including a high confidence label and a low confidence label. In addition, a label tree can be constructed according to the attribute labels, and label dimension information can be displayed.

In some embodiments, the method further comprises:

receiving a video acquisition request sent by a terminal in response to a video acquisition instruction; the video acquisition request carries the associated information of the video to be acquired;

determining target attribute information matched with the associated information of the video to be acquired to obtain matched attribute information;

searching a target video corresponding to the matching attribute information from the video attribute mapping relation to serve as the video to be acquired;

and sending the video to be acquired to the terminal.

In the embodiment of the present application, as shown in fig. 10, a user inputs "how to fry potatoes" in a search box, that is, a corresponding video can be displayed, and the corresponding video and an attribute tag corresponding to the video can be displayed.

In some embodiments, the method further comprises:

determining a target account corresponding to the target video;

and determining account attribute information of the target account according to the target attribute information.

In some embodiments, the target videos are at least two, the method further comprising:

and classifying the at least two target videos according to the target attribute information corresponding to the at least two target videos.

In some embodiments, the method further comprises:

determining a target video set of a target category;

acquiring service data of each target video in the target video set;

in the embodiment of the present application, the traffic data of the target video may include, but is not limited to, data of click rate, exposure amount, and the like of the target video.

Sequencing the target videos in the target video set according to the service data of each target video in the target video set;

sending the target video set and the sequencing result of the target video in the target video set to a terminal;

and the terminal displays the target videos in the target video set according to the sequencing result.

In the embodiment of the application, a plurality of target videos can be sequenced according to service data, the target video with higher service index is arranged in front, and the target video with lower service index is arranged behind; the service data is used for representing the level of the service index, the quality of the video is further reflected, the videos are sorted according to the service index, and a plurality of target videos are displayed according to the sorting, so that the user experience can be improved, and the click rate of the target videos is improved.

According to the technical scheme provided by the embodiment of the application, the embodiment of the application acquires the target video frame of the target video; extracting object features of the target video frame to obtain target image attribute features of a target object in the target video; carrying out scene feature extraction on the target video frame to obtain target scene features of the target video; performing fusion processing on the target image attribute features and the target scene features to obtain target fusion features; and determining target attribute information of the target video according to the target fusion characteristics. According to the method and the device, in the process of determining the video attribute information, the scene characteristics of the video are fused, the fusion characteristics are determined through the target image attribute characteristics and the target scene characteristics, and the accuracy of the determined attribute information is improved.

An embodiment of the present application further provides a video attribute determining apparatus, as shown in fig. 11, the apparatus includes:

a target video frame obtaining module 1110, configured to obtain a target video frame of a target video;

a target image attribute feature determining module 1120, configured to perform object feature extraction on the target video frame to obtain a target image attribute feature of a target object in the target video;

a target scene feature determining module 1130, configured to perform scene feature extraction on the target video frame to obtain a target scene feature of the target video;

a target fusion feature determining module 1140, configured to perform fusion processing on the target image attribute features and the target scene features to obtain target fusion features;

a target attribute information determining module 1150, configured to determine target attribute information of the target video according to the target fusion feature.

In some embodiments, the target image attribute feature determination module may include:

the information determining unit is used for determining the structural integrity and the definition of at least two target objects in the target video;

the target object determining unit is used for determining an object with the structural integrity greater than a first threshold and the definition greater than a second threshold as a first target object and determining an object except the first target object in the at least two target objects as a second target object;

the first target image attribute feature extraction unit is used for extracting object features of the first target object to obtain first target image attribute features;

the second target image attribute feature extraction unit is used for extracting object features of the second target object to obtain second target image attribute features;

a target image attribute feature determination unit configured to use the first target image attribute feature and the second target image attribute feature as the target image attribute feature.

In some embodiments, the second target image attribute feature extraction unit includes:

the target self-reconfigurable feature determining subunit is used for performing self-reconfigurable feature extraction on the second target object to obtain a target self-reconfigurable feature;

the target description feature determining subunit is used for performing attribute description feature extraction on the second target object to obtain a target description feature;

and the second target image attribute feature determining subunit is used for taking the target self-reconstruction feature and the target description feature as the second target image attribute feature.

In some embodiments, the target-self-reconfigurable-feature-determining subunit may include:

and the target self-reconstruction feature extraction subunit is used for extracting self-reconstruction features of the second target object based on the self-reconstruction feature extraction model to obtain the target self-reconstruction features.

In some embodiments, the apparatus may further comprise:

the grid image dividing module is used for dividing the sample video frame into at least two grid images;

the image processing module is used for carrying out image processing on at least one grid image to obtain a processed video frame; the image processing comprises at least one of position replacement of the grid image and shielding processing of a part of image in the grid image;

the self-reconfigurable feature determination module is used for carrying out self-reconfigurable feature extraction training on the first preset model based on the processed video frame to obtain self-reconfigurable features;

the reconstructed video frame determining module is used for carrying out image reconstruction training on the second preset model based on the self-reconstruction characteristics to obtain a reconstructed video frame;

the training module is used for continuously adjusting first model parameters of a first preset model and second model parameters of a second preset model in the training process until a reconstructed video frame output by the second preset model is matched with the sample video frame;

the model determining module is used for taking a first preset model corresponding to the current first model parameter as the self-reconfigurable feature extraction model; and the current first model parameter is the model parameter of the first preset model when the reconstructed video frame output by the second preset model is matched with the sample video frame.

In some embodiments, the apparatus may further include:

the text acquisition module is used for acquiring a target audio and a target text corresponding to the target video;

the target audio attribute feature extraction module is used for extracting object features of the target audio to obtain target audio attribute features of the target object;

and the target text attribute feature extraction module is used for extracting the object features of the target text to obtain the target text attribute features of the target object.

In some embodiments, the target fusion feature determination module may include:

and the feature fusion unit is used for performing fusion processing on the target image attribute feature, the target scene feature, the target audio attribute feature and the target text attribute feature to obtain the target fusion feature.

In some embodiments, the feature fusion unit may include:

a feature-to-be-fused determining subunit, configured to determine at least two features to be fused based on the target image attribute feature, the target scene feature, the target audio attribute feature, and the target text attribute feature;

and the feature fusion subunit is used for fusing the at least two features to be fused to obtain the target fusion feature.

the object feature extraction unit is used for extracting object features of the target video frame to obtain target raw material features, target naming features and target category features of a target object in the target video;

in some embodiments, the target fusion feature determination module comprises:

a target fusion characteristic determining unit, configured to perform fusion processing on the target raw material characteristic, the target naming characteristic, the target category characteristic, and the target scene characteristic to obtain the target fusion characteristic;

the target attribute information determination module includes:

and the target attribute information determining unit is used for determining target raw material information, target naming information and target category information of the target object according to the target fusion characteristics.

In some embodiments, the target scene feature determination module may include:

a target associated object determining unit for determining at least two target associated objects of the target objects based on the target video frame;

the directed acyclic graph building unit is used for building a directed acyclic graph by taking the associated object characteristics corresponding to the at least two target associated objects as nodes; the edges in the directed acyclic graph represent the similarity between two associated object features corresponding to the edges;

and the target scene characteristic determining unit is used for extracting scene characteristics based on the directed acyclic graph to obtain the target scene characteristics.

The device and method embodiments in the device embodiment described are based on the same inventive concept.

The embodiment of the present application provides a video attribute determining apparatus, which includes a processor and a memory, where the memory stores at least one instruction or at least one program, and the at least one instruction or at least one program is loaded and executed by the processor to implement the video attribute determining method provided by the above method embodiment.

Embodiments of the present application further provide a computer storage medium, where the storage medium may be disposed in a terminal to store at least one instruction or at least one program for implementing a video attribute determination method in the method embodiments, and the at least one instruction or the at least one program is loaded and executed by the processor to implement the video attribute determination method provided in the method embodiments.

Embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the computer instructions to implement the video attribute determination method provided by the above method embodiment.

Optionally, in this application embodiment, the storage medium may be located on at least one network server of a plurality of network servers of a computer network. Optionally, in this embodiment, the storage medium may include, but is not limited to: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

The memory according to the embodiments of the present application may be used to store software programs and modules, and the processor may execute various functional applications and data processing by operating the software programs and modules stored in the memory. The memory can mainly comprise a program storage area and a data storage area, wherein the program storage area can store an operating system, application programs needed by functions and the like; the storage data area may store data created according to use of the apparatus, and the like. Further, the memory may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory may also include a memory controller to provide the processor access to the memory.

The embodiment of the video attribute determination method provided by the embodiment of the application can be executed in a mobile terminal, a computer terminal, a server or a similar arithmetic device. Taking an example of the video attribute determination method running on a server, fig. 12 is a block diagram of a hardware structure of the server according to the video attribute determination method provided in the embodiment of the present application. As shown in fig. 12, the server 1200 may have a relatively large difference due to different configurations or performances, and may include one or more Central Processing Units (CPUs) 1210 (the CPU 1210 may include but is not limited to a Processing device such as a microprocessor MCU or a programmable logic device FPGA), a memory 1230 for storing data, and one or more storage media 1220 (e.g., one or more mass storage devices) for storing application programs 1223 or data 1222. Memory 1230 and storage media 1220, among other things, may be transient storage or persistent storage. The program stored in the storage medium 1220 may include one or more modules, each of which may include a series of instruction operations for a server. Still further, the central processor 1210 may be configured to communicate with the storage medium 1220 to execute a series of instruction operations in the storage medium 1220 on the server 1200. The server 1200 may also include one or more power supplies 1260, one or more wired or wireless network interfaces 1250, one or more input-output interfaces 1240, and/or one or more operating systems 1221, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, and so forth.

The input/output interface 1240 may be used to receive or transmit data via a network. The specific example of the network described above may include a wireless network provided by a communication provider of the server 1200. In one example, i/o Interface 1240 includes a Network Interface Controller (NIC) that may be coupled to other Network devices via a base station to communicate with the internet. In one example, the input/output interface 1240 may be a Radio Frequency (RF) module, which is used to communicate with the internet in a wireless manner.

It will be understood by those skilled in the art that the structure shown in fig. 12 is only an illustration and is not intended to limit the structure of the electronic device. For example, the server 1200 may also include more or fewer components than shown in FIG. 12, or have a different configuration than shown in FIG. 12.

As can be seen from the embodiments of the method, the apparatus, the device or the storage medium for determining video attributes provided by the present application, a target video frame of a target video is obtained by the present application; extracting object features of the target video frame to obtain target image attribute features of a target object in the target video; carrying out scene feature extraction on the target video frame to obtain target scene features of the target video; fusing the target image attribute features and the target scene features to obtain target fusion features; and determining target attribute information of the target video according to the target fusion characteristics. In the process of determining the video attribute information, the scene characteristics of the video are integrated, the integration characteristics are determined through the target image attribute characteristics and the target scene characteristics, and the accuracy of the determined attribute information is improved.

It should be noted that: the sequence of the embodiments of the present application is only for description, and does not represent the advantages and disadvantages of the embodiments. And specific embodiments thereof have been described above. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, as for the apparatus, device, and storage medium embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference may be made to some descriptions of the method embodiments for relevant points.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk.

The above description is only exemplary of the present application and should not be taken as limiting the present application, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method for video attribute determination, the method comprising:

acquiring a target video frame of a target video;

extracting object features of the target video frame to obtain target image attribute features of a target object in the target video;

fusing the target image attribute features and the target scene features to obtain target fusion features;

2. The method according to claim 1, wherein said performing object feature extraction on the target video frame to obtain a target image attribute feature of a target object in the target video comprises:

determining the structural integrity and definition of at least two target objects in the target video;

determining an object with structural integrity greater than a first threshold and definition greater than a second threshold as a first target object, and determining an object except the first target object in the at least two target objects as a second target object;

performing object feature extraction on the first target object to obtain a first target image attribute feature;

performing object feature extraction on the second target object to obtain second target image attribute features;

and taking the first target image attribute feature and the second target image attribute feature as the target image attribute features.

3. The method according to claim 2, wherein the performing object feature extraction on the second target object to obtain a second target image attribute feature comprises:

performing self-reconfigurable feature extraction on the second target object to obtain target self-reconfigurable features;

extracting attribute description features of the second target object to obtain target description features;

and taking the target self-reconstruction characteristic and the target description characteristic as the second target image attribute characteristic.

4. The method of claim 3, wherein the performing self-reconfigurable feature extraction on the second target object to obtain target self-reconfigurable features comprises:

5. The method of claim 4, wherein the training method from the reconstructed feature extraction model comprises:

dividing a sample video frame into at least two grid images;

based on the self-reconstruction characteristics, carrying out image reconstruction training on a second preset model to obtain a reconstructed video frame;

continuously adjusting first model parameters of a first preset model and second model parameters of a second preset model in the training process until a reconstructed video frame output by the second preset model is matched with the sample video frame;

taking a first preset model corresponding to the current first model parameter as the self-reconfigurable feature extraction model; and the current first model parameter is the model parameter of the first preset model when the reconstructed video frame output by the second preset model is matched with the sample video frame.

6. The method of claim 1, further comprising:

acquiring a target audio and a target text corresponding to the target video;

performing object feature extraction on the target audio to obtain target audio attribute features of the target object;

7. The method according to claim 6, wherein the fusing the target image attribute feature and the target scene feature to obtain a target fusion feature comprises:

and performing fusion processing on the target image attribute feature, the target scene feature, the target audio attribute feature and the target text attribute feature to obtain the target fusion feature.

8. The method according to claim 7, wherein the fusing the target image attribute feature, the target scene feature, the target audio attribute feature and the target text attribute feature to obtain the target fusion feature comprises:

determining at least two features to be fused based on the target image attribute feature, the target scene feature, the target audio attribute feature and the target text attribute feature;

and fusing the at least two features to be fused to obtain the target fusion feature.

9. The method according to claim 1, wherein said performing object feature extraction on the target video frame to obtain a target image attribute feature of a target object in the target video comprises:

extracting object features of the target video frame to obtain target raw material features, target naming features and target category features of a target object in the target video;

the fusion processing of the target image attribute features and the target scene features to obtain target fusion features includes:

fusing the target raw material characteristics, the target naming characteristics, the target category characteristics and the target scene characteristics to obtain target fusion characteristics;

the determining the target attribute information of the target video according to the target fusion feature includes:

10. The method according to claim 1, wherein the performing scene feature extraction on the target video frame to obtain a target scene feature of the target video comprises:

determining at least two target associated objects of the target objects based on the target video frames;

constructing a directed acyclic graph by taking the associated object characteristics corresponding to the at least two target associated objects as nodes; the edges in the directed acyclic graph represent the similarity between two associated object features corresponding to the edges;

and extracting scene features based on the directed acyclic graph to obtain the target scene features.

11. A video attribute determination apparatus, the apparatus comprising:

the target image attribute characteristic determining module is used for extracting the object characteristics of the target video frame to obtain the target image attribute characteristics of the target object in the target video;

12. A video attribute determination device, the device comprising: a processor and a memory, the memory having stored therein at least one instruction or at least one program, the at least one instruction or at least one program being loaded and executed by the processor to implement the video property determination method of any of claims 1-10.

13. A computer storage medium, characterized in that the computer storage medium stores at least one instruction or at least one program, which is loaded and executed by a processor to implement the video property determination method according to any one of claims 1 to 10.

14. A computer program product comprising computer instructions, wherein said computer instructions, when executed by a processor, implement the video property determination method of any of claims 1-10.