CN115131698B

CN115131698B - Video attribute determining method, device, equipment and storage medium

Info

Publication number: CN115131698B
Application number: CN202210578444.8A
Authority: CN
Inventors: 胡益珲; 岑杰鹏; 杨伟东; 祁雷; 马锴; 陈宇
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-05-25
Filing date: 2022-05-25
Publication date: 2024-04-12
Anticipated expiration: 2042-05-25
Also published as: CN115131698A

Abstract

The application discloses a video attribute determining method, a device, equipment and a storage medium, which can be applied to various scenes such as cloud technology, artificial intelligence, intelligent transportation, internet of vehicles and the like, and the method comprises the following steps: acquiring a target video frame of a target video; extracting object characteristics of the target video frame to obtain target image attribute characteristics of a target object in the target video; extracting scene characteristics of the target video frame to obtain target scene characteristics of the target video; performing fusion processing on the target image attribute characteristics and the target scene characteristics to obtain target fusion characteristics; and determining target attribute information of the target video according to the target fusion characteristics. The method and the device improve the accuracy of the determined attribute information.

Description

Video attribute determining method, device, equipment and storage medium

Technical Field

The present disclosure relates to the field of internet technologies, and in particular, to a method, an apparatus, a device, and a storage medium for determining a video attribute.

Background

In the related technology, the pre-trained image model on the public data set is directly used for extracting the features, so that the visual features which are relatively suitable for the coarse-granularity classification task of the service data can be extracted. However, the comprehensive modeling utilization of multi-mode information is lacking, the multi-mode information is usually identified by means of visual information alone, a characteristic extractor with identification force is lacking, and service data labels with fine granularity are difficult to distinguish; and information extraction of scene context is missing; therefore, it is difficult to ensure accuracy and recall.

Disclosure of Invention

The application provides a video attribute determining method, a video attribute determining device, video attribute determining equipment and a video attribute storing medium, which can improve the accuracy of determined attribute information.

In one aspect, the present application provides a method for determining a video attribute, the method including:

acquiring a target video frame of a target video;

extracting object characteristics of the target video frame to obtain target image attribute characteristics of a target object in the target video;

extracting scene characteristics of the target video frame to obtain target scene characteristics of the target video;

performing fusion processing on the target image attribute characteristics and the target scene characteristics to obtain target fusion characteristics;

and determining target attribute information of the target video according to the target fusion characteristics.

Another aspect provides a video attribute determining apparatus, the apparatus comprising:

the target video frame acquisition module is used for acquiring target video frames of the target video;

the target image attribute feature determining module is used for extracting object features of the target video frames to obtain target image attribute features of target objects in the target video;

the target scene feature determining module is used for extracting scene features of the target video frame to obtain target scene features of the target video;

The target fusion feature determining module is used for carrying out fusion processing on the target image attribute features and the target scene features to obtain target fusion features;

and the target attribute information determining module is used for determining target attribute information of the target video according to the target fusion characteristics.

Another aspect provides a video attribute determining apparatus comprising a processor and a memory having stored therein at least one instruction or at least one program loaded and executed by the processor to implement a video attribute determining method as described above.

Another aspect provides a computer storage medium storing at least one instruction or at least one program loaded and executed by a processor to implement a video attribute determination method as described above.

Another aspect provides a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions so that the computer device executes to implement the video attribute determination method as described above.

The video attribute determining method, the device, the equipment and the storage medium have the following technical effects:

the method comprises the steps of obtaining a target video frame of a target video; extracting object characteristics of the target video frame to obtain target image attribute characteristics of a target object in the target video; extracting scene characteristics of the target video frame to obtain target scene characteristics of the target video; performing fusion processing on the target image attribute characteristics and the target scene characteristics to obtain target fusion characteristics; and determining target attribute information of the target video according to the target fusion characteristics. In the process of determining the video attribute information, the method and the device integrate the scene characteristics of the video, determine the integration characteristics through the target image attribute characteristics and the target scene characteristics, and improve the accuracy of the determined attribute information.

Drawings

In order to more clearly illustrate the technical solutions and advantages of embodiments of the present application or of the prior art, the following description will briefly introduce the drawings that are required to be used in the embodiments or the prior art descriptions, it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a video attribute determination system provided in an embodiment of the present application;

fig. 2 is a flowchart of a video attribute determining method according to an embodiment of the present application;

fig. 3 is a flowchart of a method for extracting object features of the target video frame to obtain target image attribute features of a target object in the target video according to an embodiment of the present application;

fig. 4 is a flowchart of a method for extracting object features of the second target object to obtain attribute features of the second target image according to the embodiment of the present application;

fig. 5 is a flowchart of a method for extracting scene characteristics of the target video frame to obtain target scene characteristics of the target video according to the embodiment of the present application;

FIG. 6 is a schematic diagram of a directed acyclic graph constructed based on target video frames provided by an embodiment of the present application;

FIG. 7 is a schematic diagram of a transducer module according to an embodiment of the present disclosure;

fig. 8 is a schematic structural diagram of an MLP network according to an embodiment of the present application;

fig. 9 is a flowchart of a method for determining attribute information of a food video according to an embodiment of the present application;

Fig. 10 is a page of a terminal provided in an embodiment of the present application displaying a target video and target attribute information;

fig. 11 is a schematic structural diagram of a video attribute determining apparatus according to an embodiment of the present application;

fig. 12 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present application based on the embodiments herein.

Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Computer Vision (CV) is a science of studying how to "look" a machine, and more specifically, to replace a human eye with a camera and a Computer to perform machine Vision such as recognition and measurement on a target, and further perform graphic processing to make the Computer process an image more suitable for human eye observation or transmission to an instrument for detection. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D techniques, virtual reality, augmented reality, synchronous positioning, and map construction, among others, as well as common biometric recognition techniques such as face recognition, fingerprint recognition, and others.

Machine Learning (ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

The scheme provided by the embodiment of the application relates to the technologies of computer vision technology, machine learning and the like of artificial intelligence, and is specifically described by the following embodiment.

It should be noted that the terms "first," "second," and the like in the description and claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or server that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Referring to fig. 1, fig. 1 is a schematic diagram of a video attribute determining system according to an embodiment of the present application, and as shown in fig. 1, the video attribute determining system may at least include a server 01 and a client 02.

Specifically, in the embodiment of the present application, the server 01 may include a server that operates independently, or a distributed server, or a server cluster that is formed by a plurality of servers, and may also be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDN (Content Delivery Network ), and basic cloud computing services such as big data and artificial intelligence platforms. The server 01 may include a network communication unit, a processor, a memory, and the like. Specifically, the server 01 may be configured to obtain a target video frame of a target video; extracting object characteristics of the target video frame to obtain target image attribute characteristics of a target object in the target video; extracting scene characteristics of the target video frame to obtain target scene characteristics of the target video; performing fusion processing on the target image attribute characteristics and the target scene characteristics to obtain target fusion characteristics; and determining target attribute information of the target video according to the target fusion characteristics.

Specifically, in the embodiment of the present application, the client 02 may include smart phones, desktop computers, tablet computers, notebook computers, digital assistants, smart wearable devices, smart speakers, vehicle terminals, smart televisions, and other types of physical devices, or may include software running in the physical devices, for example, web pages provided by some service providers to users, or may also provide applications provided by the service providers to users. Specifically, the client 02 may be configured to display target attribute information of a target video.

In the following, a method for determining video attributes of the present application is described, and fig. 2 is a schematic flow chart of a method for determining video attributes provided in an embodiment of the present application, where the method includes steps as described in the examples or the flowcharts, but may include more or less steps based on conventional or non-inventive labor. The order of steps recited in the embodiments is merely one way of performing the order of steps and does not represent a unique order of execution. When implemented in a real system or server product, the methods illustrated in the embodiments or figures may be performed sequentially or in parallel (e.g., in a parallel processor or multithreaded environment). As shown in fig. 2, the method may include:

S201: and obtaining a target video frame of the target video.

In the embodiment of the present application, video frames in the target video may be extracted to obtain target video frames, and the number of target video frames may be at least two. The target video can be a video for making a target object, the target video frame comprises the target object, and for the food video, the target object can comprise raw materials and finished products of food; for a hand work class video, the target object may include raw materials, finished products of the hand work. The extraction method of the target video frame comprises two modes of uniformly frame acquisition and frame acquisition at intervals; the uniform frame sampling refers to segmented sparse sampling, and N frames are uniformly sampled from a target video to form a video frame set; the interval frame acquisition refers to frame acquisition at fixed time intervals, such as 1 second frame acquisition. The target video frame may be a digital image or a frequency domain image.

S203: and extracting object characteristics of the target video frame to obtain target image attribute characteristics of a target object in the target video.

In the embodiment of the application, object feature extraction can be performed on the target video frame to obtain the target image attribute feature of the target object. The target image attribute features may include RGB information and timing information of the target video frame; RGB colors are commonly known as optical three primary colors, R representing Red (Red), G representing Green (Green), and B representing Blue (Blue). Any color visible to the naked eye in nature can be formed by mixing and superposing the three colors, so that the color-adding mode is also called.

In this embodiment of the present application, the extracting object features of the target video frame to obtain target image attribute features of a target object in the target video includes:

and extracting object characteristics from the target video frame to obtain target raw material characteristics, target naming characteristics and target category characteristics in the target video.

In the embodiment of the application, for the target video for producing the target object, the target image attribute features may include target raw material features, target naming features, target category features and the like.

In this embodiment of the present application, as shown in fig. 3, the extracting object features of the target video frame to obtain target image attribute features of a target object in the target video includes:

s2031: determining the structural integrity and definition of each of at least two target objects in the target video;

in the embodiment of the application, the target video frame may include a plurality of target objects, and the integrity and definition of each target object in the video frame are different.

S2033: determining an object with structural integrity greater than a first threshold and definition greater than a second threshold as a first target object, and determining objects of the at least two target objects except the first target object as second target objects;

In the embodiment of the application, the target objects can be classified according to the integrity and definition of each target object in the target video frame; and extracting the characteristics respectively. For a first target object, extracting object characteristics of the first target object through a supervised training model; for the second target object, its object features may be extracted by an unsupervised training model.

S2035: extracting object features of the first target object to obtain first target image attribute features;

in this embodiment of the present application, extracting the object feature of the first target object to obtain the first target image attribute feature may include:

and extracting object features of the first target object based on the supervised feature extraction model to obtain first target image attribute features.

In some embodiments, the training method of the supervised feature extraction model includes:

acquiring a training video frame of a training video; the training video frames are marked with training image attribute feature labels of training objects;

in the embodiment of the present application, the training video and the target video may be the same type of video, and the extraction mode of the training video frame is the same as the extraction mode of the target video frame.

Performing image attribute feature extraction training on a preset machine learning model based on the training video frame to adjust model parameters of the preset machine learning model until training image attribute feature labels output by the preset machine learning model are matched with labeled training image attribute feature labels;

and taking a preset machine learning model corresponding to the model parameters when the output training image attribute feature labels are matched with the labeled training image attribute feature labels as the supervised feature extraction model.

In this embodiment, the preset machine learning model may be Video Swin Transformer, video Swin Transformer, which is the same as the overall architecture of the backbone network (backbone) and the Swin Transformer, and has one more time dimension, and there is a block size (Patch size) of the time dimension when the block Partition (Patch Partition) is performed. Video Swin Transformer comprises three parts: video to token, model pages, and head.

Wherein, video to token: in image to token (for converting an image to a token), image blocks of 4*4 are grouped, while in Video to token (for converting a Video to a token), video blocks of 2 x 4 are grouped, followed by linear embedding (embedding) and position embedding (position embedding).

Model stages: model pages are composed of multiple repeated pages, each Model page including Video Swin Transformer Block and Patch metering.

1) Video Swin Transformer Block is divided into two parts, video W-MSA and Video SW-MSA. This corresponds to expanding the Swin Transformer Block calculation from two dimensions to three dimensions.

2) Patch merge merges token features in adjacent (2 x 2 windows) and then reduces the dimension by a linear layer equivalent to 4 times the token, but the dimension is not kept unchanged, but the feature dimension is still increased by 2 times after each Patch merge, and the feature map in convolutional neural network (Convolutional Neural Network, CNN) is reduced, the channel number is increased similarly, and the video frame number is unchanged after each Patch merge.

head: after Model pages, high-dimensional features of multi-frame data are obtained, simple frame fusion (average) is required for video classification, and head (head) codes can be used.

In the embodiment of the application, the target image attribute characteristics of the target object in the target video frame can be rapidly and accurately extracted through the supervised feature extraction model.

S2037: extracting object features of the second target object to obtain second target image attribute features;

In this embodiment of the present application, as shown in fig. 4, the extracting object features of the second target object to obtain second target image attribute features includes:

s20371: extracting the self-reconstruction feature of the second target object to obtain a target self-reconstruction feature;

in this embodiment of the present application, the extracting the self-reconstruction feature of the second target object to obtain the self-reconstruction feature of the target object includes:

and based on the self-reconstruction feature extraction model, carrying out self-reconstruction feature extraction on the second target object to obtain the target self-reconstruction feature.

In an embodiment of the present application, the training method for the self-reconstruction feature extraction model includes:

dividing a sample video frame into at least two grid images;

in embodiments of the present application, the sample video frames are determined based on the sample video, which may be the same or different from the training video.

Performing image processing on at least one grid image to obtain a processed video frame; the image processing comprises at least one of position replacement of the grid image and shielding processing of partial images in the grid image;

in the embodiment of the application, one network image can be subjected to shielding processing, and meanwhile, two grid images are subjected to position replacement, so that a processed video frame is obtained.

In an embodiment of the present application, the performing image processing on at least one grid image to obtain a processed video frame includes:

performing first image processing on at least one grid image during first training of the model; the first image processing includes at least one of repositioning a first number of grid images, and blocking a first percentage of area of any grid image;

during the Nth training of the model, carrying out Nth image processing on at least one grid image; the Nth image processing comprises at least one of position replacement of the Nth number of grid images and N percent area shielding of any grid image; the Nth number is greater than the N-1 th number; the nth percentage is greater than the nth-1 percentage; wherein n=2, 3, … …, N is a positive integer.

Based on the processed video frames, performing self-reconstruction feature extraction training on the first preset model to obtain self-reconstruction features;

in an embodiment of the present application, the first preset model may be an encoder (encoder) model.

Based on the reconstruction characteristics, performing image reconstruction training on a second preset model to obtain a reconstructed video frame;

In an embodiment of the present application, the second preset model may be a decoder (decoder) model.

In the training process, continuously adjusting a first model parameter of a first preset model and a second model parameter of a second preset model until a reconstructed video frame output by the second preset model is matched with the sample video frame;

and taking a first preset model corresponding to the current first model parameter as the self-reconstruction feature extraction model. The current first model parameters are model parameters when the reconstructed video frame output by the second preset model is matched with the sample video frame.

In the embodiment of the application, in the model training process, the sample video frames can be used as supervision signals for training, and the exchange and shielding degrees gradually increase along with the progress of training. For example, the area of occlusion for a grid image is 10% during a first training and 20% during a second training; and so on; the exchange of the grid images can be increased from the exchange of two grid images to the exchange of four grid images, so that the accuracy of the model is improved.

In the embodiment of the application, the target reconstruction feature in the target video frame can be extracted through the reconstruction feature extraction model, so that the video attribute information is determined by combining the feature, and the accuracy of the video attribute information is improved.

S20373: extracting attribute description features of the second target object to obtain target description features;

in the embodiment of the application, the attribute description feature extraction can be performed on the second target object through an attribute description feature extraction model to obtain a target description feature; the attribute description feature may be used to describe an attribute of the second target object; the attribute description feature extraction model can be obtained by training a CLIP model based on the training of the matching degree of the graph and text; CLIP (Contrastive Language-Image Pre-Training, contrast language-Image Pre-Training) is a neural network trained on various (Image, text) pairs. It may be indicated in natural language to predict the most relevant text segment of a given image without directly optimizing for the task.

S20375: and taking the target reconstruction feature and the target description feature as the second target image attribute feature.

In the embodiment of the application, the target reconstruction feature and the target description feature may be the same type of feature or different types of features.

S2039: and taking the first target image attribute characteristic and the second target image attribute characteristic as the target image attribute characteristic.

In the embodiment of the application, the first target image attribute feature of the first target object and the second target image attribute feature of the second target object can be used as the target image attribute features.

S205: and extracting scene characteristics of the target video frame to obtain target scene characteristics of the target video.

In the embodiment of the application, the scene features are features corresponding to the scenes in the target video; in a food video, scene features may include, but are not limited to, corresponding features of a table, pan, knife, etc.

In this embodiment of the present application, as shown in fig. 5, the extracting the scene feature of the target video frame to obtain the target scene feature of the target video includes:

s2051: determining at least two target associated objects of the target object based on the target video frame;

in the embodiment of the application, the target associated object is used for representing a scene of the target video, for example, in the food video, the target object is a raw material, a finished product and the like of food; the target associated objects include dining tables, pans, knives, and the like.

S2053: constructing a directed acyclic graph by taking the associated object characteristics corresponding to the at least two target associated objects as nodes; the edges in the directed acyclic graph represent the similarity between two associated object features corresponding to the edges;

in the embodiment of the application, the directed acyclic graph can be constructed through a plurality of target video frames; nodes in the directed acyclic graph represent associated object features, and edges represent the similarity between two associated object features corresponding to the edges.

S2055: and extracting scene features based on the directed acyclic graph to obtain the target scene features.

In the embodiment of the application, the scene features in the video can be extracted through a directed acyclic graph, specifically, the scene features in the directed acyclic graph can be extracted through a graph convolution neural network (Graph Convolutional Networks, GCN), and the target scene features are obtained.

In a specific embodiment, as shown in fig. 6, fig. 6 is a schematic diagram of a directed acyclic graph constructed based on a target video frame; wherein the graph corresponds to video frames between t=k to t=m; extracting scene characteristics of the directed acyclic graph through GCN to obtain target scene characteristics; and finally, after fusing the target scene characteristics and the target image attribute characteristics, inputting an attribute information prediction model to obtain target attribute information of the target video. The attribute information prediction model may be obtained by training a multi-layer perceptron (MLP, multilayer Perceptron) network.

In an embodiment of the present application, the method further includes:

acquiring target audio corresponding to the target video and target text;

in the embodiment of the application, the target video can be analyzed, and the corresponding audio signal is extracted to obtain the target audio; and extracting text information corresponding to the target video through OCR (Optical Character Recognition ) to obtain a target text.

Extracting object characteristics of the target audio to obtain target audio attribute characteristics of the target object;

in the embodiment of the application, object feature extraction can be performed on the target audio through an audio attribute feature extraction model to obtain target audio attribute features of the target object; the audio attribute feature extraction model can be obtained through VGGish network training, and finally the target audio attribute feature with fixed dimension is obtained. Wherein a VGG-like model is trained on a large number of video datasets, in which model 128-dimensional emmbedding is generated.

The tensorfilow (a symbolic mathematical system programmed based on data streams) based VGG model is called VGGish. VGGish supports extracting 128-dimensional emmbed feature vectors with semantics from audio waveforms. "VGG" stands for Oxford Visual Geometry Group at the university of oxford, a group of studies ranging from machine learning to mobile robots.

The VGG model is characterized as follows:

(1) A small convolution kernel (3*3 convolution);

(2) Small pooling nuclei (pooling nuclei of 2 x 2);

(3) The number of layers is deeper and the feature map is wider. Based on the first two points, the convolution kernel focuses on enlarging the number of channels and pooling focuses on reducing the width and height, so that the model is deeper and wider in architecture, and meanwhile, the calculation amount is slowly increased;

(4) Full join-convolution. The network test stage replaces three full connections of the training stage with three convolutions, and the test reuses parameters in training, so that the full convolution network obtained by the test can receive any wide or high input because of no full connection limit.

In the embodiment of the application, the target audio attribute characteristics of the target object can be rapidly and accurately extracted through the audio attribute characteristic extraction model.

And extracting object characteristics of the target text to obtain target text attribute characteristics of the target object.

In the embodiment of the application, for text information in video, our data mainly comes from the title and recognition result of OCR in video, and the two texts are extracted into corresponding text attribute features by using BERT model. BERT is a pre-trained model proposed by Google AI institute, 10 months 2018. The full name of BERT is Bidirectional Encoder Representation from Transformers. The target text attribute characteristics of the target object can be rapidly extracted through the BERT model.

S207: and carrying out fusion processing on the target image attribute characteristics and the target scene characteristics to obtain target fusion characteristics.

In this embodiment of the present application, the fusing the target image attribute feature and the target scene feature to obtain a target fusion feature includes:

and carrying out fusion processing on the target image attribute characteristics, the target scene characteristics, the target audio attribute characteristics and the target text attribute characteristics to obtain the target fusion characteristics.

In this embodiment of the present application, the fusing the target image attribute feature, the target scene feature, the target audio attribute feature, and the target text attribute feature to obtain the target fusion feature includes:

splicing the target image attribute characteristics, the target scene characteristics, the target audio attribute characteristics and the target text attribute characteristics to obtain target splicing characteristics;

and carrying out feature fusion processing on the target splicing features based on a feature fusion model to obtain the target fusion features.

In an embodiment of the present application, the training method of the feature fusion model includes:

Acquiring sample splicing characteristics marked with sample fusion characteristics;

in the embodiment of the application, the construction method of the sample splicing feature is the same as that of the target splicing feature.

Training a preset multi-modal model based on the sample splicing characteristics to adjust model parameters of the preset multi-modal model until sample fusion characteristics output by the preset multi-modal model are matched with marked sample fusion characteristics;

in the embodiment of the present application, the preset multimodal model may be a transducer (transformer network) model, where the transducer model uses a memrybank mechanism.

And taking the preset multi-mode model when the output sample fusion characteristics are matched with the marked sample fusion characteristics as the characteristic fusion model.

In the embodiment of the application, as different characteristics are different dimension characteristics extracted by different supervision signals and different models. Their hidden spatial distributions are also different, and direct fusion can lead to inconsistent feature expression, ultimately affecting the recognition performance of the model. Thus, the problem of differential distribution of different modalities can be aligned by the transducer and memrybank mechanisms. First, the various features are connected in sequence, and division symbols are added to the connection points and input into a transducer model. As shown in fig. 7, fig. 7 is a schematic structural diagram of a transducer model, in which each block is a spliced feature. the transducer model carries out self-attention mechanism learning on the input spliced multi-mode features through a multi-head attention mechanism (multi-head), and carries out similarity calculation on the input spliced multi-mode features and the features in the memrybank so as to adjust the specific gravity of different modes. And finally, storing the fused characteristics into a memrybank and outputting the fused characteristics to a next-layer network. the transducer model may be a Scaled Dot product attention mechanism (Scaled Dot-Product Attention) model having two sequences xxx, YYY: sequence X XX provides query information Q (query), and sequence Y YY provides keys, value information K (key), V (value). Q is the query vector for the word, K is the "checked" vector, and V is the content vector. Wherein Q is the most suitable target to be searched, K is the most suitable target to be searched, V is the content, and the three are not necessarily consistent, so the network sets three vectors in this way, and then learns the most suitable Q, K and V, thereby enhancing the capability of the network. The model comprises a matrix multiplication layer (Mat multiple, matMul), a scaling layer (scale) and a mask layer (mask) normalized exponential function layer (Softmax), wherein the matrix multiplication layer comprises a first matrix multiplication layer and a second matrix multiplication layer; q, K, V of the current splicing characteristic is obtained; inputting Q, K of the current splicing characteristic into a first matrix multiplication layer, and inputting V of the current splicing characteristic into a second matrix multiplication layer; the output end of the first matrix multiplication layer is connected with the input end of the scaling layer, the output end of the scaling layer is connected with the input end of the mask layer, the output end of the mask layer is connected with the input end of the normalized exponential function layer, and the output end of the normalized exponential function layer is connected with the input end of the second matrix multiplication layer; taking the output of the second matrix multiplication layer as Q of the stored features in the memrybank, simultaneously acquiring K, V of the stored features in the memrybank, and respectively inputting the K, V to the first matrix multiplication layer and the second matrix multiplication layer for model training.

In some embodiments, feature fusion may also be performed using a channel-based attention mechanism to calculate the similarity of individual features across channels.

determining at least two features to be fused based on the target image attribute feature, the target scene feature, the target audio attribute feature and the target text attribute feature;

in embodiments of the present application, at least two features to be fused may be determined by a gradient-lifted tree (Gradient Boosting Decision Tree, GBDT) network; specifically, attribute information labels can be marked for training image attribute features, the training scene features, the training audio attribute features and training text attribute features of the same training video; training the GBDT network according to the training image attribute characteristics, the training scene characteristics, the training audio attribute characteristics and the training text attribute characteristics, and obtaining weights corresponding to the training image attribute characteristics, the training scene characteristics, the training audio attribute characteristics and the training text attribute characteristics when the model converges; and determining target features corresponding to training features with weights greater than a preset weight threshold as features to be fused.

And fusing the at least two features to be fused to obtain the target fusion feature.

In the embodiment of the application, at least two characteristics of the target image attribute characteristics, the target scene characteristics, the target audio attribute characteristics and the target text attribute characteristics can be selected for fusion processing to obtain target fusion characteristics, so that the accuracy of the determined attribute information is improved.

and carrying out fusion processing on the target raw material characteristics, the target naming characteristics, the target category characteristics and the target scene characteristics to obtain the target fusion characteristics.

S209: and determining target attribute information of the target video according to the target fusion characteristics.

In an embodiment of the present application, the determining, according to the target fusion feature, target attribute information of the target video includes:

and determining target raw material information, target naming information and target category information of the target object according to the target fusion characteristics.

In the embodiment of the present application, the target image attribute features may include a target raw material feature, the target naming feature, and the target class feature, and these features are fused with the target scene feature, so as to obtain a target fusion feature.

In the embodiment of the application, the target raw material information, the target naming information and the target category information can be used as three prediction tasks, and the attribute information prediction processing can be performed on the target fusion characteristics through an attribute information prediction model to obtain the target attribute information of the target video; the attribute information prediction model may be obtained through three layers of MLP training, as shown in fig. 8, fig. 8 is a schematic structural diagram of an MLP network, including three task prediction networks, each task prediction network corresponds to a loss value, where each layer is responsible for predicting a task, each task has an independent loss function, and the final loss function is an average value of weighted sums of loss values of the tasks.

During training, the countermeasure training can be performed by combining a speed gradient method (Fast Gradient Method, FGM); noise is added to the gradient during training to enhance the generalization performance of the model.

For example, in task 1 the original loss function is:

L(θ,x,y)＝-min _θ lovp (y|x, θ); wherein L represents a loss function and represents the difference between a fusion characteristic value corresponding to the sample attribute characteristic x and a true value y; θ is a parameter.

Adding perturbations to the gradient Wherein->The final loss function is obtained as L (θ, x, y) +L _adv (θ,x,y)。

In the embodiment of the application, the attribute information prediction model can also be obtained through model training of other multitasking structures, for example, multitasking learning can be performed by using the structure of a teacher and student network. The network learned knowledge may be transferred by distillation for samples of different dimensions.

In a specific embodiment, the prediction accuracy and recall of the attribute information of the video are determined by using embodiments 1 and 2, respectively, and the results are shown in table 1; example 1: predicting attribute information of the video by adopting the attribute characteristics of the target image; example 2: performing fusion processing on the target image attribute characteristics, the target scene characteristics, the target audio attribute characteristics and the target text attribute characteristics to obtain target fusion characteristics; and predicting attribute information of the video according to the target fusion characteristics.

TABLE 1

	Accuracy rate of	Recall rate of recall
			Example 1	79.6％	48.1％
Example 2	81.7％	68.0％

In a specific embodiment, the prediction accuracy and recall of the attribute information of the video are determined by using embodiments 3 and 4, respectively, and the results are shown in table 2; example 3: extracting object features of the target video frame according to the supervised feature extraction model to obtain target image attribute features, and predicting attribute information of the video; example 4: and (3) introducing CBAM, random Mask based on patch and reconstruction task of Shuffle, and predicting attribute information of the video by combining the target image attribute characteristics and the reconstruction characteristics obtained by the supervised feature extraction model.

TABLE 2

	Accuracy rate of	Recall rate of recall
			Example 3	55.5％	68.5％
Example 4	51.3％	66.2％

In a specific embodiment, as shown in fig. 9, fig. 9 is a flow chart of a method for determining attribute information of a food video, where the method includes:

s901: analyzing the video into a video frame set, an audio signal and text information;

s903: extracting audio features corresponding to the audio signals based on the audio information extraction model; extracting text features corresponding to the text information based on the text information extraction model; extracting video frame characteristics corresponding to a video frame set based on the video frame information extraction model;

s905: extracting attribute description features of the video frame set based on the image-text matching model to obtain the attribute description features; carrying out reconstruction feature extraction on the video frame set based on the feature reconstruction model to obtain reconstruction features;

s907: fusing the audio features, the text features, the video frame features, the attribute description features and the reconstruction features according to the converter network model and the information storage model to obtain target fusion features;

s909: and predicting the target fusion characteristics based on the multi-task learning model to obtain the menu label, the menu name label and the material label of the video.

In an embodiment of the present application, the method further includes:

sending the target video and the target attribute information to a terminal; so that the terminal displays the target video and the target attribute information.

And constructing a video attribute mapping relation according to the corresponding relation between the target video and the target attribute information.

In some embodiments, the method further comprises:

determining associated attribute information according to the target attribute information;

determining a first label of a target video according to the target attribute information;

and determining a second label of the target video according to the associated attribute information.

In this embodiment of the present application, the associated attribute information may be information having a similarity with the target attribute information greater than a preset value, and the first label may be a high-confidence label, and the second label may be a low-confidence label.

In some embodiments, the method further comprises:

determining category information of the target video according to the target attribute information;

and sending the target video, the target attribute information and the category information of the target video to a terminal.

In this embodiment of the present application, as shown in fig. 10, fig. 10 is a page of a terminal displaying a target video and target attribute information; the page displays a primary classification and a secondary classification label corresponding to the video category; attribute tags corresponding to the video are also shown, including high confidence tags and low confidence tags. In addition, a label tree can be constructed according to the attribute labels, and label dimension information can be displayed.

In some embodiments, the method further comprises:

the receiving terminal responds to a video acquisition instruction and sends a video acquisition request; the video acquisition request carries the associated information of the video to be acquired;

determining target attribute information matched with the associated information of the video to be acquired to obtain matched attribute information;

searching a target video corresponding to the matching attribute information from the video attribute mapping relation as the video to be acquired;

and sending the video to be acquired to the terminal.

In this embodiment of the present application, as shown in fig. 10, a user inputs "how to fry potatoes" in a search box, so that a corresponding video may be displayed, and an attribute tag corresponding to the video may be displayed.

In some embodiments, the method further comprises:

determining a target account corresponding to the target video;

and determining account attribute information of the target account according to the target attribute information.

In some embodiments, the target video is at least two, the method further comprising:

and classifying the at least two target videos according to the target attribute information corresponding to the at least two target videos.

In some embodiments, the method further comprises:

determining a target video set of a target category;

acquiring service data of each target video in the target video set;

in the embodiment of the present application, the service data of the target video may include, but is not limited to, click rate, exposure, and other data of the target video.

Sequencing the target videos in the target video set according to the service data of each target video in the target video set;

sending the target video set and the sequencing result of the target videos in the target video set to a terminal;

and the terminal displays the target videos in the target video set according to the sorting result.

In the embodiment of the application, a plurality of target videos can be ranked according to service data, the target video with higher service indexes is ranked in front, and the target video with lower service indexes is ranked in back; the service data is used for representing the height of the service index, further reflecting the quality of the video, sorting the video according to the service index, displaying a plurality of target videos according to the sorting, and improving the user experience and the click rate of the target videos.

As can be seen from the technical solutions provided in the embodiments of the present application, the embodiments of the present application acquire a target video frame of a target video; extracting object characteristics of the target video frame to obtain target image attribute characteristics of a target object in the target video; extracting scene characteristics of the target video frame to obtain target scene characteristics of the target video; performing fusion processing on the target image attribute characteristics and the target scene characteristics to obtain target fusion characteristics; and determining target attribute information of the target video according to the target fusion characteristics. In the process of determining the video attribute information, the method and the device integrate the scene characteristics of the video, determine the integration characteristics through the target image attribute characteristics and the target scene characteristics, and improve the accuracy of the determined attribute information.

The embodiment of the application also provides a video attribute determining device, as shown in fig. 11, which comprises:

a target video frame acquisition module 1110, configured to acquire a target video frame of a target video;

the target image attribute feature determining module 1120 is configured to perform object feature extraction on the target video frame to obtain a target image attribute feature of a target object in the target video;

the target scene feature determining module 1130 is configured to perform scene feature extraction on the target video frame to obtain target scene features of the target video;

the target fusion feature determining module 1140 is configured to perform fusion processing on the target image attribute feature and the target scene feature to obtain a target fusion feature;

the target attribute information determining module 1150 is configured to determine target attribute information of the target video according to the target fusion feature.

In some embodiments, the target image attribute feature determination module may include:

the information determining unit is used for determining the structural integrity and definition of each of at least two target objects in the target video;

a target object determining unit, configured to determine an object whose structural integrity is greater than a first threshold and whose sharpness is greater than a second threshold as a first target object, and determine an object other than the first target object of the at least two target objects as a second target object;

The first target image attribute feature extraction unit is used for extracting object features of the first target object to obtain first target image attribute features;

the second target image attribute feature extraction unit is used for extracting object features of the second target object to obtain second target image attribute features;

and the target image attribute feature determining unit is used for taking the first target image attribute feature and the second target image attribute feature as the target image attribute feature.

In some embodiments, the second target image attribute feature extraction unit includes:

the target reconstruction feature determining subunit is used for extracting the reconstruction feature of the second target object to obtain a target reconstruction feature;

the target description feature determining subunit is used for extracting the attribute description feature of the second target object to obtain a target description feature;

and the second target image attribute characteristic determining subunit is used for taking the target self-reconstruction characteristic and the target description characteristic as the second target image attribute characteristic.

In some embodiments, the target reconstruction feature determination subunit may include:

And the target self-reconstruction feature extraction subunit is used for extracting the self-reconstruction feature of the second target object based on the self-reconstruction feature extraction model to obtain the target self-reconstruction feature.

In some embodiments, the apparatus may further comprise:

the grid image dividing module is used for dividing the sample video frame into at least two grid images;

the image processing module is used for carrying out image processing on at least one grid image to obtain a processed video frame; the image processing comprises at least one of position replacement of the grid image and shielding processing of partial images in the grid image;

the self-reconstruction feature determining module is used for carrying out self-reconstruction feature extraction training on the first preset model based on the processed video frame to obtain self-reconstruction features;

the reconstructed video frame determining module is used for carrying out image reconstruction training on the second preset model based on the reconstructed characteristics to obtain a reconstructed video frame;

the training module is used for continuously adjusting the first model parameter of the first preset model and the second model parameter of the second preset model in the training process until the reconstructed video frame output by the second preset model is matched with the sample video frame;

The model determining module is used for taking a first preset model corresponding to the current first model parameter as the self-reconstruction feature extraction model; the current first model parameters are model parameters when the reconstructed video frame output by the second preset model is matched with the sample video frame.

In some embodiments, the apparatus may further comprise:

the text acquisition module is used for acquiring target audio and target text corresponding to the target video;

the target audio attribute feature extraction module is used for extracting object features of the target audio to obtain target audio attribute features of the target object;

and the target text attribute feature extraction module is used for extracting object features of the target text to obtain target text attribute features of the target object.

In some embodiments, the target fusion feature determination module may include:

and the feature fusion unit is used for carrying out fusion processing on the target image attribute feature, the target scene feature, the target audio attribute feature and the target text attribute feature to obtain the target fusion feature.

In some embodiments, the feature fusion unit may include:

The feature to be fused determining subunit is configured to determine at least two features to be fused based on the target image attribute feature, the target scene feature, the target audio attribute feature and the target text attribute feature;

and the feature fusion subunit is used for fusing the at least two features to be fused to obtain the target fusion feature.

the object feature extraction unit is used for extracting object features of the target video frame to obtain target raw material features, target naming features and target category features of the target object in the target video;

in some embodiments, the target fusion feature determination module comprises:

the target fusion feature determining unit is used for carrying out fusion processing on the target raw material feature, the target naming feature, the target category feature and the target scene feature to obtain the target fusion feature;

the target attribute information determining module includes:

and the target attribute information determining unit is used for determining target raw material information, target naming information and target category information of the target object according to the target fusion characteristics.

In some embodiments, the target scene feature determination module may include:

a target associated object determining unit, configured to determine at least two target associated objects of the target objects based on the target video frame;

the directed acyclic graph construction unit is used for constructing a directed acyclic graph by taking the associated object characteristics corresponding to each of the at least two target associated objects as nodes; the edges in the directed acyclic graph represent the similarity between two associated object features corresponding to the edges;

and the target scene feature determining unit is used for extracting scene features based on the directed acyclic graph to obtain the target scene features.

The device and method embodiments in the device embodiments described are based on the same inventive concept.

The embodiment of the application provides video attribute determining equipment, which comprises a processor and a memory, wherein at least one instruction or at least one section of program is stored in the memory, and the at least one instruction or the at least one section of program is loaded and executed by the processor to realize the video attribute determining method provided by the embodiment of the method.

Embodiments of the present application also provide a computer storage medium that may be provided in a terminal to store at least one instruction or at least one program related to a video attribute determining method in a method embodiment, where the at least one instruction or at least one program is loaded and executed by the processor to implement the video attribute determining method provided in the method embodiment.

Embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes to implement the video attribute determining method provided by the above-mentioned method embodiment.

Alternatively, in embodiments of the present application, the storage medium may be located on at least one of a plurality of network servers of the computer network. Alternatively, in the present embodiment, the storage medium may include, but is not limited to: a U-disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The memory according to the embodiments of the present application may be used to store software programs and modules, and the processor executes the software programs and modules stored in the memory to perform various functional applications and data processing. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, application programs required for functions, and the like; the storage data area may store data created according to the use of the device, etc. In addition, the memory may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device. Accordingly, the memory may also include a memory controller to provide access to the memory by the processor.

The video attribute determining method provided by the embodiment of the application can be executed in a mobile terminal, a computer terminal, a server or similar computing devices. Taking the operation on the server as an example, fig. 12 is a hardware block diagram of the server of a video attribute determining method provided in the embodiment of the present application. As shown in fig. 12, the server 1200 may vary considerably in configuration or performance, and may include one or more central processing units (Central Processing Units, CPU) 1210 (the central processing unit 1210 may include, but is not limited to, a microprocessor MCU or a processing device such as a programmable logic device FPGA), a memory 1230 for storing data, one or more storage media 1220 (e.g., one or more mass storage devices) for storing applications 1223 or data 1222. Wherein memory 1230 and storage medium 1220 can be transitory or persistent. The program stored on the storage medium 1220 may include one or more modules, each of which may include a series of instruction operations on a server. Still further, the central processor 1210 may be configured to communicate with a storage medium 1220 and execute a series of instruction operations in the storage medium 1220 on the server 1200. The server 1200 may also include one or more power supplies 1260, one or more wired or wireless network interfaces 1250, one or more input/output interfaces 1240, and/or one or more operating systems 1221, such as Windows Server, mac OS XTM, unixTM, linuxTM, freeBSDTM, and the like.

The input-output interface 1240 may be used to receive or transmit data via a network. The specific example of the network described above may include a wireless network provided by a communication provider of the server 1200. In one example, the input-output interface 1240 includes a network adapter (Network Interface Controller, NIC) that can connect to other network devices through a base station to communicate with the internet. In one example, the input/output interface 1240 may be a Radio Frequency (RF) module for communicating with the internet wirelessly.

It will be appreciated by those of ordinary skill in the art that the configuration shown in fig. 12 is merely illustrative and is not intended to limit the configuration of the electronic device described above. For example, the server 1200 may also include more or fewer components than shown in fig. 12, or have a different configuration than shown in fig. 12.

As can be seen from the embodiments of the video attribute determining method, apparatus, device or storage medium provided in the present application, the present application acquires a target video frame of a target video; extracting object characteristics of the target video frame to obtain target image attribute characteristics of a target object in the target video; extracting scene characteristics of the target video frame to obtain target scene characteristics of the target video; performing fusion processing on the target image attribute characteristics and the target scene characteristics to obtain target fusion characteristics; and determining target attribute information of the target video according to the target fusion characteristics. In the process of determining the video attribute information, the method and the device integrate the scene characteristics of the video, determine the integration characteristics through the target image attribute characteristics and the target scene characteristics, and improve the accuracy of the determined attribute information.

It should be noted that: the foregoing sequence of the embodiments of the present application is only for describing, and does not represent the advantages and disadvantages of the embodiments. And the foregoing description has been directed to specific embodiments of this specification. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for the apparatus, device, storage medium embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and references to the parts of the description of the method embodiments are only required.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The foregoing description of the preferred embodiments of the present application is not intended to limit the invention to the particular embodiments of the present application, but to limit the scope of the invention to the particular embodiments of the present application.

Claims

1. A method of video attribute determination, the method comprising:

acquiring a target video frame of a target video;

determining the structural integrity and definition of each of at least two target objects in the target video;

determining an object with structural integrity greater than a first threshold and definition greater than a second threshold as a first target object, and determining objects of the at least two target objects except the first target object as second target objects;

extracting object features of the first target object through a supervised training model to obtain first target image attribute features;

extracting object features of the second target object through an unsupervised training model to obtain second target image attribute features; extracting object features of the second target object to obtain second target image attribute features includes: extracting the self-reconstruction feature of the second target object to obtain a target self-reconstruction feature; extracting attribute description features of the second target object to obtain target description features; taking the target reconstruction feature and the target description feature as second target image attribute features;

Taking the first target image attribute feature and the second target image attribute feature as target image attribute features;

performing fusion processing on the target image attribute characteristics, the target scene characteristics, the target audio attribute characteristics and the target text attribute characteristics to obtain target fusion characteristics; the target audio attribute feature and the target text attribute feature are determined based on the target video;

2. The method according to claim 1, wherein the performing the self-reconstruction feature extraction on the second target object to obtain a target self-reconstruction feature includes:

3. The method of claim 2, wherein the training method for reconstructing the feature extraction model comprises:

dividing a sample video frame into at least two grid images;

taking a first preset model corresponding to the current first model parameter as the self-reconstruction feature extraction model; the current first model parameters are model parameters when the reconstructed video frame output by the second preset model is matched with the sample video frame.

4. The method according to claim 1, wherein the method further comprises:

acquiring target audio corresponding to the target video and target text;

5. The method of claim 1, wherein the fusing the target image attribute feature, the target scene feature, the target audio attribute feature, and the target text attribute feature to obtain the target fused feature comprises:

6. The method according to claim 1, wherein the method further comprises:

extracting object characteristics of the target video frame to obtain target raw material characteristics, target naming characteristics and target category characteristics of a target object in the target video;

performing fusion processing on the target raw material characteristics, the target naming characteristics, the target category characteristics and the target scene characteristics to obtain target fusion characteristics;

the determining the target attribute information of the target video according to the target fusion feature comprises the following steps:

7. The method according to claim 1, wherein the extracting the scene feature from the target video frame to obtain the target scene feature of the target video includes:

determining at least two target associated objects of the target object based on the target video frame;

constructing a directed acyclic graph by taking the associated object characteristics corresponding to the at least two target associated objects as nodes; the edges in the directed acyclic graph represent the similarity between two associated object features corresponding to the edges;

and extracting scene features based on the directed acyclic graph to obtain the target scene features.

8. A video attribute determining apparatus, the apparatus comprising:

the target fusion feature determining module is used for carrying out fusion processing on the target image attribute feature, the target scene feature, the target audio attribute feature and the target text attribute feature to obtain a target fusion feature; the target audio attribute feature and the target text attribute feature are determined based on the target video;

The target attribute information determining module is used for determining target attribute information of the target video according to the target fusion characteristics;

the target image attribute feature determination module includes:

the information determining unit is used for determining the structural integrity and definition of each of at least two target objects in the target video; a target object determining unit, configured to determine an object whose structural integrity is greater than a first threshold and whose sharpness is greater than a second threshold as a first target object, and determine an object other than the first target object of the at least two target objects as a second target object; the first target image attribute feature extraction unit is used for extracting object features of the first target object through the supervised training model to obtain first target image attribute features; the second target image attribute feature extraction unit is used for extracting object features of the second target object through an unsupervised training model to obtain second target image attribute features; a target image attribute feature determining unit configured to take the first target image attribute feature and the second target image attribute feature as the target image attribute features;

The second target image attribute feature extraction unit includes: the target reconstruction feature determining subunit is used for extracting the reconstruction feature of the second target object to obtain a target reconstruction feature; the target description feature determining subunit is used for extracting the attribute description feature of the second target object to obtain a target description feature; and the second target image attribute characteristic determining subunit is used for taking the target self-reconstruction characteristic and the target description characteristic as the second target image attribute characteristic.

9. The apparatus of claim 8, wherein the target reconstruction feature determination subunit comprises:

10. The apparatus of claim 9, wherein the apparatus further comprises:

11. The apparatus of claim 8, wherein the apparatus further comprises:

12. The apparatus of claim 8, wherein the feature fusion unit comprises:

13. The apparatus of claim 8, wherein the target image attribute feature determination module comprises:

the target fusion feature determination module comprises:

The target attribute information determining module includes:

14. The apparatus of claim 8, wherein the target scene feature determination module comprises:

15. A video attribute determining apparatus, the apparatus comprising: a processor and a memory having stored therein at least one instruction or at least one program loaded and executed by the processor to implement the video attribute determination method of any of claims 1-7.

16. A computer storage medium storing at least one instruction or at least one program loaded and executed by a processor to implement the video attribute determination method of any one of claims 1 to 7.