CN117011745A

CN117011745A - Data processing method, device, computer equipment and readable storage medium

Info

Publication number: CN117011745A
Application number: CN202211521144.2A
Authority: CN
Inventors: 刘刚
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-11-30
Filing date: 2022-11-30
Publication date: 2023-11-07

Abstract

The embodiment of the application provides a data processing method, a device, computer equipment and a readable storage medium, which are applied to cloud technology, artificial intelligence, intelligent traffic, auxiliary driving, video and other scenes, and the method comprises the following steps: carrying out multi-mode feature extraction and fusion on the target video to generate multi-mode fusion features corresponding to the target video; feature recall processing is carried out on the multi-mode fusion features to obtain recall feature vectors, and S times of feature sorting processing is carried out on the multi-mode fusion features to obtain S sorting feature vectors; determining recall classification information corresponding to the target video according to the recall feature vector; and step-by-step fusion is carried out on the S sorting feature vectors through the recall feature vectors to obtain S fusion sorting feature vectors, and step-by-step sorting is carried out on recall sorting information according to the S fusion sorting feature vectors to obtain S-level sorting information corresponding to the target video. The method and the device can improve the efficiency of determining the sorting classification information corresponding to the target video.

Description

Data processing method, device, computer equipment and readable storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a data processing method, a data processing device, a computer device, and a readable storage medium.

Background

With the development of multimedia technology, video has become a main carrier for people to acquire information and enjoy entertainment in daily life. Current video classification algorithms can train two models serially and independently, one model being a recall model and the other model being a sort model. Inputting the target video to a recall model, and determining recall classification information corresponding to the target video through the recall model; and inputting the target video and the recall classification information into a sequencing model, and determining the sequencing classification information corresponding to the target video in the recall classification information through the sequencing model, namely, the sequencing model can carry out secondary screening on the recall classification information. However, due to the structural features of the serial cascade, once the recall model is adjusted, the order model needs to be retrained, which increases the training time of the model, thereby increasing the time for determining the order classification information corresponding to the target video. In addition, the efficiency of determining the sorting classification information corresponding to the target video is further reduced by independently determining the classification information corresponding to the target video through the two models.

Disclosure of Invention

The embodiment of the application provides a data processing method, a data processing device, computer equipment and a readable storage medium, which can improve the efficiency of determining sorting classification information corresponding to a target video.

In one aspect, an embodiment of the present application provides a data processing method, including:

carrying out multi-mode feature extraction and fusion on the target video to generate multi-mode fusion features corresponding to the target video;

feature recall processing is carried out on the multi-mode fusion features to obtain recall feature vectors corresponding to the target video, S times of feature sorting processing is carried out on the multi-mode fusion features to obtain S sorting feature vectors corresponding to the target video; s is a positive integer;

determining recall classification information corresponding to the target video according to the recall feature vector;

step-by-step fusion is carried out on the S sorting feature vectors through the recall feature vectors to obtain S fusion sorting feature vectors, and step-by-step sorting is carried out on recall sorting information according to the S fusion sorting feature vectors to obtain S-level sorting information corresponding to the target video; the S-level sorting classification information belongs to recall classification information; the sorting classification information of the first level is obtained by screening and sorting from recall classification information, the sorting classification information of the (i+1) th level is obtained by screening and sorting from the sorting classification information of the (i) th level, and i is a positive integer less than S.

In one aspect, an embodiment of the present application provides a data processing apparatus, including:

the first fusion module is used for carrying out multi-mode feature extraction fusion on the target video and generating multi-mode fusion features corresponding to the target video;

the first processing module is used for carrying out feature recall processing on the multi-mode fusion features to obtain recall feature vectors corresponding to the target video, and carrying out S times of feature sorting processing on the multi-mode fusion features to obtain S sorting feature vectors corresponding to the target video; s is a positive integer;

the recall classifying module is used for determining recall classifying information corresponding to the target video according to the recall characteristic vector;

the sorting classification module is used for merging the S sorting feature vectors step by step through the recall feature vectors to obtain S merged sorting feature vectors, and sorting the recall classification information step by step according to the S merged sorting feature vectors to obtain S-level sorting classification information corresponding to the target video; the S-level sorting classification information belongs to recall classification information; the sorting classification information of the first level is obtained by screening and sorting from recall classification information, the sorting classification information of the (i+1) th level is obtained by screening and sorting from the sorting classification information of the (i) th level, and i is a positive integer less than S.

Wherein the first processing module comprises:

the feature input unit is used for inputting the multi-mode fusion features into the task sub-model in the target classification model; the task sub-model comprises a recall full-connection layer and S sequencing full-connection layers; the S sorting full-connection layers comprise a sorting full-connection layer H _i I is less thanOr a positive integer equal to S;

the first processing unit is used for carrying out full connection processing on the multi-mode fusion characteristics in the recall full connection layer to obtain candidate recall characteristic vectors corresponding to the target video, and carrying out full connection processing on the candidate recall characteristic vectors to obtain recall characteristic vectors output by the recall full connection layer;

a second processing unit for sorting the full connection layer H _i In the method, the multi-mode fusion feature is subjected to full connection processing to obtain candidate sorting feature vectors corresponding to the target video, and the candidate sorting feature vectors are subjected to full connection processing to obtain a sorting full connection layer H _i The output sequencing feature vector; s sorting feature vectors corresponding to the target video are sorting feature vectors respectively output by S sorting full-connection layers.

The recall classifying module is specifically used for carrying out normalization processing on the recall feature vectors to obtain normalized recall vectors corresponding to the recall feature vectors; the normalized recall vector includes at least two recall vector parameters;

The recall classification module is specifically configured to sort at least two recall vector parameters to obtain sorted at least two recall vector parameters;

the recall classification module is specifically configured to obtain H with a top ranking from at least two recall vector parameters after ranking ₁ Recall vector parameters, H to be ranked first ₁ Determining label information corresponding to the individual recall vector parameters as recall classification information corresponding to the target video; h ₁ Is an integer greater than 1.

Wherein, the sorting module comprises:

a first obtaining unit, configured to obtain a j-th sorting feature vector from the S sorting feature vectors; j is a positive integer less than or equal to S;

the first fusion unit is used for carrying out vector fusion on the recall feature vector and the first ordering feature vector if j is equal to 1, so as to obtain a fusion ordering feature vector corresponding to the first ordering feature vector;

and the second fusion unit is used for carrying out vector fusion on the j-th sorting feature vector and the fusion sorting feature vector corresponding to the j-1-th sorting feature vector if j is greater than 1, so as to obtain the fusion sorting feature vector corresponding to the j-th sorting feature vector.

Wherein the number of recall classification information is H ₁ And H is ₁ Is an integer greater than 1;

the sorting and classifying module comprises:

the second acquisition unit is used for acquiring a kth sorting characteristic vector from the S fusion sorting characteristic vectors; k is a positive integer less than or equal to S;

a first sorting unit for sorting the feature vector pair H according to the first fusion if k is equal to 1 ₁ Screening and sorting the recall classification information to obtain sorting classification information of a first level corresponding to the target video;

and the second sorting unit is used for screening and sorting the sorting classification information of the kth-1 level corresponding to the target video according to the kth fusion sorting feature vector if k is greater than 1, so as to obtain the sorting classification information of the kth level corresponding to the target video.

The first sorting unit is specifically configured to normalize the first fused sorting feature vector to obtain a normalized sorting vector corresponding to the first fused sorting feature vector; normalized rank vector includes H ₁ A ranking vector parameter;

a first sorting unit, particularly for H ₁ Sequencing the sequencing vector parameters to obtain sequenced H ₁ A ranking vector parameter; ordered H ₁ The order vector parameters are used to indicate H ₁ The order of the individual recall classification information;

A first sorting unit, specifically configured to sort according to H ₁ A ranking vector parameter, from H ₁ Acquisition of H from personal recall classification information ₂ Tag information to be acquired H ₂ The label information is determined to be ordering classification information of a first level corresponding to the target video; h ₂ Is less than or equal to H ₁ Is a positive integer of (a).

obtaining a pre-training classification model; the pre-training classification model comprises a multi-mode feature sub-model, a feature fusion sub-model and an initial task sub-model;

performing multi-mode feature extraction fusion on a target sample video belonging to the target field type through a multi-mode feature sub-model and a feature fusion sub-model to generate a sample multi-mode fusion feature corresponding to the target sample video;

carrying out feature recall processing on the sample multi-mode fusion features to obtain sample recall feature vectors corresponding to the target sample video, and carrying out S times of feature sorting processing on the sample multi-mode fusion features to obtain S sample sorting feature vectors corresponding to the target sample video; s is a positive integer;

step-by-step fusion is carried out on the S sample sequencing feature vectors through the sample recall feature vectors, so that S sample fusion sequencing feature vectors are obtained;

Based on S-level sample sequencing tag information of the target sample video, sample recall feature vectors and S sample fusion sequencing feature vectors, carrying out parameter adjustment on the initial task sub-model to obtain a task sub-model; sample ordering tag information of S layers belongs to sample recall tag information; the multi-mode feature sub-model, the feature fusion sub-model and the task sub-model are used for forming a target classification model; the target classification model is used for predicting recall classification information corresponding to target videos belonging to the target field type and S-level sorting classification information corresponding to the target videos.

the model acquisition module is used for acquiring a pre-training classification model; the pre-training classification model comprises a multi-mode feature sub-model, a feature fusion sub-model and an initial task sub-model;

the second fusion module is used for carrying out multi-mode feature extraction fusion on the target sample video belonging to the target field type through the multi-mode feature sub-model and the feature fusion sub-model to generate a sample multi-mode fusion feature corresponding to the target sample video;

The second processing module is used for carrying out feature recall processing on the sample multi-mode fusion features to obtain sample recall feature vectors corresponding to the target sample video, and carrying out S times of feature sorting processing on the sample multi-mode fusion features to obtain S sample sorting feature vectors corresponding to the target sample video; s is a positive integer;

the step-by-step fusion module is used for carrying out step-by-step fusion on the S sample sequencing feature vectors through the sample recall feature vectors to obtain S sample fusion sequencing feature vectors;

the parameter adjustment module is used for carrying out parameter adjustment on the initial task sub-model based on S-level sample ordering label information of the target sample video, sample recall feature vectors and S sample fusion ordering feature vectors to obtain a task sub-model; sample ordering tag information of S layers belongs to sample recall tag information; the multi-mode feature sub-model, the feature fusion sub-model and the task sub-model are used for forming a target classification model; the target classification model is used for predicting recall classification information corresponding to target videos belonging to the target field type and S-level sorting classification information corresponding to the target videos.

The feature fusion sub-model is obtained by pre-training a sample video set; associating at least two video domain types with a sample video in the sample video set, wherein the at least two video domain types comprise target domain types; the sample video set comprises N sample videos, wherein N is a positive integer; the N sample videos include sample video S _i The method comprises the steps of carrying out a first treatment on the surface of the i is a positive integer less than or equal to N;

the first pre-training module is used for acquiring an initial classification model; the initial classification model comprises a multi-mode feature sub-model and an initial feature fusion sub-model;

a first pre-training module for acquiring a sample video S through a multi-modal feature sub-model _i Corresponding P sample mode vectors, obtaining a target sample mode vector from the P sample mode vectors, and determining sample mode vectors except the target sample mode vector in the P sample mode vectors as candidate sample mode vectorsThe method comprises the steps of carrying out a first treatment on the surface of the P is a positive integer;

the first pre-training module is used for carrying out vector change on the target sample modal vector to obtain an auxiliary sample modal vector corresponding to the target sample modal vector, and carrying out fusion learning on the auxiliary sample modal vector and the candidate sample modal vector through the initial feature fusion sub-model to obtain a first fusion sample vector corresponding to the auxiliary sample modal vector;

The first pre-training module is used for pre-training the initial feature fusion sub-model based on the first fusion sample vector and the target sample modal vector to obtain the feature fusion sub-model.

The feature fusion sub-model is obtained by pre-training a sample video set; associating at least two video domain types with a sample video in the sample video set, wherein the at least two video domain types comprise target domain types; the sample video set comprises N sample videos, wherein N is a positive integer;

the second pre-training module is used for acquiring an initial classification model; the initial classification model comprises a multi-mode feature sub-model and an initial feature fusion sub-model;

the second pre-training module is used for obtaining P initial modal feature vectors corresponding to each sample video through the multi-modal feature sub-model, and combining initial modal feature vectors belonging to the same mode in the N P initial modal feature vectors into the same initial modal feature vector sequence to obtain P initial modal feature vector sequences; each initial modal feature vector sequence comprises N initial modal feature vectors; p is a positive integer;

the second pre-training module is used for acquiring candidate modal feature vector sequences from the P initial modal feature vector sequences, acquiring R candidate modal feature vectors from the candidate modal feature vector sequences, and adjusting the sequence of the R candidate modal feature vectors to obtain a candidate modal feature vector sequence after the sequence is adjusted; r is a positive integer less than N;

The second pre-training module is used for carrying out fusion learning on the candidate modal feature vectors in the candidate modal feature vector sequence after the sequence adjustment and the initial modal feature vectors in the P-1 initial modal feature vector sequences through the initial feature fusion sub-model to obtain a second fusion sample vector; the P-1 initial modal feature vector sequences are the initial modal feature vector sequences except the candidate modal feature vector sequences in the P initial modal feature vector sequences;

the second pre-training module is used for obtaining a matching tag between the candidate modal feature vector in the candidate modal feature vector sequence after the sequence adjustment and the initial modal feature vectors in the P-1 initial modal feature vector sequences, and pre-training the initial feature fusion sub-model based on the matching tag and the second fusion sample vector to obtain the feature fusion sub-model.

the third pre-training module is used for acquiring an initial classification model; the initial classification model comprises a multi-mode feature sub-model and an initial feature fusion sub-model;

a third pre-training module for acquiring a sample video S through the multi-modal feature sub-model _i The sequence of the P sample mode vectors is adjusted to obtain P sample mode vectors after adjustment; p is a positive integer;

a third pre-training module, configured to perform fusion learning on the adjusted P sample modal vectors through an initial feature fusion sub-model to obtain a sample video S _i Corresponding P fusion sample modal vectors;

the third pre-training module is used for respectively carrying out full connection processing on the P fusion sample modal vectors to obtain fusion modal vectors respectively corresponding to the P fusion sample modal vectors;

and the third pre-training module is used for pre-training the initial feature fusion sub-model based on the P fusion modal vectors and the P sample modal vectors to obtain the feature fusion sub-model.

The method for performing parameter adjustment on the initial task sub-model based on the P-level sample ordering label information of the target sample video, the sample recall feature vector and the P-sample fusion ordering feature vector to obtain the task sub-model comprises the following steps:

The parameter adjustment module comprises:

the first determining unit is used for carrying out normalization processing on the sample recall feature vector to obtain a sample normalization recall vector corresponding to the sample recall feature;

the first determining unit is used for determining a first model loss value of the pre-training classification model based on the sample normalized recall vector and sample recall tag information of the target sample video;

the second determining unit is used for adjusting the S sample fusion and sorting feature vectors step by step according to the first model loss value to obtain adjusted S sample fusion and sorting feature vectors, and determining S second model loss values of the pre-training classification model based on the adjusted S sample fusion and sorting feature vectors and S levels of sample sorting label information of the target sample video;

the loss determination unit is used for determining the total model loss value of the pre-training classification model according to the first model loss value and the S second model loss values, and carrying out parameter adjustment on the initial task sub-model based on the total model loss value to obtain the task sub-model.

The second determining unit is specifically configured to adjust a first sample fusion ordering feature vector in the S sample fusion ordering feature vectors according to the first model loss value, so as to obtain an adjusted first sample fusion ordering feature vector;

The second determining unit is specifically configured to normalize the adjusted first sample fusion ordering feature vector to obtain a sample normalization ordering vector corresponding to the adjusted first sample fusion ordering feature vector;

the second determining unit is specifically configured to determine a first model loss value and a second model loss value of the pre-training classification model based on the sample normalized ordering vector corresponding to the adjusted first sample fused ordering feature vector and sample ordering label information of a first level of the target sample video;

the second determining unit is specifically configured to adjust the second sample fusion and sorting feature vector based on the first and second model loss values to obtain an adjusted second sample fusion and sorting feature vector, and determine S-1 second model loss values of the pre-training classification model based on the adjusted second sample fusion and sorting feature vector, S-2 sample fusion and sorting feature vectors and S-1 level sample sorting label information; the S-2 sample fusion ordering feature vectors are sample fusion ordering feature vectors except the first sample fusion ordering feature vector and the second sample fusion ordering feature vector in the S sample fusion ordering feature vectors; the S-1 level sample ordering tag information is sample ordering tag information except for the first level sample ordering tag information in the S level sample ordering tag information.

The multi-modal feature sub-model comprises a text network layer, an image network layer, a video network layer and an audio network layer;

the second fusion module includes:

the video acquisition unit is used for acquiring a target sample video belonging to the type of the target field;

the feature extraction unit is used for extracting text features of the target text data of the target sample video through the text network layer to obtain a sample sub-text vector corresponding to the target sample video;

the feature extraction unit is used for extracting image features of target image data of the target sample video through the image network layer to obtain sample sub-image vectors corresponding to the target sample video;

the feature extraction unit is used for extracting video features of target video data of the target sample video through the video network layer to obtain sample sub-video vectors corresponding to the target sample video;

the feature extraction unit is used for extracting audio features of the target audio data of the target sample video through the audio network layer to obtain sample sub-audio vectors corresponding to the target sample video;

and the fusion learning unit is used for carrying out fusion learning on the sample sub-text vector, the sample sub-audio vector, the sample sub-image vector and the sample sub-video vector through the feature fusion sub-model to obtain sample multi-mode fusion features corresponding to the target sample video.

The feature extraction unit is specifically configured to perform feature extraction on target text data of a target sample video through a text network layer, so as to obtain a text feature vector corresponding to the target text data;

the feature extraction unit is specifically used for performing word segmentation on the target text data to obtain text word segmentation of the target text data, and performing text position coding on the text position of the text word segmentation in the target text data to obtain a text position vector corresponding to the text word segmentation;

the feature extraction unit is specifically configured to obtain a text mode feature corresponding to the target text data, and fuse the text feature vector, the text position vector and the text mode feature to obtain a sample sub-text vector corresponding to the target sample video.

The feature extraction unit is specifically configured to perform feature extraction on target image data of a target sample video through an image network layer to obtain an image feature vector corresponding to the target image data;

the feature extraction unit is specifically configured to obtain an image mode feature corresponding to the target image data, and fuse the image feature vector and the image mode feature to obtain a sample sub-image vector corresponding to the target sample video.

The feature extraction unit is specifically configured to perform feature extraction on target video data of a target sample video through a video network layer, so as to obtain a video feature vector corresponding to the target video data;

the feature extraction unit is specifically used for performing frame extraction processing on the target video data to obtain a key video frame in the target video data, and performing feature extraction on the key video frame through the video network layer to obtain a video frame feature vector corresponding to the key video frame;

the feature extraction unit is specifically configured to obtain a video mode feature corresponding to the target video data, and fuse the video feature vector, the video frame feature vector and the video mode feature to obtain a sample sub-video vector corresponding to the target sample video.

The characteristic extraction unit is specifically configured to perform frame segmentation processing on target audio data of a target sample video to obtain at least two audio frames in the target audio data, and perform characteristic extraction on the at least two audio frames through an audio network layer to obtain audio frame feature vectors corresponding to the at least two audio frames respectively;

the feature extraction unit is specifically used for carrying out audio position coding on the audio position of each audio frame in the target audio data to obtain an audio frame position vector corresponding to each audio frame respectively;

The feature extraction unit is specifically configured to obtain an audio mode feature corresponding to the target audio data, and fuse at least two audio frame feature vectors, at least two audio frame position vectors and the audio mode feature to obtain a sample sub-audio vector corresponding to the target sample video.

In one aspect, an embodiment of the present application provides a computer device, including: a processor and a memory;

the processor is connected to the memory, wherein the memory is configured to store a computer program, and when the computer program is executed by the processor, the computer device is caused to execute the method provided by the embodiment of the application.

In one aspect, the present application provides a computer readable storage medium storing a computer program adapted to be loaded and executed by a processor, so that a computer device having the processor performs the method provided by the embodiment of the present application.

In one aspect, embodiments of the present application provide a computer program product comprising a computer program stored on a computer readable storage medium. The processor of the computer device reads the computer program from the computer-readable storage medium, and the processor executes the computer program, so that the computer device performs the method provided by the embodiment of the present application.

Therefore, the embodiment of the application can simultaneously carry out feature recall processing and S times of feature sorting processing on the multi-mode fusion features, step-by-step fusion is carried out on S sorting feature vectors obtained by S times of feature sorting processing according to the recall feature vectors obtained by the feature recall processing, and step-by-step sorting is carried out on recall sorting information determined by the recall feature vectors according to S fused sorting feature vectors obtained by step-by-step fusion, so that S levels of sorting information are obtained. Therefore, the embodiment of the application can process the multi-mode fusion features corresponding to the target video in parallel (namely, perform layering processing on the multi-mode fusion features corresponding to the target video), namely, extract the features of the recall stage (namely, the recall feature vectors) and the features of the sorting stage (namely, the S sorting feature vectors) in parallel, further fuse the features of the recall stage and the features of the sorting stage in a step-by-step fusion manner, and determine recall classification information and S-level sorting classification information on the basis of parallelism, thereby reducing the time for determining the sorting classification information corresponding to the target video, and further improving the efficiency for determining the sorting classification information corresponding to the target video.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the related art, the drawings that are required to be used in the embodiments or the related technical descriptions will be briefly described, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to the drawings without inventive effort for those skilled in the art.

Fig. 1 is a schematic structural diagram of a network architecture according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a scenario for data interaction according to an embodiment of the present application;

FIG. 3 is a schematic flow chart of a data processing method according to an embodiment of the present application;

FIG. 4a is a schematic diagram of a classification model according to an embodiment of the present application;

FIG. 4b is a schematic diagram of a classification model according to an embodiment of the present application;

FIG. 5 is a schematic flow chart of a data processing method according to an embodiment of the present application;

FIG. 6 is a schematic flow chart of a data processing method according to an embodiment of the present application;

FIG. 7 is a flow chart of a method and system for large-scale fine-grained tag identification of video content based on a machine-learned multi-branch network;

FIG. 8 is a schematic diagram of a data processing apparatus according to an embodiment of the present application;

FIG. 9 is a schematic diagram of a data processing apparatus according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

It should be appreciated that artificial intelligence (Artificial Intelligence, AI for short) is the intelligence of a person using a digital computer or a machine controlled by a digital computer to simulate, extend and extend the environment, sense the environment, acquire knowledge and use knowledge to obtain the best results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and other directions.

The solution provided by the embodiment of the application mainly relates to Computer Vision (CV) technology, machine Learning (ML) technology, voice technology (Speech Technology, ST) and natural language processing (Nature Language processing, NLP) technology of artificial intelligence.

The computer vision is a science for researching how to make a machine "see", and more specifically, a camera and a computer are used to replace human eyes to identify and measure targets and perform graphic processing, so that the computer is processed into images more suitable for human eyes to observe or transmit to an instrument to detect. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, OCR (Optical Character Recognition, i.e., optical character recognition), video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technology, virtual reality, augmented reality, synchronous positioning and mapping, autopilot, intelligent transportation, and the like.

The machine learning is a multi-field interdisciplinary, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like. The deep learning technology is a technology for machine learning by using a deep neural network system. The concept of deep learning is derived from the study of artificial neural networks, for example, a multi-layer perceptron with multiple hidden layers is a deep learning structure. Deep learning forms more abstract high-level representation attribute categories or features by combining low-level features to discover distributed feature representations of data.

Among these key technologies of the voice technology are an automatic voice recognition technology and a voice synthesis technology, and a voiceprint recognition technology. The method can enable the computer to listen, watch, say and feel, is the development direction of human-computer interaction in the future, and voice becomes one of the best human-computer interaction modes in the future.

Among them, natural language processing technology is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Thus, the research in this field will involve natural language, i.e. language that people use daily, so it has a close relationship with the research in linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like.

Specifically, referring to fig. 1, fig. 1 is a schematic structural diagram of a network architecture according to an embodiment of the present application. As shown in fig. 1, the network architecture may include a server 2000 and a cluster of terminal devices. Wherein the cluster of terminal devices may in particular comprise one or more terminal devices, the number of terminal devices in the cluster of terminal devices will not be limited here. As shown in fig. 1, the plurality of terminal devices may specifically include a terminal device 3000a, a terminal device 3000b, terminal devices 3000c, …, a terminal device 3000n; the terminal devices 3000a, 3000b, 3000c, …, 3000n may be directly or indirectly connected to the server 2000 through a wired or wireless communication manner, respectively, so that each terminal device may interact with the server 2000 through the network connection.

Wherein each terminal device in the terminal device cluster may include: smart phones, tablet computers, notebook computers, desktop computers, intelligent voice interaction devices, intelligent home appliances (e.g., smart televisions), wearable devices, vehicle terminals, aircraft and other intelligent terminals with data processing functions. It should be understood that each terminal device in the terminal device cluster shown in fig. 1 may be installed with an application client having a multimedia data processing function, and when the application client runs in each terminal device, data interaction may be performed between each terminal device and the server 2000 shown in fig. 1. The application client may specifically include: vehicle clients, smart home clients, entertainment clients (e.g., game clients), multimedia clients (e.g., video clients), social clients, and information-based clients (e.g., news clients), etc. The application client in the embodiment of the present application may be integrated in a certain client (for example, a social client), and the application client may also be an independent client (for example, a news client).

For easy understanding, the embodiment of the present application may select one terminal device from the plurality of terminal devices shown in fig. 1 as the target terminal device. For example, in the embodiment of the present application, the terminal device 3000a shown in fig. 1 may be used as a target terminal device, and an application client having a multimedia data processing function may be installed in the target terminal device. At this time, the target terminal device may implement data interaction between the application client and the server 2000.

The server 2000 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs (Content Delivery Network, content delivery networks), basic cloud computing services such as big data and artificial intelligence platforms, and the like.

It should be understood that, in the embodiment of the present application, the computer device may extract, through the target classification model, classification information corresponding to the target video (i.e., recall classification information and S-level order classification information), and further add the target video to the video recommendation pool corresponding to the application client according to the recall classification information corresponding to the target video and the S-level order classification information corresponding to the target video. For easy understanding, the embodiment of the application can also refer to the classification information corresponding to the target video as the label information corresponding to the target video, and extracting the classification information corresponding to the target video can be understood as labeling the target video. The recall classification information and the S-level sorting classification information have a label nesting relationship, namely the recall classification information and the S-level sorting classification information are nested according to the levels, the S-level sorting classification information can be understood as fine-grained labels of the target video, the number of the recall classification information is not limited in the embodiment of the application, and the number of the S-level sorting classification information is not limited in the embodiment of the application. For example, taking S equal to 1 as an example, the S-level ranking classification information may include the first-level ranking classification information, and the recall classification information includes the first-level ranking classification information, that is, the first-level ranking classification information is obtained by sorting the recall classification information. For another example, S is taken as an example of S being equal to 2, and the S-level ranking classification information may include first-level ranking classification information and second-level ranking classification information, where the recall-level ranking classification information includes first-level ranking classification information, and the first-level ranking classification information includes second-level ranking classification information, that is, the first-level ranking classification information is obtained by sorting from the recall-level ranking classification information, and the second-level ranking classification information is obtained by sorting from the first-level ranking classification information.

The data processing method provided by the embodiment of the present application may be executed by the server 2000 (i.e., the computer device may be the server 2000), may be executed by the target terminal device (i.e., the computer device may be the target terminal device), or may be executed by both the server 2000 and the target terminal device. For convenience of understanding, the user corresponding to the terminal device may be referred to as an object in the embodiment of the present application, for example, the user corresponding to the target terminal device may be referred to as a target object.

When the data processing method is jointly executed by the server 2000 and the target terminal device, the server 2000 may perform parameter adjustment on the pre-trained classification model to obtain the target classification model, and then send the target classification model to the target terminal device. Thus, the target object can receive the target classification model through the target terminal equipment, and further, the classification information corresponding to the target video is extracted through the target classification model.

Alternatively, the server 2000 may perform parameter adjustment on the pre-trained classification model to obtain the target classification model when the data processing method is executed by the server 2000. In this way, the target object may send a video classification request to the server 2000 through the application client in the target terminal device, so that the server 2000 obtains the target video from the video classification request, and further extracts classification information corresponding to the target video through the target classification model.

Optionally, when the data processing method is executed by the target terminal device, the target terminal device may perform parameter adjustment on the pre-trained classification model to obtain a target classification model, and further extract classification information corresponding to the target video through the target classification model.

It should be understood that the service scenario applicable to the network framework may specifically include: video distribution scenes, video search scenes, etc., and specific service scenes will not be listed here one by one. For example, in a video distribution scenario, the computer device may obtain, from the video recommendation pool, a distribution video for video distribution according to recall classification information corresponding to the target video and sorting classification information of S levels corresponding to the target video, and further distribute the distribution video to an application client (e.g., a video browsing interface) of the target terminal device. For another example, in the video search scenario, the computer device may obtain, from the video recommendation pool, a search video for performing video recommendation according to recall classification information corresponding to the target video and ranking classification information of S levels corresponding to the target video, and then recommend the search video to an application client (e.g., a search result interface) of the target terminal device.

It will be appreciated that the target terminal device may present the distributed video or the search video to the target object in the form of an information stream (i.e., feeds stream), which is a data format, and the Feeds stream is typically ordered in a time axis (i.e., timeline), i.e., the time axis is the most intuitive and basic presentation form of the Feeds stream. The content recommended to the target object for reading by the information flow can comprise images, texts and videos (for example, distributing videos and searching videos), and the videos can be vertical or horizontal.

It should be appreciated that the application client may also be referred to as an aggregator, which represents software for aggregating feed streams, e.g., an aggregator may be software that is specifically used to subscribe to web sites (different web sites corresponding to different servers), an aggregator may also be referred to as a RSS (Really Simple Syndication) reader, a feed reader, a news reader, etc.

For ease of understanding, further, please refer to fig. 2, fig. 2 is a schematic diagram of a scenario for data interaction according to an embodiment of the present application. The server 20a shown in fig. 2 may be the server 2000 in the embodiment corresponding to fig. 1, the terminal device 20b shown in fig. 2 may be the target terminal device in the embodiment corresponding to fig. 1, an application client may be installed in the terminal device 20b, and the user corresponding to the terminal device 20b may be the object 20c. For ease of understanding, embodiments of the present application will be described in terms of a data processing method performed by server 20 a.

The sample video set 21a (i.e., the content database 21 a) shown in fig. 2 may include a plurality of databases, where the plurality of databases may include databases 21b, …, and 21c shown in fig. 2, and the databases 21b, …, and 21c may be used to store videos corresponding to different video domain types, for example, the database 21b may be used to store videos corresponding to game domain types, and the database 21c may be used to store videos corresponding to life domain types.

It will be appreciated that the server 20a shown in fig. 2 may obtain a target sample video belonging to the target domain type from the sample video set 21a, and perform parameter adjustment on the pre-trained classification model through the target sample video to obtain a target classification model (not shown in the figure). For ease of understanding, the embodiment of the present application is described taking the target domain type as the life domain type as an example, and the target sample video may be a video belonging to the life domain type, so that the server 20a may obtain the target sample video for parameter adjustment of the pre-training classification model from the database 21 c.

It can be understood that, since the target sample video is a video belonging to the life domain type, the target classification model can be used for predicting classification information corresponding to the life domain type video. Similarly, the server 20a may acquire videos of other domain types except the life domain type, and perform parameter adjustment on the pre-trained classification model through the videos of the other domain types to obtain a classification model for predicting classification information corresponding to the videos of the other domain types.

As shown in fig. 2, the object 20c may send a video classification request to the server 20a through an application client in the target terminal device, so that after receiving the video classification request sent by the object 20c through the terminal device 20b, the server 20a may obtain a target video from the video classification request, and further obtain a classification model corresponding to a domain type to which the target video belongs, and for convenience of understanding, in the embodiment of the present application, the target video is exemplified as a video belonging to a living domain type. Thus, the server 20a can acquire the target classification model corresponding to the life domain type.

Further, as shown in fig. 2, the server 20a may perform multi-modal feature extraction and fusion on the target video through the target classification model to generate multi-modal fusion features corresponding to the target video, and further perform feature recall processing on the multi-modal fusion features to obtain recall feature vectors corresponding to the target video. Meanwhile, the server 20a may perform the feature sorting process on the multi-mode fusion feature for S times to obtain S sorted feature vectors corresponding to the target video, where S may be a positive integer.

It can be understood that, the server 20a may perform the first feature ranking process on the multimodal fusion feature to obtain a ranking feature vector 22a corresponding to the target video; …; the server 20a may perform the S-th feature ranking process on the multi-mode fusion feature to obtain a ranking feature vector 22b corresponding to the target video. For ease of understanding, the embodiment of the present application is illustrated with S equal to 2, and the S sorting feature vectors may specifically include a sorting feature vector 22a and a sorting feature vector 22b.

Further, as shown in fig. 2, the server 20a may perform progressive fusion on the ranking feature vector 22a and the ranking feature vector 22b through recall feature vectors to obtain S fused ranking feature vectors. In other words, the server 20a may perform vector fusion on the recall feature vector and the sequence feature vector 22a to obtain a fused sequence feature vector (i.e. fused sequence feature vector 23 a) corresponding to the sequence feature vector 22a, and further perform vector fusion on the fused sequence feature vector 23a and the sequence feature vector 22b to obtain a fused sequence feature vector (i.e. fused sequence feature vector 23 b) corresponding to the sequence feature vector 22 b.

Further, as shown in fig. 2, the server 20a may determine recall classification information corresponding to the target video according to the recall feature vectors, and further perform step-by-step sorting on the recall classification information according to S fused sorting feature vectors (i.e., the fused sorting feature vector 23a and the fused sorting feature vector 23 b), to obtain S-level sorting classification information corresponding to the target video. In other words, the server 20a may filter and sort the recall classification information according to the fused sorting feature vector 23a to obtain the sorting classification information of the first level (i.e. the sorting classification information 24 a), and further filter and sort the sorting classification information 24a according to the fused sorting feature vector 23b to obtain the sorting classification information of the second level (i.e. the sorting classification information 24 b). Wherein the S levels of sort information (i.e., sort information 24a and sort information 24 b) all belong to recall sort information, and further, sort information 24b belongs to sort information 24a.

Further, as shown in fig. 2, the server 20a may store the target video to the database 21c, and at the same time, return recall classification information corresponding to the target video and sorting classification information of S levels corresponding to the target video to the terminal device 20b. Optionally, the server 20a may store recall classification information corresponding to the target video and S-level sorting classification information corresponding to the target video in the database 21c, where the databases 21b, … and 21c may be used to store not only videos corresponding to different video domain types but also classification information corresponding to videos of different video domain types.

Therefore, after the multi-mode fusion feature corresponding to the target video is obtained, the embodiment of the application can simultaneously perform feature recall processing and feature sorting processing on the multi-mode fusion feature to obtain the recall feature vector corresponding to the recall task and S sorting feature vectors corresponding to the S sorting tasks, and further perform step-by-step fusion on the feature of the recall task and the feature of the S sorting tasks to obtain the S fusion sorting feature vectors. It can be understood that after recall classification information corresponding to the recall task is determined according to the recall feature vector, the recall classification information can be sequenced step by step according to the S fusion sequencing feature vectors, and S-level sequencing classification information corresponding to the S sequencing tasks is determined, so that recall classification information and S-level sequencing classification information can be determined on a parallel basis, and efficiency of determining sequencing classification information corresponding to the target video is improved.

Further, referring to fig. 3, fig. 3 is a flow chart of a data processing method according to an embodiment of the application. The method may be performed by a server, or may be performed by a terminal device, or may be performed by a server and a terminal device together, where the server may be the server 20a in the embodiment corresponding to fig. 2, and the terminal device may be the terminal device 20b in the embodiment corresponding to fig. 2. For ease of understanding, embodiments of the present application will be described in terms of this method being performed by a server. The data processing method may include the following steps S101 to S104:

step S101, multi-modal feature extraction fusion is carried out on a target video, and multi-modal fusion features corresponding to the target video are generated;

it will be appreciated that the target video may be video uploaded by the target object through the target terminal device, the target object may be a video uploaded from a media or video production facility (e.g., multi-Channel Network MCN), and the target video uploaded by the target object may be a PGC from a media and UGC (User Generated Content, user produced content) or a video production facility (Professional Generated Content, professional produced content). The target video integrates topics such as skill sharing, humorous, fashion trend, social hot spot, street interview, public education, advertising creative, business customization and the like, and in addition, the target video can be a short video or a long video, and the duration of the target video is not limited.

The MCN is a product form of a multi-channel network, PGC contents can be combined, and under the powerful support of capital, continuous output of the contents is ensured, so that stable realization of business is finally realized; PGC refers to a facility or organization that professionally produces content; UGC is emerging with the concept of Web2.0 advocated to be a main feature, and is not a specific service, but a new way for users to use the Internet, namely, from the original downloading to the downloading and uploading again.

The method and the device can perform multi-mode feature extraction and fusion on the target video through the multi-mode feature sub-model and the feature fusion sub-model in the target classification model, and can refer to the description of multi-mode feature extraction and fusion on the target sample video through the multi-mode feature sub-model and the feature fusion sub-model in the pre-training classification model in the embodiment corresponding to the following figure 5. The target classification model is obtained by carrying out parameter adjustment on the pre-training classification model, the multi-modal feature sub-model in the target classification model is identical to the multi-modal feature sub-model in the pre-training classification model, and the feature fusion sub-model in the target classification model is identical to the feature fusion sub-model in the pre-training classification model. In addition, the target classification model further includes a task sub-model, and the pre-training classification model further includes an initial task sub-model.

It can be understood that the feature fusion sub-model can be a transformer-based bi-directional encoder characterization (Bidirectional Encoder Representations from Transformers, abbreviated as BERT) model, and can also be a lightweight BERT model (A Lite BERT for Self-supervised Learning of Language Representations, abbreviated as ALBERT) for language characterization self-supervised learning, and the embodiment of the application does not limit the specific type of the feature fusion sub-model.

Step S102, carrying out feature recall processing on the multi-mode fusion features to obtain recall feature vectors corresponding to the target video, and carrying out S times of feature sorting processing on the multi-mode fusion features to obtain S sorting feature vectors corresponding to the target video;

specifically, the multimodal fusion feature is input to a task sub-model in the target classification model. Wherein the task sub-model comprises a recall full-connection layer and S sorting full-connection layers, and the S sorting full-connection layers comprise a sorting full-connection layer H _i Here, i may be a positive integer less than or equal to S, where S may be a positive integer. Further, in the recall full-connection layer, full-connection processing is carried out on the multi-mode fusion feature to obtain a candidate recall feature vector corresponding to the target video, and full-connection processing is carried out on the candidate recall feature vector to obtain a recall feature vector output by the recall full-connection layer. Further, at the ordered full connection layer H _i In the method, the multi-mode fusion feature is subjected to full connection processing to obtain candidate sorting feature vectors corresponding to the target video, and the candidate sorting feature vectors are subjected to full connection processing to obtain a sorting full connection layer H _i And outputting the sequencing feature vector. S sorting feature vectors corresponding to the target video are sorting feature vectors respectively output by S sorting full-connection layers; the S ordered fully connected layers are separate and independent of each other. Wherein, the task sub-model can comprise a recall fully connected layer and S order fully connected layers, so the task sub-modelMay also be referred to as a multi-branch machine learning network.

The recall full-connection layer can comprise a first full-connection network layer and a second full-connection network layer, the multi-mode fusion feature can be subjected to full-connection processing through the first full-connection network layer to obtain a candidate recall feature vector corresponding to the target video, and the candidate recall feature vector can be subjected to full-connection processing through the second full-connection network layer to obtain a recall feature vector output by the recall full-connection layer. Similarly, all-connection layer H is ordered _i The method can comprise a third full-connection network layer and a fourth full-connection network layer, wherein the multi-mode fusion feature can be subjected to full-connection processing through the third full-connection network layer to obtain candidate sorting feature vectors corresponding to the target video, and the candidate sorting feature vectors can be subjected to full-connection processing through the fourth full-connection network layer to obtain a sorting full-connection layer H _i And outputting the sequencing feature vector.

Step S103, determining recall classification information corresponding to the target video according to the recall feature vector;

specifically, the recall feature vector is normalized, and a normalized recall vector corresponding to the recall feature vector is obtained. The normalized recall vector comprises at least two recall vector parameters, and the at least two recall vector parameters can represent the matching degree of the target video and the sample tag information. Further, the at least two recall vector parameters are ordered to obtain ordered at least two recall vector parameters. Further, the H which is ranked ahead is obtained from at least two recall vector parameters which are ranked ₁ Recall vector parameters, H to be ranked first ₁ And determining the label information corresponding to the individual recall vector parameters as recall classification information corresponding to the target video. Wherein H is here ₁ Can be an integer greater than 1, and recall classification information can be H with high matching degree with a target video ₁ Sample tag information.

Step S104, step-by-step fusion is carried out on the S sorting feature vectors through the recall feature vectors to obtain S fusion sorting feature vectors, step-by-step sorting is carried out on recall sorting information according to the S fusion sorting feature vectors to obtain S-level sorting information corresponding to the target video.

Wherein, the sorting classification information of the S layers belongs to recall classification information; the first level of sorting classification information is obtained by sorting from recall classification information, and the (i+1) th level of sorting classification information is obtained by sorting from the (i) th level of sorting classification information, where i may be a positive integer less than S.

It should be appreciated that the specific process of progressive merging of S sorted feature vectors by recall feature vectors may be described as: and acquiring a j-th sorting characteristic vector from the S sorting characteristic vectors. Where j may be a positive integer less than or equal to S. Further, if j is equal to 1, vector fusion is performed on the recall feature vector and the first sorting feature vector to obtain a fused sorting feature vector corresponding to the first sorting feature vector (i.e., a fused sorting feature vector corresponding to the jth sorting feature vector). Optionally, if j is greater than 1, vector fusion is performed on the j-th sorting feature vector and the fusion sorting feature vector corresponding to the j-1-th sorting feature vector, so as to obtain the fusion sorting feature vector corresponding to the j-th sorting feature vector.

The vector fusion mode may be a vector splicing mode, a weighted summation (e.g., addition) mode, or a mode of taking a maximum value or a minimum value, and the embodiment of the application is not limited to a specific mode of vector fusion.

Wherein the number of recall classification information is H ₁ And H here ₁ May be an integer greater than 1. It should be appreciated that the specific process of sorting recall classification information step-by-step according to the S fused sorting feature vectors may be described as: and acquiring a kth sorting feature vector from the S fusion sorting feature vectors. Where k may be a positive integer less than or equal to S. Further, if k is equal to 1, the feature vector pair H is ordered according to the first fusion ₁ And screening and sorting the recall classification information to obtain sorting classification information of the first level (namely sorting classification information of the kth level) corresponding to the target video. Alternatively, if k is greater than 1, according to the kth fusionAnd the sorting feature vector screens and sorts the sorting classification information of the kth-1 level corresponding to the target video to obtain the sorting classification information of the kth level corresponding to the target video.

Wherein it can be appreciated that the feature vector pair H is ordered according to the first fusion ₁ The specific process of screening and sorting the recall classification information can be described as follows: and carrying out normalization processing on the first fusion ordering feature vector to obtain a normalized ordering vector corresponding to the first fusion ordering feature vector. Wherein the normalized rank vector includes H ₁ Order vector parameters, H ₁ The ranking vector parameters may represent the target video and H ₁ The degree of matching of the individual recall classification information. Further, for H ₁ Sequencing the sequencing vector parameters to obtain sequenced H ₁ The order vector parameters. Wherein, H after sequencing ₁ The order vector parameters are used to indicate H ₁ The order of the category information is recalled. Further, according to the ordered H ₁ A ranking vector parameter, from H ₁ Acquisition of H from personal recall classification information ₂ Tag information to be acquired H ₂ The tag information is determined as the sorting classification information of the first hierarchy corresponding to the target video. Wherein H is ₂ Is less than or equal to H ₁ Positive integer of H ₂ The individual tag information is the ordered H ₁ Earlier ordered H of the order vector parameters ₂ Tag information corresponding to the respective ordering vector parameters.

Wherein, for a specific process of screening and sorting the sorting classification information of the kth-1 th hierarchy corresponding to the target video according to the kth fusion sorting feature vector, reference may be made to the above-mentioned pair H according to the first fusion sorting feature vector ₁ The description of the screening and sorting of the recall classification information will not be repeated here.

For ease of understanding, please refer to fig. 4a, fig. 4a is a schematic structural diagram of a classification model according to an embodiment of the present application. As shown in fig. 4a, the target classification model may include a multi-modal feature sub-model 50a, a feature fusion sub-model 50b, and a task sub-model 50c, where the feature fusion sub-model 50b is illustrated as a BERT model for ease of understanding, and where the level of sorting classification information is illustrated as 1.

The feature fusion sub-model 40b may include one or more fusion network layers, where the number of fusion network layers is 3, and the 3 fusion network layers may specifically include a fusion network layer 41a, a fusion network layer 41b, and a fusion network layer 41c, where the fusion network layer 41a, the fusion network layer 41b, and the fusion network layer 41c may be encoders in a transform model, and the fusion network layer 41a, the fusion network layer 41b, and the fusion network layer 41c may perform Attention processing on input features, in other words, the feature fusion sub-model 40b may use a multi-layer transform model to perform Attention operation.

Wherein, the task sub-model 40c may include 1 sort full-link layer, the task sub-model 40c may include a recall full-link layer and a sort full-link layer, the recall full-link layer may include a full-link network layer 46a and a full-link network layer 46b, and the sort full-link layer may include a full-link network layer 46c and a full-link network layer 46d.

As shown in fig. 4a, the multi-modal feature extraction model may be performed on the target video through the multi-modal feature sub-model 50a and the feature fusion sub-model 50b, so as to generate the attention feature (i.e., the multi-modal fusion feature) corresponding to the target video. Further, the multi-mode fusion feature can be fully connected through the fully connected network layer 46a to obtain a candidate recall feature vector corresponding to the target video, and the candidate recall feature vector can be fully connected through the fully connected network layer 46b to obtain a recall feature vector output by the recall fully connected layer; similarly, the multi-mode fusion feature can be fully connected through the fully connected network layer 46c to obtain candidate sorting feature vectors corresponding to the target video, and the candidate sorting feature vectors can be fully connected through the fully connected network layer 46d to obtain sorting feature vectors output by the sorting fully connected layer.

Further, as shown in fig. 4a, the recall feature vector and the order feature vector may be vector-fused by the task sub-model 40c to obtain a fused order feature vector corresponding to the order feature vector. Further, according to the recall feature vector, recall classification information corresponding to the target video can be determined, and according to the fused sorting feature vector, sorting can be performed on the recall classification information to obtain sorting classification information corresponding to the target video.

For ease of understanding, please refer to fig. 4b, fig. 4b is a schematic structural diagram of a classification model according to an embodiment of the present application. FIG. 4b illustrates an example of a hierarchy of 2 sorting information, where the task sub-model may include 2 sorting fully connected layers, and the task sub-model may include a recall fully connected layer, a sorting fully connected layer H ₁ And ordering full connection layer H ₂ Recall that the fully-connected layers may include fully-connected network layer 50a and fully-connected network layer 50b, ordered fully-connected layer H ₁ May include a fully connected network layer 51a and a fully connected network layer 51b, ordered fully connected layer H ₂ A fully connected network layer 52a and a fully connected network layer 52b may be included.

As shown in fig. 4b, the multi-mode fusion feature may be fully connected through the fully connected network layer 50a to obtain a candidate recall feature vector corresponding to the target video, and the fully connected network layer 50b may be used to fully connect the candidate recall feature vector to obtain a recall feature vector output by the recall fully connected layer; similarly, the multi-mode fusion feature can be fully connected through the fully connected network layer 51a to obtain a first candidate ordering feature vector corresponding to the target video, and the fully connected network layer 51b can fully connect the first candidate ordering feature vector to obtain an ordering fully connected layer H ₁ The output ranking feature vector (i.e., ranking feature vector 53 a); the multi-mode fusion feature can be fully connected through the fully connected network layer 52a to obtain a second candidate ordering feature vector corresponding to the target video, and the second candidate ordering feature vector can be fully connected through the fully connected network layer 52b to obtain an ordering fully connected layer H ₂ The output ranking feature vector (i.e., ranking feature vector 53 b).

Further, as shown in fig. 4b, the recall feature vector and the sequence feature vector 53a may be vector-fused by the task submodel to obtain a fused sequence feature vector (i.e., fused sequence feature vector 54 a) corresponding to the sequence feature vector 53 a; similarly, the fusion ordering feature vector 54a and the ordering feature vector 53b can be subjected to vector fusion through the task submodel, so as to obtain a fusion ordering feature vector (i.e. the fusion ordering feature vector 54 b) corresponding to the ordering feature vector 53 b. Further, recall classification information corresponding to the target video may be determined according to the recall feature vector, the recall classification information may be filtered and ordered according to the fused ordering feature vector 54a to obtain ordering classification information of a first level corresponding to the target video, and the ordering classification information of the first level may be filtered and ordered according to the fused ordering feature vector 54b to obtain ordering classification information of a second level corresponding to the target video.

It should be understood that when S is equal to 2, the embodiment of the present application may be understood that the tags of the large candidate pool content are screened step by using the recall, coarse ranking and fine ranking methods, so as to achieve the purpose of progressive identification. Wherein, early recall and coarse ranking are used to identify inter-class (coarse granularity) differences, and later fine ranking module is used to identify intra-class (fine granularity) differences.

Further, referring to fig. 5, fig. 5 is a flow chart of a data processing method according to an embodiment of the application. The method may be performed by a server, or may be performed by a terminal device, or may be performed by a server and a terminal device together, where the server may be the server 20a in the embodiment corresponding to fig. 2, and the terminal device may be the terminal device 20b in the embodiment corresponding to fig. 2. For ease of understanding, embodiments of the present application will be described in terms of this method being performed by a server. The data processing method may include the following steps S201 to S205:

step S201, a pre-training classification model is obtained;

the pre-training classification model comprises a multi-mode feature sub-model, a feature fusion sub-model and an initial task sub-model, wherein the multi-mode feature sub-model comprises a text network layer, an image network layer, a video network layer and an audio network layer. The pre-training classification model is obtained by pre-training an initial classification model.

The feature fusion sub-model is obtained by pre-training a sample video set; associating at least two video domain types with a sample video in the sample video set, wherein the at least two video domain types comprise target domain types; the sample video set includes N sample videos, where N may be a positive integer; the N sample videos include sample video S _i Here i may be a positive integer less than or equal to N. It should be appreciated that the specific process of pre-training the initial classification model may be described as: an initial classification model is obtained. The initial classification model comprises a multi-mode feature sub-model and an initial feature fusion sub-model. Further, a sample video S is obtained through a multi-modal feature sub-model _i And obtaining a target sample mode vector from the P sample mode vectors, and determining sample mode vectors except the target sample mode vector in the P sample mode vectors as candidate sample mode vectors. Here, P may be a positive integer. Further, the target sample modal vector is orientedAnd carrying out fusion learning on the auxiliary sample modal vector and the candidate sample modal vector through an initial feature fusion sub-model to obtain a first fusion sample vector corresponding to the auxiliary sample modal vector. Further, based on the first fusion sample vector and the target sample modal vector, the initial feature fusion sub-model is pre-trained, and the feature fusion sub-model is obtained.

For ease of understanding, the embodiments of the present application are described by taking P sample mode vectors including a text mode vector, an image mode vector, a video mode vector, and an audio mode vector as examples, where the P sample mode vectors may include a sample text vector, a sample image vector, a sample video vector, and a sample audio vector. At this time, the target sample mode vector may be any one or more of a sample text vector, a sample image vector, a sample video vector and a sample audio vector, and the candidate sample mode vector may be a vector other than the target sample mode vector among the sample text vector, the sample image vector, the sample video vector and the sample audio vector. For example, the sample text vector may be a target sample modal vector, and the sample image vector, sample video vector, and sample audio vector may be candidate sample modal vectors; for another example, the sample image vector may be a target sample modality vector, and the sample text vector, the sample video vector, and the sample audio vector may be candidate sample modality vectors.

The auxiliary sample mode vector and the candidate sample mode vector can be used as a token to be input into an initial feature fusion sub-model, the first fusion sample vector can be a prediction result of the token, and the target sample mode vector can be a vector before vector change corresponding to the token. In addition, for a specific process of determining the loss value of the initial feature fusion sub-model based on the first fusion sample vector and the target sample mode vector, reference may be made to the following description of determining the loss value of the first model based on the sample normalization recall vector and the sample recall tag information.

It is appreciated that embodiments of the present application may vector vary the target sample mode vector based on a Mask language Model (Masked Language Modeling, MLM) and a Mask Frame Model (MFM). The mask language model may randomly replace x% of the lemmas (i.e., the target sample modal vector) with the mask each time, among the replaced lemmas, there is a y% probability that is replaced with the mask, a z% probability that is replaced with the random lemmas, and a z% probability that keeps the original lemmas, where the loss value used for pre-training the initial feature fusion submodel may be the first pre-training loss value. The mask frame model may randomly mask x% of the tokens and replace the tokens with all 0's, and at this time, the penalty used for pre-training the initial feature fusion sub-model may be a second pre-training penalty. The embodiment of the application does not limit the specific values of x, y and z; the penalty function used to determine the second pre-training penalty value may be NCE (Noise Contrastive Estimation) Loss, with the mask frame model resembling contrast learning such that the predicted mask features (i.e., the first fused sample vector) and the original features (i.e., the target sample modal vector) are as similar as possible and the other features (i.e., the candidate sample modal vectors) are as similar as possible.

Optionally, the feature fusion sub-model is obtained by pre-training a sample video set; associating at least two video domain types with a sample video in the sample video set, wherein the at least two video domain types comprise target domain types; the sample video set includes N sample videos, where N may be a positive integer. The specific process by which the initial classification model should be pre-trained can be described as: an initial classification model is obtained. The initial classification model comprises a multi-mode feature sub-model and an initial feature fusion sub-model. Further, P initial modal feature vectors corresponding to each sample video are obtained through the multi-modal feature sub-model, and initial modal feature vectors belonging to the same mode in the N P initial modal feature vectors are combined into the same initial modal feature vector sequence to obtain P initial modal feature vector sequences. Each initial modal feature vector sequence comprises N initial modal feature vectors; here, P may be a positive integer. Further, a candidate modal feature vector sequence is obtained from the P initial modal feature vector sequences, R candidate modal feature vectors are obtained from the candidate modal feature vector sequence, and the sequence of the R candidate modal feature vectors is adjusted to obtain a candidate modal feature vector sequence after the sequence is adjusted. Wherein, R herein may be a positive integer less than N. Further, fusion learning is carried out on the candidate modal feature vectors in the candidate modal feature vector sequence after the sequence adjustment and the initial modal feature vectors in the P-1 initial modal feature vector sequences through the initial feature fusion sub-model, so that a second fusion sample vector is obtained. The P-1 initial modal feature vector sequences are initial modal feature vector sequences except candidate modal feature vector sequences in the P initial modal feature vector sequences. Further, a matching tag between the candidate modal feature vector in the candidate modal feature vector sequence after the adjustment sequence and the initial modal feature vector in the P-1 initial modal feature vector sequence is obtained, and based on the matching tag and the second fusion sample vector, the initial feature fusion sub-model is pre-trained, so that the feature fusion sub-model is obtained.

For easy understanding, the embodiment of the present application is described by taking P initial modality feature vector sequences including a sequence of text modalities, a sequence of image modalities, a sequence of video modalities, and a sequence of audio modalities as an example, where the P initial modality feature vector sequences may specifically include an initial text modality feature sequence, an initial image modality feature sequence, an initial video modality feature sequence, and an initial audio modality feature sequence. At this time, the candidate modality feature vector sequence may be any one or more of an initial text modality feature sequence, an initial image modality feature sequence, an initial video modality feature sequence, and an initial audio modality feature sequence, and the P-1 initial modality feature vector sequences may be sequences other than the candidate modality feature vector sequences among the initial text modality feature sequence, the initial image modality feature sequence, the initial video modality feature sequence, and the initial audio modality feature sequence. For example, the candidate modal feature vector sequence may be an initial text modal feature sequence, and the initial image modal feature sequence, the initial video modal feature sequence and the initial audio modal feature sequence may be P-1 initial modal feature vector sequences; for another example, the candidate modality feature vector sequence may be an initial image modality feature sequence, and the initial text modality feature sequence, the initial video modality feature sequence, and the initial audio modality feature sequence may be P-1 initial modality feature vector sequences.

The initial feature fusion sub-model may perform fusion learning on the unmatched initial modal feature vectors in the candidate modal feature vector sequence and the P-1 initial modal feature vector sequences after the unmatched adjustment sequence, so as to obtain P second fusion sample vectors, where a matching tag corresponding to the unmatched initial modal feature vectors may be 0 (i.e. unmatched). Similarly, the initial feature fusion sub-model can perform fusion learning on the matched initial modal feature vectors in the matched P initial modal feature vector sequences to obtain P second fusion sample vectors, and a matching label corresponding to the matched initial modal feature vectors can be 1 (namely matching). In addition, the specific process of determining the loss value of the initial feature fusion sub-model based on the matching tag and the second fusion sample vector can be referred to as the following description of determining the loss value of the first model based on the sample normalized recall vector and the sample recall tag information.

It can be understood that the embodiment of the application can pretrain the initial feature fusion sub-model based on a V2T (Video To Text) task, and the Video Text matching task can be used for judging whether vectors of different modes input by the initial feature fusion sub-model are matched. At this time, the loss value used for pre-training the initial feature fusion submodel may be a third pre-training loss value.

For example, considering that there are N sample videos in the sample video set, the comparison efficiency can be improved by matching the sample videos in pairs, i.e. the number of sample videos can be two, and the two sample videos can be sample videos S ₁ And sample video S ₂ Sample videoS ₁ The corresponding P initial modal feature vectors may be a first sample text vector, a first sample image vector, a first sample video vector and a first sample audio vector, a sample video S ₂ The corresponding P initial modality feature vectors may be a second sample text vector, a second sample image vector, a second sample video vector, and a second sample audio vector. Further, vector combinations are performed on the first text sample vector, the first sample image vector, the first sample video vector, the first sample audio vector, the second sample text vector, the second sample image vector, the second sample video vector, and the second sample audio vector to obtain a combined text vector, a combined image vector, a combined video vector, and a combined audio vector. Wherein the combined text vector, the combined video vector, and the combined audio vector correspond to different sample videos. For example, the first sample text vector, the first sample image vector, the first sample video vector, and the second sample audio vector are determined as a combined text vector, a combined image vector, a combined video vector, and a combined audio vector, respectively. Further, fusion learning is carried out on the combined text vector, the combined image vector, the combined video vector and the combined audio vector through the initial feature fusion sub-model, and a second fusion sample vector is obtained. Further, based on the second fused sample vector and the matching tag (the matching tag is not matched because the first sample vector, the first sample image vector, the first sample video vector and the second sample audio vector belong to different sample videos), parameter adjustment is performed on the initial feature fusion sub-model, and a feature fusion sub-model is obtained.

Optionally, the feature fusion sub-model is obtained by pre-training a sample video set; associating at least two video domain types with a sample video in the sample video set, wherein the at least two video domain types comprise target domain types; the sample video set includes N sample videos, where N may be a positive integer; the N sample videos include sample video S _i The method comprises the steps of carrying out a first treatment on the surface of the Here i may be a positive integer less than or equal to N. The specific process by which the initial classification model should be pre-trained can be described as: an initial classification model is obtained. Wherein,the initial classification model comprises a multi-modal feature sub-model and an initial feature fusion sub-model. Further, a sample video S is obtained through a multi-modal feature sub-model _i And adjusting the sequence of the P sample mode vectors to obtain P sample mode vectors after adjustment. Here, P may be a positive integer. In other words, the embodiment of the application can acquire W sample mode vectors from P sample mode vectors, adjust the sequence of the W sample mode vectors to obtain adjusted W sample mode vectors, and further determine the adjusted W sample mode vectors and P-W sample mode vectors as adjusted P sample mode vectors, where the P-W sample mode vectors may be sample mode vectors other than the W sample mode vectors in the P sample mode vectors, where W may be a positive integer less than or equal to P. Further, fusion learning is carried out on the P sample modal vectors after adjustment through an initial feature fusion sub-model, so that a sample video S is obtained _i Corresponding P fused sample modal vectors. Further, full connection processing is performed on the P fusion sample modal vectors respectively, so that fusion modal vectors corresponding to the P fusion sample modal vectors respectively are obtained. Further, based on the P fusion modal vectors and the P sample modal vectors, pre-training is carried out on the initial feature fusion sub-model to obtain the feature fusion sub-model.

For ease of understanding, the embodiments of the present application are described by taking P sample mode vectors including a text mode vector, an image mode vector, a video mode vector, and an audio mode vector as examples, where the P sample mode vectors may include a sample text vector, a sample image vector, a sample video vector, and a sample audio vector. For example, the adjusted P sample mode vectors may be sample text vectors, sample image vectors, sample audio vectors, and sample video vectors; for another example, the adjusted P sample mode vectors may be sample image vectors, sample video vectors, sample text vectors, and sample audio vectors.

It can be understood that, in the embodiment of the present application, the initial feature fusion sub-model may be pre-trained based on the FOM (Frame Order Modeling ) task, and the frame order modeling task may enable the model to learn the relationship between the alignment on the time sequence of the cross-mode, the grasp on the internal video order, and the learning mode, so that, for easy understanding, the correct order of the input corresponding to the initial feature fusion sub-model is the text mode, the image mode, the video mode, and the audio mode. At this time, the loss value used for pre-training the initial feature fusion sub-model may be a fourth pre-training loss value.

It should be understood that, in the embodiment of the present application, the initial feature fusion sub-model may be pre-trained based on the first pre-training loss value, the initial feature fusion sub-model may be pre-trained based on the second pre-training loss value, the initial feature fusion sub-model may be pre-trained based on the third pre-training loss value, and the initial feature fusion sub-model may be pre-trained based on the fourth pre-training loss value; optionally, the embodiment of the present application may further perform pre-training on the initial feature fusion sub-model based on two, three or four loss values among the first pre-training loss value, the second pre-training loss value, the third pre-training loss value and the fourth pre-training loss value.

It can be understood that the order of the R candidate modality feature vectors is adjusted, so that a comparison learning mode is introduced, and the comparison learning can force the model to learn the association relationship between P modalities, and align different modalities, so as to correctly fuse P features (i.e., the candidate modality feature vector in the candidate modality feature vector sequence after the order adjustment and the initial modality feature vector in the P-1 initial modality feature vector sequence) corresponding to the P modalities respectively.

Therefore, the embodiment of the application can pretrain the initial classification model by adopting the pretraining technology, and the pretraining technology can effectively improve the research and development efficiency while saving training samples, wherein the multi-mode pretraining technology used by the application can comprise, but is not limited to, an MLM task, an MFM task, a V2T task and a FOM task, and the pretraining task can reduce the sample requirements of a fine tuning stage. In addition, the multi-mode pre-training mode is utilized to fully utilize the pre-training model, so that a large amount of sample labeling cost is saved, the training cost and period of the model are reduced, and each mode can be better aligned by improving and optimizing contrast learning, so that the effect of classifying the model is effectively improved.

Step S202, performing multi-mode feature extraction and fusion on a target sample video belonging to the target field type through a multi-mode feature sub-model and a feature fusion sub-model to generate a sample multi-mode fusion feature corresponding to the target sample video;

specifically, a target sample video belonging to a target field type is acquired. Further, text feature extraction is carried out on the target text data of the target sample video through the text network layer, and a sample sub-text vector corresponding to the target sample video is obtained. Further, image characteristic extraction is carried out on target image data of the target sample video through the image network layer, and sample sub-image vectors corresponding to the target sample video are obtained. Further, video feature extraction is carried out on target video data of the target sample video through the video network layer, and sample sub-video vectors corresponding to the target sample video are obtained. Further, audio feature extraction is carried out on the target audio data of the target sample video through the audio network layer, and a sample sub-audio vector corresponding to the target sample video is obtained. Further, fusion learning is carried out on the sample sub-text vector, the sample sub-audio vector, the sample sub-image vector and the sample sub-video vector through the feature fusion sub-model, so that sample multi-mode fusion features corresponding to the target sample video are obtained. The P modal feature vectors include a sample sub-text vector, a sample sub-image vector, a sample sub-video vector, and a sample sub-audio vector, in other words, the sample sub-text vector, the sample sub-image vector, the sample sub-video vector, and the sample sub-audio vector may be collectively referred to as P modal feature vectors corresponding to the target sample video, where the sample sub-text vector, the sample sub-image vector, the sample sub-video vector, and the sample sub-audio vector have the same feature dimension; the feature fusion sub-model may support cross-modal feature fusion.

For a specific process of extracting text features from the target text data, refer to the following description of step S2022 in the embodiment corresponding to fig. 6; for a specific process of extracting image features from the target image data through the image network layer, refer to the description of step S2023 in the embodiment corresponding to fig. 6; for a specific process of extracting video features from the target video data, refer to the description of step S2024 in the embodiment corresponding to fig. 6; for a specific process of extracting the audio features from the target audio data, reference may be made to the description of step S2025 in the embodiment corresponding to fig. 6.

Step S203, carrying out feature recall processing on the sample multi-mode fusion features to obtain sample recall feature vectors corresponding to the target sample video, and carrying out S times of feature sorting processing on the sample multi-mode fusion features to obtain S sample sorting feature vectors corresponding to the target sample video;

for a specific process of performing feature recall processing on the sample multi-mode fusion feature, reference may be made to the description of performing feature recall processing on the multi-mode fusion feature in the embodiment corresponding to fig. 3, which will not be described herein. Similarly, for a specific process of performing the S-order feature sorting process on the sample multi-mode fusion feature, reference may be made to the description of performing the S-order feature sorting process on the multi-mode fusion feature in the embodiment corresponding to fig. 3, which will not be described in detail herein. Here, S may be a positive integer.

Step S204, step-by-step fusion is carried out on S sample sequencing feature vectors through the sample recall feature vectors, and S sample fusion sequencing feature vectors are obtained;

for a specific process of step-by-step fusion of the S sample-ordering feature vectors by the sample recall feature vector, reference may be made to the description of step-by-step fusion of the S sample-ordering feature vectors by the recall feature vector in the embodiment corresponding to fig. 3, which will not be described in detail herein.

Step S205, parameter adjustment is performed on the initial task sub-model based on S levels of sample ordering label information of the target sample video, sample recall feature vectors and S sample fusion ordering feature vectors, and a task sub-model is obtained.

Specifically, the sample recall feature vector is normalized, and a sample normalized recall vector corresponding to the sample recall feature is obtained. Further, a first model loss value of the pre-trained classification model is determined based on the sample recall vector and sample recall tag information for the target sample video. Further, step-by-step adjustment is performed on the S sample fusion and sorting feature vectors according to the first model loss value, so that adjusted S sample fusion and sorting feature vectors are obtained, and S second model loss values of the pre-training classification model are determined based on the adjusted S sample fusion and sorting feature vectors and S-level sample sorting label information of the target sample video. The sample ordering label information of the S layers belongs to sample recall label information. Further, according to the first model loss value and the S second model loss values, determining a total model loss value of the pre-training classification model, and carrying out parameter adjustment on the initial task sub-model based on the total model loss value to obtain the task sub-model. The first model loss value, the second model loss value and the third model loss value are weighted and summed to generate a total model loss value of the pre-training classification model; the multi-mode feature sub-model, the feature fusion sub-model and the task sub-model are used for forming a target classification model; the target classification model is used for predicting recall classification information corresponding to target videos belonging to the target field type and S-level sorting classification information corresponding to the target videos.

It should be appreciated that the specific process of determining the S second model loss values for the pre-trained classification model may be described as: and adjusting a first sample fusion ordering feature vector in the S sample fusion ordering feature vectors according to the first model loss value to obtain an adjusted first sample fusion ordering feature vector. Further, normalization processing is carried out on the adjusted first sample fusion ordering feature vector, and a sample normalization ordering vector corresponding to the adjusted first sample fusion ordering feature vector is obtained. Further, a first second model loss value of the pre-trained classification model is determined based on the sample normalized ordering vector corresponding to the adjusted first sample fusion ordering feature vector and sample ordering label information of a first level of the target sample video. Further, the second sample fusion ordering feature vector is adjusted based on the first and second model loss values, an adjusted second sample fusion ordering feature vector is obtained, and S-1 second model loss values of the pre-training classification model are determined based on the adjusted second sample fusion ordering feature vector, S-2 sample fusion ordering feature vectors and S-1 level sample ordering label information. The S-2 sample fusion ordering feature vectors are sample fusion ordering feature vectors except the first sample fusion ordering feature vector and the second sample fusion ordering feature vector in the S sample fusion ordering feature vectors; the S-1 level sample ordering tag information is sample ordering tag information except for the first level sample ordering tag information in the S level sample ordering tag information.

It can be understood that the normalized processing is performed on the adjusted second sample fusion ordering feature vector, so as to obtain a sample normalized ordering vector corresponding to the adjusted second sample fusion ordering feature vector. Further, a second model loss value of the pre-trained classification model is determined based on the sample normalized ranking vector corresponding to the adjusted second sample fusion ranking feature vector and sample ranking tag information of a second level of the target sample video. Further, the third sample fusion and sorting feature vector is adjusted based on the second model loss value, the adjusted third sample fusion and sorting feature vector is obtained, and S-2 second model loss values of the pre-training classification model are determined based on the adjusted third sample fusion and sorting feature vector, S-3 sample fusion and sorting feature vectors and S-2 levels of sample sorting label information. The S-3 sample fusion and sorting feature vectors are sample fusion and sorting feature vectors except the first sample fusion and sorting feature vector, the second sample fusion and sorting feature vector and the third sample fusion and sorting feature vector in the S sample fusion and sorting feature vectors; the S-2 levels of sample ordering tag information are sample ordering tag information except the first level of sample ordering tag information and the second level of sample ordering tag information in the S levels of sample ordering tag information. Similarly, S second model loss values may be generated.

Among other things, it is understood that a specific process of determining a first second model loss value for a pre-trained classification model can be described as: and generating a classification tag vector corresponding to the first-level sample ordering tag information according to the first-level sample ordering tag information of the target sample video. Further, a first model loss value of the pre-trained classification model is determined according to the sample normalized sorting vector and the classification label vector corresponding to the adjusted first sample fused sorting feature vector. It should be appreciated that the loss function in the model training process may be used to represent the degree of difference between the predicted value and the actual value, and the smaller the loss value (e.g., the first model loss value) corresponding to the loss function, the better the model, so the goal of training a machine learning model is to find the point at which the model loss function reaches the minimum value. For example, the model loss function may be a cross entropy loss function.

Therefore, after the pre-training classification model is obtained, the multi-mode feature extraction fusion is performed on the target sample video through the pre-training classification model, so that the sample multi-mode fusion feature corresponding to the target sample video is generated, the feature recall processing and the S-time feature sorting processing are performed on the sample multi-mode fusion feature at the same time, the S-time sample sorting feature vectors obtained through the S-time feature sorting processing are fused step by step according to the sample recall feature vectors obtained through the feature recall processing, and then the parameter adjustment is performed on the initial task sub-model based on the S-time sample fusion sorting feature vectors, the sample recall tag information and the sample sorting tag information obtained through the step by step fusion. Therefore, the embodiment of the application can process the sample multi-mode fusion characteristics corresponding to the target sample video in parallel, namely, the characteristics of the recall stage and the characteristics of the sorting stage are extracted in parallel, and the characteristics of the recall stage and the characteristics of the sorting stage are fused in a step-by-step fusion mode, so that the target classification model for executing the recall task and the sorting task is trained and obtained, and when sorting classification information corresponding to the target video is determined through the target classification model, the time for determining the sorting classification information corresponding to the target video can be reduced, and the efficiency for determining the sorting classification information corresponding to the target video is further improved.

Further, referring to fig. 6, fig. 6 is a flow chart of a data processing method according to an embodiment of the application. The data processing method may include the following steps S2021 to S2026, where steps S2021 to S2026 are one embodiment of step S202 in the embodiment corresponding to fig. 5.

Step S2021, obtaining a target sample video belonging to a target field type;

the number of the target sample videos may be one or more, and the embodiment of the present application may acquire one or more target sample videos belonging to the target domain type, and further execute step S2022-step S2026 described below with respect to the one or more target sample videos.

Step S2022, extracting text features of target text data of the target sample video through a text network layer to obtain sample sub-text vectors corresponding to the target sample video;

specifically, feature extraction is performed on target text data of a target sample video through a text network layer, so that text feature vectors corresponding to the target text data are obtained. Further, word segmentation processing is carried out on the target text data to obtain text word segmentation of the target text data, text position coding is carried out on text positions of the text word segmentation in the target text data, and text position vectors corresponding to the text word segmentation are obtained. Further, text modal characteristics corresponding to the target text data are obtained, and the text characteristic vectors, the text position vectors and the text modal characteristics are fused to obtain sample sub-text vectors corresponding to the target sample video. The text mode feature can represent a mode number of a text mode and is used for uniquely identifying one mode; the method for fusing the text feature vector, the text position vector and the text modal feature can be a weighted average method or a feature splicing method, and the embodiment of the application does not limit the specific fusion process.

The target text data may include, but is not limited to, title information, subtitle text information, voice text information, and description keywords (i.e., hashTag) of the target sample video, where the target text data may be obtained by stitching the title information, the subtitle text information, the voice text information, and the description keywords. The description keyword can be a word input when the publisher uploads the target sample video, voice text information of the target sample video can be identified through ASR (Automatic Speech Recognition, namely automatic voice recognition), subtitle text information of the target sample video can be identified through OCR, and the title information is subjective description of the expression content of the target sample video by the publisher and generally covers high-level semantics which the target sample video wants to express.

It can be understood that due to reasons that OCR recognition is inaccurate, fixed-position OCR needs to be duplicated, dictation OCR needs to be reserved, news scrolling OCR needs to be deleted and the like in a picture switching process, the embodiment of the application can perform denoising processing on OCR recognition results (namely caption text information), the denoising processing can filter single-word OCR, filter pure-digital OCR, filter pure-letter OCR, filter OCR with small offset of positions of adjacent two frames of captions and high character repetition rate, filter OCR with captions at the bottom end of a screen and small height, and the like, and further can splice the denoised caption text information with other text information to obtain target text data.

According to the embodiment of the application, the feature extraction can be carried out on the target text data by training the BERT model based on the prediction of the information flow large-scale text corpus, so that the features of the text mode can be effectively extracted; optionally, the embodiment of the present application may also perform feature extraction on the target text data through the BERT model, and the embodiment of the present application does not limit the type of the model used for text feature extraction.

Optionally, the embodiment of the application can also obtain a keyword vector (for example, a word2vec vector corresponding to the description keyword) corresponding to the description keyword, so as to fuse the keyword vector, the text feature vector, the text position vector and the text modal feature, and obtain a sample sub-text vector corresponding to the target sample video. The word2vec vector has side information (auxiliary information) in the global text, which is helpful for the multi-modal feature vector (i.e. sample sub-text vector) to better understand the overall category division space.

Step S2023, extracting image features of target image data of the target sample video through the image network layer to obtain sample sub-image vectors corresponding to the target sample video;

specifically, feature extraction is performed on target image data of a target sample video through an image network layer, and an image feature vector corresponding to the target image data is obtained. Further, image mode characteristics corresponding to the target image data are obtained, and the image characteristic vectors and the image mode characteristics are fused to obtain sample sub-image vectors corresponding to the target sample video. The image mode characteristics can represent the mode number of the image mode and are used for uniquely identifying one mode; the method for fusing the image feature vector and the image mode features can be a weighted average method or a feature stitching method, and the embodiment of the application does not limit the specific fusion process.

The target image data may be a cover map of the target sample video, and the cover map may be an image captured in the target sample video, or may be an image uploaded by a publisher. It should be understood that the embodiment of the present application may perform feature extraction on the target image data through the SwinT (Swin Transformer) model, and may also perform feature extraction on the target image data through the VIT (Vision Transformer) model, and the embodiment of the present application does not limit the type of model used for image feature extraction.

Optionally, in the embodiment of the present application, the target detection model may be used to perform target detection on the target image data to obtain a detection object in the target image data, obtain a detection object vector corresponding to the detection object, and further fuse the detection object vector, the image feature vector and the image mode feature to obtain a sample sub-image vector corresponding to the target sample video. The target detection model may be a FastRCNN (Fast Region Convolutional Neural Networks) model, and the embodiment of the application does not limit the model type of the target detection model.

Step S2024, extracting video features of target video data of the target sample video through the video network layer to obtain sample sub-video vectors corresponding to the target sample video;

Specifically, feature extraction is performed on target video data of a target sample video through a video network layer, so as to obtain a video feature vector corresponding to the target video data. Further, frame extraction processing is carried out on the target video data to obtain key video frames in the target video data, and feature extraction is carried out on the key video frames through a video network layer to obtain video frame feature vectors corresponding to the key video frames. Further, video mode features corresponding to the target video data are obtained, and the video feature vectors, the video frame feature vectors and the video mode features are fused to obtain sample sub-video vectors corresponding to the target sample video. The video mode characteristics can represent the mode number of the video mode and are used for uniquely identifying one mode; the method for fusing the video feature vector, the video frame feature vector and the video mode feature can be a weighted average method or a feature splicing method, and the embodiment of the application does not limit the specific fusion process.

It may be appreciated that the video feature vectors corresponding to the two target sample videos may represent a distance between the two target sample videos, and thus a similarity between the two target sample videos may be calculated, and the video feature vectors may represent content-based "implicit" features. Thus, the video feature vector contains 2-layer meaning: layer 1 means that the video feature vector is a dense feature of low dimension, layer 2 means that the video feature vector is a vector of similarity measure, and the "distance" of the two vectors represents the "similarity" of the two videos.

The embodiment of the application can perform feature extraction on the target video data through Video Swin Transformer, can perform feature extraction on the key video frames through a SwinT model, and can perform feature extraction on the key video frames through a VIT model, and the embodiment of the application does not limit the types of models used for video feature extraction.

Wherein Video Swin Transformer is based on the Swin transducer improvement, the basic structure of Video Swin Transformer is very close to the Swin transducer, and the frame (time) dimension is increased when the model is calculated. The biggest characteristic of the Swin transform is similar to the convolution+pooling structure in CNN (Convolutional Neural Network ), in the Swin transform, the structure becomes Swin Transformer Block +Patch raising, and the calculation amount of the model is reduced due to the specially designed Window Transformer calculation mode.

Step S2025, extracting audio features from the target audio data of the target sample video through the audio network layer to obtain a sample sub-audio vector corresponding to the target sample video;

specifically, frame-dividing processing is performed on target audio data of a target sample video to obtain at least two audio frames in the target audio data, and feature extraction is performed on the at least two audio frames through an audio network layer respectively to obtain audio frame feature vectors corresponding to the at least two audio frames respectively. Further, audio position coding is carried out on the audio position of each audio frame in the target audio data, and an audio frame position vector corresponding to each audio frame is obtained. Further, audio mode characteristics corresponding to the target audio data are obtained, at least two audio frame characteristic vectors, at least two audio frame position vectors and the audio mode characteristics are fused, and sample sub-audio vectors corresponding to the target sample video are obtained. The audio mode feature can represent a mode number of an audio mode and is used for uniquely identifying one mode; the method for fusing the at least two audio frame feature vectors, the at least two audio frame position vectors and the audio mode features can be a weighted average method or a feature splicing method, and the embodiment of the application does not limit the specific fusion process.

The embodiment of the application can respectively perform feature extraction on at least two audio frames through WavLM (Large-Scale Self-superior Pre-Training for Full Stack Speech Processing), and does not limit the type of a model used for audio feature extraction. In addition, the classification accuracy of emotion, smiling, film and television synthetic arts, video courses and other types of contents can be obviously improved through the sample sub-audio vectors.

In step S2026, fusion learning is performed on the sample sub-text vector, the sample sub-audio vector, the sample sub-image vector and the sample sub-video vector through the feature fusion sub-model, so as to obtain the sample multi-mode fusion feature corresponding to the target sample video.

It can be understood that, in the embodiment of the application, each feature (i.e., P modal feature vectors) output by the multi-modal feature sub-model can be used as a token, and a feature sequence is input into the feature fusion sub-model, so that modal fusion is performed through the multi-modal feature fusion sub-model based on the transducer network.

Optionally, the embodiment of the application can also perform fusion learning on one, two or three of the sample sub-text vector, the sample sub-audio vector, the sample sub-image vector and the sample sub-video vector to obtain the sample multi-mode fusion characteristic corresponding to the target sample video. For example, fusion learning is performed on the sample sub-text vector, the sample sub-audio vector, the sample sub-image vector and the sample sub-video vector, so as to obtain a sample multi-mode fusion characteristic corresponding to the target sample video.

For ease of understanding, referring to fig. 4a again, fig. 4a illustrates that the number of modes is 4, and the 4 modes may specifically include a text mode, an image mode, a video mode, and an audio mode, and optionally, the modes in the embodiment of the present application may also be 1, 2, or 3 of the text mode, the image mode, the video mode, and the audio mode.

As shown in fig. 4a, P modal feature vectors corresponding to the target sample video may be obtained through the multi-modal feature sub-model 40a, where the P modal feature vectors may include a modal feature vector 42d corresponding to a text mode, a modal feature vector 43c corresponding to an image mode, a modal feature vector 44d corresponding to a video mode, and a modal feature vector 45d corresponding to an audio mode; alternatively, the P modal feature vectors may include a modal feature vector 42e corresponding to a text modality, a modal feature vector 43d corresponding to an image modality, a modal feature vector 44e corresponding to a video modality, and a modal feature vector 45e corresponding to an audio modality. The modal feature vector 42d, the modal feature vector 43c, the modal feature vector 44d, and the modal feature vector 45d may be the modal feature vectors of different modes corresponding to the same video, and the modal feature vector 42e, the modal feature vector 43d, the modal feature vector 44d, and the modal feature vector 45e may be the modal feature vectors of different modes corresponding to the same video.

As shown in fig. 4a, the feature fusion sub-model 40b may perform fusion learning on the modal feature vector 42d, the modal feature vector 43c, the modal feature vector 44d and the modal feature vector 45d, so as to obtain a sample multi-modal fusion feature (i.e. attention feature) corresponding to the target sample video. The mode feature vector 42d, the mode feature vector 43c, the mode feature vector 44d and the mode feature vector 45d may be subjected to attention processing by the converged network layer 41a to obtain an output of the converged network layer 41a, and then the output of the converged network layer 41a is subjected to attention processing by the converged network layer 41b to obtain an output of the converged network layer 41b, and then the output of the converged network layer 41b is subjected to attention processing by the converged network layer 41c to obtain an attention feature corresponding to the target sample video.

It will be appreciated that where the classification model shown in fig. 4a is a pre-trained classification model, the recall feature vector shown in fig. 4a may be a sample recall feature vector and the fused rank feature vector shown in fig. 4a may be a sample fused rank feature vector. A first model penalty value (i.e., model penalty value 47 a) for the pre-trained classification model may be determined based on the sample recall feature vector and sample recall tag information for the target sample video, and a second model penalty value (i.e., model penalty value 47 b) for the pre-trained classification model may be determined based on the model penalty value 47a, the sample merge ordering feature vector, and sample ordering tag information for the target sample video.

Wherein, as shown in fig. 4a, the modal feature vector 42d (i.e., sample sub-text vector 42 d) may be obtained by fusing the text feature vector 42a, the text position vector 42b, and the text modal feature 42c, the modal feature vector 43c (i.e., sample sub-image vector 43 c) may be obtained by fusing the image feature vector 43a and the image modal feature 43b, the modal feature vector 44d (i.e., sample sub-video vector 44 d) may be obtained by fusing the video feature vector 44a, the video frame feature vector 44b, and the video modal feature 44c, and the modal feature vector 45d (i.e., sample sub-audio vector 45 d) may be obtained by fusing the at least two audio frame feature vectors 45a, the at least two audio frame position vectors 45b, and the audio modal feature 45 c.

For ease of understanding, referring again to fig. 4b, where the classification model shown in fig. 4b is a pre-trained classification model, the recall feature vector shown in fig. 4b may be a sample recall feature vector, the fused rank feature vector 54a shown in fig. 4b may be a first sample fused rank feature vector, and the fused rank feature vector 54b shown in fig. 4b may be a second sample fused rank feature vector. A first model penalty value (i.e., model penalty value 55 a) for the pre-trained classification model may be determined based on the sample recall feature vector and sample recall tag information for the target sample video, a first second model penalty value (i.e., model penalty value 55 b) for the pre-trained classification model may be determined based on the model penalty value 55a, the sample merge sort feature vector 54a, and the first level sample sort tag information for the target sample video, and a second model penalty value (i.e., model penalty value 55 c) for the pre-trained classification model may be determined based on the model penalty value 55b, the sample merge sort feature vector 54b, and the second level sample sort tag information for the target sample video.

Therefore, the embodiment of the application can acquire the target sample video belonging to the target field type, and perform feature extraction on the target text data, the target image data, the target video data and the target audio data of the target sample video to obtain the sample sub-text vector, the sample sub-image vector, the sample sub-video vector and the sample sub-audio vector corresponding to the target sample video, so as to perform fusion learning on the sample sub-text vector, the sample sub-image vector, the sample sub-video vector and the sample sub-audio vector to obtain the sample multi-mode fusion feature corresponding to the target sample video. It can be understood that the embodiment of the application can use a multi-mode content understanding algorithm to fuse the target text data, the target image data, the target video data and the target audio data, and can deepen understanding of video content by multi-mode multi-level modeling mining information, thereby realizing a downstream recall sorting task and improving accuracy of recall sorting information corresponding to a predicted target video and S levels of sorting information corresponding to the target video.

Specifically, referring to fig. 7, fig. 7 is a flowchart of a method and a system for identifying a large-scale fine-granularity tag of video content based on a machine learning multi-branch network according to an embodiment of the present application. As shown in fig. 7, step S11 to step S21 may be one execution path, step S31 to step S32 may be one execution path, step S41 to step S42 may be one execution path, step S51 to step S54 may be one execution path, and the 4 execution paths may be synchronously cross-executed.

As shown in fig. 7, in the execution path of step S11-step S21, the content production end (for example, the application client in the target terminal device) may perform step S11 to upload the target video to the uplink and downlink content interface service through the content upload interface, so that the uplink and downlink content interface service may perform step S12 when acquiring the target video uploaded by the target terminal device, to send the target video (i.e., the source file) to the content storage service, so that the content storage service stores the target video to the content database. The content storage service can be deployed on a group of storage servers which have wide distribution range and are close to users, and the periphery can be accelerated in a distributed cache way through a CDN acceleration server. Further, the uplink and downlink content interface service may execute step S13 to transcode (i.e. re-transcode) the target video to obtain meta-information (e.g. file size, cover map link, code rate, file format, title, release time, author, etc.) of the target video, write the meta-information into the content database, and promote playing compatibility of the target video on each platform.

Further, the uplink and downlink content interface service may execute step S14 to directly upload the target video to the scheduling center service for subsequent content processing and circulation, where the target video may mobilize the start of the scheduling process, and the target video enters the scheduling hierarchy. Therefore, the dispatch center service may perform step S15, call the deduplication service to perform the deduplication process on the target video (the deduplication process may perform vectorization on the target video, determine the similarity between videos by comparing the distances between the vectors), and write the result of the deduplication process into the content database. When the duplication eliminating process is passed, the scheduling center service can schedule the manual auditing system to manually audit the target video through the step S17, and update meta-information in the content database based on the duplication eliminating process result through the step S16; when the duplication eliminating process is failed, the dispatch center service can delete the target video, so that the target video which does not pass the duplication eliminating process cannot be checked by the manual checking system. In addition, videos that do not reach the repeated filtering can output content similarity and similarity relation chains for the recommendation system to break up.

Further, when the result of the duplication elimination process indicates that the exclusion process is passed, the manual review system may perform step S18 of reading the target video and meta information from the content database, and performing a first content review (i.e., a primary review) and a second content review (i.e., a review) on the read information. It can be appreciated that when the first content audit of the target video passes, the second content audit can be performed on the target video, and when the second content audit of the target video passes, the target video can be used as the distributable content; when the first content audit and the second content audit of the target video are not passed, step S18 may be executed to update meta information of the target video, and the target video satisfying the integrity is taken as the distributable content. In addition, the machine can extract classification information corresponding to the target video through an algorithm (such as a multi-mode content fine-granularity label prediction model) at the same time of manual auditing. The manual auditing system is a system with complex business and based on web (webpage) database development, the first content auditing can audit the sensitivity of the target video, and the second content auditing can confirm the label and quality problem of the target video, so that the accuracy and efficiency of the high-level label labeling of the video are improved through man-machine cooperation.

Further, the service of the dispatching center may execute step S19, and flexibly set different dispatching detail policies according to the category of the content from the meta information obtained from the content database. Further, the dispatch center service may perform step S20 to enable content distribution via the content distribution outlet service to start distribution, obtain at least one distribution content (e.g., a target video) from the content database, and distribute the at least one distribution content to the content consumer via step S21. At this time, the content consumption end may consume the target video, and the consumption behavior is precipitated on the classification information (for example, the sorting classification information) corresponding to the target video.

It may be appreciated that the video recommendation may be performed by a video recommendation algorithm, for example, the video recommendation algorithm may be a collaborative recommendation algorithm, a matrix decomposition algorithm, a supervised learning algorithm (for example, a logistic regression model), a deep learning model (for example, a factorizer and a gradient boost decision tree (Gradient Boosting Decision Tree, abbreviated as GBDT)) and the like.

As shown in fig. 7, in the execution path from step S31 to step S32, the content consumption end may execute step S31, and when the content consumption end selects to watch a certain video (for example, a target video), index information of the target video may be obtained from the uplink and downlink content interface server, where the index information may be a URL (Uniform Resource Locator ) address corresponding to the target video. Further, the content consumption end may execute step S32 to directly download the source file (i.e. the streaming media file) of the target video from the content storage service based on the URL address, and play the obtained source file through the local player, where the content consumption end may display the content according to different tag sets by using the classification information of the content (i.e. the target video) of the server. Meanwhile, the content consumption end can report the blocking, loading time, consumption behavior and the like in the downloading process to the content storage service.

It can be appreciated that, in the present application, related data such as consumption behavior, loading time, and katon are involved, and when the above embodiments of the present application are applied to specific products or technologies, user permission or consent needs to be obtained, and the collection, use, and processing of related data need to comply with relevant national laws and regulations and national standards of the country. For example, the content consumption end may display a prompt message "whether to record the current consumption behavior, and send the recorded information to the server", and when the authorization of the user corresponding to the content consumption end passes, the consumption behavior may be uploaded to the server.

As shown in fig. 7, in the execution path from step S41 to step S42, the download file system may execute step S41 to download the source file of the target video from the content storage service, where the download file system may control the speed and progress of the download, and the download file system is typically a set of parallel servers, and is composed of related task scheduling and distribution clusters. Further, the download file system may execute step S42, call the video content extraction frame and the audio separation service to process the source file, and obtain the video content feature information of the target video, for example, the video content feature information may be a key video frame obtained from the target video data, an audio frame obtained by framing the target audio data, and subtitle text information obtained by OCR text recognition on the video content extraction frame, which may be used as input of more dimensions of the classification of the video content. The content storage service is used as a data source for external service and also used as a data source for internal service for the download file system to acquire the original video data for relevant processing, and the paths of the internal and external data sources are usually arranged separately so as to avoid mutual influence.

As shown in fig. 7, in the execution path from step S51 to step S54, the dispatch center service may execute step S51 to call the multi-mode content fine-granularity tag service to complete the video content fine-granularity multi-tag prediction processing and tagging, execute step S52 through the multi-media content fine-granularity tag service to service the multi-mode content fine-granularity tag model (i.e., the multi-mode content fine-granularity tag prediction model) and complete the identification and tagging of the multi-mode content fine-granularity tag (i.e., recall classification information and S-level sorting classification information) of the main link content in the content flow. Further, the multi-modal content fine-granularity tag model may acquire meta information from the content database through step S53, and acquire video content feature information through step S54. The multi-modal content fine-granularity label prediction model can utilize the modal information of the video content, such as a visual mode, a text mode, a video mode and an audio mode, in a multi-dimension manner, and then uses a small amount of supervision data to perform task-level modeling, so as to construct the multi-modal content fine-granularity label prediction model (i.e. a target classification model) of the video content classification system, thereby better understanding the video content. Plays a great role in the operation of the content, the recommendation of the content and the clustering/scattering of the content, and helps the information flow to promote the distribution effect. Meanwhile, the multi-modal content fine-grained tag service can save the classification result (i.e., recall classification information and S levels of order classification information) of the obtained content in the content database through the dispatch center service.

It should be understood that after the content production end uploads the target video to the uplink and downlink content interface service, the target video may enter the server through the uplink and downlink content interface service. The content distribution export service, the uplink and downlink content interface service, the content storage service, the dispatch center service, the content duplication elimination service, the multi-modal content fine granularity tagging service, the video content extraction and audio separation service shown in fig. 7 may be a server program deployed on a server, which provides a remote network service specifically for an application client.

Further, referring to fig. 8, fig. 8 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application, where the data processing apparatus 1 may include: the first fusion module 11, the first processing module 12, the recall classification module 13 and the order classification module 14;

the first fusion module 11 is used for carrying out multi-mode feature extraction fusion on the target video and generating multi-mode fusion features corresponding to the target video;

the first processing module 12 is configured to perform feature recall processing on the multimodal fusion feature to obtain a recall feature vector corresponding to the target video, and perform feature ordering processing on the multimodal fusion feature for S times to obtain S ordered feature vectors corresponding to the target video; s is a positive integer;

Wherein the first processing module 12 comprises: a feature input unit 121, a first processing unit 122, a second processing unit 123;

a feature input unit 121 for inputting the multimodal fusion feature to a task sub-model in the target classification model; the task sub-model comprises a recall full-connection layer and S sequencing full-connection layers; the S sorting full-connection layers comprise a sorting full-connection layer H _i I is a positive integer less than or equal to S;

the first processing unit 122 is configured to perform full-connection processing on the multi-mode fusion feature in the recall full-connection layer to obtain a candidate recall feature vector corresponding to the target video, and perform full-connection processing on the candidate recall feature vector to obtain a recall feature vector output by the recall full-connection layer;

a second processing unit 123 for, at the ordered full connection layer H _i In the method, the multi-mode fusion feature is subjected to full connection processing to obtain candidate sorting feature vectors corresponding to the target video, and the candidate sorting feature vectors are subjected to full connection processing to obtain a sorting full connection layer H _i The output sequencing feature vector; s sorting feature vectors corresponding to the target video are sorting feature vectors respectively output by S sorting full-connection layers.

The specific implementation manner of the feature input unit 121, the first processing unit 122, and the second processing unit 123 may be referred to the description of step S102 in the embodiment corresponding to fig. 3, which will not be described herein.

The recall classifying module 13 is used for determining recall classifying information corresponding to the target video according to the recall characteristic vector;

the recall classifying module 13 is specifically configured to normalize the recall feature vectors to obtain normalized recall vectors corresponding to the recall feature vectors; the normalized recall vector includes at least two recall vector parameters;

the recall classification module 13 is specifically configured to sort at least two recall vector parameters to obtain sorted at least two recall vector parameters;

recall classification module 13, particularly for slave orderingAcquiring H with earlier sequence from at least two recall vector parameters ₁ Recall vector parameters, H to be ranked first ₁ Determining label information corresponding to the individual recall vector parameters as recall classification information corresponding to the target video; h ₁ Is an integer greater than 1.

The sorting and classifying module 14 is configured to perform step-by-step fusion on the S sorting feature vectors through the recall feature vectors to obtain S fused sorting feature vectors, and perform step-by-step sorting on the recall classifying information according to the S fused sorting feature vectors to obtain S-level sorting classifying information corresponding to the target video; the S-level sorting classification information belongs to recall classification information; the sorting classification information of the first level is obtained by screening and sorting from recall classification information, the sorting classification information of the (i+1) th level is obtained by screening and sorting from the sorting classification information of the (i) th level, and i is a positive integer less than S.

the sort module 14 includes: a first acquisition unit 141, a first fusion unit 142, a second fusion unit 143, a second acquisition unit 144, a first sorting unit 145, a second sorting unit 146;

a first obtaining unit 141, configured to obtain a j-th ranking feature vector from the S ranking feature vectors; j is a positive integer less than or equal to S;

a first fusion unit 142, configured to, if j is equal to 1, perform vector fusion on the recall feature vector and the first ordering feature vector to obtain a fused ordering feature vector corresponding to the first ordering feature vector;

and the second fusion unit 143 is configured to, if j is greater than 1, perform vector fusion on the j-th sorting feature vector and the fusion sorting feature vector corresponding to the j-1-th sorting feature vector, to obtain the fusion sorting feature vector corresponding to the j-th sorting feature vector.

A second obtaining unit 144, configured to obtain a kth ranking feature vector from the S fused ranking feature vectors; k is a positive integer less than or equal to S;

first rowOrder unit 145 for sorting the feature vector pair H according to the first fusion if k is equal to 1 ₁ Screening and sorting the recall classification information to obtain sorting classification information of a first level corresponding to the target video;

The first sorting unit 145 is specifically configured to normalize the first fused sorting feature vector to obtain a normalized sorting vector corresponding to the first fused sorting feature vector; normalized rank vector includes H ₁ A ranking vector parameter;

a first sorting unit 145, specifically for H ₁ Sequencing the sequencing vector parameters to obtain sequenced H ₁ A ranking vector parameter; ordered H ₁ The order vector parameters are used to indicate H ₁ The order of the individual recall classification information;

the first sorting unit 145 is specifically configured to sort the H according to the sorted H ₁ A ranking vector parameter, from H ₁ Acquisition of H from personal recall classification information ₂ Tag information to be acquired H ₂ The label information is determined to be ordering classification information of a first level corresponding to the target video; h ₂ Is less than or equal to H ₁ Is a positive integer of (a).

And the second sorting unit 146 is configured to, if k is greater than 1, screen and sort the sorting information of the kth-1 level corresponding to the target video according to the kth fusion sorting feature vector, so as to obtain the sorting information of the kth level corresponding to the target video.

The specific implementation manner of the first obtaining unit 141, the first fusing unit 142, the second fusing unit 143, the second obtaining unit 144, the first sorting unit 145 and the second sorting unit 146 may be referred to the description of step S104 in the embodiment corresponding to fig. 3, and will not be repeated here.

The specific implementation manner of the first fusion module 11, the first processing module 12, the recall classifying module 13 and the sort classifying module 14 may refer to the description of step S102 to step S104 in the embodiment corresponding to fig. 3, and will not be repeated here. In addition, the description of the beneficial effects of the same method is omitted.

Further, referring to fig. 9, fig. 9 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application, where the data processing apparatus 2 may include: the system comprises a model acquisition module 21, a second fusion module 22, a second processing module 23, a gradual fusion module 24 and a parameter adjustment module 25; further, the data processing apparatus 2 may further include: a first pre-training module 26, a second pre-training module 27, and a third pre-training module 28;

a model acquisition module 21 for acquiring a pre-trained classification model; the pre-training classification model comprises a multi-mode feature sub-model, a feature fusion sub-model and an initial task sub-model;

the second fusion module 22 is configured to perform multi-modal feature extraction and fusion on a target sample video belonging to the target field type through the multi-modal feature sub-model and the feature fusion sub-model, so as to generate a sample multi-modal fusion feature corresponding to the target sample video;

the second fusion module 22 includes: a video acquisition unit 221, a feature extraction unit 222, and a fusion learning unit 223;

a video acquisition unit 221, configured to acquire a target sample video belonging to a target domain type;

the feature extraction unit 222 is configured to perform text feature extraction on target text data of a target sample video through a text network layer, so as to obtain a sample sub-text vector corresponding to the target sample video;

the feature extraction unit 222 is specifically configured to perform feature extraction on target text data of the target sample video through the text network layer, so as to obtain a text feature vector corresponding to the target text data;

the feature extraction unit 222 is specifically configured to perform word segmentation on the target text data to obtain text words of the target text data, and perform text position encoding on text positions of the text words in the target text data to obtain text position vectors corresponding to the text words;

the feature extraction unit 222 is specifically configured to obtain a text mode feature corresponding to the target text data, and fuse the text feature vector, the text position vector and the text mode feature to obtain a sample sub-text vector corresponding to the target sample video.

The feature extraction unit 222 is configured to perform image feature extraction on target image data of a target sample video through an image network layer, so as to obtain a sample sub-image vector corresponding to the target sample video;

the feature extraction unit 222 is specifically configured to perform feature extraction on target image data of the target sample video through the image network layer, so as to obtain an image feature vector corresponding to the target image data;

the feature extraction unit 222 is specifically configured to obtain an image mode feature corresponding to the target image data, and fuse the image feature vector and the image mode feature to obtain a sample sub-image vector corresponding to the target sample video.

The feature extraction unit 222 is configured to perform video feature extraction on target video data of a target sample video through a video network layer, so as to obtain a sample sub-video vector corresponding to the target sample video;

the feature extraction unit 222 is specifically configured to perform feature extraction on target video data of the target sample video through the video network layer, so as to obtain a video feature vector corresponding to the target video data;

the feature extraction unit 222 is specifically configured to perform frame extraction processing on the target video data to obtain a key video frame in the target video data, and perform feature extraction on the key video frame through the video network layer to obtain a video frame feature vector corresponding to the key video frame;

The feature extraction unit 222 is specifically configured to obtain a video mode feature corresponding to the target video data, and fuse the video feature vector, the video frame feature vector and the video mode feature to obtain a sample sub-video vector corresponding to the target sample video.

The feature extraction unit 222 is configured to perform audio feature extraction on target audio data of a target sample video through an audio network layer, so as to obtain a sample sub-audio vector corresponding to the target sample video;

the feature extraction unit 222 is specifically configured to perform frame segmentation on target audio data of a target sample video to obtain at least two audio frames in the target audio data, and perform feature extraction on the at least two audio frames through an audio network layer to obtain audio frame feature vectors corresponding to the at least two audio frames respectively;

the feature extraction unit 222 is specifically configured to perform audio position encoding on an audio position of each audio frame in the target audio data, so as to obtain an audio frame position vector corresponding to each audio frame respectively;

the feature extraction unit 222 is specifically configured to obtain an audio mode feature corresponding to the target audio data, and fuse at least two audio frame feature vectors, at least two audio frame position vectors, and the audio mode feature to obtain a sample sub-audio vector corresponding to the target sample video.

And the fusion learning unit 223 is configured to perform fusion learning on the sample sub-text vector, the sample sub-audio vector, the sample sub-image vector and the sample sub-video vector through the feature fusion sub-model, so as to obtain a sample multi-mode fusion feature corresponding to the target sample video.

For the specific implementation of the video obtaining unit 221, the feature extracting unit 222 and the fusion learning unit 223, refer to the description of step S202 in the embodiment corresponding to fig. 5, and the descriptions of step S2021 to step S2026 in the embodiment corresponding to fig. 6, which will not be repeated here.

The second processing module 23 is configured to perform feature recall processing on the sample multi-mode fusion feature to obtain a sample recall feature vector corresponding to the target sample video, and perform feature sorting processing on the sample multi-mode fusion feature for S times to obtain S sample sorting feature vectors corresponding to the target sample video; s is a positive integer;

the step-by-step fusion module 24 is configured to obtain S sample fusion ordering feature vectors by step-by-step fusing the S sample ordering feature vectors with the sample recall feature vectors;

the parameter adjustment module 25 is configured to perform parameter adjustment on the initial task sub-model based on S levels of sample ordering tag information of the target sample video, sample recall feature vectors, and S sample fusion ordering feature vectors, to obtain a task sub-model; sample ordering tag information of S layers belongs to sample recall tag information; the multi-mode feature sub-model, the feature fusion sub-model and the task sub-model are used for forming a target classification model; the target classification model is used for predicting recall classification information corresponding to target videos belonging to the target field type and S-level sorting classification information corresponding to the target videos.

The method for performing parameter adjustment on the initial task sub-model based on S-level sample ordering label information of the target sample video, sample recall feature vectors and S sample fusion ordering feature vectors to obtain the task sub-model comprises the following steps:

the parameter adjustment module 25 includes: a first determination unit 251, a second determination unit 252, a loss determination unit 253;

a first determining unit 251, configured to normalize the sample recall feature vector to obtain a sample normalized recall vector corresponding to the sample recall feature;

a first determining unit 251, configured to determine a first model loss value of the pre-training classification model based on the sample recall vector and sample recall tag information of the target sample video;

the second determining unit 252 is configured to adjust the S sample fusion and sorting feature vectors step by step according to the first model loss value, obtain adjusted S sample fusion and sorting feature vectors, and determine S second model loss values of the pre-training classification model based on the adjusted S sample fusion and sorting feature vectors and S levels of sample sorting label information of the target sample video;

The second determining unit 252 is specifically configured to adjust a first sample fusion ordering feature vector of the S sample fusion ordering feature vectors according to the first model loss value, so as to obtain an adjusted first sample fusion ordering feature vector;

the second determining unit 252 is specifically configured to normalize the adjusted first sample fusion ordering feature vector to obtain a sample normalized ordering vector corresponding to the adjusted first sample fusion ordering feature vector;

the second determining unit 252 is specifically configured to determine a first second model loss value of the pre-training classification model based on the sample normalized ordering vector corresponding to the adjusted first sample fused ordering feature vector and sample ordering label information of the first level of the target sample video;

the second determining unit 252 is specifically configured to adjust the second sample fusion and sorting feature vector based on the first and second model loss values, obtain an adjusted second sample fusion and sorting feature vector, and determine S-1 second model loss values of the pre-training classification model based on the adjusted second sample fusion and sorting feature vector, S-2 sample fusion and sorting feature vectors, and S-1 level sample sorting label information; the S-2 sample fusion ordering feature vectors are sample fusion ordering feature vectors except the first sample fusion ordering feature vector and the second sample fusion ordering feature vector in the S sample fusion ordering feature vectors; the S-1 level sample ordering tag information is sample ordering tag information except for the first level sample ordering tag information in the S level sample ordering tag information.

The loss determining unit 253 is configured to determine a total model loss value of the pre-training classification model according to the first model loss value and the S second model loss values, and perform parameter adjustment on the initial task sub-model based on the total model loss value to obtain the task sub-model.

The specific implementation manner of the first determining unit 251, the second determining unit 252 and the loss determining unit 253 may be referred to the description of step S205 in the embodiment corresponding to fig. 5, and will not be described herein.

Optionally, the feature fusion sub-model is obtained by pre-training a sample video set; sample videos in a sample video set are associated with at least two video domain types, the at least two video domain types including a targetThe type of the target area; the sample video set comprises N sample videos, wherein N is a positive integer; the N sample videos include sample video S _i The method comprises the steps of carrying out a first treatment on the surface of the i is a positive integer less than or equal to N;

a first pre-training module 26 for obtaining an initial classification model; the initial classification model comprises a multi-mode feature sub-model and an initial feature fusion sub-model;

a first pre-training module 26 for acquiring a sample video S through a multi-modal feature sub-model _i The method comprises the steps of obtaining a target sample mode vector from P sample mode vectors corresponding to the P sample mode vectors, and determining sample mode vectors except the target sample mode vector in the P sample mode vectors as candidate sample mode vectors; p is a positive integer;

The first pre-training module 26 is configured to perform vector change on the target sample mode vector to obtain an auxiliary sample mode vector corresponding to the target sample mode vector, and perform fusion learning on the auxiliary sample mode vector and the candidate sample mode vector through the initial feature fusion sub-model to obtain a first fusion sample vector corresponding to the auxiliary sample mode vector;

the first pre-training module 26 is configured to pre-train the initial feature fusion sub-model based on the first fusion sample vector and the target sample modal vector to obtain a feature fusion sub-model.

Optionally, the feature fusion sub-model is obtained by pre-training a sample video set; associating at least two video domain types with a sample video in the sample video set, wherein the at least two video domain types comprise target domain types; the sample video set comprises N sample videos, wherein N is a positive integer;

a second pre-training module 27 for obtaining an initial classification model; the initial classification model comprises a multi-mode feature sub-model and an initial feature fusion sub-model;

the second pre-training module 27 is configured to obtain P initial modal feature vectors corresponding to each sample video through the multi-modal feature sub-model, and combine initial modal feature vectors belonging to the same modality in the n×p initial modal feature vectors into the same initial modal feature vector sequence to obtain P initial modal feature vector sequences; each initial modal feature vector sequence comprises N initial modal feature vectors; p is a positive integer;

The second pre-training module 27 is configured to obtain a candidate modal feature vector sequence from the P initial modal feature vector sequences, obtain R candidate modal feature vectors from the candidate modal feature vector sequence, and adjust the order of the R candidate modal feature vectors to obtain a candidate modal feature vector sequence after the order is adjusted; r is a positive integer less than N;

the second pre-training module 27 is configured to perform fusion learning on the candidate modal feature vector in the candidate modal feature vector sequence after the sequence adjustment and the initial modal feature vector in the P-1 initial modal feature vector sequence through the initial feature fusion sub-model to obtain a second fusion sample vector; the P-1 initial modal feature vector sequences are the initial modal feature vector sequences except the candidate modal feature vector sequences in the P initial modal feature vector sequences;

the second pre-training module 27 is configured to obtain a matching tag between the candidate modal feature vector in the candidate modal feature vector sequence after the adjustment sequence and the initial modal feature vectors in the P-1 initial modal feature vector sequence, and pre-train the initial feature fusion sub-model based on the matching tag and the second fusion sample vector to obtain a feature fusion sub-model.

Optionally, the feature fusion sub-model is obtained by pre-training a sample video set; associating at least two video domain types with a sample video in the sample video set, wherein the at least two video domain types comprise target domain types; the sample video set comprises N sample videos, wherein N is a positive integer; the N sample videos include sample video S _i The method comprises the steps of carrying out a first treatment on the surface of the i is a positive integer less than or equal to N;

a third pre-training module 28 for obtaining an initial classification model; the initial classification model comprises a multi-mode feature sub-model and an initial feature fusion sub-model;

a third pre-training module 28 for acquiring sample video S through the multi-modal feature sub-model _i Corresponding P sample modalitiesThe vector is used for adjusting the sequence of the P sample mode vectors to obtain P sample mode vectors after adjustment; p is a positive integer;

a third pre-training module 28 for performing fusion learning on the adjusted P sample modal vectors through the initial feature fusion sub-model to obtain a sample video S _i Corresponding P fusion sample modal vectors;

a third pre-training module 28, configured to perform full connection processing on the P fused sample modal vectors, so as to obtain fused modal vectors corresponding to the P fused sample modal vectors respectively;

A third pre-training module 28 is configured to pre-train the initial feature fusion sub-model based on the P fusion modal vectors and the P sample modal vectors to obtain a feature fusion sub-model.

The specific implementation manners of the model obtaining module 21, the second fusing module 22, the second processing module 23, the step-by-step fusing module 24, the parameter adjusting module 25, the first pre-training module 26, the second pre-training module 27 and the third pre-training module 28 can be referred to in the embodiment corresponding to fig. 5 for the description of step S201 to step S205, and the embodiment corresponding to fig. 6 for the description of step S2021 to step S2026 will not be repeated here. In addition, the description of the beneficial effects of the same method is omitted.

Further, referring to fig. 10, fig. 10 is a schematic structural diagram of a computer device according to an embodiment of the present application, where the computer device may be a terminal device or a server. As shown in fig. 10, the computer device 1000 may include: processor 1001, network interface 1004, and memory 1005, and in addition, the above-described computer device 1000 may further include: a user interface 1003, and at least one communication bus 1002. Wherein the communication bus 1002 is used to enable connected communication between these components. In some embodiments, the user interface 1003 may include a Display (Display), a Keyboard (Keyboard), and the optional user interface 1003 may further include a standard wired interface, a wireless interface, among others. Alternatively, the network interface 1004 may include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (non-volatile memory), such as at least one disk memory. Optionally, the memory 1005 may also be at least one memory device located remotely from the aforementioned processor 1001. As shown in fig. 10, an operating system, a network communication module, a user interface module, and a device control application program may be included in the memory 1005, which is one type of computer-readable storage medium.

In the computer device 1000 shown in FIG. 10, the network interface 1004 may provide network communication functions; while user interface 1003 is primarily used as an interface for providing input to a user; and the processor 1001 may be used to invoke device control applications stored in the memory 1005.

It should be understood that the computer apparatus 1000 described in the embodiments of the present application may perform the description of the data processing method in the embodiments corresponding to fig. 3, 5 or 6, and may also perform the description of the data processing device 1 in the embodiments corresponding to fig. 8 or the description of the data processing device 2 in the embodiments corresponding to fig. 9, which are not described herein again. In addition, the description of the beneficial effects of the same method is omitted.

Furthermore, it should be noted here that: the embodiment of the present application further provides a computer readable storage medium, in which the aforementioned computer program executed by the data processing apparatus 1 or the data processing apparatus 2 is stored, and when the processor executes the computer program, the description of the data processing method in the embodiment corresponding to fig. 3, 5 or 6 can be executed, and therefore, a detailed description will not be given here. In addition, the description of the beneficial effects of the same method is omitted. For technical details not disclosed in the embodiments of the computer-readable storage medium according to the present application, please refer to the description of the method embodiments of the present application.

In addition, it should be noted that: embodiments of the present application also provide a computer program product, which may include a computer program, which may be stored in a computer readable storage medium. The processor of the computer device reads the computer program from the computer readable storage medium, and the processor may execute the computer program, so that the computer device performs the description of the data processing method in the embodiment corresponding to fig. 3, 5 or 6, and thus, a detailed description will not be given here. In addition, the description of the beneficial effects of the same method is omitted. For technical details not disclosed in the embodiments of the computer program product according to the present application, reference is made to the description of the method embodiments of the present application.

Those skilled in the art will appreciate that implementing all or part of the above-described methods may be accomplished by way of a computer program stored in a computer-readable storage medium, which when executed may comprise the steps of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), or the like.

The foregoing disclosure is illustrative of the present application and is not to be construed as limiting the scope of the application, which is defined by the appended claims.

Claims

1. A method of data processing, comprising:

carrying out multi-mode feature extraction and fusion on a target video to generate multi-mode fusion features corresponding to the target video;

performing feature recall processing on the multi-mode fusion features to obtain recall feature vectors corresponding to the target video, and performing feature sorting processing on the multi-mode fusion features for S times to obtain S sorting feature vectors corresponding to the target video; s is a positive integer;

step-by-step fusion is carried out on the S sorting feature vectors through the recall feature vectors to obtain S fusion sorting feature vectors, step-by-step sorting is carried out on the recall sorting information according to the S fusion sorting feature vectors to obtain S-level sorting information corresponding to the target video; the sorting classification information of the S layers belongs to the recall classification information; the sorting classification information of the first level is obtained by screening and sorting from the recall classification information, the sorting classification information of the (i+1) th level is obtained by screening and sorting from the sorting classification information of the (i) th level, and the i is a positive integer smaller than the S.

2. The method of claim 1, wherein the performing feature recall processing on the multi-mode fusion feature to obtain a recall feature vector corresponding to the target video, performing feature ordering processing on the multi-mode fusion feature for S times, and obtaining S ordered feature vectors corresponding to the target video, includes:

inputting the multi-modal fusion features into a task sub-model in a target classification model; the task sub-model comprises a recall full-connection layer and S sequencing full-connection layers; s said ordered full connection layers include ordered full connection layer H _i The i is a positive integer less than or equal to the S;

in the recall full-connection layer, carrying out full-connection processing on the multi-mode fusion feature to obtain a candidate recall feature vector corresponding to the target video, and carrying out full-connection processing on the candidate recall feature vector to obtain a recall feature vector output by the recall full-connection layer;

at the ordered full connection layer H _i In the method, the multi-mode fusion feature is subjected to full connection processing to obtain candidate sorting feature vectors corresponding to the target video, and the candidate sorting feature vectors are subjected to full connection processing to obtain the sorting full connection layer H _i The output sequencing feature vector; s sorting feature vectors corresponding to the target video are sorting feature vectors respectively output by S sorting full-connection layers.

3. The method of claim 1, wherein the determining recall classification information corresponding to the target video based on the recall feature vector comprises:

normalizing the recall feature vector to obtain a normalized recall vector corresponding to the recall feature vector; the normalized recall vector includes at least two recall vector parameters;

sequencing the at least two recall vector parameters to obtain sequenced at least two recall vector parameters;

acquiring the H with the earlier sequence from the at least two recall vector parameters after sequencing ₁ Recall vector parameters, H ordering the top ₁ Determining label information corresponding to the individual recall vector parameters as recall classification information corresponding to the target video; the H is ₁ Is an integer greater than 1.

4. The method of claim 1, wherein the step-wise merging the S sorted feature vectors by the recall feature vector to obtain S merged sorted feature vectors comprises:

Acquiring a j-th sorting feature vector from the S sorting feature vectors; j is a positive integer less than or equal to S;

if j is equal to 1, carrying out vector fusion on the recall feature vector and the first ordering feature vector to obtain a fusion ordering feature vector corresponding to the first ordering feature vector;

if j is greater than 1, vector fusion is carried out on the j-th sorting feature vector and the fusion sorting feature vector corresponding to the j-1-th sorting feature vector, and the fusion sorting feature vector corresponding to the j-th sorting feature vector is obtained.

5. The method of claim 1, wherein the number of recall classification information is H ₁ And, the H ₁ Is an integer greater than 1;

and step-by-step ordering the recall classification information according to the S fusion ordering feature vectors to obtain S-level ordering classification information corresponding to the target video, wherein the step-by-step ordering classification information comprises the following steps:

acquiring a kth sorting feature vector from the S fusion sorting feature vectors; k is a positive integer less than or equal to S;

if the k is equal to 1, the feature vector pair H is ordered according to the first fusion ₁ Screening and sorting the recall classification information to obtain sorting classification information of the first level corresponding to the target video;

And if the k is greater than 1, screening and sorting the sorting classification information of the k-1 th level corresponding to the target video according to the k fusion sorting feature vector to obtain the sorting classification information of the k-1 th level corresponding to the target video.

6. The method of claim 5, wherein the pairs of feature vectors H are ordered according to a first fusion ₁ Screening and sorting the recall classification information to obtain the sorting classification information of the first level corresponding to the target video, wherein the sorting classification information comprises the following steps:

normalizing the first fusion ordering feature vector to obtain a normalized ordering vector corresponding to the first fusion ordering feature vector; the normalized rank vector includes H ₁ A ranking vector parameter;

for H ₁ Sorting the sorting vector parameters to obtain sorted H ₁ A ranking vector parameter; the ordered H ₁ The order vector parameters are used to indicate H ₁ The order of the recall classification information;

according to the ordered H ₁ A ranking vector parameter, from H ₁ Obtaining H from each recall classification information ₂ Tag information to be acquired H ₂ The label information is determined to be the sorting classification information of the first level corresponding to the target video; the H is ₂ Is less than or equal to the H ₁ Is a positive integer of (a).

7. A method of data processing, comprising:

performing multi-mode feature extraction fusion on a target sample video belonging to a target field type through the multi-mode feature sub-model and the feature fusion sub-model to generate a sample multi-mode fusion feature corresponding to the target sample video;

performing feature recall processing on the sample multi-mode fusion features to obtain sample recall feature vectors corresponding to the target sample video, and performing S times of feature sorting processing on the sample multi-mode fusion features to obtain S sample sorting feature vectors corresponding to the target sample video; s is a positive integer;

based on S levels of sample sequencing label information of the target sample video, sample recall label information of the target sample video, the sample recall feature vector and S sample fusion sequencing feature vectors, carrying out parameter adjustment on the initial task sub-model to obtain a task sub-model; the sample ordering tag information of the S layers belongs to the sample recall tag information; the multi-modal feature sub-model, the feature fusion sub-model and the task sub-model are used for forming a target classification model; the target classification model is used for predicting recall classification information corresponding to the target video belonging to the target field type and sorting classification information of S levels corresponding to the target video.

8. The method of claim 7, wherein the feature fusion sub-model is pre-trained from a sample video set; sample videos in the sample video set are associated with at least two video domain types, wherein the at least two video domain types comprise the target domain type; the sample video set comprises N sample videos, wherein N is a positive integer; the N sample videos comprise sample video S _i The method comprises the steps of carrying out a first treatment on the surface of the The i is less than or equal toA positive integer at said N;

the method further comprises the steps of:

acquiring an initial classification model; the initial classification model comprises the multi-mode feature sub-model and an initial feature fusion sub-model;

acquiring the sample video S through the multi-mode feature sub-model _i Obtaining a target sample mode vector from the P sample mode vectors, and determining sample mode vectors except the target sample mode vector in the P sample mode vectors as candidate sample mode vectors; the P is a positive integer;

performing vector change on the target sample modal vector to obtain an auxiliary sample modal vector corresponding to the target sample modal vector, and performing fusion learning on the auxiliary sample modal vector and the candidate sample modal vector through the initial feature fusion sub-model to obtain a first fusion sample vector corresponding to the auxiliary sample modal vector;

And pre-training the initial feature fusion sub-model based on the first fusion sample vector and the target sample modal vector to obtain the feature fusion sub-model.

9. The method of claim 7, wherein the feature fusion sub-model is pre-trained from a sample video set; sample videos in the sample video set are associated with at least two video domain types, wherein the at least two video domain types comprise the target domain type; the sample video set comprises N sample videos, wherein N is a positive integer;

the method further comprises the steps of:

p initial modal feature vectors corresponding to each sample video are obtained through the multi-modal feature sub-model, and initial modal feature vectors belonging to the same mode in the N P initial modal feature vectors are combined into the same initial modal feature vector sequence to obtain P initial modal feature vector sequences; each initial modal feature vector sequence comprises N initial modal feature vectors; the P is a positive integer;

Acquiring candidate modal feature vector sequences from the P initial modal feature vector sequences, acquiring R candidate modal feature vectors from the candidate modal feature vector sequences, and adjusting the sequence of the R candidate modal feature vectors to obtain candidate modal feature vector sequences after the sequence is adjusted; r is a positive integer less than N;

performing fusion learning on the candidate modal feature vectors in the candidate modal feature vector sequence after the sequence adjustment and the initial modal feature vectors in the P-1 initial modal feature vector sequence through the initial feature fusion sub-model to obtain a second fusion sample vector; the P-1 initial modal feature vector sequences are initial modal feature vector sequences except the candidate modal feature vector sequences in the P initial modal feature vector sequences;

and acquiring a matching tag between a candidate modal feature vector in the candidate modal feature vector sequence after the adjustment sequence and an initial modal feature vector in the P-1 initial modal feature vector sequence, and pre-training the initial feature fusion sub-model based on the matching tag and the second fusion sample vector to obtain the feature fusion sub-model.

10. The method of claim 7, wherein the feature fusion sub-model is pre-trained from a sample video set; sample videos in the sample video set are associated with at least two video domain types, wherein the at least two video domain types comprise the target domain type; the sample video set comprises N sample videos, wherein N is a positive integer; the N sample videos comprise sample video S _i The method comprises the steps of carrying out a first treatment on the surface of the The i is a positive integer less than or equal to the N;

the method further comprises the steps of:

acquiring the sample video S through the multi-mode feature sub-model _i The sequence of the P sample mode vectors is adjusted to obtain P adjusted sample mode vectors; the P is a positive integer;

performing fusion learning on the adjusted P sample modal vectors through the initial feature fusion sub-model to obtain the sample video S _i Corresponding P fusion sample modal vectors;

performing full connection processing on the P fusion sample modal vectors respectively to obtain fusion modal vectors corresponding to the P fusion sample modal vectors respectively;

And pre-training the initial feature fusion sub-model based on the P fusion modal vectors and the P sample modal vectors to obtain the feature fusion sub-model.

11. The method of claim 7, wherein the performing parameter adjustment on the initial task sub-model based on the S levels of sample ordering label information of the target sample video, the sample recall feature vector, and the S sample fusion ordering feature vectors to obtain a task sub-model comprises:

carrying out normalization processing on the sample recall feature vector to obtain a sample normalization recall vector corresponding to the sample recall feature;

determining a first model loss value of the pre-trained classification model based on the sample normalized recall vector and sample recall tag information of the target sample video;

step-by-step adjustment is carried out on the S sample fusion and sorting feature vectors according to the first model loss value, so that adjusted S sample fusion and sorting feature vectors are obtained, and S second model loss values of the pre-training classification model are determined based on the adjusted S sample fusion and sorting feature vectors and S-level sample sorting label information of the target sample video;

And determining a total model loss value of the pre-training classification model according to the first model loss value and the S second model loss values, and carrying out parameter adjustment on the initial task sub-model based on the total model loss value to obtain a task sub-model.

12. The method of claim 11, wherein the step-wise adjusting the S sample fusion ordering feature vectors according to the first model loss value to obtain adjusted S sample fusion ordering feature vectors, and determining S second model loss values of the pre-trained classification model based on the adjusted S sample fusion ordering feature vectors and the S levels of sample ordering label information of the target sample video, comprises:

according to the first model loss value, a first sample fusion ordering feature vector in the S sample fusion ordering feature vectors is adjusted, and an adjusted first sample fusion ordering feature vector is obtained;

normalizing the adjusted first sample fusion ordering feature vector to obtain a sample normalization ordering vector corresponding to the adjusted first sample fusion ordering feature vector;

Determining a first model loss value of the pre-training classification model based on a sample normalized ordering vector corresponding to the adjusted first sample fused ordering feature vector and sample ordering label information of a first level of the target sample video;

adjusting a second sample fusion and sorting feature vector based on the first and second model loss values to obtain an adjusted second sample fusion and sorting feature vector, and determining S-1 second model loss values of the pre-training classification model based on the adjusted second sample fusion and sorting feature vector, S-2 sample fusion and sorting feature vectors and S-1 level sample sorting label information; the S-2 sample fusion ordering feature vectors are sample fusion ordering feature vectors except the first sample fusion ordering feature vector and the second sample fusion ordering feature vector in the S sample fusion ordering feature vectors; the S-1-level sample ordering tag information is sample ordering tag information except the first-level sample ordering tag information in the S-level sample ordering tag information.

13. The method of claim 7, wherein the multimodal feature sub-model includes a text network layer, an image network layer, a video network layer, and an audio network layer;

The multi-modal feature extraction and fusion are carried out on the target sample video belonging to the target field type through the multi-modal feature sub-model and the feature fusion sub-model, and the generation of the sample multi-modal fusion feature corresponding to the target sample video comprises the following steps:

acquiring a target sample video belonging to a target field type;

extracting text features of target text data of the target sample video through the text network layer to obtain sample sub-text vectors corresponding to the target sample video;

extracting image features of target image data of the target sample video through the image network layer to obtain sample sub-image vectors corresponding to the target sample video;

extracting video features of target video data of the target sample video through the video network layer to obtain sample sub-video vectors corresponding to the target sample video;

extracting audio characteristics of target audio data of the target sample video through the audio network layer to obtain a sample sub-audio vector corresponding to the target sample video;

and carrying out fusion learning on the sample sub-text vector, the sample sub-audio vector, the sample sub-image vector and the sample sub-video vector through the feature fusion sub-model to obtain sample multi-mode fusion features corresponding to the target sample video.

14. The method according to claim 13, wherein the extracting text features of the target text data of the target sample video through the text network layer to obtain the sample sub-text vector corresponding to the target sample video includes:

extracting characteristics of target text data of the target sample video through the text network layer to obtain text characteristic vectors corresponding to the target text data;

word segmentation processing is carried out on the target text data to obtain text word segmentation of the target text data, and text position coding is carried out on the text position of the text word segmentation in the target text data to obtain a text position vector corresponding to the text word segmentation;

and acquiring text modal characteristics corresponding to the target text data, and fusing the text characteristic vector, the text position vector and the text modal characteristics to obtain a sample sub-text vector corresponding to the target sample video.

15. The method according to claim 13, wherein the extracting, by the image network layer, image features of the target image data of the target sample video to obtain a sample sub-image vector corresponding to the target sample video, includes:

Extracting features of target image data of the target sample video through the image network layer to obtain image feature vectors corresponding to the target image data;

and acquiring image mode characteristics corresponding to the target image data, and fusing the image characteristic vectors and the image mode characteristics to obtain sample sub-image vectors corresponding to the target sample video.

16. The method according to claim 13, wherein the performing, by the video network layer, video feature extraction on the target video data of the target sample video to obtain a sample sub-video vector corresponding to the target sample video includes:

extracting characteristics of target video data of the target sample video through the video network layer to obtain video characteristic vectors corresponding to the target video data;

performing frame extraction processing on the target video data to obtain a key video frame in the target video data, and performing feature extraction on the key video frame through the video network layer to obtain a video frame feature vector corresponding to the key video frame;

and acquiring video mode characteristics corresponding to the target video data, and fusing the video characteristic vectors, the video frame characteristic vectors and the video mode characteristics to obtain sample sub-video vectors corresponding to the target sample video.

17. The method according to claim 13, wherein the extracting, by the audio network layer, the audio feature of the target audio data of the target sample video to obtain the sample sub-audio vector corresponding to the target sample video includes:

carrying out frame division processing on target audio data of the target sample video to obtain at least two audio frames in the target audio data, and respectively carrying out feature extraction on at least two audio frames through the audio network layer to obtain at least two audio frame feature vectors respectively corresponding to the audio frames;

performing audio position coding on the audio position of each audio frame in the target audio data to obtain an audio frame position vector corresponding to each audio frame;

and acquiring audio mode characteristics corresponding to the target audio data, and fusing at least two audio frame characteristic vectors, at least two audio frame position vectors and the audio mode characteristics to obtain sample sub-audio vectors corresponding to the target sample video.

18. A data processing apparatus, comprising:

the recall classification module is used for determining recall classification information corresponding to the target video according to the recall feature vector;

the sorting and classifying module is used for merging S sorting feature vectors step by step through the recall feature vectors to obtain S merged sorting feature vectors, and sorting the recall classifying information step by step according to the S merged sorting feature vectors to obtain S-level sorting classifying information corresponding to the target video; the sorting classification information of the S layers belongs to the recall classification information; the sorting classification information of the first level is obtained by screening and sorting from the recall classification information, the sorting classification information of the (i+1) th level is obtained by screening and sorting from the sorting classification information of the (i) th level, and the i is a positive integer smaller than the S.

19. A data processing apparatus, comprising:

the second fusion module is used for carrying out multi-mode feature extraction and fusion on the target sample video belonging to the target field type through the multi-mode feature sub-model and the feature fusion sub-model to generate sample multi-mode fusion features corresponding to the target sample video;

the parameter adjustment module is used for carrying out parameter adjustment on the initial task sub-model based on S-level sample ordering label information of the target sample video, sample recall label information of the target sample video, the sample recall feature vector and S sample fusion ordering feature vectors to obtain a task sub-model; the sample ordering tag information of the S layers belongs to the sample recall tag information; the multi-modal feature sub-model, the feature fusion sub-model and the task sub-model are used for forming a target classification model; the target classification model is used for predicting recall classification information corresponding to the target video belonging to the target field type and sorting classification information of S levels corresponding to the target video.

20. A computer device, comprising: a processor and a memory;

the processor is connected to the memory, wherein the memory is configured to store a computer program, and the processor is configured to invoke the computer program to cause the computer device to perform the method of any of claims 1-17.