CN116976327A - Data processing method, device, computer equipment and readable storage medium - Google Patents

Data processing method, device, computer equipment and readable storage medium Download PDF

Info

Publication number
CN116976327A
CN116976327A CN202211521178.1A CN202211521178A CN116976327A CN 116976327 A CN116976327 A CN 116976327A CN 202211521178 A CN202211521178 A CN 202211521178A CN 116976327 A CN116976327 A CN 116976327A
Authority
CN
China
Prior art keywords
sample
classification
target
video
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211521178.1A
Other languages
Chinese (zh)
Inventor
刘刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202211521178.1A priority Critical patent/CN116976327A/en
Publication of CN116976327A publication Critical patent/CN116976327A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The embodiment of the application provides a data processing method, a device, computer equipment and a readable storage medium, which are applied to cloud technology, artificial intelligence, intelligent traffic, auxiliary driving, video and other scenes, and the method comprises the following steps: obtaining a pre-training classification model; s modal feature vectors corresponding to a target sample video belonging to the target field type are obtained through a multi-modal feature sub-model, and fusion learning is carried out on the S modal feature vectors through a feature fusion sub-model to obtain M initial sample classification features; extracting M-1 classification matching features and target sample classification features respectively corresponding to M initial sample classification features through an initial task sub-model; and carrying out parameter adjustment on the initial task sub-model based on M sample label information, S modal feature vectors, M target sample classification features and M-1 classification matching features of the target sample video to obtain the task sub-model. The method and the device can improve the accuracy of the target classification information corresponding to the predicted target video.

Description

Data processing method, device, computer equipment and readable storage medium
Technical Field
The present application relates to the field of computer technologies, and in particular, to a data processing method, a data processing device, a computer device, and a readable storage medium.
Background
With the development of multimedia technology, video has become a main carrier for people to acquire information and enjoy entertainment in daily life. Current video classification algorithms may obtain text data (e.g., video title, video summary) of a target video, and extract classification information corresponding to the target video from the text data. However, when the text data includes entity nouns (for example, mobile phones), the classification information corresponding to the target video can be easily extracted, that is, the entity nouns in the text data are directly used as the classification information corresponding to the target video; when the text data does not include entity nouns, it is difficult to extract classification information corresponding to the target video. Furthermore, extracting the classification information of the target video directly in the text data makes it difficult to ensure the accuracy of the extracted classification information.
Disclosure of Invention
The embodiment of the application provides a data processing method, a data processing device, computer equipment and a readable storage medium, which can improve the accuracy of target classification information corresponding to a predicted target video.
In one aspect, an embodiment of the present application provides a data processing method, including:
obtaining a pre-training classification model; the pre-training classification model comprises a multi-mode feature sub-model, a feature fusion sub-model and an initial task sub-model;
s modal feature vectors corresponding to the target sample video belonging to the target field type are obtained through the multi-modal feature sub-model, and fusion learning is carried out on the S modal feature vectors through the feature fusion sub-model, so that M initial sample classification features corresponding to the target sample video are obtained; m and S are positive integers greater than 1;
extracting M-1 classification matching features and target sample classification features respectively corresponding to M initial sample classification features through an initial task sub-model; the M target sample classification features are used for determining M levels of sample classification information, and the M-1 classification matching features are used for determining the matching degree between every two adjacent levels of sample classification information; the categories respectively indicated by every two adjacent levels have category level nesting relations;
based on M sample tag information, S modal feature vectors, M target sample classification features and M-1 classification matching features of the target sample video, carrying out parameter adjustment on the initial task sub-model to obtain a task sub-model; the multi-mode feature sub-model, the feature fusion sub-model and the task sub-model are used for forming a target classification model; the target classification model is used for predicting M levels of target classification information corresponding to target videos belonging to the target field type.
In one aspect, an embodiment of the present application provides a data processing apparatus, including:
the model acquisition module is used for acquiring a pre-training classification model; the pre-training classification model comprises a multi-mode feature sub-model, a feature fusion sub-model and an initial task sub-model;
the fusion learning module is used for acquiring S modal feature vectors corresponding to the target sample video belonging to the target field type through the multi-modal feature sub-model, and carrying out fusion learning on the S modal feature vectors through the feature fusion sub-model to acquire M initial sample classification features corresponding to the target sample video; m and S are positive integers greater than 1;
the feature extraction module is used for extracting M-1 classification matching features and target sample classification features respectively corresponding to the M initial sample classification features through the initial task sub-model; the M target sample classification features are used for determining M levels of sample classification information, and the M-1 classification matching features are used for determining the matching degree between every two adjacent levels of sample classification information; the categories respectively indicated by every two adjacent levels have category level nesting relations;
the parameter adjustment module is used for carrying out parameter adjustment on the initial task sub-model based on M sample label information, S modal feature vectors, M target sample classification features and M-1 classification matching features of the target sample video to obtain a task sub-model; the multi-mode feature sub-model, the feature fusion sub-model and the task sub-model are used for forming a target classification model; the target classification model is used for predicting M levels of target classification information corresponding to target videos belonging to the target field type.
The feature fusion sub-model is obtained by pre-training a sample video set; sample videos in a sample video set are associated with at least two video domain types,the at least two video domain types include a target domain type; the sample video set comprises N sample videos, wherein N is a positive integer; the N sample videos include sample video S i The method comprises the steps of carrying out a first treatment on the surface of the i is a positive integer less than or equal to N;
the apparatus further comprises:
the first pre-training module is used for acquiring an initial classification model; the initial classification model comprises a multi-mode feature sub-model and an initial feature fusion sub-model;
a first pre-training module for acquiring a sample video S through a multi-modal feature sub-model i The method comprises the steps of obtaining a target sample mode vector from S sample mode vectors corresponding to the S sample mode vectors, and determining sample mode vectors of the S sample mode vectors except the target sample mode vector as candidate sample mode vectors;
the first pre-training module is used for carrying out vector change on the target sample modal vector to obtain an auxiliary sample modal vector corresponding to the target sample modal vector, and carrying out fusion learning on the auxiliary sample modal vector and the candidate sample modal vector through the initial feature fusion sub-model to obtain a first fusion sample vector corresponding to the auxiliary sample modal vector;
The first pre-training module is used for pre-training the initial feature fusion sub-model based on the first fusion sample vector and the target sample modal vector to obtain the feature fusion sub-model.
The feature fusion sub-model is obtained by pre-training a sample video set; associating at least two video domain types with a sample video in the sample video set, wherein the at least two video domain types comprise target domain types; the sample video set comprises N sample videos, wherein N is a positive integer;
the apparatus further comprises:
the second pre-training module is used for acquiring an initial classification model; the initial classification model comprises a multi-mode feature sub-model and an initial feature fusion sub-model;
the second pre-training module is used for acquiring S initial modal feature vectors corresponding to each sample video through the multi-modal feature sub-model, and combining initial modal feature vectors belonging to the same mode in the N x S initial modal feature vectors into the same initial modal feature vector sequence to obtain S initial modal feature vector sequences; each initial modal feature vector sequence comprises N initial modal feature vectors;
the second pre-training module is used for acquiring candidate modal feature vector sequences from the S initial modal feature vector sequences, acquiring R candidate modal feature vectors from the candidate modal feature vector sequences, and adjusting the sequence of the R candidate modal feature vectors to obtain a candidate modal feature vector sequence after the sequence is adjusted; r is a positive integer less than N;
The second pre-training module is used for carrying out fusion learning on the candidate modal feature vectors in the candidate modal feature vector sequence after the sequence adjustment and the initial modal feature vectors in the S-1 initial modal feature vector sequences through the initial feature fusion sub-model to obtain a second fusion sample vector; the S-1 initial modal feature vector sequences are initial modal feature vector sequences except candidate modal feature vector sequences in the S initial modal feature vector sequences;
and the second pre-training module is used for pre-training the initial feature fusion sub-model based on the S initial modal feature vector sequences and the second fusion sample vector to obtain a feature fusion sub-model.
The multi-modal feature sub-model comprises a text network layer, an image network layer, a video network layer and an audio network layer; the S modal feature vectors comprise sample sub-text vectors, sample sub-image vectors, sample sub-video vectors and sample sub-audio vectors;
the fusion learning module comprises:
the video acquisition unit is used for acquiring a target sample video belonging to the type of the target field;
the feature extraction unit is used for extracting text features of the target text data of the target sample video through the text network layer to obtain a sample sub-text vector corresponding to the target sample video;
The feature extraction unit is used for extracting image features of target image data of the target sample video through the image network layer to obtain sample sub-image vectors corresponding to the target sample video;
the feature extraction unit is used for extracting video features of target video data of the target sample video through the video network layer to obtain sample sub-video vectors corresponding to the target sample video;
the feature extraction unit is used for extracting audio features of the target audio data of the target sample video through the audio network layer to obtain sample sub-audio vectors corresponding to the target sample video;
and the fusion learning unit is used for carrying out fusion learning on the sample sub-text vector, the sample sub-audio vector, the sample sub-image vector and the sample sub-video vector through the feature fusion sub-model to obtain M initial sample classification features corresponding to the target sample video.
The feature extraction unit is specifically configured to perform feature extraction on target text data of a target sample video through a text network layer, so as to obtain a text feature vector corresponding to the target text data;
the feature extraction unit is specifically used for performing word segmentation on the target text data to obtain text word segmentation of the target text data, and performing text position coding on the text position of the text word segmentation in the target text data to obtain a text position vector corresponding to the text word segmentation;
The feature extraction unit is specifically configured to obtain a text mode feature corresponding to the target text data, and fuse the text feature vector, the text position vector and the text mode feature to obtain a sample sub-text vector corresponding to the target sample video.
The feature extraction unit is specifically configured to perform feature extraction on target image data of a target sample video through an image network layer to obtain an image feature vector corresponding to the target image data;
the feature extraction unit is specifically configured to obtain an image mode feature corresponding to the target image data, and fuse the image feature vector and the image mode feature to obtain a sample sub-image vector corresponding to the target sample video.
The feature extraction unit is specifically configured to perform feature extraction on target video data of a target sample video through a video network layer, so as to obtain a video feature vector corresponding to the target video data;
the feature extraction unit is specifically used for performing frame extraction processing on the target video data to obtain a key video frame in the target video data, and performing feature extraction on the key video frame through the video network layer to obtain a video frame feature vector corresponding to the key video frame;
The feature extraction unit is specifically configured to obtain a video mode feature corresponding to the target video data, and fuse the video feature vector, the video frame feature vector and the video mode feature to obtain a sample sub-video vector corresponding to the target sample video.
The characteristic extraction unit is specifically configured to perform frame segmentation processing on target audio data of a target sample video to obtain at least two audio frames in the target audio data, and perform characteristic extraction on the at least two audio frames through an audio network layer to obtain audio frame feature vectors corresponding to the at least two audio frames respectively;
the feature extraction unit is specifically used for carrying out audio position coding on the audio position of each audio frame in the target audio data to obtain an audio frame position vector corresponding to each audio frame respectively;
the feature extraction unit is specifically configured to obtain an audio mode feature corresponding to the target audio data, and fuse at least two audio frame feature vectors, at least two audio frame position vectors and the audio mode feature to obtain a sample sub-audio vector corresponding to the target sample video.
The initial task sub-model comprises M classification network layers corresponding to the classification characteristics of the initial samples respectively; the M classified network layers comprise classified network layer H k K is a positive integer less than or equal to M;
the feature extraction module comprises:
a feature acquisition unit for acquiring classified network layer H from M initial sample classified features k Corresponding initial sample classification feature J k
A first processing unit for classifying the networkLayer H k For the first classified network layer of the M classified network layers, then pass through classified network layer H k Classifying features J for initial sample k Performing full connection processing to obtain a classified network layer H k Corresponding auxiliary sample classification features pass through the classification network layer H k Classification network layer H k Full-connection processing is carried out on the corresponding auxiliary sample classification features to obtain initial sample classification features J k Corresponding target sample classification features;
a second processing unit for classifying the network layer H k Not being the first of the M classified network layers, then passing through classified network layer H k Classifying features J for initial sample k Performing full connection processing to obtain a classified network layer H k Corresponding auxiliary sample classification features based on M initial sample classification features and classification network layer H k Corresponding auxiliary sample classification features, and determining initial sample classification features J k Corresponding target sample classification features and classification network layer H k The corresponding classifications match the features.
Wherein the second processing unit is specifically configured to, if the network layer H is classified k For the second classified network layer of the M classified network layers, then classified network layer H k-1 Corresponding auxiliary sample classification features and classification network layer H k Performing feature stitching on the corresponding auxiliary sample classification features to obtain a classification network layer H k Corresponding spliced sample classification features and classification network layer H k-1 Corresponding auxiliary sample classification features and classification network layer H k Feature stitching is carried out on the corresponding spliced sample classification features to obtain a classified network layer H k Corresponding classification matching features, classification network layer H k Performing full-connection processing on the corresponding spliced sample classification features to obtain initial sample classification features J k Corresponding target sample classification features; classified network layer H k-1 To classify network layer H k Is a last classified network layer; classified network layer H k-1 The corresponding auxiliary sample classification features are based on classification network layer H k-1 Corresponding initial sample classification features;
a second processing unit for, if the network layer H is classified k Not being the second of the M classified network layers, then classified network layer H k-1 Corresponding spliced sample classification features and classification network layer H k Performing feature stitching on the corresponding auxiliary sample classification features to obtain a classification network layer H k Corresponding spliced sample classification features and classification network layer H k-1 Corresponding spliced sample classification features and classification network layer H k Feature stitching is carried out on the corresponding spliced sample classification features to obtain a classified network layer H k Corresponding classification matching features, classification network layer H k Performing full-connection processing on the corresponding spliced sample classification features to obtain initial sample classification features J k And corresponding target sample classification features.
The S modal feature vectors comprise sample sub-text vectors, sample sub-image vectors, sample sub-video vectors and sample sub-audio vectors;
the parameter adjustment module comprises:
the loss determination unit is used for determining a first model loss value of the pre-training classification model based on the M sample label information of the target sample video and the associated sample label information and target sample classification characteristics in the M target sample classification characteristics;
a loss determination unit for determining a second model loss value of the pre-trained classification model based on the M sample label information, the sample sub-text vector, the sample sub-image vector, the sample sub-video vector, and the sample sub-audio vector;
the loss determination unit is used for determining a third model loss value of the pre-training classification model based on M target sample classification features and M-1 classification matching features;
And the parameter adjustment unit is used for determining the total model loss value of the pre-training classification model according to the first model loss value, the second model loss value and the third model loss value, and carrying out parameter adjustment on the initial task sub-model based on the total model loss value to obtain the task sub-model.
The pre-training classification model further comprises a text classifier, an image classifier, a video classifier and an audio classifier;
the loss determination unit is specifically configured to input the sample sub-text vectors to the text classifier, determine M predicted text classification vectors by the text classifier, and determine a text loss value of the pre-training classification model according to the predicted text classification vectors associated with the M predicted text classification vectors and the M sample label information;
the loss determination unit is specifically configured to input the sample sub-image vectors to the image classifier, determine M prediction image classification vectors by the image classifier, and determine an image loss value of the pre-training classification model according to the prediction image classification vectors and the sample label information associated with the M prediction image classification vectors and the M sample label information;
the loss determination unit is specifically configured to input the sample sub-video vector to the video classifier, determine M prediction video classification vectors by the video classifier, and determine a video loss value of the pre-training classification model according to the prediction video classification vector and the sample label information associated with the M prediction video classification vectors and the M sample label information;
The loss determination unit is specifically configured to input the sample sub-audio vectors to an audio classifier, determine M predicted audio classification vectors by the audio classifier, and determine an audio loss value of the pre-training classification model according to the predicted audio classification vectors and sample tag information associated with the M predicted audio classification vectors and the M sample tag information;
the loss determination unit is specifically configured to determine a second model loss value of the pre-trained classification model according to the text loss value, the image loss value, the video loss value and the audio loss value.
The loss determination unit is specifically configured to perform normalization processing on the M target sample classification features, obtain normalized sample classification features corresponding to the M target sample classification features, and determine M levels of sample classification information corresponding to the target sample video according to the M normalized sample classification features;
a loss determination unit for classifying M-1 classification matching featuresRespectively carrying out normalization processing to obtain normalized matching features corresponding to M-1 classified matching features; the M-1 normalized matching features include normalized matching feature P b B is a positive integer less than or equal to M-1;
the loss determination unit is specifically configured to obtain normalized matching characteristics P b The associated first sample classification information and second sample classification information, and determining normalized matching features P according to the first sample classification information and the second sample classification information b Matching labels with corresponding categories; the first sample classification information is matched with the normalized characteristic P b Sample classification information corresponding to normalized sample classification features with the same level, and the second sample classification information is normalized matching feature P b Sample classification information corresponding to the normalized sample classification feature of the previous level;
a loss determination unit, in particular for determining the matching characteristic P according to normalization b And normalized matching feature P b Corresponding category matching labels, and determining normalized matching features P b Matching loss of corresponding categories;
the loss determination unit is specifically configured to determine a third model loss value of the pre-training classification model according to category matching losses corresponding to the M-1 normalized matching features.
Wherein the apparatus further comprises:
the video classification module is used for acquiring S target modal characteristics corresponding to the target video through the multi-modal characteristic sub-model;
the video classification module is used for carrying out fusion learning on the S target modal characteristics through the characteristic fusion sub-model to obtain M initial classification characteristics corresponding to the target video;
The video classification module is used for extracting target classification features corresponding to the M initial classification features respectively through the task sub-model;
the video classification module is used for respectively carrying out normalization processing on the M target classification features to obtain normalized classification features corresponding to the M target classification features, and determining M levels of target classification information corresponding to the target video according to the M normalized classification features.
In one aspect, an embodiment of the present application provides a computer device, including: a processor and a memory;
the processor is connected to the memory, wherein the memory is configured to store a computer program, and when the computer program is executed by the processor, the computer device is caused to execute the method provided by the embodiment of the application.
In one aspect, the present application provides a computer readable storage medium storing a computer program adapted to be loaded and executed by a processor, so that a computer device having the processor performs the method provided by the embodiment of the present application.
In one aspect, embodiments of the present application provide a computer program product comprising a computer program stored on a computer readable storage medium. The processor of the computer device reads the computer program from the computer-readable storage medium, and the processor executes the computer program, so that the computer device performs the method provided by the embodiment of the present application.
Therefore, the embodiment of the application provides an information flow video content classification method based on a multi-mode pre-training network, which can acquire the modal feature vectors corresponding to S modes of a target sample video respectively through a pre-training classification model, multi-dimensionally utilize the target sample video, fully utilize information implied by different modes of the target sample video, and improve the accuracy and the comprehensiveness of the feature representation of the target sample video through multi-mode content understanding. In addition, the embodiment of the application can predict M levels of sample classification information corresponding to the target sample video based on M target sample classification features, and accurately classify the target sample video in multiple levels by describing the classification information of the target sample video on different granularity levels through the M levels of sample classification information. Therefore, when the target sample video is classified based on the target classification model, the information of S modes of the target video can be fully utilized to generate M levels of target classification information corresponding to the target video, so that the accuracy of predicting the target classification information corresponding to the target video is improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the related art, the drawings that are required to be used in the embodiments or the related technical descriptions will be briefly described, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to the drawings without inventive effort for those skilled in the art.
Fig. 1 is a schematic structural diagram of a network architecture according to an embodiment of the present application;
FIG. 2 is a schematic diagram of a scenario for data interaction according to an embodiment of the present application;
FIG. 3 is a schematic flow chart of a data processing method according to an embodiment of the present application;
FIG. 4 is a schematic diagram of a multi-layer perceptron provided by an embodiment of the present application;
FIG. 5 is a schematic diagram of a classification model according to an embodiment of the present application;
FIG. 6 is a schematic flow chart of a data processing method according to an embodiment of the present application;
FIG. 7 is a schematic flow chart of a data processing method according to an embodiment of the present application;
FIG. 8 is a flow chart of a data processing method according to an embodiment of the present application;
FIG. 9 is a schematic diagram of a scenario for performing contrast learning according to an embodiment of the present application;
FIG. 10 is a flowchart of a method and system for classifying information flow video content based on a multi-mode pre-training network according to an embodiment of the present application;
FIG. 11 is a schematic diagram of a data processing apparatus according to an embodiment of the present application;
fig. 12 is a schematic structural diagram of a computer device according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
It should be appreciated that artificial intelligence (Artificial Intelligence, AI for short) is the intelligence of a person using a digital computer or a machine controlled by a digital computer to simulate, extend and extend the environment, sense the environment, acquire knowledge and use knowledge to obtain the best results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.
The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and other directions.
The solution provided by the embodiment of the application mainly relates to Computer Vision (CV) technology, machine Learning (ML) technology, voice technology (Speech Technology, ST) and natural language processing (Nature Language processing, NLP) technology of artificial intelligence.
The computer vision is a science for researching how to make a machine "see", and more specifically, a camera and a computer are used to replace human eyes to identify and measure targets and perform graphic processing, so that the computer is processed into images more suitable for human eyes to observe or transmit to an instrument to detect. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, OCR (Optical Character Recognition, i.e., optical character recognition), video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technology, virtual reality, augmented reality, synchronous positioning and mapping, autopilot, intelligent transportation, and the like.
The machine learning is a multi-field interdisciplinary, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like. The deep learning technology is a technology for machine learning by using a deep neural network system. The concept of deep learning is derived from the study of artificial neural networks, for example, a multi-layer perceptron with multiple hidden layers is a deep learning structure. Deep learning forms more abstract high-level representation attribute categories or features by combining low-level features to discover distributed feature representations of data.
Among these key technologies of the voice technology are an automatic voice recognition technology and a voice synthesis technology, and a voiceprint recognition technology. The method can enable the computer to listen, watch, say and feel, is the development direction of human-computer interaction in the future, and voice becomes one of the best human-computer interaction modes in the future.
Among them, natural language processing technology is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Thus, the research in this field will involve natural language, i.e. language that people use daily, so it has a close relationship with the research in linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like.
Specifically, referring to fig. 1, fig. 1 is a schematic structural diagram of a network architecture according to an embodiment of the present application. As shown in fig. 1, the network architecture may include a server 2000 and a cluster of terminal devices. Wherein the cluster of terminal devices may in particular comprise one or more terminal devices, the number of terminal devices in the cluster of terminal devices will not be limited here. As shown in fig. 1, the plurality of terminal devices may specifically include a terminal device 3000a, a terminal device 3000b, terminal devices 3000c, …, a terminal device 3000n; the terminal devices 3000a, 3000b, 3000c, …, 3000n may be directly or indirectly connected to the server 2000 through a wired or wireless communication manner, respectively, so that each terminal device may interact with the server 2000 through the network connection.
Wherein each terminal device in the terminal device cluster may include: smart phones, tablet computers, notebook computers, desktop computers, intelligent voice interaction devices, intelligent home appliances (e.g., smart televisions), wearable devices, vehicle terminals, aircraft and other intelligent terminals with data processing functions. It should be understood that each terminal device in the terminal device cluster shown in fig. 1 may be installed with an application client having a multimedia data processing function, and when the application client runs in each terminal device, data interaction may be performed between each terminal device and the server 2000 shown in fig. 1. The application client may specifically include: vehicle clients, smart home clients, entertainment clients (e.g., game clients), multimedia clients (e.g., video clients), social clients, and information-based clients (e.g., news clients), etc. The application client in the embodiment of the present application may be integrated in a certain client (for example, a social client), and the application client may also be an independent client (for example, a news client).
For easy understanding, the embodiment of the present application may select one terminal device from the plurality of terminal devices shown in fig. 1 as the target terminal device. For example, in the embodiment of the present application, the terminal device 3000a shown in fig. 1 may be used as a target terminal device, and an application client having a multimedia data processing function may be installed in the target terminal device. At this time, the target terminal device may implement data interaction between the application client and the server 2000.
The server 2000 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs (Content Delivery Network, content delivery networks), basic cloud computing services such as big data and artificial intelligence platforms, and the like.
It should be understood that, in the embodiment of the present application, the computer device may perform a multi-level classification process on the target video through the target classification model to obtain M levels of target classification information corresponding to the target video, and further add the target video to the video recommendation pool corresponding to the application client according to the M levels of target classification information corresponding to the target video. The categories respectively indicated by every two adjacent levels in the M levels have category level nesting relations, namely every two adjacent target classification information in the target classification information of the M levels have category level nesting relations. For example, taking M as 3 as an example, the M target classification information may specifically include first-level target classification information, second-level target classification information, and third-level target classification information, that is, the category hierarchy nesting relationship is a hierarchy from coarse granularity to fine granularity, the first-level target classification information is the last-level classification information of the second-level target classification information, the second-level target classification information is the last-level classification information of the third-level target classification information, for example, the first-level target classification information may be a mobile phone, the second-level target classification information may be a domestic mobile phone, and the third-level target classification information may be a brand-a mobile phone.
The data processing method provided by the embodiment of the present application may be executed by the server 2000 (i.e., the computer device may be the server 2000), may be executed by the target terminal device (i.e., the computer device may be the target terminal device), or may be executed by both the server 2000 and the target terminal device. For convenience of understanding, the user corresponding to the terminal device may be referred to as an object in the embodiment of the present application, for example, the user corresponding to the target terminal device may be referred to as a target object.
When the data processing method is jointly executed by the server 2000 and the target terminal device, the server 2000 may perform parameter adjustment on the pre-trained classification model to obtain the target classification model, and then send the target classification model to the target terminal device. Thus, the target object can receive the target classification model through the target terminal equipment, and then the target video is subjected to multi-level classification processing through the target classification model.
Alternatively, the server 2000 may perform parameter adjustment on the pre-trained classification model to obtain the target classification model when the data processing method is executed by the server 2000. In this way, the target object may transmit a video classification request to the server 2000 through the application client in the target terminal device, so that the server 2000 performs a multi-level classification process on the target video in the video classification request through the target classification model.
Optionally, when the data processing method is executed by the target terminal device, the target terminal device may perform parameter adjustment on the pre-trained classification model to obtain a target classification model, and further perform multi-level classification processing on the target video through the target classification model.
It should be understood that the service scenario applicable to the network framework may specifically include: video distribution scenes, video search scenes, etc., and specific service scenes will not be listed here one by one. For example, in a video distribution scenario, the computer device may obtain, from the video recommendation pool, a distribution video for performing video distribution according to M levels of target classification information corresponding to the target video, and further distribute the distribution video to an application client (e.g., a video browsing interface) of the target terminal device. For another example, in the video search scenario, the computer device may obtain, from the video recommendation pool, a search video for performing video recommendation according to M levels of target classification information corresponding to the target video, and then recommend the search video to an application client (e.g., a search result interface) of the target terminal device.
It will be appreciated that the target terminal device may present the distributed video or the search video to the target object in the form of an information stream (i.e., feeds stream), which is a data format, and the Feeds stream is typically ordered in a time axis (i.e., timeline), i.e., the time axis is the most intuitive and basic presentation form of the Feeds stream. The content recommended to the target object for reading by the information flow can comprise images, texts and videos (for example, distributing videos and searching videos), and the videos can be vertical or horizontal.
It should be appreciated that the application client may also be referred to as an aggregator, which represents software for aggregating feed streams, e.g., an aggregator may be software that is specifically used to subscribe to web sites (different web sites corresponding to different servers), an aggregator may also be referred to as a RSS (Really Simple Syndication) reader, a feed reader, a news reader, etc.
For ease of understanding, further, please refer to fig. 2, fig. 2 is a schematic diagram of a scenario for data interaction according to an embodiment of the present application. The server 20a shown in fig. 2 may be the server 2000 in the embodiment corresponding to fig. 1, the terminal device 20b shown in fig. 2 may be the target terminal device in the embodiment corresponding to fig. 1, an application client may be installed in the terminal device 20b, and the user corresponding to the terminal device 20b may be the object 20c. For ease of understanding, embodiments of the present application will be described in terms of a data processing method performed by server 20 a.
The sample video set 21a (i.e., the content database 21 a) shown in fig. 2 may include a plurality of databases, where the plurality of databases may include databases 21b, …, and 21c shown in fig. 2, and the databases 21b, …, and 21c may be used to store videos corresponding to different video domain types, for example, the database 21b may be used to store videos corresponding to game domain types, and the database 21c may be used to store videos corresponding to life domain types.
As shown in fig. 2, the server 20a may obtain a target sample video belonging to the target domain type from the sample video set 21a, and parameter-adjust the pre-training classification model 22a through the target sample video. For ease of understanding, the embodiment of the present application is described taking the target domain type as the life domain type as an example, and the target sample video may be a video belonging to the life domain type, so that the server 20a may obtain the target sample video for parameter adjustment of the pre-training classification model 22a from the database 21 c.
The pre-trained classification model 22a may include, among other things, a multi-modal feature sub-model, a feature fusion sub-model, and an initial task sub-model. As shown in fig. 2, the server 20a may input the target sample video into a multi-mode feature sub-model of the pre-training classification model 22a, and obtain S modal feature vectors corresponding to the target sample video through the multi-mode feature sub-model, where S may be a positive integer greater than 1, and the S modal feature vectors may specifically include modal feature vectors 23a, …, and modal feature vector 23b. The mode feature vectors may be vectors corresponding to data of different modes, the data of different modes may include, but are not limited to, data of a text mode, data of a video mode and data of an audio mode, and the S mode feature vectors may include, but are not limited to, vectors corresponding to data of a text mode, vectors corresponding to data of a video mode and vectors corresponding to data of an audio mode. For example, the modal feature vector 23a may be a vector corresponding to data of a text modality, and the modal feature vector 23b may be a vector corresponding to data of a video modality.
Further, as shown in fig. 2, the server 20a may input the modal feature vectors 23a, … and 23b into a feature fusion sub-model of the pre-training classification model 22a, and perform fusion learning on the modal feature vectors 23a, … and 23b through the feature fusion sub-model to obtain M initial sample classification features corresponding to the target sample video, where M may be a positive integer greater than 1. For ease of understanding, the 3 initial sample classification features are described herein as M equal to 3, and may specifically include initial sample classification feature 24a, initial sample classification feature 24b, and initial sample classification feature 24c.
Further, as shown in fig. 2, the server 20a may input the initial sample classification feature 24a, the initial sample classification feature 24b, and the initial sample classification feature 24c into an initial task sub-model of the pre-training classification model 22a, and extract target sample classification features corresponding to the M-1 classification matching features and the M initial sample classification features, respectively, through the initial task sub-model. For ease of understanding, the 2 classification matching features may include classification matching feature 25a and classification matching feature 25b, and the 3 target sample classification features may include target sample classification feature 26a, target sample classification feature 26b, and target sample classification feature 26c, as described herein by way of example with M equal to 3.
It will be appreciated, among other things, that the target sample classification feature 26a, the target sample classification feature 26b, and the target sample classification feature 26c may be used to determine M levels of sample classification information, i.e., the target sample classification feature 26a, the target sample classification feature 26b, and the target sample classification feature 26c may be used to determine 3 levels of sample classification information. The categories indicated by each adjacent two levels have a category-level nested relationship, for example, the category indicated by the target sample classification feature 26a (i.e., sample classification information) and the category indicated by the target sample classification feature 26b (i.e., sample classification information) have a category-level nested relationship, and the category indicated by the target sample classification feature 26b (i.e., sample classification information) and the category indicated by the target sample classification feature 26c (i.e., sample classification information) have a category-level nested relationship.
Further, the classification matching feature 25a and the classification matching feature 25b are used to determine the degree of matching between the sample classification information of each adjacent two levels, for example, the classification matching feature 25a is used to determine the degree of matching between the sample classification information indicated by the target sample classification feature 26a and the sample classification information indicated by the target sample classification feature 26b, and the classification matching feature 25b is used to determine the degree of matching between the sample classification information indicated by the target sample classification feature 26b and the sample classification information indicated by the target sample classification feature 26 c.
Further, as shown in fig. 2, the server 20a may obtain M pieces of sample tag information of the target sample video, where for convenience of understanding, M is taken as an example for illustration, and the 3 pieces of sample tag information may specifically include sample tag information 27a, sample tag information 27b, and sample tag information 27c. The M sample tag information of the target sample video may be tag information obtained by manual labeling, or tag information obtained by a single-classification machine learning model with high accuracy, which is not limited in the present application.
It may be understood that the sample tag information 27a may be tag information corresponding to sample classification information indicated by the target sample classification feature 26a, the sample tag information 27b may be tag information corresponding to sample classification information indicated by the target sample classification feature 26b, and the sample tag information 27c may be tag information corresponding to sample classification information indicated by the target sample classification feature 26 c. Wherein, the sample label information may represent an actual classification result, and the sample classification information may represent a predicted classification result.
Further, as shown in fig. 2, the server 20a may determine a model loss value of the pre-training classification model 22a based on the M sample label information, the S modal feature vectors, the M target sample classification features and the M-1 classification matching features, and perform parameter adjustment on an initial task sub-model in the pre-training classification model 22a according to the model loss value of the pre-training classification model 22a to obtain an initial task sub-model after parameter adjustment, and further determine the initial task sub-model after parameter adjustment as the task sub-model. Wherein the multi-modal feature sub-model in the pre-trained classification model 22a, the feature fusion sub-model and the task sub-model in the pre-trained classification model 22a may be used to construct the target classification model 22b. In other words, the server 20a may perform parameter adjustment on the pre-training classification model 22a according to the model loss value of the pre-training classification model 22a to obtain the pre-training classification model 22a after parameter adjustment, and further determine the pre-training classification model 22a after parameter adjustment as the target classification model 22b.
It will be appreciated that, since the target sample video is a video belonging to the life domain type, the target classification model 22b may be used to predict classification information corresponding to the video of the life domain type, that is, the target classification model 22b is a target classification model corresponding to the life domain type. Similarly, the server 20a may acquire videos of other domain types except for the life domain type, and perform parameter adjustment on the pre-trained classification model 22a through the videos of the other domain types to obtain a target classification model (not shown in the figure) corresponding to the other domain types.
Therefore, after the target classification model 22b is generated, the server 20a may receive a video classification request sent by the object 20c through the terminal device 20b, obtain a target video to be subjected to the multi-level classification processing from the video classification request, and further obtain a target classification model corresponding to a domain type to which the target video belongs. For easy understanding, the embodiment of the present application is described taking the target video as a video belonging to the life domain type as an example, so that the server 20a may obtain the target classification model (i.e. the target classification model 22 b) corresponding to the life domain type, and further perform multi-level classification processing on the target video through the target classification model 22b to obtain the M levels of target classification information corresponding to the target video. Wherein, since the object classification model 22b is generated based on the 3 sample tag information and the 3 sample classification information, the object classification model 22b can generate 3 levels of object classification information corresponding to the object video.
Further, as shown in fig. 2, the server 20a may store the target videos in the database 21c, and at the same time, return M pieces of target classification information corresponding to the target videos to the terminal device 20b. Optionally, the server 20a may store M pieces of object classification information corresponding to the object videos in the database 21c, where the databases 21b, … and 21c may be used to store not only videos corresponding to different video domain types, but also classification information corresponding to videos of different video domain types.
Therefore, the embodiment of the application can understand the content of the multidimensional data of the target sample video to obtain S modal feature vectors corresponding to the target sample video, and further determine M-1 classification matching features and M target sample classification features based on the S modal feature vectors. It can be understood that the M target sample classification features may be used to determine M levels of sample classification information of the target sample video, actual classification information corresponding to the M sample classification information may be M sample tag information, and parameter adjustment may be performed on the pre-training classification model based on the M sample tag information, the S modal feature vectors, the M target sample classification features, and the M-1 classification matching features to obtain the target classification model. In this way, when the target classification model is used for carrying out multi-stage classification processing on the target video, the accuracy of the target classification information corresponding to the predicted target video can be improved.
Further, referring to fig. 3, fig. 3 is a flow chart of a data processing method according to an embodiment of the application. The method may be performed by a server, or may be performed by a terminal device, or may be performed by a server and a terminal device together, where the server may be the server 20a in the embodiment corresponding to fig. 2, and the terminal device may be the terminal device 20b in the embodiment corresponding to fig. 2. For ease of understanding, embodiments of the present application will be described in terms of this method being performed by a server. The data processing method may include the following steps S101 to S104:
step S101, a pre-training classification model is obtained;
the pre-training classification model comprises a multi-mode feature sub-model, a feature fusion sub-model and an initial task sub-model, wherein the multi-mode feature sub-model can comprise a text network layer, an image network layer, a video network layer and an audio network layer. The pre-training classification model is obtained by pre-training an initial classification model. The process of pre-training the initial classification model may be referred to as the description of step S201 in the embodiment corresponding to fig. 8 below.
Step S102, S modal feature vectors corresponding to a target sample video belonging to a target field type are obtained through a multi-modal feature sub-model, and fusion learning is carried out on the S modal feature vectors through a feature fusion sub-model to obtain M initial sample classification features corresponding to the target sample video;
Specifically, a target sample video belonging to a target field type is acquired. Further, text feature extraction is carried out on the target text data of the target sample video through the text network layer, and a sample sub-text vector corresponding to the target sample video is obtained. Further, image characteristic extraction is carried out on target image data of the target sample video through the image network layer, and sample sub-image vectors corresponding to the target sample video are obtained. Further, video feature extraction is carried out on target video data of the target sample video through the video network layer, and sample sub-video vectors corresponding to the target sample video are obtained. Further, audio feature extraction is carried out on the target audio data of the target sample video through the audio network layer, and a sample sub-audio vector corresponding to the target sample video is obtained. The S modal feature vectors include a sample sub-text vector, a sample sub-image vector, a sample sub-video vector, and a sample sub-audio vector, in other words, the sample sub-text vector, the sample sub-image vector, the sample sub-video vector, and the sample sub-audio vector may be collectively referred to as S modal feature vectors corresponding to the target sample video, where the sample sub-text vector, the sample sub-image vector, the sample sub-video vector, and the sample sub-audio vector have the same feature dimensions. Further, fusion learning is carried out on the sample sub-text vector, the sample sub-audio vector, the sample sub-image vector and the sample sub-video vector through the feature fusion sub-model, and M initial sample classification features corresponding to the target sample video are obtained. Wherein M and S are positive integers greater than 1, and the feature fusion submodel can support cross-modal feature fusion.
Wherein, it can be understood that the M initial sample classification features can be M different features output by the feature fusion submodel; alternatively, the M initial sample classification features may be one feature output by the feature fusion sub-model, and the feature may be repeated M, that is, the M initial sample classification features are identical.
For a specific process of extracting text features from the target text data, refer to the following description of step S1022 in the embodiment corresponding to fig. 6; for a specific process of extracting image features from the target image data through the image network layer, refer to the description of step S1023 in the embodiment corresponding to fig. 6; for a specific process of extracting video features from the target video data, refer to the description of step S1024 in the embodiment corresponding to fig. 6; for a specific process of extracting the audio features from the target audio data, reference may be made to the description of step S1025 in the embodiment corresponding to fig. 6.
It can be understood that the feature fusion sub-model can be a transformer-based bi-directional encoder characterization (Bidirectional Encoder Representations from Transformers, abbreviated as BERT) model, and can also be a lightweight BERT model (A Lite BERT for Self-supervised Learning of Language Representations, abbreviated as ALBERT) for language characterization self-supervised learning, and the embodiment of the application does not limit the specific type of the feature fusion sub-model.
The initial task sub-model may include M classification network layers corresponding to the classification features of the initial sample, where the M classification network layers include classification network layer H k Where k may be a positive integer less than or equal to M, i.e., M classified network layers may include classified network layer H 1 …, classified network layer H M
Step S103, extracting target sample classification features respectively corresponding to M-1 classification matching features and M initial sample classification features through an initial task sub-model;
specifically, a classification network layer H is obtained from M initial sample classification features k Corresponding initial sample classification feature J k . Further, if the network layer H is classified k For the first classified network layer of the M classified network layers, then pass through classified network layer H k Classifying features J for initial sample k Carrying out full connectionPerforming connection processing to obtain a classified network layer H k Corresponding auxiliary sample classification features pass through the classification network layer H k Classification network layer H k Full-connection processing is carried out on the corresponding auxiliary sample classification features to obtain initial sample classification features J k And corresponding target sample classification features. Alternatively, if the network layer H is classified k Not being the first of the M classified network layers, then passing through classified network layer H k Classifying features J for initial sample k Performing full connection processing to obtain a classified network layer H k Corresponding auxiliary sample classification features based on M initial sample classification features and classification network layer H k Corresponding auxiliary sample classification features, and determining initial sample classification features J k Corresponding target sample classification features and classification network layer H k The corresponding classifications match the features. The M target sample classification features are used for determining M levels of sample classification information, the M-1 classification matching features are used for determining the matching degree between every two adjacent levels of sample classification information, in other words, the M target sample classification features can represent M levels of conditional probability distribution, and the M-1 classification matching features can represent the dependency relationship between two adjacent levels. Wherein the categories indicated by each adjacent two levels have category-level nesting relationships. For example, a category level nesting relationship (i.e., a dependency relationship) exists between a primary category and a secondary category, a category level nesting relationship exists between a secondary category and a tertiary category, a primary classification result has a guiding effect on a secondary classification result and a tertiary classification result, and a secondary classification result has a guiding effect on a tertiary classification result.
Therein, it can be appreciated that classifying features and classifying the network layer H based on M initial samples k Corresponding auxiliary sample classification features, and determining initial sample classification features J k Corresponding target sample classification features and classification network layer H k The specific process of the corresponding class matching feature can be described as: if classifying network layer H k For the second classified network layer of the M classified network layers, then classified network layer H k-1 Corresponding auxiliary sample classification features and classification network layer H k Performing feature stitching on the corresponding auxiliary sample classification features to obtain a classification network layer H k Corresponding spliced sample classification features and classification network layer H k-1 Corresponding auxiliary sample classification features and classification network layer H k Feature stitching is carried out on the corresponding spliced sample classification features to obtain a classified network layer H k Corresponding classification matching features, classification network layer H k Performing full-connection processing on the corresponding spliced sample classification features to obtain initial sample classification features J k And corresponding target sample classification features. Wherein, classifying network layer H k-1 To classify network layer H k Is the last classified network layer of (a), classified network layer H k-1 The corresponding auxiliary sample classification features are based on classification network layer H k-1 And determining the corresponding initial sample classification characteristics. Alternatively, if the network layer H is classified k Not being the second of the M classified network layers, then classified network layer H k-1 Corresponding spliced sample classification features and classification network layer H k Performing feature stitching on the corresponding auxiliary sample classification features to obtain a classification network layer H k Corresponding spliced sample classification features and classification network layer H k-1 Corresponding spliced sample classification features and classification network layer H k Feature stitching is carried out on the corresponding spliced sample classification features to obtain a classified network layer H k Corresponding classification matching features, classification network layer H k Performing full-connection processing on the corresponding spliced sample classification features to obtain initial sample classification features J k And corresponding target sample classification features. Wherein, acquiring classified network layer H k-1 The specific process of corresponding spliced sample classification features can be seen in acquiring the classification network layer H k Description of the corresponding classification features of the spliced samples will not be repeated here.
It should be appreciated that each classification network layer may include two multi-layer perceptrons (Multilayer Perceptron, MLP, i.e., multi-layer perceptrons), classification network layer H k May include two multi-layer perceptrons, through classification network layer H k The two multi-layer perceptrons in the model can realize the process of the full connection processing, namely classifying the characteristics J of the initial sample k Performing full connection processing and classifying network layer H k Full connection processing is carried out on corresponding auxiliary sample classification features, and classification network layer H is carried out k And performing full connection processing on the corresponding spliced sample classification features, namely introducing constraint relations among different classifications to improve the model performance. In addition, the multi-layer perceptron may also be referred to as an artificial neural network (ANN, artificial Neural Network), and the multi-layer perceptron may include an input layer, a hidden layer, and an output layer, and the number of hidden layers may be at least one.
The specific process of performing the full connection processing through the multi-layer perceptron may refer to fig. 4, and fig. 4 is a schematic structural diagram of the multi-layer perceptron provided by the embodiment of the application. As shown in FIG. 4, the layers of the multi-Layer perceptron are fully connected (i.e. the outputs of the nodes of the upper Layer are all connected together as inputs to the interface of the next Layer), as shown in FIG. 4, the basic structure of a three-Layer neural network is shown, i.e. FIG. 4 illustrates that the multi-Layer perceptron comprises a hidden Layer, the bottom Layer is the input Layer 40a (i.e. Layer L 1 ) The intermediate Layer is a hidden Layer 40b (i.e., layer L 2 ) The highest Layer is the output Layer 40c (i.e., layer L 3 ). Wherein the input layer 40a, hidden layer 40b and output layer 40c are comprised of neurons, each neuron having a respective weight thereon.
Where input layer 40a may be an n-dimensional vector, such that the input layer has n neurons, for example, input layer 40a may be a 3-dimensional vector x= { X 1 ,X 2 ,X 3 Thus, the input layer has 3 neurons. It should be understood that the input and output of input layer 40a are the same, and that when the input of input layer 40a is vector X, the output of input layer 40a may also be vector X.
Wherein, the hidden layer 40b and the input layer 40a are fully connected, when the output of the input layer 40a is the vector X, the input of the hidden layer 40b may be the vector X, and thus, the output of the hidden layer 40b may be: f (W) 1 X+b 1 ) Wherein W is 1 A weight (also called a connection coefficient 1) corresponding to each neuron of the input layer 40a, b 1 May be a bias term for input layer 40a, whereThe function f of (2) may be referred to as an activation function, e.g., the activation function may be a sigmoid function, a tanh function, etc. Wherein (+1) on the input layer 40a neuron may represent the bias term b 1 Is (+1). The nonlinear expression can be learned by the neuron by introducing nonlinearity into the output of the neuron through the activation function and the bias term.
Wherein the hidden layer 40b to the output layer 40c can be regarded as a type of logistic regression (i.e. softmax regression), the output layer 40c and the hidden layer 40b are fully connected, and the output at the hidden layer 40b is f (W 1 X+b 1 ) The input of the output layer 40c may be f (W 1 X+b 1 ) Thus, the output 40d of the output layer 40c may be: softmax (W) 2 X 1 +b 2 ) Wherein X is 1 Can represent the output, W, of the hidden layer 40b 2 A weight (also called a connection coefficient 2) corresponding to each neuron of the hidden layer 40b, b 2 The bias term may be the hidden layer 40b, for example, the softmax function may be a sigmoid function. Wherein (+ 1) on the hidden layer 40b neuron may indicate that the bias term b2 has a weight of (+ 1).
Step S104, carrying out parameter adjustment on the initial task sub-model based on M sample label information, S modal feature vectors, M target sample classification features and M-1 classification matching features of the target sample video to obtain a task sub-model;
specifically, a first model loss value of the pre-trained classification model is determined based on the associated sample tag information and the target sample classification feature of the M sample tag information and the M target sample classification features of the target sample video. Further, a second model loss value for the pre-trained classification model is determined based on the M sample label information, the sample sub-text vector, the sample sub-image vector, the sample sub-video vector, and the sample sub-audio vector. Further, a third model loss value for the pre-trained classification model is determined based on the M target sample classification features and the M-1 classification matching features. Further, according to the first model loss value, the second model loss value and the third model loss value, determining a total model loss value of the pre-training classification model, and carrying out parameter adjustment on the initial task sub-model based on the total model loss value to obtain the task sub-model. Optionally, the embodiment of the present application may further determine a total model loss value of the pre-trained classification model according to one or two of the first model loss value, the second model loss value and the third model loss value. The multi-mode feature sub-model, the feature fusion sub-model and the task sub-model are used for forming a target classification model, and the target classification model is used for predicting M levels of target classification information corresponding to target videos belonging to the target field types.
The specific process of determining the first model loss value may refer to the description of step S1041 in the embodiment corresponding to fig. 7, the specific process of determining the second model loss value may refer to the description of step S1042 in the embodiment corresponding to fig. 7, and the specific process of determining the third model loss value may refer to the description of step S1043 in the embodiment corresponding to fig. 7.
For ease of understanding, please refer to fig. 5, fig. 5 is a schematic structural diagram of a classification model according to an embodiment of the present application. As shown in fig. 5, the pre-training classification model may include a multi-modal feature sub-model 50a, a feature fusion sub-model 50b, and an initial task sub-model 50c, and for ease of understanding, the embodiment of the present application is illustrated with the number of modes being 4, the embodiment of the present application is illustrated with the level of classification information being 3, and the embodiment of the present application is illustrated with the feature fusion sub-model 50b being a BERT model.
The 4 modes can specifically include a text mode, an image mode, a video mode and an audio mode, and optionally, the modes of the embodiment of the application can also be 1, 2 or 3 of the text mode, the image mode, the video mode and the audio mode; the 3 levels may specifically include a first level, a second level, and a third level; the feature fusion sub-model 50b may include one or more fusion network layers, where the number of fusion network layers is 3, and the 3 fusion network layers may specifically include a fusion network layer 51a, a fusion network layer 51b, and a fusion network layer 51c, where the fusion network layer 51a, the fusion network layer 51b, and the fusion network layer 51c may be encoders in a transform model, and the fusion network layer 51a, the fusion network layer 51b, and the fusion network layer 51c may perform Attention processing on input features, in other words, the feature fusion sub-model 50b may use a multi-layer transform model to perform Attention operation.
Wherein the initial task sub-model 50c may include 3 classification network layers, and the 3 classification network layers may include a classification network layer H 1 Classified network layer H 2 And classifying network layer H 3 . Classified network layer H 1 May include a multi-layer perceptron 60a and a multi-layer perceptron 60b, classifying the network layer H 2 May include a multi-layer perceptron 61a, a splice layer 63a, and a multi-layer perceptron 61b, classifying network layer H 3 A multi-layer perceptron 62a, a stitching layer 63b, and a multi-layer perceptron 62b may be included.
As shown in fig. 5, S modal feature vectors corresponding to the target sample video may be obtained through the multi-modal feature sub-model 50a, where the S modal feature vectors may include a modal feature vector 56a corresponding to a text mode, a modal feature vector 57a corresponding to an image mode, a modal feature vector 58a corresponding to a video mode, and a modal feature vector 59a corresponding to an audio mode; alternatively, the S modal feature vectors may include a modal feature vector 56b corresponding to a text modality, a modal feature vector 57b corresponding to an image modality, a modal feature vector 58b corresponding to a video modality, and a modal feature vector 59b corresponding to an audio modality. The modal feature vector 56a, the modal feature vector 57a, the modal feature vector 58a, and the modal feature vector 59a may be modal feature vectors of different modalities corresponding to the same video, and the modal feature vector 56b, the modal feature vector 57b, the modal feature vector 58b, and the modal feature vector 59b may be modal feature vectors of different modalities corresponding to the same video.
As shown in fig. 5, the model feature vector 56a, the modal feature vector 57a, the modal feature vector 58a and the modal feature vector 59a may be subjected to fusion learning through the feature fusion submodel 50b, so as to obtain M initial sample classification features (i.e., attention features) corresponding to the target sample video. The converged network layer 51a may perform attention processing on the mode feature vector 56a, the mode feature vector 57a, the mode feature vector 58a, and the mode feature vector 59a to obtain an output of the converged network layer 51a, and further perform attention processing on the output of the converged network layer 51a through the converged network layer 51b to obtain an output of the converged network layer 51b, and further perform attention processing on the output of the converged network layer 51b through the converged network layer 51c to obtain an attention feature corresponding to the target sample video.
Further, as shown in FIG. 5, the classification network layer H is performed by the multi-layer perceptron 60a 1 Full-connection processing is carried out on the corresponding initial sample classification characteristics to obtain a classification network layer H 1 Corresponding auxiliary sample classification features; classification of network layer H by multilayer perceptron 61a 2 Full-connection processing is carried out on the corresponding initial sample classification characteristics to obtain a classification network layer H 2 Corresponding auxiliary sample classification features; classification of network layer H by multilayer perceptron 62a 3 Full-connection processing is carried out on the corresponding initial sample classification characteristics to obtain a classification network layer H 3 Corresponding auxiliary sample classification features. Further, the split network layer H is classified by the splicing layer 63a 1 Corresponding auxiliary sample classification features and classification network layer H 2 Performing feature stitching on the corresponding auxiliary sample classification features to obtain a classification network layer H 2 Corresponding spliced sample classification features; similarly, the split network layer H is split by the splice layer 63b 2 Corresponding spliced sample classification features and classification network layer H 3 Performing feature stitching on the corresponding auxiliary sample classification features to obtain a classification network layer H 3 And corresponding spliced sample classification features.
Further, the classification network layer H is performed by the multi-layer perceptron 60b 1 Full-connection processing is carried out on the corresponding auxiliary sample classification characteristics to obtain a classification network layer H 1 Corresponding target sample classification features; similarly, classification of network layer H by multilayer perceptron 61b 2 Full-connection processing is carried out on the corresponding spliced sample classification features to obtain a classification network layer H 2 The corresponding target sample classification features are classified into a network layer H by the multi-layer perceptron 62b 3 The corresponding spliced sample classification features are subjected to full connection processing,obtaining a classified network layer H 3 And corresponding target sample classification features.
Further, as shown in FIG. 5, the network layer H is classified according to classifications 1 Corresponding target sample classification features can determine sample classification information (namely first-level classification) of a first level corresponding to the target sample video, and the first-level sample classification information is classified into a first-level sample classification information according to a classification network layer H 2 Corresponding target sample classification features can determine sample classification information (i.e. secondary classification) of a second level corresponding to the target sample video, and the second level is based on classification network layer H 3 Corresponding target sample classification features, sample classification information (i.e., three-level classification) of a third level corresponding to the target sample video may be determined.
Therefore, the embodiment of the application provides an information flow video content classification method based on a multi-mode pre-training network, which can acquire the modal feature vectors corresponding to S modes of a target sample video respectively through a pre-training classification model, multi-dimensionally utilize the target sample video, fully utilize information implied by different modes of the target sample video, and improve the accuracy and the comprehensiveness of the feature representation of the target sample video through multi-mode content understanding. In addition, the embodiment of the application can predict M levels of sample classification information corresponding to the target sample video based on M target sample classification features, and accurately classify the target sample video in multiple levels by describing the classification information of the target sample video on different granularity levels through the M levels of sample classification information. Therefore, when the target sample video is classified based on the target classification model, the information of S modes of the target video can be fully utilized to generate M levels of target classification information corresponding to the target video, so that the accuracy of predicting the target classification information corresponding to the target video is improved.
Further, referring to fig. 6, fig. 6 is a flow chart of a data processing method according to an embodiment of the application. The data processing method may include the following steps S1021-S1026, where steps S1021-S1026 are a specific embodiment of step S102 in the embodiment corresponding to fig. 3.
Step S1021, obtaining a target sample video belonging to the target field type;
the number of the target sample videos may be one or more, and the embodiment of the present application may acquire one or more target sample videos belonging to the target domain type, and further execute the following steps S1022-S1026 for the one or more target sample videos.
Step S1022, extracting text features of target text data of the target sample video through the text network layer to obtain sample sub-text vectors corresponding to the target sample video;
specifically, feature extraction is performed on target text data of a target sample video through a text network layer, so that text feature vectors corresponding to the target text data are obtained. Further, word segmentation processing is carried out on the target text data to obtain text word segmentation of the target text data, text position coding is carried out on text positions of the text word segmentation in the target text data, and text position vectors corresponding to the text word segmentation are obtained. Further, text modal characteristics corresponding to the target text data are obtained, and the text characteristic vectors, the text position vectors and the text modal characteristics are fused to obtain sample sub-text vectors corresponding to the target sample video. The text mode feature can represent a mode number of a text mode and is used for uniquely identifying one mode; the method for fusing the text feature vector, the text position vector and the text modal feature can be a weighted average method or a feature splicing method, and the embodiment of the application does not limit the specific fusion process.
The target text data may include, but is not limited to, title information, subtitle text information, voice text information, and description keywords (i.e., hashTag) of the target sample video, where the target text data may be obtained by stitching the title information, the subtitle text information, the voice text information, and the description keywords. The description keyword can be a word input when the publisher uploads the target sample video, voice text information of the target sample video can be identified through ASR (Automatic Speech Recognition, namely automatic voice recognition), subtitle text information of the target sample video can be identified through OCR, and the title information is subjective description of the expression content of the target sample video by the publisher and generally covers high-level semantics which the target sample video wants to express.
It can be understood that due to reasons that OCR recognition is inaccurate, fixed-position OCR needs to be duplicated, dictation OCR needs to be reserved, news scrolling OCR needs to be deleted and the like in a picture switching process, the embodiment of the application can perform denoising processing on OCR recognition results (namely caption text information), the denoising processing can filter single-word OCR, filter pure-digital OCR, filter pure-letter OCR, filter OCR with small offset of positions of adjacent two frames of captions and high character repetition rate, filter OCR with captions at the bottom end of a screen and small height, and the like, and further can splice the denoised caption text information with other text information to obtain target text data.
According to the embodiment of the application, the feature extraction can be carried out on the target text data by training the BERT model based on the prediction of the information flow large-scale text corpus, so that the features of the text mode can be effectively extracted; optionally, the embodiment of the present application may also perform feature extraction on the target text data through the BERT model, and the embodiment of the present application does not limit the type of the model used for text feature extraction.
Optionally, the embodiment of the application can also obtain a keyword vector (for example, a word2vec vector corresponding to the description keyword) corresponding to the description keyword, so as to fuse the keyword vector, the text feature vector, the text position vector and the text modal feature, and obtain a sample sub-text vector corresponding to the target sample video. The word2vec vector has side information (auxiliary information) in the global text, which is helpful for the multi-modal feature vector (i.e. sample sub-text vector) to better understand the overall category division space.
Step S1023, extracting image features of target image data of a target sample video through an image network layer to obtain sample sub-image vectors corresponding to the target sample video;
specifically, feature extraction is performed on target image data of a target sample video through an image network layer, and an image feature vector corresponding to the target image data is obtained. Further, image mode characteristics corresponding to the target image data are obtained, and the image characteristic vectors and the image mode characteristics are fused to obtain sample sub-image vectors corresponding to the target sample video. The image mode characteristics can represent the mode number of the image mode and are used for uniquely identifying one mode; the method for fusing the image feature vector and the image mode features can be a weighted average method or a feature stitching method, and the embodiment of the application does not limit the specific fusion process.
The target image data may be a cover map of the target sample video, and the cover map may be an image captured in the target sample video, or may be an image uploaded by a publisher. It should be understood that the embodiment of the present application may perform feature extraction on the target image data through the SwinT (Swin Transformer) model, and may also perform feature extraction on the target image data through the VIT (Vision Transformer) model, and the embodiment of the present application does not limit the type of model used for image feature extraction.
Optionally, in the embodiment of the present application, the target detection model may be used to perform target detection on the target image data to obtain a detection object in the target image data, obtain a detection object vector corresponding to the detection object, and further fuse the detection object vector, the image feature vector and the image mode feature to obtain a sample sub-image vector corresponding to the target sample video. The target detection model may be a FastRCNN (Fast Region Convolutional Neural Networks) model, and the embodiment of the application does not limit the model type of the target detection model.
Step S1024, extracting video features of target video data of the target sample video through the video network layer to obtain sample sub-video vectors corresponding to the target sample video;
Specifically, feature extraction is performed on target video data of a target sample video through a video network layer, so as to obtain a video feature vector corresponding to the target video data. Further, frame extraction processing is carried out on the target video data to obtain key video frames in the target video data, and feature extraction is carried out on the key video frames through a video network layer to obtain video frame feature vectors corresponding to the key video frames. Further, video mode features corresponding to the target video data are obtained, and the video feature vectors, the video frame feature vectors and the video mode features are fused to obtain sample sub-video vectors corresponding to the target sample video. The video mode characteristics can represent the mode number of the video mode and are used for uniquely identifying one mode; the method for fusing the video feature vector, the video frame feature vector and the video mode feature can be a weighted average method or a feature splicing method, and the embodiment of the application does not limit the specific fusion process.
It may be appreciated that the video feature vectors corresponding to the two target sample videos may represent a distance between the two target sample videos, and thus a similarity between the two target sample videos may be calculated, and the video feature vectors may represent content-based "implicit" features. Thus, the video feature vector contains 2-layer meaning: layer 1 means that the video feature vector is a dense feature of low dimension, layer 2 means that the video feature vector is a vector of similarity measure, and the "distance" of the two vectors represents the "similarity" of the two videos.
The embodiment of the application can perform feature extraction on the target video data through Video Swin Transformer, can perform feature extraction on the key video frames through a SwinT model, and can perform feature extraction on the key video frames through a VIT model, and the embodiment of the application does not limit the types of models used for video feature extraction.
Wherein Video Swin Transformer is based on the Swin transducer improvement, the basic structure of Video Swin Transformer is very close to the Swin transducer, and the frame (time) dimension is increased when the model is calculated. The biggest characteristic of the Swin transform is similar to the convolution+pooling structure in CNN (Convolutional Neural Network ), in the Swin transform, the structure becomes Swin Transformer Block +Patch raising, and the calculation amount of the model is reduced due to the specially designed Window Transformer calculation mode.
Step S1025, extracting audio features of target audio data of the target sample video through the audio network layer to obtain sample sub-audio vectors corresponding to the target sample video;
specifically, frame-dividing processing is performed on target audio data of a target sample video to obtain at least two audio frames in the target audio data, and feature extraction is performed on the at least two audio frames through an audio network layer respectively to obtain audio frame feature vectors corresponding to the at least two audio frames respectively. Further, audio position coding is carried out on the audio position of each audio frame in the target audio data, and an audio frame position vector corresponding to each audio frame is obtained. Further, audio mode characteristics corresponding to the target audio data are obtained, at least two audio frame characteristic vectors, at least two audio frame position vectors and the audio mode characteristics are fused, and sample sub-audio vectors corresponding to the target sample video are obtained. The audio mode feature can represent a mode number of an audio mode and is used for uniquely identifying one mode; the method for fusing the at least two audio frame feature vectors, the at least two audio frame position vectors and the audio mode features can be a weighted average method or a feature splicing method, and the embodiment of the application does not limit the specific fusion process.
The embodiment of the application can respectively perform feature extraction on at least two audio frames through WavLM (Large-Scale Self-superior Pre-Training for Full Stack Speech Processing), and does not limit the type of a model used for audio feature extraction. In addition, the classification accuracy of emotion, smiling, film and television synthetic arts, video courses and other types of contents can be obviously improved through the sample sub-audio vectors.
And step S1026, performing fusion learning on the sample sub-text vector, the sample sub-audio vector, the sample sub-image vector and the sample sub-video vector through the feature fusion sub-model to obtain M initial sample classification features corresponding to the target sample video.
It can be understood that, in the embodiment of the application, each feature (i.e., S modal feature vectors) output by the multi-modal feature sub-model can be used as a token, and a feature sequence is input into the feature fusion sub-model, so that modal fusion is performed through the multi-modal feature fusion sub-model based on the transducer network.
Optionally, the embodiment of the application can also perform fusion learning on one, two or three of the sample sub-text vector, the sample sub-audio vector, the sample sub-image vector and the sample sub-video vector to obtain M initial sample classification features corresponding to the target sample video. For example, fusion learning is performed on the sample sub-text vector, the sample sub-audio vector, the sample sub-image vector and the sample sub-video vector, so as to obtain M initial sample classification features corresponding to the target sample video.
For ease of understanding, referring back to fig. 5, embodiments of the present application may obtain features of a text mode, features of an image mode, features of a video mode, and features of an audio mode associated with a target sample video, where the features of the text mode may include text feature vector 52a, text position vector 52b, and text mode feature 52c, the features of the image mode may include image feature vector 53a and image mode feature 53b, the features of the video mode may include video feature vector 54a, video frame feature vector 54b, and video mode feature 54c, and the features of the audio mode may include audio frame feature vector 55a, audio frame position vector 55b, and audio mode feature 55c.
Further, as shown in fig. 5, by fusing text feature vector 52a, text position vector 52b, and text modality feature 52c, sample sub-text vector 56a (i.e., modality feature vector 56 a) may be obtained; by fusing the image feature vector 53a and the image modality feature 53b, a sample sub-image vector 57a (i.e., modality feature vector 57 a) may be obtained; by fusing video feature vector 54a, video frame feature vector 54b, and video modality feature vector 54c, sample sub-video vector 58a (i.e., modality feature vector 58 a) may be obtained; by fusing the audio frame feature vector 55a, the audio frame position vector 55b, and the audio modality feature 55c, a sample sub-audio vector 59a (i.e., modality feature vector 59 a) may be obtained.
Therefore, the embodiment of the application can acquire the target sample video belonging to the target field type, perform feature extraction on the target text data, the target image data, the target video data and the target audio data of the target sample video to obtain the sample sub-text vector, the sample sub-image vector, the sample sub-video vector and the sample sub-audio vector corresponding to the target sample video, and further perform fusion learning on the sample sub-text vector, the sample sub-image vector, the sample sub-video vector and the sample sub-audio vector to obtain M initial sample classification features corresponding to the target sample video. It can be understood that the embodiment of the application can use a multi-mode content understanding algorithm to fuse the target text data, the target image data, the target video data and the target audio data, and can deepen understanding of video content by multi-mode multi-level modeling mining information, thereby realizing downstream multi-classification tasks and improving accuracy of target classification information corresponding to a predicted target video.
Further, referring to fig. 7, fig. 7 is a flow chart of a data processing method according to an embodiment of the application. The data processing method may include the following steps S1041 to S1044, where the steps S1041 to S1044 are a specific embodiment of the step S104 in the embodiment corresponding to fig. 3. The pre-training classification model further comprises a text classifier, an image classifier, a video classifier and an audio classifier; optionally, the pre-trained classification model may also include one, two, or three of a text classifier, an image classifier, a video classifier, and an audio classifier.
Step S1041, determining a first model loss value of a pre-training classification model based on the M sample label information of the target sample video and the associated sample label information and target sample classification feature in the M target sample classification features;
specifically, the M target sample classification features are respectively normalized, so as to obtain normalized sample classification features corresponding to the M target sample classification features. Further, according to M sample tag information of the target sample video, a classification tag vector corresponding to each sample tag information is generated. Further, a classification loss value between the associated normalized sample classification feature and the classification label vector is determined based on the M normalized sample classification features and the M classification label vectors. Further, a first model penalty value of the pre-trained classification model is obtained from the classification penalty values between the M associated normalized sample classification features and the classification tag vectors.
If the number of the target sample videos is one, determining the sum of the classification loss values between M associated normalized sample classification features corresponding to the target sample videos and the classification label vectors as a first model loss value of the pre-training classification model. Optionally, if the number of the target sample videos is at least two, determining a sum of classification loss values between M associated normalized sample classification features and classification label vectors corresponding to the at least two target sample videos as a first model loss value of the pre-training classification model.
It will be appreciated that the associated sample tag information and the target sample classification feature in the M sample tag information and the M target sample classification feature may be the same level of sample tag information and target sample classification feature, for example, the sample tag information 27a and the target sample classification feature 26a in the embodiment corresponding to fig. 2 may be the associated sample tag information and target sample classification feature.
It should be appreciated that the loss function in the model training process may be used to represent the degree of difference between the predicted value and the actual value, and the smaller the loss value (e.g., the first model loss value) corresponding to the loss function, the better the model, and therefore, the goal of training a machine learning model is to find the point at which the model loss function reaches a minimum. The model loss function may be a cross entropy loss function, and the model loss function may also be a logic loss function, where the model loss function is taken as an example of the cross entropy loss function, and the embodiment of the application does not limit a specific type of the model loss function.
When the sample tag information of the target sample video may be classified into two types (for example, the sample tag information of the target sample video may be a mobile phone or a non-mobile phone, i.e., the predicted value may be classified into a mobile phone (i.e., positive) or a non-mobile phone (i.e., negative)), the present application may use a two-class cross entropy loss function as a model loss function of the pre-training classification model, i.e., the model loss function may refer to the following formula (1):
Where n may be the number of target sample videos, where y may represent the actual value of the sample tag information (y may be 1 (i.e., the class tag vector is equal to (1, 0)) when the actual value is mobile, and y may be 0 (i.e., the class tag vector is equal to (0, 1)) when the actual value is non-mobile, and where a may represent the probability that the predicted value is mobile (i.e., positive).
Alternatively, when sample tag information of a target sample image may be classified into multiple categories (for example, sample tag information of a target sample image may be brand mobile phone 1, brand mobile phone 2, brand mobile phone 3, etc.), the present application may use a multi-classification cross entropy loss function (classification_cross sentropy) as a model loss function of a pre-training classification model, that is, the model loss function may be referred to the following formula (2):
where n may be the number of target images, where m may represent the number of sample tag information, where y c Can represent an indicator variable (the indicator variable can be 1 when the actual value is the same as the sample tag information c; the indicator variable can be 0 when the actual value is different from the sample tag information c), a c The probability that the predicted value is the sample tag information c may be represented. Wherein the class label vector is determined from a set of indicator variables. For example, the class penalty value between the associated normalized sample class feature and class label vector may be: loss1= - (0×ln0.3+0×ln0.3+1×ln0.4) =0.91, the classification loss value between the associated normalized sample classification feature and classification label vector may be: loss2= - (0×ln0.3+0×ln0.3+1×ln0.4) =0.91, phase The class penalty value between the associated normalized sample class feature and class label vector may be: loss 3= - (0×ln0.7+0×ln0.2+1×ln0.1) =2.30, then the sum of the classification loss values between the M associated normalized sample classification features and the classification label vector may be: c=0.91+0.91+2.30=4.12. Wherein the class label vector may be represented as (0, 1).
It should be appreciated that for ease of storage and processing, the normalized exponential function (i.e., softmax function) may normalize a vector of real numbers such that the vector of real numbers meets certain conditions, e.g., compress the range of values of real numbers in the vector of real numbers such that each range of real numbers is between (0, 1) and the sum of all real numbers is 1. Therefore, the normalized exponential function can be understood as a classifier of the model, and the value in the result obtained by the normalization processing is the probability.
Step S1042, determining a second model loss value of the pre-trained classification model based on the M sample label information, the sample sub-text vector, the sample sub-image vector, the sample sub-video vector, and the sample sub-audio vector;
specifically, the sample sub-text vectors are input to a text classifier, M predicted text classification vectors are determined through the text classifier, and a text loss value of a pre-training classification model is determined according to the associated predicted text classification vectors and sample label information in the M predicted text classification vectors and the M sample label information. Further, the sample sub-image vector is input to an image classifier, M predicted image classification vectors are determined by the image classifier, and an image loss value of the pre-training classification model is determined according to the associated predicted image classification vector and sample label information in the M predicted image classification vectors and the M sample label information. Further, the sample sub-video vectors are input to a video classifier, M predicted video classification vectors are determined by the video classifier, and a video loss value of the pre-training classification model is determined according to the associated predicted video classification vector and sample label information in the M predicted video classification vectors and the M sample label information. Further, the sample sub-audio vectors are input to an audio classifier, M predicted audio classification vectors are determined by the audio classifier, and an audio loss value of the pre-training classification model is determined according to the associated predicted audio classification vector and sample label information in the M predicted audio classification vectors and the M sample label information. Further, a second model penalty value for the pre-trained classification model is determined based on the text penalty value, the image penalty value, the video penalty value, and the audio penalty value.
The specific process of determining the text loss value according to the associated prediction text classification vector and sample label information in the M prediction text classification vectors and M sample label information, determining the image loss value according to the associated prediction image classification vector and sample label information in the M prediction image classification vectors and M sample label information, determining the video loss value according to the associated prediction video classification vector and sample label information in the M prediction video classification vectors and M sample label information, and determining the audio loss value according to the associated prediction audio classification vector and sample label information in the M prediction audio classification vectors and M sample label information can be referred to the above description of determining the first model loss value based on the associated sample label information and the associated sample classification feature in the M sample classification features of the target sample video, and will not be repeated herein.
It will be appreciated that a weighted summation of text loss values, image loss values, video loss values, and audio loss values may generate a second model loss value for the pre-trained classification model; optionally, the embodiment of the present application may further perform weighted summation on one, two or three of the text loss value, the image loss value, the video loss value and the audio loss value to generate a second model loss value of the pre-trained classification model.
Step S1043, determining a third model loss value of the pre-training classification model based on the M target sample classification features and the M-1 classification matching features;
specifically, the M target sample classification features are respectively normalized to obtain normalized sample classification features corresponding to the M target sample classification features, and the target sample view is determined according to the M normalized sample classification featuresM levels of sample classification information corresponding to frequencies. Further, respectively carrying out normalization processing on the M-1 classification matching features to obtain normalized matching features respectively corresponding to the M-1 classification matching features. Wherein the M-1 normalized matching features include normalized matching feature P b Here b may be a positive integer less than or equal to M-1. Further, a normalized matching feature P is obtained b The associated first sample classification information and second sample classification information, and determining normalized matching features P according to the first sample classification information and the second sample classification information b The corresponding category matches the tag. Wherein the first sample classification information is matched with the normalized characteristic P b Sample classification information corresponding to normalized sample classification features with the same level, and the second sample classification information is normalized matching feature P b Sample classification information corresponding to the normalized sample classification feature of the previous level. Further, according to the normalized matching characteristic P b And normalized matching feature P b Corresponding category matching labels, and determining normalized matching features P b The corresponding category matches the loss. Further, according to category matching loss corresponding to the M-1 normalized matching features, determining a third model loss value of the pre-trained classification model. And determining the sum of category matching losses corresponding to the M-1 normalized matching features as a third model loss value of the pre-training classification model.
For example, if b is equal to 1, then the matching feature P is normalized b For the normalized matching feature of the second level, the first sample classification information may be sample classification information of the second level, and the second sample classification information may be sample classification information of the first level. Further, comparing category level nesting relation between the second-level sample classification information and the first-level sample classification information, and if the second-level sample classification information and the first-level sample classification information have category level nesting relation, the category matching label can be first label information (namely 1); optionally, if the second level of sample classification information and the first level of sample classification information do not have a category-level nesting relationship, the category matching tag may be a second tag Information (i.e., 0).
Wherein, according to the normalized matching characteristic P b And normalized matching feature P b Corresponding category matching labels, and determining normalized matching features P b For a specific process of matching loss of the corresponding category, reference may be made to the description of determining the value of the classification loss between the associated normalized sample classification feature and the classification label vector based on the M normalized sample classification features and the M classification label vectors, which will not be described in detail herein.
Therefore, the embodiment of the application can introduce constraint relations and dependency relations among different class hierarchies for the initial task sub-model, adopts the combination of a hierarchical classifier and an MLP layer (namely a multi-layer perceptron), and introduces a multi-objective loss function based on class mismatch, thereby invisibly learning the dependency relations among M classes, being capable of promoting each other and improving the model prediction accuracy.
Step S1044, determining a total model loss value of the pre-training classification model according to the first model loss value, the second model loss value and the third model loss value, and performing parameter adjustment on the initial task sub-model based on the total model loss value to obtain the task sub-model.
It will be appreciated that a weighted sum of the first model loss value, the second model loss value, and the third model loss value may generate a total model loss value for the pre-trained classification model. The multi-modal feature sub-model, feature fusion sub-model, and task sub-model may be used to construct a target classification model.
It should be appreciated that two problems need to be noted during training and debugging of the pre-trained classification model, on the one hand, that the model will learn some simple and distinguishing information very quickly for the classification task, but these information may be immaterial, e.g. captioning, black, logo (logo) information, the model will learn very distinguishing information, black, very easily, and the generalization ability of the resulting model will be poor if these information is not processed. Therefore, information to be shielded is required to be increased in data augmentation, or all data are added with black edges and logo, namely, information which is needed to be restrained is increased in universality, so that the information is not differentiated, and model learning is helped; on the other hand, these problems are mainly related to the collection of training samples, and normally the operation will collect data by using title keywords, for example, the collection of game video data will use game names, etc., these samples are added into a multi-mode network, the title information branches will quickly converge, the network classification performance is good, but the network finally learned will ignore the information of other dimensions, and the generalization ability of classification is poor for the video without the keywords. The training samples can also use fingerprint matching data, the diversity of pictures of the samples obtained by fingerprint matching is slightly worse than that of edited videos, and the problem similar to title keywords can also occur in classification network learning, but the problem is not serious. Therefore, the application can add the supervision classifier for each type of modal information, and the loss of the supervision classifier can be trained together, so that the application has the advantages that if the loss of one branch is smaller, the loss of the other branches is larger, and the loss of each information branch is very low as a final optimization result, so that the network can be promoted to fully utilize the information of each modality.
Therefore, the embodiment of the application can determine the first model loss value, the second model loss value and the third model loss value of the pre-training classification model based on the M sample label information, the S modal feature vectors, the M target sample classification features and the M-1 classification matching features of the target sample video, further perform parameter adjustment on the initial task sub-model according to the first model loss value, the second model loss value and the third model loss value to obtain the task sub-model, namely perform parameter adjustment on the pre-training classification model to obtain the target classification model, accelerate the convergence speed of the pre-training classification model, and further improve the accuracy of predicting the target classification information corresponding to the target video when predicting the M levels of target classification information corresponding to the target video based on the target classification model.
Further, referring to fig. 8, fig. 8 is a flow chart of a data processing method according to an embodiment of the application. The method may be performed by a server, or may be performed by a terminal device, or may be performed by a server and a terminal device together, where the server may be the server 20a in the embodiment corresponding to fig. 2, and the terminal device may be the terminal device 20b in the embodiment corresponding to fig. 2. For ease of understanding, embodiments of the present application will be described in terms of this method being performed by a server. The data processing method may include the following steps S201 to S206:
Step S201, pre-training an initial classification model to obtain a pre-trained classification model;
the feature fusion sub-model is obtained by pre-training a sample video set; associating at least two video domain types with a sample video in the sample video set, wherein the at least two video domain types comprise target domain types; the sample video set includes N sample videos, where N may be a positive integer; the N sample videos include sample video S i Here i may be a positive integer less than or equal to N. It should be appreciated that the specific process of pre-training the initial classification model may be described as: an initial classification model is obtained. The initial classification model comprises a multi-mode feature sub-model and an initial feature fusion sub-model, and optionally, the initial classification model also comprises an initial task sub-model. Further, a sample video S is obtained through a multi-modal feature sub-model i And obtaining target sample mode vectors from the S sample mode vectors, and determining sample mode vectors of the S sample mode vectors except for the target sample mode vectors as candidate sample mode vectors. Further, vector change is carried out on the target sample mode vector to obtain an auxiliary sample mode vector corresponding to the target sample mode vector, fusion learning is carried out on the auxiliary sample mode vector and the candidate sample mode vector through the initial feature fusion sub-model, and a first fusion sample vector corresponding to the auxiliary sample mode vector is obtained. Further, based on the first fusion sample vector and the target sample modal vector, the initial feature fusion sub-model is pre-trained, and the feature fusion sub-model is obtained.
For ease of understanding, the embodiments of the present application are described by taking S sample mode vectors including a text mode vector, an image mode vector, a video mode vector, and an audio mode vector as examples, where the S sample mode vectors may include a sample text vector, a sample image vector, a sample video vector, and a sample audio vector. At this time, the target sample mode vector may be any one or more of a sample text vector, a sample image vector, a sample video vector and a sample audio vector, and the candidate sample mode vector may be a vector other than the target sample mode vector among the sample text vector, the sample image vector, the sample video vector and the sample audio vector. For example, the sample text vector may be a target sample modal vector, and the sample image vector, sample video vector, and sample audio vector may be candidate sample modal vectors; for another example, the sample image vector may be a target sample modality vector, and the sample text vector, the sample video vector, and the sample audio vector may be candidate sample modality vectors.
The auxiliary sample mode vector and the candidate sample mode vector can be used as a token to be input into an initial feature fusion sub-model, the first fusion sample vector can be a prediction result of the token, and the target sample mode vector can be a vector before vector change corresponding to the token. In addition, for a specific process of determining the loss value of the initial feature fusion sub-model based on the first fusion sample vector and the target sample mode vector, reference may be made to the description of determining the classification loss value based on the associated normalized sample classification feature and classification label vector, which will not be described in detail herein.
It is appreciated that embodiments of the present application may vector vary the target sample mode vector based on a Mask language Model (Masked Language Modeling, MLM) and a Mask Frame Model (MFM). The mask language model may randomly replace x% of the lemmas (i.e., the target sample modal vector) with the mask each time, among the replaced lemmas, there is a y% probability that is replaced with the mask, a z% probability that is replaced with the random lemmas, and a z% probability that keeps the original lemmas, where the loss value used for pre-training the initial feature fusion submodel may be the first pre-training loss value. The mask frame model may randomly mask x% of the tokens and replace the tokens with all 0's, and at this time, the penalty used for pre-training the initial feature fusion sub-model may be a second pre-training penalty. The embodiment of the application does not limit the specific values of x, y and z.
The feature fusion sub-model is obtained by pre-training a sample video set; associating at least two video domain types with a sample video in the sample video set, wherein the at least two video domain types comprise target domain types; the sample video set includes N sample videos, where N may be a positive integer. It should be appreciated that the specific process of pre-training the initial classification model may be described as: an initial classification model is obtained. The initial classification model comprises a multi-mode feature sub-model and an initial feature fusion sub-model, and optionally, the initial classification model also comprises an initial task sub-model. Further, S initial modal feature vectors corresponding to each sample video are obtained through the multi-modal feature sub-model, and initial modal feature vectors belonging to the same mode in the N.S initial modal feature vectors are combined into the same initial modal feature vector sequence to obtain S initial modal feature vector sequences. Wherein each initial modal feature vector sequence includes N initial modal feature vectors. Further, a candidate modal feature vector sequence is obtained from the S initial modal feature vector sequences, R candidate modal feature vectors are obtained from the candidate modal feature vector sequence, the order of the R candidate modal feature vectors is adjusted, and the candidate modal feature vector sequence after the order adjustment is obtained. Wherein, R herein may be a positive integer less than N. Further, fusion learning is carried out on the candidate modal feature vectors in the candidate modal feature vector sequence after the sequence adjustment and the initial modal feature vectors in the S-1 initial modal feature vector sequences through the initial feature fusion sub-model, so that a second fusion sample vector is obtained. The S-1 initial modal feature vector sequences are initial modal feature vector sequences except candidate modal feature vector sequences in the S initial modal feature vector sequences. Further, based on the S initial modal feature vector sequences and the second fusion sample vector, the initial feature fusion sub-model is pre-trained, and the feature fusion sub-model is obtained.
For easy understanding, the embodiment of the present application is illustrated by taking the sequence of S initial modality feature vectors including the sequence of text modalities, the sequence of image modalities, the sequence of video modalities and the sequence of audio modalities as an example, and the S initial modality feature vector sequences may specifically include the sequence of initial text modality features, the sequence of initial image modality features, the sequence of initial video modality features and the sequence of initial audio modality features. At this time, the candidate modality feature vector sequence may be any one or more of an initial text modality feature sequence, an initial image modality feature sequence, an initial video modality feature sequence, and an initial audio modality feature sequence, and the S-1 initial modality feature vector sequences may be sequences other than the candidate modality feature vector sequence among the initial text modality feature sequence, the initial image modality feature sequence, the initial video modality feature sequence, and the initial audio modality feature sequence. For example, the candidate modal feature vector sequence may be an initial text modal feature sequence, and the initial image modal feature sequence, the initial video modal feature sequence and the initial audio modal feature sequence may be S-1 initial modal feature vector sequences; for another example, the candidate modality feature vector sequence may be an initial image modality feature sequence, and the initial text modality feature sequence, the initial video modality feature sequence, and the initial audio modality feature sequence may be S-1 initial modality feature vector sequences.
The initial feature fusion sub-model can perform fusion learning on the unmatched initial modal feature vectors in the candidate modal feature vector sequence and the S-1 initial modal feature vector sequences after the unmatched adjustment sequence, so as to obtain N second fusion sample vectors. Similarly, the initial feature fusion sub-model can perform fusion learning on the matched initial modal feature vectors in the matched S initial modal feature vector sequences to obtain N third fusion sample vectors. Therefore, the second fused sample vector may be an output obtained by the unmatched initial modal feature vector, and the third fused sample vector may be an output obtained by the matched initial modal feature vector, and based on the second fused sample vector and the third fused sample vector, the initial feature fusion sub-model may be pre-trained to obtain the feature fusion sub-model.
It can be understood that the embodiment of the application can pretrain the initial feature fusion sub-model based on an Image-Text Matching (ITM) task, and the Image Text Matching task can be used for judging whether vectors of different modes input by the initial feature fusion sub-model are matched. At this time, the loss value used for pre-training the initial feature fusion submodel may be a third pre-training loss value.
For example, considering that there are N sample videos in the sample video set, the comparison efficiency can be improved by matching the sample videos in pairs, i.e. the number of sample videos can be two, and the two sample videos can be sample videos S 1 And sample video S 2 Sample video S 1 The corresponding S initial modal feature vectors may be a first sample text vector, a first sample image vector, a first sample video vector and a first sample audio vector, a sample video S 2 The corresponding S initial modal feature vectors may be a second sample text vector, a second sample image vector, a second sample video vector, and a second sample audio vector. Further, vector combinations are performed on the first text sample vector, the first sample image vector, the first sample video vector, the first sample audio vector, the second sample text vector, the second sample image vector, the second sample video vector, and the second sample audio vector to obtain a combined text vector, a combined image vector, a combined video vector, and a combined audio vector. Wherein the combined text vector, the combined video vector, and the combined audio vector correspond to different sample videos. For example, the first sample text vector, the first sample image vector, the first sample video vector, and the second sample audio vector are determined as a combined text vector, a combined image vector, a combined video vector, and a combined audio vector, respectively. Further, the first sample vector, the first sample image vector, the first by the initial feature fusion submodel Carrying out fusion learning on the sample video vector and the first sample audio vector to obtain a third fusion sample vector; and performing fusion learning on the combined text vector, the combined image vector, the combined video vector and the combined audio vector through the initial feature fusion sub-model to obtain a second fusion sample vector. Further, based on the second fusion sample vector and the third fusion sample vector, parameter adjustment is carried out on the initial feature fusion sub-model, and the feature fusion sub-model is obtained.
It should be understood that, in the embodiment of the present application, the initial feature fusion sub-model may be pre-trained based on the first pre-training loss value, the initial feature fusion sub-model may be pre-trained based on the second pre-training loss value, and the initial feature fusion sub-model may be pre-trained based on the third pre-training loss value; optionally, the embodiment of the present application may further perform pre-training on the initial feature fusion sub-model based on two or three of the first pre-training loss value, the second pre-training loss value, and the third pre-training loss value.
For ease of understanding, please refer to fig. 9, fig. 9 is a schematic diagram of a scenario for performing contrast learning according to an embodiment of the present application. Fig. 9 illustrates an example in which the S initial modality feature vector sequences include a sequence of two modalities (for example, a sequence of text modalities and a sequence of video modalities), and fig. 9 illustrates an example in which the number of sample videos is 4. Wherein the S initial modal feature vector sequences include an initial text modal feature sequence 90c and an initial video modal feature sequence 90b, the initial video modal feature sequence 90b may be used as a candidate modal feature vector sequence.
As shown in fig. 9, the initial text modality feature sequence 90c may include an initial modality feature vector 95a, an initial modality feature vector 95b, an initial modality feature vector 95c, and an initial modality feature vector 95d, and the initial video modality feature sequence 90b may include an initial modality feature vector 96a, an initial modality feature vector 96b, an initial modality feature vector 96c, and an initial modality feature vector 96d. The initial modal feature vector 95a and the initial modal feature vector 96a may be initial modal feature vectors corresponding to the same sample video, the initial modal feature vector 95b and the initial modal feature vector 96b may be initial modal feature vectors corresponding to the same sample video, the initial modal feature vector 95c and the initial modal feature vector 96c may be initial modal feature vectors corresponding to the same sample video, and the initial modal feature vector 95d and the initial modal feature vector 96d may be initial modal feature vectors corresponding to the same sample video.
Wherein the initial modal feature vector 96a is generated based on the target video data 94a, the initial modal feature vector 96b is generated based on the target video data 94b, the initial modal feature vector 96c is generated based on the target video data 94c, the initial modal feature vector 96d is generated based on the target video data 94d, the target video data 94a, the target video data 94b, the target video data 94c and the target video data 94d may be used to construct the target video data sequence 90a, and the target video data sequence 90a may represent target video data corresponding to the 4 sample videos, respectively. For example, the target video data 94d may include a plurality of video frames, which may include, in particular, a video frame 91a, a video frame 91b, a video frame 91c, and a video frame 91d.
Wherein initial modal feature vector 96d is determined by vector 93a, vector 93b, vector 93c, and vector 93d, and initial modal feature vector 95d is determined by vector 92a, vector 92b, vector 92c, vector 92e, and vector 92 f. For example, vector 92a may be a text feature vector, vector 92b may be a text position vector, and vector 92c may be a text modality feature; vector 93a may be a video feature vector, vector 93b may be a video frame feature vector, and vector 93c may be a video modality feature.
As shown in fig. 9, the embodiment of the present application may perform fusion learning on the initial modal feature vector 95a and the initial modal feature vector 96a to obtain an initial fusion classification feature C 1 Wherein, the classification characteristic C is initially fused 1 The classification feature may be any one of M initial fusion classification features. The initial fusion classification characteristic C can be obtained by the same method 2 …, initial fusion classification feature C V
Further, as shown in the figureAs shown in fig. 9, in the embodiment of the present application, R initial video modality features may be obtained from the initial video modality feature sequence 90b, for example, R is equal to 2, and here, for example, the 2 initial video modality features may be an initial video modality feature 96a and an initial video modality feature 96d, and the order of the initial video modality feature 96a and the initial video modality feature 96d is adjusted, so as to obtain an initial video modality feature sequence 90b after the adjustment of the order. Further, fusion learning is performed on the initial modal feature vector 95a and the initial modal feature vector 96d to obtain an initial fusion classification feature Z 1 The method comprises the steps of carrying out a first treatment on the surface of the Fusion learning is carried out on the initial modal feature vector 95d and the initial modal feature vector 96a to obtain an initial fusion classification feature Z V The initial fusion classification characteristic Z can be obtained by the same method 2 Etc.
It can be understood that the order of the R candidate modality feature vectors is adjusted, so that a comparison learning mode is introduced, and the comparison learning can force the model to learn the association relationship between the S modalities, and align the different modalities, so as to correctly fuse the S features (i.e., the candidate modality feature vector in the candidate modality feature vector sequence after the order adjustment and the initial modality feature vector in the S-1 initial modality feature vector sequence) corresponding to the S modalities respectively.
Therefore, the embodiment of the application can pretrain the initial classification model by adopting the pretraining technology, and the pretraining technology can effectively improve the research and development efficiency while saving training samples, wherein the multimodal pretraining technology used by the application can comprise, but is not limited to, an MLM task, an MFM task and an ITM task. The pre-training task can construct tasks such as text interior, image interior, visual (video) interior, audio interior, text-visual-image-audio alignment and the like, so that besides learning the characteristics of each mode, the tasks are directly aligned with each other, and finally, the characterization capability and the subsequent fusion capability of the characteristics are improved, and further, the sample requirement of a fine adjustment stage in the step S202 is reduced. In addition, the multi-mode pre-training mode is utilized to fully utilize the pre-training model, so that a large amount of sample labeling cost is saved, the training cost and period of the model are reduced, and each mode can be better aligned by improving and optimizing contrast learning, so that the effect of classifying the model is effectively improved.
Step S202, carrying out parameter adjustment on a pre-training classification model to obtain a target classification model;
for the specific process of performing parameter adjustment on the pre-training classification model, refer to the description of step S101 to step S104 in the embodiment corresponding to fig. 3, the description of step S1021 to step S1026 in the embodiment corresponding to fig. 6, and the description of step S1041 to step S1044 in the embodiment corresponding to fig. 7, which will not be described herein.
Step S203, S target modal features corresponding to the target video are obtained through the multi-modal feature sub-model;
it will be appreciated that the target video may be video uploaded by the target object through the target terminal device, the target object may be a video uploaded from a media or video production facility (e.g., multi-Channel Network MCN), and the target video uploaded by the target object may be a PGC from a media and UGC (User Generated Content, user produced content) or a video production facility (Professional Generated Content, professional produced content). The target video integrates topics such as skill sharing, humorous, fashion trend, social hot spot, street interview, public education, advertising creative, business customization and the like, and in addition, the target video can be a short video or a long video, and the duration of the target video is not limited.
The MCN is a product form of a multi-channel network, PGC contents can be combined, and under the powerful support of capital, continuous output of the contents is ensured, so that stable realization of business is finally realized; PGC refers to a facility or organization that professionally produces content; UGC is emerging with the concept of Web2.0 advocated to be a main feature, and is not a specific service, but a new way for users to use the Internet, namely, from the original downloading to the downloading and uploading again.
The specific process of acquiring S target modal features corresponding to the target video through the multi-modal feature sub-model may refer to the description of the S modal feature vectors corresponding to the target sample video acquired through the multi-modal feature sub-model, which will not be described herein.
Step S204, fusion learning is carried out on S target modal features through a feature fusion sub-model, and M initial classification features corresponding to the target video are obtained;
the specific process of performing fusion learning on the S target modal features through the feature fusion sub-model may refer to the description of performing fusion learning on the S modal feature vectors through the feature fusion sub-model, which will not be described herein.
Step S205, extracting target classification features corresponding to the M initial classification features respectively through a task sub-model;
the specific process of extracting the target classification features corresponding to the M initial classification features by the task sub-model may refer to the description of extracting the target sample classification features corresponding to the M initial sample classification features by the initial task sub-model, which will not be described herein.
And S206, respectively carrying out normalization processing on the M target classification features to obtain normalized classification features respectively corresponding to the M target classification features, and determining M levels of target classification information corresponding to the target video according to the M normalized classification features.
The embodiment of the application can determine the classification information corresponding to the maximum probability in each normalized classification feature as the target classification information of the level corresponding to the target video, thereby determining the classification information corresponding to the maximum probability in each normalized classification feature in M normalized classification features as the target classification information of M levels corresponding to the target video.
Therefore, the embodiment of the application can pretrain the initial classification model to obtain a pretrained classification model, and further perform parameter adjustment (i.e. fine adjustment) on the pretrained classification model to obtain the target classification model. It can be understood that the target classification model can be used for acquiring target modal characteristics corresponding to S modalities of the target video respectively, utilizing the target video in a multi-dimension manner, fully utilizing information implied by different modalities of the target video, and improving the accuracy and comprehensiveness of characteristic characterization of the target video through content understanding of the multi-modalities. In addition, the embodiment of the application can predict the M levels of target classification information corresponding to the target video based on the M target classification features, and accurately classify the target video in multiple levels by describing the classification information of the target video on different granularity levels through the M levels of target classification information, thereby improving the accuracy of the target classification information corresponding to the predicted target video.
Specifically, referring to fig. 10, fig. 10 is a flowchart of a method and a system for classifying information flow video content based on a multi-mode pre-training network according to an embodiment of the present application. As shown in fig. 10, step S11 to step S21 may be one execution path, step S31 to step S32 may be one execution path, step S41 to step S42 may be one execution path, step S51 to step S54 may be one execution path, and the 4 execution paths may be synchronously cross-executed.
As shown in fig. 10, in the execution path of step S11-step S21, the content production end (for example, the application client in the target terminal device) may perform step S11 to upload the target video to the uplink and downlink content interface service through the content upload interface, so that when the uplink and downlink content interface service acquires the target video uploaded by the target terminal device, step S12 may be performed to transmit the target video (i.e., the source file) to the content storage service, so that the content storage service stores the target video to the content database. The content storage service can be deployed on a group of storage servers which have wide distribution range and are close to users, and the periphery can be accelerated in a distributed cache way through a CDN acceleration server. Further, the uplink and downlink content interface service may execute step S13 to transcode (i.e. re-transcode) the target video to obtain meta-information (e.g. file size, cover map link, code rate, file format, title, release time, author, etc.) of the target video, write the meta-information into the content database, and promote playing compatibility of the target video on each platform.
Further, the uplink and downlink content interface service may execute step S14 to directly upload the target video to the scheduling center service for subsequent content processing and circulation, where the target video may mobilize the start of the scheduling process, and the target video enters the scheduling hierarchy. Therefore, the dispatch center service may perform step S15, call the deduplication service to perform the deduplication process on the target video (the deduplication process may perform vectorization on the target video, determine the similarity between videos by comparing the distances between the vectors), and write the result of the deduplication process into the content database. When the duplication eliminating process is passed, the scheduling center service can schedule the manual auditing system to manually audit the target video through the step S17, and update meta-information in the content database based on the duplication eliminating process result through the step S16; when the duplication eliminating process is failed, the dispatch center service can delete the target video, so that the target video which does not pass the duplication eliminating process cannot be checked by the manual checking system. In addition, videos that do not reach the repeated filtering can output content similarity and similarity relation chains for the recommendation system to break up.
Further, when the result of the duplication elimination process indicates that the exclusion process is passed, the manual review system may perform step S18 of reading the target video and meta information from the content database, and performing a first content review (i.e., a primary review) and a second content review (i.e., a review) on the read information. It can be appreciated that when the first content audit of the target video passes, the second content audit can be performed on the target video, and when the second content audit of the target video passes, the target video can be used as the distributable content; when the first content audit and the second content audit of the target video are not passed, step S18 may be executed to update meta information of the target video, and the target video satisfying the integrity is taken as the distributable content. In addition, the machine performs multi-level classification processing on the target video through an algorithm (such as a multi-mode content classification model) at the same time of manual auditing. The manual auditing system is a system with complex business and based on web (webpage) database development, the first content auditing can audit the sensitivity of the target video, and the second content auditing can confirm the label and quality problem of the target video, so that the accuracy and efficiency of the high-level label labeling of the video are improved through man-machine cooperation.
Further, the service of the dispatching center may execute step S19, and flexibly set different dispatching detail policies according to the category of the content from the meta information obtained from the content database. Further, the dispatch center service may perform step S20 to enable content distribution via the content distribution outlet service to start distribution, obtain at least one distribution content (e.g., a target video) from the content database, and distribute the at least one distribution content to the content consumer via step S21. At this time, the content consumption end can consume the target video, and the consumption behavior is precipitated on the target classification information corresponding to the target video.
It may be appreciated that the video recommendation may be performed by a video recommendation algorithm, for example, the video recommendation algorithm may be a collaborative recommendation algorithm, a matrix decomposition algorithm, a supervised learning algorithm (for example, a logistic regression model), a deep learning model (for example, a factorizer and a gradient boost decision tree (Gradient Boosting Decision Tree, abbreviated as GBDT)) and the like.
As shown in fig. 10, in the execution path from step S31 to step S32, the content consumption end may execute step S31, and when the content consumption end selects to watch a certain video (for example, a target video), index information of the target video may be obtained from the uplink and downlink content interface server, where the index information may be a URL (Uniform Resource Locator ) address corresponding to the target video. Further, the content consumption end may perform step S32, directly download the source file (i.e. the streaming media file) of the target video from the content storage service based on the URL address, and play the obtained source file through the local player, where the content consumption end may display the content according to different categories and channels by using the multi-level classification of the content (i.e. the target video) of the server. Meanwhile, the content consumption end can report the blocking, loading time, consumption behavior and the like in the downloading process to the content storage service.
It can be appreciated that, in the present application, related data such as consumption behavior, loading time, and katon are involved, and when the above embodiments of the present application are applied to specific products or technologies, user permission or consent needs to be obtained, and the collection, use, and processing of related data need to comply with relevant national laws and regulations and national standards of the country. For example, the content consumption end may display a prompt message "whether to record the current consumption behavior, and send the recorded information to the server", and when the authorization of the user corresponding to the content consumption end passes, the consumption behavior may be uploaded to the server.
As shown in fig. 10, in the execution path from step S41 to step S42, the download file system may execute step S41 to download the source file of the target video from the content storage service, where the download file system may control the speed and progress of the download, and the download file system is typically a set of parallel servers, and is composed of related task scheduling and distribution clusters. Further, the download file system may execute step S42, call the video content extraction frame and the audio separation service to process the source file, and obtain the video content feature information of the target video, for example, the video content feature information may be a key video frame obtained from the target video data, an audio frame obtained by framing the target audio data, and subtitle text information obtained by OCR text recognition on the video content extraction frame, which may be used as input of more dimensions of the multi-stage classification of the video content. The content storage service is used as a data source for external service and also used as a data source for internal service for the download file system to acquire the original video data for relevant processing, and the paths of the internal and external data sources are usually arranged separately so as to avoid mutual influence.
As shown in fig. 10, in the execution path from step S51 to step S54, the dispatch center service may execute step S51 to call the multi-modal content classification service, and execute step S52 through the multi-media content classification service to service the multi-modal content multi-level classification model, thereby completing the identification and marking of multi-modal content classification of the main link content in the content flow. Further, the multimodal content classification model may obtain meta information from the content database through step S53 and video content feature information through step S54. The multi-modal content classification model can utilize the modal information of the video content, such as a visual mode, a text mode, a video mode and an audio mode, and then uses a small amount of supervision data to perform task-level modeling, so as to construct a multi-modal content classification model (i.e., a target classification model) of the video content classification system, thereby better understanding the video content. Plays a great role in the operation of the content, the recommendation of the content and the clustering/scattering of the content, and helps the information flow to promote the distribution effect. Meanwhile, the multi-modal content classification service may store the multi-classification result (i.e., M levels of target classification information) of the obtained content in the content database through the dispatch center service.
It should be understood that after the content production end uploads the target video to the uplink and downlink content interface service, the target video may enter the server through the uplink and downlink content interface service. The content distribution outlet service, the uplink and downlink content interface service, the content storage service, the dispatch center service, the content duplication elimination service, the multi-modal content classification service, the video content extraction and audio separation service shown in fig. 10 may be server programs deployed on servers and providing remote network services specifically for application clients.
Further, referring to fig. 11, fig. 11 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application, where the data processing apparatus 1 may include: the model acquisition module 11, the fusion learning module 12, the feature extraction module 13 and the parameter adjustment module 14; further, the data processing apparatus 1 may further include: a first pre-training module 15, a second pre-training module 16, a video classification module 17;
a model acquisition module 11 for acquiring a pre-trained classification model; the pre-training classification model comprises a multi-mode feature sub-model, a feature fusion sub-model and an initial task sub-model;
the fusion learning module 12 is configured to obtain S modal feature vectors corresponding to a target sample video belonging to a target field type through a multi-modal feature sub-model, and perform fusion learning on the S modal feature vectors through a feature fusion sub-model to obtain M initial sample classification features corresponding to the target sample video; m and S are positive integers greater than 1;
The multi-modal feature sub-model comprises a text network layer, an image network layer, a video network layer and an audio network layer; the S modal feature vectors comprise sample sub-text vectors, sample sub-image vectors, sample sub-video vectors and sample sub-audio vectors;
the fusion learning module 12 includes: a video acquisition unit 121, a feature extraction unit 122, and a fusion learning unit 123;
a video acquisition unit 121, configured to acquire a target sample video belonging to a target domain type;
the feature extraction unit 122 is configured to perform text feature extraction on target text data of a target sample video through a text network layer, so as to obtain a sample sub-text vector corresponding to the target sample video;
the feature extraction unit 122 is specifically configured to perform feature extraction on target text data of the target sample video through the text network layer, so as to obtain a text feature vector corresponding to the target text data;
the feature extraction unit 122 is specifically configured to perform word segmentation on the target text data to obtain text words of the target text data, and perform text position coding on text positions of the text words in the target text data to obtain text position vectors corresponding to the text words;
The feature extraction unit 122 is specifically configured to obtain a text mode feature corresponding to the target text data, and fuse the text feature vector, the text position vector and the text mode feature to obtain a sample sub-text vector corresponding to the target sample video.
The feature extraction unit 122 is configured to perform image feature extraction on target image data of a target sample video through an image network layer, so as to obtain a sample sub-image vector corresponding to the target sample video;
the feature extraction unit 122 is specifically configured to perform feature extraction on target image data of the target sample video through the image network layer, so as to obtain an image feature vector corresponding to the target image data;
the feature extraction unit 122 is specifically configured to obtain an image mode feature corresponding to the target image data, and fuse the image feature vector and the image mode feature to obtain a sample sub-image vector corresponding to the target sample video.
The feature extraction unit 122 is configured to perform video feature extraction on target video data of a target sample video through a video network layer, so as to obtain a sample sub-video vector corresponding to the target sample video;
the feature extraction unit 122 is specifically configured to perform feature extraction on target video data of the target sample video through the video network layer, so as to obtain a video feature vector corresponding to the target video data;
The feature extraction unit 122 is specifically configured to perform frame extraction processing on the target video data to obtain a key video frame in the target video data, and perform feature extraction on the key video frame through the video network layer to obtain a video frame feature vector corresponding to the key video frame;
the feature extraction unit 122 is specifically configured to obtain a video mode feature corresponding to the target video data, and fuse the video feature vector, the video frame feature vector and the video mode feature to obtain a sample sub-video vector corresponding to the target sample video.
The feature extraction unit 122 is configured to perform audio feature extraction on the target audio data of the target sample video through the audio network layer, so as to obtain a sample sub-audio vector corresponding to the target sample video;
the feature extraction unit 122 is specifically configured to perform frame segmentation on target audio data of a target sample video to obtain at least two audio frames in the target audio data, and perform feature extraction on the at least two audio frames through an audio network layer to obtain audio frame feature vectors corresponding to the at least two audio frames respectively;
the feature extraction unit 122 is specifically configured to perform audio position encoding on an audio position of each audio frame in the target audio data, so as to obtain an audio frame position vector corresponding to each audio frame respectively;
The feature extraction unit 122 is specifically configured to obtain an audio mode feature corresponding to the target audio data, and fuse at least two audio frame feature vectors, at least two audio frame position vectors, and the audio mode feature to obtain a sample sub-audio vector corresponding to the target sample video.
And the fusion learning unit 123 is configured to perform fusion learning on the sample sub-text vector, the sample sub-audio vector, the sample sub-image vector and the sample sub-video vector through the feature fusion sub-model, so as to obtain M initial sample classification features corresponding to the target sample video.
The specific implementation manners of the video acquisition unit 121, the feature extraction unit 122 and the fusion learning unit 123 may be referred to the description of step S102 in the embodiment corresponding to fig. 3 and the descriptions of step S1021 to step S1026 in the embodiment corresponding to fig. 6, which will not be repeated here.
The feature extraction module 13 is used for extracting M-1 classification matching features and target sample classification features respectively corresponding to the M initial sample classification features through the initial task sub-model; the M target sample classification features are used for determining M levels of sample classification information, and the M-1 classification matching features are used for determining the matching degree between every two adjacent levels of sample classification information; the categories respectively indicated by every two adjacent levels have category level nesting relations;
The initial task sub-model comprises M classification network layers corresponding to the classification characteristics of the initial samples respectively; the M classified network layers comprise classified network layer H k K is a positive integer less than or equal to M;
the feature extraction module 13 includes: a feature acquisition unit 131, a first processing unit 132, a second processing unit 133;
a feature obtaining unit 131 for obtaining a classified network layer H from M initial sample classification features k Corresponding initial sample classification feature J k
A first processing unit 132 for classifying the network layer H k For the first classified network layer of the M classified network layers, then pass through classified network layer H k Classifying features J for initial sample k Performing full connection processing to obtain a classified network layer H k Corresponding auxiliary sample classificationBy classifying network layers H k Classification network layer H k Full-connection processing is carried out on the corresponding auxiliary sample classification features to obtain initial sample classification features J k Corresponding target sample classification features;
a second processing unit 133 for classifying the network layer H k Not being the first of the M classified network layers, then passing through classified network layer H k Classifying features J for initial sample k Performing full connection processing to obtain a classified network layer H k Corresponding auxiliary sample classification features based on M initial sample classification features and classification network layer H k Corresponding auxiliary sample classification features, and determining initial sample classification features J k Corresponding target sample classification features and classification network layer H k The corresponding classifications match the features.
Wherein the second processing unit 133 is specifically configured to, if the network layer H is classified k For the second classified network layer of the M classified network layers, then classified network layer H k-1 Corresponding auxiliary sample classification features and classification network layer H k Performing feature stitching on the corresponding auxiliary sample classification features to obtain a classification network layer H k Corresponding spliced sample classification features and classification network layer H k-1 Corresponding auxiliary sample classification features and classification network layer H k Feature stitching is carried out on the corresponding spliced sample classification features to obtain a classified network layer H k Corresponding classification matching features, classification network layer H k Performing full-connection processing on the corresponding spliced sample classification features to obtain initial sample classification features J k Corresponding target sample classification features; classified network layer H k-1 To classify network layer H k Is a last classified network layer; classified network layer H k-1 The corresponding auxiliary sample classification features are based on classification network layer H k-1 Corresponding initial sample classification features;
The second processing unit 133 is specifically configured to, if the network layer H is classified k Not being the second of the M classified network layers, then classified network layer H k-1 Corresponding spliced sample classification features and classification network layer H k Performing feature stitching on the corresponding auxiliary sample classification features to obtain a classification network layer H k Corresponding spliced sample classification features and classification network layer H k-1 Corresponding spliced sample classification features and classification network layer H k Feature stitching is carried out on the corresponding spliced sample classification features to obtain a classified network layer H k Corresponding classification matching features, classification network layer H k Performing full-connection processing on the corresponding spliced sample classification features to obtain initial sample classification features J k And corresponding target sample classification features.
The specific implementation manner of the feature obtaining unit 131, the first processing unit 132, and the second processing unit 133 may be referred to the description of step S103 in the embodiment corresponding to fig. 3, which will not be described herein.
The parameter adjustment module 14 is configured to perform parameter adjustment on the initial task sub-model based on M sample tag information, S modal feature vectors, M target sample classification features, and M-1 classification matching features of the target sample video, to obtain a task sub-model; the multi-mode feature sub-model, the feature fusion sub-model and the task sub-model are used for forming a target classification model; the target classification model is used for predicting M levels of target classification information corresponding to target videos belonging to the target field type.
The S modal feature vectors comprise sample sub-text vectors, sample sub-image vectors, sample sub-video vectors and sample sub-audio vectors;
the parameter adjustment module 14 includes: a loss determination unit 141, a parameter adjustment unit 142;
a loss determining unit 141, configured to determine a first model loss value of the pre-training classification model based on the M sample tag information of the target sample video and the associated sample tag information and target sample classification feature of the M target sample classification features;
a loss determination unit 141 for determining a second model loss value of the pre-trained classification model based on the M sample label information, the sample sub-text vector, the sample sub-image vector, the sample sub-video vector, and the sample sub-audio vector;
the pre-training classification model further comprises a text classifier, an image classifier, a video classifier and an audio classifier;
the loss determining unit 141 is specifically configured to input the sample sub-text vector to a text classifier, determine M predicted text classification vectors by the text classifier, and determine a text loss value of the pre-training classification model according to the predicted text classification vector and sample label information associated with the M predicted text classification vectors and the M sample label information;
The loss determining unit 141 is specifically configured to input the sample sub-image vector to an image classifier, determine M prediction image classification vectors by the image classifier, and determine an image loss value of the pre-training classification model according to the prediction image classification vector and the sample label information associated with the M prediction image classification vectors and the M sample label information;
the loss determining unit 141 is specifically configured to input the sample sub-video vector to a video classifier, determine M prediction video classification vectors by the video classifier, and determine a video loss value of the pre-training classification model according to the prediction video classification vector and the sample label information associated with the M prediction video classification vectors and the M sample label information;
the loss determining unit 141 is specifically configured to input the sample sub-audio vectors to an audio classifier, determine M predicted audio classification vectors by the audio classifier, and determine an audio loss value of the pre-training classification model according to the predicted audio classification vectors and sample tag information associated with the M predicted audio classification vectors and the M sample tag information;
the loss determination unit 141 is specifically configured to determine a second model loss value of the pre-trained classification model according to the text loss value, the image loss value, the video loss value and the audio loss value.
A loss determination unit 141, configured to determine a third model loss value of the pre-training classification model based on the M target sample classification features and the M-1 classification matching features;
the loss determining unit 141 is specifically configured to normalize the M target sample classification features, obtain normalized sample classification features corresponding to the M target sample classification features, and determine M levels of sample classification information corresponding to the target sample video according to the M normalized sample classification features;
the loss determination unit 141 is specifically configured to normalize the M-1 classification matching features, so as to obtain normalized matching features corresponding to the M-1 classification matching features; the M-1 normalized matching features include normalized matching feature P b B is a positive integer less than or equal to M-1;
the loss determination unit 141 is specifically configured to obtain the normalized matching feature P b The associated first sample classification information and second sample classification information, and determining normalized matching features P according to the first sample classification information and the second sample classification information b Matching labels with corresponding categories; the first sample classification information is matched with the normalized characteristic P b Sample classification information corresponding to normalized sample classification features with the same level, and the second sample classification information is normalized matching feature P b Sample classification information corresponding to the normalized sample classification feature of the previous level;
the loss determination unit 141 is specifically configured to determine the matching characteristic P according to normalization b And normalized matching feature P b Corresponding category matching labels, and determining normalized matching features P b Matching loss of corresponding categories;
the loss determining unit 141 is specifically configured to determine a third model loss value of the pre-trained classification model according to category matching loss corresponding to the M-1 normalized matching features.
The parameter adjustment unit 142 is configured to determine a total model loss value of the pre-trained classification model according to the first model loss value, the second model loss value, and the third model loss value, and perform parameter adjustment on the initial task sub-model based on the total model loss value, so as to obtain the task sub-model.
For specific implementation manners of the loss determining unit 141 and the parameter adjusting unit 142, refer to the description of step S104 in the embodiment corresponding to fig. 3 and the descriptions of step S1041 to step S1044 in the embodiment corresponding to fig. 7, and will not be repeated here.
Optionally, the feature fusion sub-model is obtained by pre-training a sample video set; associating at least two video domain types with a sample video in the sample video set, wherein the at least two video domain types comprise target domain types; the sample video set comprises N sample videos, wherein N is a positive integer; the N sample videos include sample video S i The method comprises the steps of carrying out a first treatment on the surface of the i is a positive integer less than or equal to N;
a first pre-training module 15, configured to obtain an initial classification model; the initial classification model comprises a multi-mode feature sub-model and an initial feature fusion sub-model;
a first pre-training module 15 for acquiring a sample video S through a multi-modal feature sub-model i The method comprises the steps of obtaining a target sample mode vector from S sample mode vectors corresponding to the S sample mode vectors, and determining sample mode vectors of the S sample mode vectors except the target sample mode vector as candidate sample mode vectors;
the first pre-training module 15 is configured to perform vector change on the target sample mode vector to obtain an auxiliary sample mode vector corresponding to the target sample mode vector, and perform fusion learning on the auxiliary sample mode vector and the candidate sample mode vector through the initial feature fusion sub-model to obtain a first fusion sample vector corresponding to the auxiliary sample mode vector;
the first pre-training module 15 is configured to pre-train the initial feature fusion sub-model based on the first fusion sample vector and the target sample modal vector, to obtain a feature fusion sub-model.
Optionally, the feature fusion sub-model is obtained by pre-training a sample video set; associating at least two video domain types with a sample video in the sample video set, wherein the at least two video domain types comprise target domain types; the sample video set comprises N sample videos, wherein N is a positive integer;
A second pre-training module 16 for obtaining an initial classification model; the initial classification model comprises a multi-mode feature sub-model and an initial feature fusion sub-model;
the second pre-training module 16 is configured to obtain S initial modal feature vectors corresponding to each sample video through the multi-modal feature sub-model, and combine initial modal feature vectors belonging to the same modality in the n×s initial modal feature vectors into the same initial modal feature vector sequence to obtain S initial modal feature vector sequences; each initial modal feature vector sequence comprises N initial modal feature vectors;
the second pre-training module 16 is configured to obtain a candidate modal feature vector sequence from the S initial modal feature vector sequences, obtain R candidate modal feature vectors from the candidate modal feature vector sequence, and adjust the order of the R candidate modal feature vectors to obtain a candidate modal feature vector sequence after the order is adjusted; r is a positive integer less than N;
the second pre-training module 16 is configured to perform fusion learning on the candidate modal feature vector in the candidate modal feature vector sequence after the sequence adjustment and the initial modal feature vector in the S-1 initial modal feature vector sequence through the initial feature fusion sub-model, so as to obtain a second fusion sample vector; the S-1 initial modal feature vector sequences are initial modal feature vector sequences except candidate modal feature vector sequences in the S initial modal feature vector sequences;
The second pre-training module 16 is configured to pre-train the initial feature fusion sub-model based on the S initial modal feature vector sequences and the second fusion sample vector, to obtain a feature fusion sub-model.
Optionally, the video classification module 17 is configured to obtain S target modality features corresponding to the target video through the multi-modality feature sub-model;
the video classification module 17 is configured to perform fusion learning on the S target modal features through a feature fusion sub-model, so as to obtain M initial classification features corresponding to the target video;
the video classification module 17 is used for extracting target classification features corresponding to the M initial classification features respectively through the task sub-model;
the video classification module 17 is configured to perform normalization processing on the M target classification features, obtain normalized classification features corresponding to the M target classification features, and determine M levels of target classification information corresponding to the target video according to the M normalized classification features.
The specific implementation manners of the model obtaining module 11, the fusion learning module 12, the feature extracting module 13 and the parameter adjusting module 14 may be referred to in the embodiment corresponding to fig. 3, and the description of the steps S101 to S104, the steps S1021 to S1026, and the steps S1041 to S1044 in the embodiment corresponding to fig. 6 will not be repeated here. The specific implementation manner of the first pre-training module 15, the second pre-training module 16 and the video classification module 17 may be referred to the description of step S201 to step S206 in the embodiment corresponding to fig. 8, which will not be described herein. In addition, the description of the beneficial effects of the same method is omitted.
Further, referring to fig. 12, fig. 12 is a schematic structural diagram of a computer device according to an embodiment of the present application, where the computer device may be a terminal device or a server. As shown in fig. 12, the computer device 1000 may include: processor 1001, network interface 1004, and memory 1005, and in addition, the above-described computer device 1000 may further include: a user interface 1003, and at least one communication bus 1002. Wherein the communication bus 1002 is used to enable connected communication between these components. In some embodiments, the user interface 1003 may include a Display (Display), a Keyboard (Keyboard), and the optional user interface 1003 may further include a standard wired interface, a wireless interface, among others. Alternatively, the network interface 1004 may include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (non-volatile memory), such as at least one disk memory. Optionally, the memory 1005 may also be at least one memory device located remotely from the aforementioned processor 1001. As shown in fig. 12, an operating system, a network communication module, a user interface module, and a device control application program may be included in the memory 1005, which is one type of computer-readable storage medium.
In the computer device 1000 shown in FIG. 12, the network interface 1004 may provide network communication functions; while user interface 1003 is primarily used as an interface for providing input to a user; and the processor 1001 may be used to invoke a device control application stored in the memory 1005 to implement:
obtaining a pre-training classification model; the pre-training classification model comprises a multi-mode feature sub-model, a feature fusion sub-model and an initial task sub-model;
s modal feature vectors corresponding to the target sample video belonging to the target field type are obtained through the multi-modal feature sub-model, and fusion learning is carried out on the S modal feature vectors through the feature fusion sub-model, so that M initial sample classification features corresponding to the target sample video are obtained; m and S are positive integers greater than 1;
extracting M-1 classification matching features and target sample classification features respectively corresponding to M initial sample classification features through an initial task sub-model; the M target sample classification features are used for determining M levels of sample classification information, and the M-1 classification matching features are used for determining the matching degree between every two adjacent levels of sample classification information; the categories respectively indicated by every two adjacent levels have category level nesting relations;
Based on M sample tag information, S modal feature vectors, M target sample classification features and M-1 classification matching features of the target sample video, carrying out parameter adjustment on the initial task sub-model to obtain a task sub-model; the multi-mode feature sub-model, the feature fusion sub-model and the task sub-model are used for forming a target classification model; the target classification model is used for predicting M levels of target classification information corresponding to target videos belonging to the target field type.
It should be understood that the computer device 1000 described in the embodiments of the present application may perform the description of the data processing method in the embodiments corresponding to fig. 3, 6, 7 or 8, and may also perform the description of the data processing apparatus 1 in the embodiments corresponding to fig. 11, which are not described herein. In addition, the description of the beneficial effects of the same method is omitted.
Furthermore, it should be noted here that: the embodiment of the present application further provides a computer readable storage medium, in which the computer program executed by the aforementioned data processing apparatus 1 is stored, and when the processor executes the computer program, the description of the data processing method in the embodiment corresponding to fig. 3, 6, 7 or 8 can be executed, and therefore, will not be repeated herein. In addition, the description of the beneficial effects of the same method is omitted. For technical details not disclosed in the embodiments of the computer-readable storage medium according to the present application, please refer to the description of the method embodiments of the present application.
In addition, it should be noted that: embodiments of the present application also provide a computer program product, which may include a computer program, which may be stored in a computer readable storage medium. The processor of the computer device reads the computer program from the computer readable storage medium, and the processor may execute the computer program, so that the computer device performs the description of the data processing method in the embodiment corresponding to fig. 3, 6, 7 or 8, and thus, a detailed description thereof will not be provided herein. In addition, the description of the beneficial effects of the same method is omitted. For technical details not disclosed in the embodiments of the computer program product according to the present application, reference is made to the description of the method embodiments of the present application.
Those skilled in the art will appreciate that implementing all or part of the above-described methods may be accomplished by way of a computer program stored in a computer-readable storage medium, which when executed may comprise the steps of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), or the like.
The foregoing disclosure is illustrative of the present application and is not to be construed as limiting the scope of the application, which is defined by the appended claims.

Claims (18)

1. A method of data processing, comprising:
obtaining a pre-training classification model; the pre-training classification model comprises a multi-mode feature sub-model, a feature fusion sub-model and an initial task sub-model;
s modal feature vectors corresponding to a target sample video belonging to a target field type are obtained through the multi-modal feature sub-model, and fusion learning is carried out on the S modal feature vectors through the feature fusion sub-model to obtain M initial sample classification features corresponding to the target sample video; the M and the S are positive integers greater than 1;
extracting M-1 classification matching features and M target sample classification features respectively corresponding to the initial sample classification features through the initial task sub-model; m target sample classification features are used for determining M levels of sample classification information, and M-1 classification matching features are used for determining the matching degree between every two adjacent levels of sample classification information; the categories respectively indicated by every two adjacent levels have category level nesting relations;
Based on M sample label information, S modal feature vectors, M target sample classification features and M-1 classification matching features of the target sample video, carrying out parameter adjustment on the initial task sub-model to obtain a task sub-model; the multi-modal feature sub-model, the feature fusion sub-model and the task sub-model are used for forming a target classification model; the target classification model is used for predicting M levels of target classification information corresponding to the target video belonging to the target field type.
2. The method of claim 1, wherein the feature fusion sub-model is pre-trained from a sample video set; sample videos in the sample video set are associated with at least two video domain types, wherein the at least two video domain types comprise the target domain type; the sample video set comprises N sample videos, wherein N is a positive integer; n of the sample videos include samplesThe video S i The method comprises the steps of carrying out a first treatment on the surface of the The i is a positive integer less than or equal to the N;
the method further comprises the steps of:
acquiring an initial classification model; the initial classification model comprises the multi-mode feature sub-model and an initial feature fusion sub-model;
Acquiring the sample video S through the multi-mode feature sub-model i The method comprises the steps of obtaining a target sample mode vector from S sample mode vectors corresponding to the S sample mode vectors, and determining sample mode vectors except the target sample mode vector in the S sample mode vectors as candidate sample mode vectors;
performing vector change on the target sample modal vector to obtain an auxiliary sample modal vector corresponding to the target sample modal vector, and performing fusion learning on the auxiliary sample modal vector and the candidate sample modal vector through the initial feature fusion sub-model to obtain a first fusion sample vector corresponding to the auxiliary sample modal vector;
and pre-training the initial feature fusion sub-model based on the first fusion sample vector and the target sample modal vector to obtain the feature fusion sub-model.
3. The method of claim 1, wherein the feature fusion sub-model is pre-trained from a sample video set; sample videos in the sample video set are associated with at least two video domain types, wherein the at least two video domain types comprise the target domain type; the sample video set comprises N sample videos, wherein N is a positive integer;
The method further comprises the steps of:
acquiring an initial classification model; the initial classification model comprises the multi-mode feature sub-model and an initial feature fusion sub-model;
s initial modal feature vectors corresponding to each sample video are obtained through the multi-modal feature sub-model, and initial modal feature vectors belonging to the same mode in the N.S initial modal feature vectors are combined into the same initial modal feature vector sequence to obtain S initial modal feature vector sequences; each initial modal feature vector sequence comprises N initial modal feature vectors;
acquiring candidate modal feature vector sequences from the S initial modal feature vector sequences, acquiring R candidate modal feature vectors from the candidate modal feature vector sequences, and adjusting the sequence of the R candidate modal feature vectors to obtain candidate modal feature vector sequences after the sequence is adjusted; r is a positive integer less than N;
performing fusion learning on the candidate modal feature vectors in the candidate modal feature vector sequence after the sequence adjustment and the initial modal feature vectors in the S-1 initial modal feature vector sequences through the initial feature fusion sub-model to obtain a second fusion sample vector; s-1 initial modal feature vector sequences are initial modal feature vector sequences except the candidate modal feature vector sequences in the S initial modal feature vector sequences;
And pre-training the initial feature fusion sub-model based on the S initial modal feature vector sequences and the second fusion sample vector to obtain the feature fusion sub-model.
4. The method of claim 1, wherein the multimodal feature sub-model includes a text network layer, an image network layer, a video network layer, and an audio network layer; the S modal feature vectors comprise sample sub-text vectors, sample sub-image vectors, sample sub-video vectors and sample sub-audio vectors;
s modal feature vectors corresponding to a target sample video belonging to a target field type are obtained through the multi-modal feature sub-model, and S modal feature vectors are subjected to fusion learning through the feature fusion sub-model to obtain M initial sample classification features corresponding to the target sample video, wherein the M initial sample classification features comprise:
acquiring a target sample video belonging to a target field type;
extracting text features of target text data of the target sample video through the text network layer to obtain the sample sub-text vector corresponding to the target sample video;
extracting image features of target image data of the target sample video through the image network layer to obtain the sample sub-image vector corresponding to the target sample video;
Extracting video features of target video data of the target sample video through the video network layer to obtain the sample sub-video vector corresponding to the target sample video;
extracting audio characteristics of target audio data of the target sample video through the audio network layer to obtain the sample sub-audio vector corresponding to the target sample video;
and carrying out fusion learning on the sample sub-text vector, the sample sub-audio vector, the sample sub-image vector and the sample sub-video vector through the feature fusion sub-model to obtain M initial sample classification features corresponding to the target sample video.
5. The method according to claim 4, wherein the extracting text features of the target text data of the target sample video through the text network layer to obtain the sample sub-text vector corresponding to the target sample video includes:
extracting characteristics of target text data of the target sample video through the text network layer to obtain text characteristic vectors corresponding to the target text data;
word segmentation processing is carried out on the target text data to obtain text word segmentation of the target text data, and text position coding is carried out on the text position of the text word segmentation in the target text data to obtain a text position vector corresponding to the text word segmentation;
And acquiring text modal characteristics corresponding to the target text data, and fusing the text characteristic vector, the text position vector and the text modal characteristics to obtain the sample sub-text vector corresponding to the target sample video.
6. The method according to claim 4, wherein the extracting, by the image network layer, image features of the target image data of the target sample video to obtain the sample sub-image vector corresponding to the target sample video includes:
extracting features of target image data of the target sample video through the image network layer to obtain image feature vectors corresponding to the target image data;
and acquiring image mode characteristics corresponding to the target image data, and fusing the image characteristic vectors and the image mode characteristics to obtain the sample sub-image vectors corresponding to the target sample video.
7. The method according to claim 4, wherein the performing, by the video network layer, video feature extraction on the target video data of the target sample video to obtain the sample sub-video vector corresponding to the target sample video includes:
Extracting characteristics of target video data of the target sample video through the video network layer to obtain video characteristic vectors corresponding to the target video data;
performing frame extraction processing on the target video data to obtain a key video frame in the target video data, and performing feature extraction on the key video frame through the video network layer to obtain a video frame feature vector corresponding to the key video frame;
and acquiring video mode characteristics corresponding to the target video data, and fusing the video characteristic vectors, the video frame characteristic vectors and the video mode characteristics to obtain the sample sub-video vectors corresponding to the target sample video.
8. The method according to claim 4, wherein the extracting, by the audio network layer, the audio feature of the target audio data of the target sample video to obtain the sample sub-audio vector corresponding to the target sample video includes:
carrying out frame division processing on target audio data of the target sample video to obtain at least two audio frames in the target audio data, and respectively carrying out feature extraction on at least two audio frames through the audio network layer to obtain at least two audio frame feature vectors respectively corresponding to the audio frames;
Performing audio position coding on the audio position of each audio frame in the target audio data to obtain an audio frame position vector corresponding to each audio frame;
and acquiring audio mode characteristics corresponding to the target audio data, and fusing at least two audio frame characteristic vectors, at least two audio frame position vectors and the audio mode characteristics to obtain the sample sub-audio vectors corresponding to the target sample video.
9. The method of claim 1, wherein the initial task sub-model comprises M classification network layers to which the initial sample classification features respectively correspond; the M classification network layers comprise a classification network layer H k K is a positive integer less than or equal to M;
the extracting, by the initial task sub-model, M-1 classification matching features and M target sample classification features respectively corresponding to the initial sample classification features includes:
acquiring the classified network layer H from M initial sample classified features k Corresponding initial sample classification feature J k
If the classified network layer H k For the first of the M classified network layers, then passing through the classified network layer H k Classifying the initial sample into feature J k Performing full connection processing to obtain the classified network layer H k Corresponding auxiliary sample classification features through the classification network layer H k For the classified network layer H k Corresponding auxiliary sample classification featuresPerforming full connection processing to obtain the initial sample classification feature J k Corresponding target sample classification features;
if the classified network layer H k Not being the first of the M classified network layers, then passing through the classified network layer H k Classifying the initial sample into feature J k Performing full connection processing to obtain the classified network layer H k Corresponding auxiliary sample classification features based on M initial sample classification features and the classification network layer H k Corresponding auxiliary sample classification features, determining the initial sample classification feature J k Corresponding target sample classification features and the classification network layer H k The corresponding classifications match the features.
10. The method of claim 9, wherein the classifying network layer H based on M of the initial sample classification features k Corresponding auxiliary sample classification features, determining the initial sample classification feature J k Corresponding target sample classification features and the classification network layer H k Corresponding classification matching features, including:
if the classified network layer H k For the second classified network layer of the M classified network layers, then for the classified network layer H k-1 Corresponding auxiliary sample classification features and the classification network layer H k Performing feature stitching on the corresponding auxiliary sample classification features to obtain the classification network layer H k Corresponding spliced sample classification features for the classification network layer H k-1 Corresponding auxiliary sample classification features and the classification network layer H k Feature stitching is carried out on the corresponding spliced sample classification features to obtain the classified network layer H k Corresponding classification matching features for the classification network layer H k Performing full connection processing on the corresponding spliced sample classification features to obtain the initial sample classification feature J k Corresponding target sample classification features; the classified network layer H k-1 For the classified network layer H k Is a last classified network layer; the classified network layer H k-1 Corresponding toThe auxiliary sample classification characteristic is based on classification network layer H k-1 Corresponding initial sample classification features;
if the classified network layer H k Not being the second of the M classified network layers, for the classified network layer H k-1 Corresponding spliced sample classification features and the classification network layer H k Performing feature stitching on the corresponding auxiliary sample classification features to obtain the classification network layer H k Corresponding spliced sample classification features for the classification network layer H k-1 Corresponding spliced sample classification features and the classification network layer H k Feature stitching is carried out on the corresponding spliced sample classification features to obtain the classified network layer H k Corresponding classification matching features for the classification network layer H k Performing full connection processing on the corresponding spliced sample classification features to obtain the initial sample classification feature J k And corresponding target sample classification features.
11. The method of claim 1, wherein S of the modal feature vectors include a sample sub-text vector, a sample sub-image vector, a sample sub-video vector, and a sample sub-audio vector;
the parameter adjustment is performed on the initial task sub-model based on M sample label information, S modal feature vectors, M target sample classification features and M-1 classification matching features of the target sample video to obtain a task sub-model, and the method comprises the following steps:
determining a first model loss value of the pre-training classification model based on the M sample tag information of the target sample video and associated sample tag information and target sample classification features of the M target sample classification features;
Determining a second model loss value for the pre-trained classification model based on M sample label information, the sample sub-text vector, the sample sub-image vector, the sample sub-video vector, and the sample sub-audio vector;
determining a third model loss value of the pre-trained classification model based on M target sample classification features and M-1 classification matching features;
and determining a total model loss value of the pre-training classification model according to the first model loss value, the second model loss value and the third model loss value, and carrying out parameter adjustment on the initial task sub-model based on the total model loss value to obtain a task sub-model.
12. The method of claim 11, wherein the pre-trained classification model further comprises a text classifier, an image classifier, a video classifier, and an audio classifier;
the determining a second model loss value for the pre-trained classification model based on M of the sample label information, the sample sub-text vector, the sample sub-image vector, the sample sub-video vector, and the sample sub-audio vector, comprises:
inputting the sample sub-text vectors into the text classifier, determining M predicted text classification vectors by the text classifier, and determining text loss values of the pre-training classification model according to the associated predicted text classification vectors and sample label information in the M predicted text classification vectors and the M sample label information;
Inputting the sample sub-image vector to the image classifier, determining M predicted image classification vectors by the image classifier, and determining an image loss value of the pre-training classification model according to the associated predicted image classification vector and sample label information in the M predicted image classification vectors and the M sample label information;
inputting the sample sub-video vectors to the video classifier, determining M predicted video classification vectors by the video classifier, and determining video loss values of the pre-training classification model according to the associated predicted video classification vectors and sample label information in the M predicted video classification vectors and the M sample label information;
inputting the sample sub-audio vectors to the audio classifier, determining M predicted audio classification vectors by the audio classifier, and determining an audio loss value of the pre-training classification model according to the associated predicted audio classification vectors and sample label information in the M predicted audio classification vectors and the M sample label information;
and determining a second model loss value of the pre-trained classification model according to the text loss value, the image loss value, the video loss value and the audio loss value.
13. The method of claim 11, wherein the determining a third model loss value for the pre-trained classification model based on M of the target sample classification features and M-1 of the classification matching features comprises:
respectively carrying out normalization processing on M target sample classification features to obtain normalized sample classification features corresponding to the M target sample classification features, and determining M levels of sample classification information corresponding to the target sample video according to the M normalized sample classification features;
respectively carrying out normalization processing on the M-1 classified matching features to obtain normalized matching features corresponding to the M-1 classified matching features; m-1 said normalized matching features include normalized matching feature P b B is a positive integer less than or equal to M-1;
acquiring the normalized matching feature P b Associated first and second sample classification information, determining the normalized matching feature P based on the first and second sample classification information b Matching labels with corresponding categories; the first sample classification information is matched with the normalized matching characteristic P b Sample classification information corresponding to normalized sample classification features with the same level, wherein the second sample classification information is the normalized matching feature P b Sample classification information corresponding to the normalized sample classification feature of the previous level;
according to the normalized matching characteristic P b And the normalized matching feature P b Corresponding category matching labels, determining the normalized matching feature P b Matching loss of corresponding categories;
and determining a third model loss value of the pre-training classification model according to category matching losses respectively corresponding to the M-1 normalized matching features.
14. The method according to claim 1, wherein the method further comprises:
s target modal features corresponding to the target video are obtained through the multi-modal feature sub-model;
performing fusion learning on S target modal features through the feature fusion sub-model to obtain M initial classification features corresponding to the target video;
extracting target classification features corresponding to the M initial classification features respectively through the task sub-model;
and respectively carrying out normalization processing on the M target classification features to obtain normalized classification features corresponding to the M target classification features, and determining M levels of target classification information corresponding to the target video according to the M normalized classification features.
15. A data processing apparatus, comprising:
the model acquisition module is used for acquiring a pre-training classification model; the pre-training classification model comprises a multi-mode feature sub-model, a feature fusion sub-model and an initial task sub-model;
the fusion learning module is used for acquiring S modal feature vectors corresponding to a target sample video belonging to a target field type through the multi-modal feature sub-model, and carrying out fusion learning on the S modal feature vectors through the feature fusion sub-model to acquire M initial sample classification features corresponding to the target sample video; the M and the S are positive integers greater than 1;
the feature extraction module is used for extracting M-1 classification matching features and M target sample classification features respectively corresponding to the initial sample classification features through the initial task sub-model; m target sample classification features are used for determining M levels of sample classification information, and M-1 classification matching features are used for determining the matching degree between every two adjacent levels of sample classification information; the categories respectively indicated by every two adjacent levels have category level nesting relations;
the parameter adjustment module is used for carrying out parameter adjustment on the initial task sub-model based on M sample label information, S modal feature vectors, M target sample classification features and M-1 classification matching features of the target sample video to obtain a task sub-model; the multi-modal feature sub-model, the feature fusion sub-model and the task sub-model are used for forming a target classification model; the target classification model is used for predicting M levels of target classification information corresponding to the target video belonging to the target field type.
16. A computer device, comprising: a processor and a memory;
the processor is connected to the memory, wherein the memory is configured to store a computer program, and the processor is configured to invoke the computer program to cause the computer device to perform the method of any of claims 1-14.
17. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a computer program adapted to be loaded and executed by a processor to cause a computer device having the processor to perform the method of any of claims 1-14.
18. A computer program product, characterized in that the computer program product comprises a computer program stored in a computer readable storage medium and adapted to be read and executed by a processor to cause a computer device with the processor to perform the method of any of claims 1-14.
CN202211521178.1A 2022-11-30 2022-11-30 Data processing method, device, computer equipment and readable storage medium Pending CN116976327A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211521178.1A CN116976327A (en) 2022-11-30 2022-11-30 Data processing method, device, computer equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211521178.1A CN116976327A (en) 2022-11-30 2022-11-30 Data processing method, device, computer equipment and readable storage medium

Publications (1)

Publication Number Publication Date
CN116976327A true CN116976327A (en) 2023-10-31

Family

ID=88478414

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211521178.1A Pending CN116976327A (en) 2022-11-30 2022-11-30 Data processing method, device, computer equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN116976327A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117851640A (en) * 2024-03-04 2024-04-09 广东智媒云图科技股份有限公司 Video data processing method, device, equipment and medium based on composite characteristics
CN118035491A (en) * 2024-04-11 2024-05-14 北京搜狐新媒体信息技术有限公司 Training method and using method of video label labeling model and related products

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117851640A (en) * 2024-03-04 2024-04-09 广东智媒云图科技股份有限公司 Video data processing method, device, equipment and medium based on composite characteristics
CN117851640B (en) * 2024-03-04 2024-05-31 广东智媒云图科技股份有限公司 Video data processing method, device, equipment and medium based on composite characteristics
CN118035491A (en) * 2024-04-11 2024-05-14 北京搜狐新媒体信息技术有限公司 Training method and using method of video label labeling model and related products

Similar Documents

Publication Publication Date Title
US12001474B2 (en) Information determining method and apparatus, computer device, and storage medium
CN113010703B (en) Information recommendation method and device, electronic equipment and storage medium
KR20180136265A (en) Apparatus, method and computer-readable medium for searching and providing sectional video
CN111258995B (en) Data processing method, device, storage medium and equipment
CN112163122A (en) Method and device for determining label of target video, computing equipment and storage medium
CN116702737B (en) Document generation method, device, equipment, storage medium and product
CN114969316B (en) Text data processing method, device, equipment and medium
CN117011745A (en) Data processing method, device, computer equipment and readable storage medium
CN113469152B (en) Similar video detection method and device
CN116935170B (en) Processing method and device of video processing model, computer equipment and storage medium
CN113766299A (en) Video data playing method, device, equipment and medium
CN112231563A (en) Content recommendation method and device and storage medium
CN113395594A (en) Video processing method, device, equipment and medium
CN114443899A (en) Video classification method, device, equipment and medium
CN113515669A (en) Data processing method based on artificial intelligence and related equipment
CN114282055A (en) Video feature extraction method, device and equipment and computer storage medium
CN115909374B (en) Information identification method, device, equipment, storage medium and program product
CN114661951A (en) Video processing method and device, computer equipment and storage medium
US20200410292A1 (en) Machine learned historically accurate temporal classification of objects
CN116976327A (en) Data processing method, device, computer equipment and readable storage medium
Liu et al. A multimodal approach for multiple-relation extraction in videos
CN113822127A (en) Video processing method, video processing device, video processing equipment and storage medium
CN117033626A (en) Text auditing method, device, equipment and storage medium
CN116977701A (en) Video classification model training method, video classification method and device
CN116980665A (en) Video processing method, device, computer equipment, medium and product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication