CN114462539A

CN114462539A - Training method of content classification model, and content classification method and device

Info

Publication number: CN114462539A
Application number: CN202210126390.1A
Authority: CN
Inventors: 徐培; 黄珊
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-02-10
Filing date: 2022-02-10
Publication date: 2022-05-10

Abstract

The application discloses a training method of a content classification model, and relates to a computer vision technology and an image semantic understanding technology based on artificial intelligence. The method comprises the steps of obtaining M modal characteristic vectors through a characteristic extraction network included in a content classification model based on a content training sample; acquiring content probability distribution through a multi-mode fusion network included in a content classification model based on the M modal feature vectors; obtaining M modal probability distributions through M modal classification networks based on the M modal feature vectors; and updating the model parameters of the content classification model according to the M modal probability distributions, the M single modal class labels, the content probability distribution and the content class labels. The application also provides a content classification method and device. The method and the device achieve the purpose of optimizing the multi-mode fusion network and the feature extraction network simultaneously, achieve the effect of balancing each network in the content classification model, and are favorable for improving the precision and the effect of multi-mode classification results.

Description

Training method of content classification model, and content classification method and device

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a training method of a content classification model, a content classification method and a content classification device.

Background

With the development of information technology, content in various fields is produced, accumulated, shared and spread more and more variously and flexibly. Among them, the spread of contents occupies an important position in modern human life. To enhance the user experience of consuming content, it is often desirable to identify and classify content.

Currently, the mainstream content classification work can be realized by adopting an artificial intelligence model. In the process of model training, after each single-mode feature extraction model is usually fine-tuned, each single-mode feature vector in the content is respectively extracted, and then each mode feature vector is used as the input of a multi-mode fusion network to realize the training of a classification task.

The inventor finds that in the existing scheme, at least the effect of each single-mode feature extraction model is not necessarily balanced, that is, the optimization of the single-mode feature extraction model belongs to local optimization, and the dependence degree of the multi-mode fusion network on each single mode is different, so that the local optimization may not reach global optimization, and the precision and the effect of the multi-mode classification result are poor.

Disclosure of Invention

The embodiment of the application provides a training method of a content classification model, a content classification method and a content classification device. The method and the device achieve the purpose of optimizing the multi-mode fusion network and the feature extraction network simultaneously, achieve the effect of balancing each network in the content classification model, and therefore are beneficial to improving the precision and the effect of multi-mode classification results.

In view of the above, an aspect of the present application provides a method for training a content classification model, including:

acquiring a content training sample, wherein the content training sample corresponds to M marked single-mode type labels and content type labels, and M is an integer greater than 1;

based on the content training sample, obtaining M modal feature vectors through a feature extraction network included in a content classification model;

acquiring content probability distribution through a multi-mode fusion network included in a content classification model based on the M modal feature vectors;

acquiring M modal probability distributions through M modal classification networks based on M modal feature vectors, wherein each modal feature vector input by each modal classification network has a corresponding relation with the output modal probability distribution;

and updating the model parameters of the content classification model according to the M modal probability distributions, the M single modal class labels, the content probability distribution and the content class labels.

Another aspect of the present application provides a method for content classification, including:

acquiring target content;

based on the target content, acquiring content probability distribution corresponding to the target content through a content classification model, wherein the content classification model is obtained by adopting the method of the aspects;

and determining a classification result of the target content according to the content probability distribution corresponding to the target content.

Another aspect of the present application provides a model training apparatus, including:

the system comprises an acquisition module, a storage module and a processing module, wherein the acquisition module is used for acquiring content training samples, the content training samples correspond to M marked single-mode category labels and content category labels, and M is an integer greater than 1;

the acquisition module is also used for acquiring M modal characteristic vectors through a characteristic extraction network included in the content classification model based on the content training sample;

the acquisition module is also used for acquiring content probability distribution through a multi-mode fusion network included by the content classification model based on the M modal feature vectors;

the acquisition module is further used for acquiring M modal probability distributions through M modal classification networks based on the M modal feature vectors, wherein the modal feature vector input by each modal classification network has a corresponding relation with the output modal probability distribution;

and the training module is used for updating the model parameters of the content classification model according to the M modal probability distributions, the M single-modal class labels, the content probability distribution and the content class labels.

In one possible design, in another implementation of another aspect of an embodiment of the present application,

the acquisition module is further used for acquiring a text recognition result through a text recognition network included in the content classification model based on the content training sample after the content training sample is acquired;

the acquisition module is specifically used for acquiring text feature vectors through a text feature extraction network included by the content classification model based on the text recognition result, wherein the text feature vectors are contained in the M modal feature vectors;

based on the content training sample, obtaining an image feature vector through an image feature extraction network included in the content classification model, wherein the image feature vector is included in the M modal feature vectors.

In one possible design, in another implementation of another aspect of an embodiment of the present application, the M single-modality category tags include a text category tag and an image category tag;

the acquisition module is specifically used for acquiring text probability distribution through a text classification network based on the text feature vector;

based on the image feature vector, obtaining image probability distribution through an image classification network;

the acquiring module is specifically used for acquiring content probability distribution through a multi-mode fusion network included in the content classification model based on the text feature vector and the image feature vector;

and the training module is specifically used for updating the model parameters of the content classification model according to the text probability distribution, the text category label, the image probability distribution, the image category label, the content probability distribution and the content category label.

the acquisition module is specifically used for acquiring a text detection result through a text detection network included in a text recognition network based on the content training sample, wherein the text recognition network is included in the content classification model;

and acquiring a text recognition result through a character recognition network included in the text recognition network based on the text detection result.

the acquisition module is specifically used for acquiring the probability distribution of each character through a text recognition network included in the content classification model based on the content training sample;

and acquiring a text recognition result according to the probability distribution of each character.

the acquisition module is specifically used for acquiring S symbol coding sequences through a text feature extraction network included in the content classification model based on a text recognition result, wherein S is an integer greater than or equal to 1;

and adding the start character coding sequence and the stop character coding sequence to the S symbol coding sequences to obtain the text characteristic vector.

the acquisition module is specifically used for acquiring a target characteristic diagram through an image characteristic extraction network included in the content classification model based on the content training sample;

converting the target feature map into a first image feature vector and two-dimensional image features, wherein the first image feature vector is contained in the image feature vector;

generating a second image feature vector according to the two-dimensional image features, wherein the second image feature vector comprises N symbol encoding sequences, a start character encoding sequence and a stop character encoding sequence, the second image feature vector is contained in the image feature vector, and N is an integer greater than or equal to 1;

the acquisition module is specifically used for acquiring image probability distribution through an image classification network based on the first image feature vector;

and the obtaining module is specifically used for obtaining the content probability distribution through a multi-mode fusion network included in the content classification model based on the text feature vector and the second image feature vector.

the acquisition module is specifically used for splicing the text characteristic vector and the second image characteristic vector to obtain a target characteristic vector;

based on the target characteristic vector, the position coding vector and the segment coding vector, obtaining an image-text coding result through a coding network included in the multi-mode fusion network, wherein the multi-mode fusion network is contained in a content classification model, the position coding vector represents the position of each symbol in the characteristic vector, and the segment coding vector represents the category to which the characteristic vector belongs;

and acquiring content probability distribution through a classification network included in the multi-mode fusion network based on the image-text coding result.

the acquisition module is further used for acquiring an audio recognition result through an audio recognition network included in the content classification model based on the content training sample after the content training sample is acquired;

the acquisition module is specifically used for acquiring audio text feature vectors through a text feature extraction network included by the content classification model based on the audio recognition result, wherein the audio text feature vectors are contained in the M modal feature vectors;

In one possible design, in another implementation of another aspect of an embodiment of the present application, the M single-modality category tags include an audio category tag and an image category tag;

the acquisition module is specifically used for acquiring audio text probability distribution through a text classification network based on the audio text feature vector;

the acquisition module is specifically used for acquiring content probability distribution through a multi-mode fusion network included in the content classification model based on the audio text characteristic vector and the image characteristic vector;

and the training module is specifically used for updating the model parameters of the content classification model according to the audio text probability distribution, the audio class label, the image probability distribution, the image class label, the content probability distribution and the content class label.

the acquisition module is also used for acquiring a text recognition result through a text recognition network included in the content classification model based on the content training sample;

the acquisition module is specifically used for acquiring audio text feature vectors through a first text feature extraction network included in the content classification model based on the audio recognition result, wherein the audio text feature vectors are contained in the M modal feature vectors;

based on the text recognition result, obtaining a text feature vector through a second text feature extraction network included in the content classification model, wherein the text feature vector is contained in the M modal feature vectors;

In one possible design, in another implementation of another aspect of an embodiment of the present application, the M single-modality category tags include an audio category tag, a text category tag, and an image category tag;

the acquisition module is specifically used for acquiring audio text probability distribution through a first text classification network based on the audio text feature vector;

based on the text feature vector, obtaining text probability distribution through a second text classification network;

the acquisition module is specifically used for acquiring content probability distribution through a multi-mode fusion network included in the content classification model based on the audio text feature vector, the text feature vector and the image feature vector;

and the training module is specifically used for updating the model parameters of the content classification model according to the audio text probability distribution, the audio class label, the text probability distribution, the text class label, the image probability distribution, the image class label, the content probability distribution and the content class label.

Another aspect of the present application provides a content classification apparatus, including:

the acquisition module is used for acquiring target content;

the acquisition module is further used for acquiring content probability distribution corresponding to the target content through a content classification model based on the target content, wherein the content classification model is obtained by adopting the method in each aspect;

and the classification module is used for determining the classification result of the target content according to the content probability distribution corresponding to the target content.

Another aspect of the present application provides a computer device, comprising: a memory, a processor, and a bus system;

wherein, the memory is used for storing programs;

a processor for executing the program in the memory, the processor for performing the above-described aspects of the method according to instructions in the program code;

the bus system is used for connecting the memory and the processor so as to enable the memory and the processor to communicate.

Another aspect of the present application provides a computer-readable storage medium having stored therein instructions, which when executed on a computer, cause the computer to perform the method of the above-described aspects.

In another aspect of the application, a computer program product or computer program is provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method provided by the above aspects.

According to the technical scheme, the embodiment of the application has the following advantages:

in the embodiment of the application, a training method of a content classification model is provided, which includes obtaining a content training sample, where the content training sample corresponds to M labeled single-mode class labels and content class labels, and then obtaining M modal feature vectors through a feature extraction network included in the content classification model based on the content training sample. Therefore, on one hand, the content probability distribution needs to be obtained through a multi-modal fusion network included in the content classification model based on the M modal feature vectors. On the other hand, M modal probability distributions need to be obtained through M modal classification networks based on M modal feature vectors, where an input modal feature vector and an output modal probability distribution of each modal classification network have a corresponding relationship. And finally, updating the model parameters of the content classification model according to the M modal probability distributions, the M single-modal category labels, the content probability distribution and the content category labels. Through the mode, M single-mode classification branches are added on the basis of a multi-mode classification task to serve as supervision of corresponding modes, and a multi-branch collaborative training method is used for achieving the purpose of optimizing a multi-mode fusion network and a feature extraction network simultaneously, achieving the effect of balancing each network in a content classification model, and facilitating improvement of the precision and the effect of multi-mode classification results.

Drawings

FIG. 1 is a schematic diagram of an environment of a content classification system in an embodiment of the present application;

FIG. 2 is a block diagram illustrating an architecture of multi-branch cooperative training in an embodiment of the present application;

FIG. 3 is a schematic flow chart of a content classification model training method according to an embodiment of the present application;

FIG. 4 is a schematic structural diagram of a content classification model in an embodiment of the present application;

FIG. 5 is a schematic structural diagram of a text recognition network in an embodiment of the present application;

FIG. 6 is a schematic diagram of another structure of a text recognition network in an embodiment of the present application;

fig. 7 is a schematic structural diagram of a modality fusion network according to an embodiment of the present application;

FIG. 8 is a diagram illustrating another structure of a content classification model according to an embodiment of the present application;

FIG. 9 is a diagram illustrating another structure of a content classification model according to an embodiment of the present application;

FIG. 10 is a flow chart illustrating a content classification method according to an embodiment of the present application;

FIG. 11 is a diagram illustrating a content classification scenario in an embodiment of the present application;

FIG. 12 is another diagram illustrating a content classification scenario in an embodiment of the present application;

FIG. 13 is a schematic view of a model training apparatus according to an embodiment of the present application;

FIG. 14 is a schematic diagram of a content classification apparatus in an embodiment of the present application;

FIG. 15 is a schematic structural diagram of a server in an embodiment of the present application;

fig. 16 is a schematic structural diagram of a terminal device in the embodiment of the present application.

Detailed Description

The terms "first," "second," "third," "fourth," and the like in the description and claims of this application and in the above-described drawings, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "corresponding" and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The multi-modal scene classification recognition technology has been widely applied to graphic and text or video understanding capability of Artificial Intelligence (AI), for example, in the fields of video websites, e-commerce logistics, social applications, and automatic driving. The content classification method can be used for identifying the image-text content and the video content, classifying the image-text and the video, labeling the image-text and the video, and the like. By "modality," it is meant that the source and form of the contact information, e.g., text, audio, images, etc., are of the modality.

The AI is a theory, method, technique and application system that simulates, extends and expands human intelligence, senses the environment, acquires knowledge and uses the knowledge to obtain the best results using a digital computer or a machine controlled by a digital computer. In other words, AI is an integrated technique of computer science that attempts to understand the essence of intelligence and produces a new intelligent machine that can react in a manner similar to human intelligence. AI is to study the design principles and implementation methods of various intelligent machines, so that the machine has the functions of perception, reasoning and decision making. The AI technology is a comprehensive subject, and relates to the field of extensive technology, both hardware level technology and software level technology. The AI base technologies generally include technologies such as sensors, dedicated AI chips, cloud computing, distributed storage, big data processing technologies, operating/interactive systems, mechatronics, and the like. The AI software technology mainly includes Computer Vision (CV) technology, speech processing technology, Natural Language Processing (NLP) technology, Machine Learning (ML)/deep Learning, and so on.

In order to improve the precision and effect of the multi-modal classification result, the present application provides a training method of a content classification model and a content classification method, which are applied to the content classification system shown in fig. 1, as shown in the figure, the content classification system includes a server and a terminal device, and a client is deployed on the terminal device, wherein the client may run on the terminal device in the form of a browser, or run on the terminal device in the form of an independent Application (APP), and a specific presentation form of the client is not limited herein. The server related to the application can be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, and a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, a cloud function, cloud storage, Network service, cloud communication, middleware service, domain name service, security service, Content Delivery Network (CDN), big data and an AI platform. The terminal device may be a smart phone, a tablet computer, a notebook computer, a palm computer, a personal computer, a smart television, a smart watch, a vehicle-mounted device, a wearable device, and the like, but is not limited thereto. The terminal device and the server may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein. The number of servers and terminal devices is not limited. The scheme provided by the application can be independently completed by the terminal device, can also be independently completed by the server, and can also be completed by the cooperation of the terminal device and the server, so that the application is not particularly limited.

Based on the content classification system shown in fig. 1, in the training stage of the content classification model, the server may call the content training samples in the database for model training, and after the training is completed, the content classification model is stored locally. In the inference stage of the content classification model, the server calls the locally trained content classification model, classifies the target content uploaded by the terminal equipment, and performs subsequent processing based on corresponding downstream tasks.

The application provides a multi-modal content classification method based on multi-branch collaborative training, which respectively introduces single-modal categories to different modal feature vectors for supervision. For convenience of understanding, please refer to fig. 2, where fig. 2 is a schematic architecture diagram of multi-branch collaborative training in an embodiment of the present application, and as shown in the figure, taking training of a content classification model supporting bimodal classification as an example, specifically, content training samples are respectively used as inputs of the content classification model, where the content classification model includes a feature extraction network a and a feature extraction network B. A modal feature vector a is extracted by a feature extraction network a, and a modal feature vector B is extracted by a feature extraction network B. Based on the modal probability distribution, the modal feature vector A is input into the classification network A, and the modal probability distribution A is output through the classification network A. And inputting the modal characteristic vector B into a classification network B, and outputting modal probability distribution B through the classification network B. Meanwhile, the modal feature vector A and the modal feature vector B are used as the input of the multi-modal fusion network, and the content probability distribution is output through the multi-modal fusion network.

Thus, the content classification model is trained based on the individual monomodal labels, the content category labels and the obtained probability distributions of the content training samples.

In view of the fact that this application refers to certain terms that are relevant to the field of endeavor, the following explanations will be made for the purpose of facilitating understanding.

(1) Bidirectional transformer-Based Encoders (BERTs): the goal of the BERT model is to obtain a representation of the text containing rich semantic information using large-scale unlabeled corpus training, namely: semantic representation of text. The semantic representation of the text is then fine-tuned in a particular NLP task and ultimately applied to that NLP task.

(2) NLP: is an important direction in the fields of computer science and AI. It studies various theories and methods that enable efficient communication between humans and computers using natural language. NLP is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. NLP techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.

(3) CV: the computer vision image processing method is a science for researching how to make a machine see, and further refers to replacing human eyes with a camera and a computer to perform machine vision such as identification and measurement on a target, and further performing image processing to enable the computer processing to be an image more suitable for human eyes to observe or to be transmitted to an instrument to detect. CV research has associated theories and techniques in an attempt to establish an AI system capable of obtaining information from images or multidimensional data. CV technologies generally include image processing, image Recognition, image semantic understanding, image retrieval, Optical Character Recognition (OCR), video processing, video semantic understanding, video content/behavior Recognition, three-dimensional object reconstruction, 3D technologies, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also include common biometric technologies such as face Recognition and fingerprint Recognition.

(4) Speech Technology (Speech Technology): the key technology of the method comprises an Automatic Speech Recognition (ASR) technology, a Text To Speech (TTS) technology and a voiceprint Recognition technology. The computer can listen, see, speak and feel, and the development direction of the future human-computer interaction is provided, wherein the voice becomes one of the best human-computer interaction modes in the future.

(5) ML: the method is a multi-field cross discipline and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The method specially studies how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of AI, is the fundamental approach to making computers intelligent, and is applied throughout various areas of AI. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

(6) Converter (Transformer): a deep learning model employs a self-attention mechanism to differentially weight the importance of each portion of input data. It is mainly used in the fields of natural language processing and computer vision, like the recurrent neural networks, where the converter is designed to process sequential input data to perform translation and text summarization tasks.

(7) Residual Network (Residual Neural Network, ResNet): a model framework for deep Convolutional Neural Networks (CNN).

(8) Multimodal bidirectional converter (Multi-Modal bistransformers, MMBT): pre-trained ResNet and pre-trained BERT are applied to the modeless image and text, respectively, and input into a bi-directional Transformer.

With reference to fig. 3, a method for training a content classification model in the present application will be described below, where an embodiment of the method for training a content classification model in the present application includes:

110. acquiring a content training sample, wherein the content training sample corresponds to M marked single-mode category labels and content category labels, and M is an integer greater than 1;

in one or more embodiments, the model training apparatus obtains a training sample set, where one training sample set may be a batch (batch) of content training samples, for convenience of description, any content training sample is taken as an example for the present application, and for other content training samples, processing is performed in a similar manner, which is not described herein again.

Specifically, the content training sample is a sample with a plurality of modalities, and the content training sample has at least two single-modality category labels and one content category label which are labeled in advance.

It should be noted that the model training apparatus may be deployed in a server, or may be deployed in a terminal device, or may be deployed in a system composed of a server and a terminal device, which is not limited herein.

120. Based on the content training sample, obtaining M modal feature vectors through a feature extraction network included in a content classification model;

in one or more embodiments, the model training apparatus takes the content training sample as an input of the content classification model, invokes a feature extraction network included in the content classification model, and outputs M modal feature vectors. Wherein each modality corresponds to a modality feature vector.

130. Acquiring content probability distribution through a multi-mode fusion network included in a content classification model based on the M modal feature vectors;

in one or more embodiments, the model training device can take the M modal feature vectors as input to the multi-modal fusion network, thereby obtaining a content probability distribution. Wherein the multi-modal convergence network is part of a content classification network. Assume that the content category labels are respectively two kinds, one is "content malicious category", denoted as "1". The other is "content normal category", denoted as "0". Based on this, the content probability distribution can be represented as a two-dimensional vector.

Specifically, there are various fusion methods of the M modal feature vectors, and two fusion methods will be described below.

Illustratively, the M modal feature vectors may be summed by concatenation (concat) and weighting, and subsequent network layers will automatically adapt to this operation. The concat operation may be used to combine input features with each other. While the weighted sum is a weighted sum method with scalar weight, this iterative method requires that the vectors generated by the pre-training model have certain dimensions and are arranged in a certain order and suitable for element-wise addition. To meet this requirement, a fully connected layer (FC) may be used to control dimensions and reorder each dimension.

Illustratively, M modal feature vectors are fused using an attention mechanism. The attention mechanism generally refers to a weighted sum of a set of scalar weight vectors that the model dynamically generates at each time step. The multiple output heads of attention may dynamically generate the weights to be used in the summation, so that eventually additional weight information may be saved at the time of stitching. Based on the attention mechanism, the semantic relation among the M modal feature vectors is utilized, and then the M modal feature vectors are fused for multi-modal classification.

140. Acquiring M modal probability distributions through M modal classification networks based on M modal feature vectors, wherein each modal feature vector input by each modal classification network has a corresponding relation with the output modal probability distribution;

in one or more embodiments, the model training device inputs each modal feature vector of the M modal feature vectors to a corresponding modal classification network, where the modal classification network is a classifier in general, and the classifier includes an FC layer. Based on this, the respective modal probability distributions are output by the respective modal classification networks.

It should be noted that the present application does not limit the execution sequence between

steps

140 and 130.

150. And updating the model parameters of the content classification model according to the M modal probability distributions, the M single modal class labels, the content probability distribution and the content class labels.

In one or more embodiments, since the content training samples are pre-labeled with M single-modality category labels, the M single-modality category labels are "true values". And the M modal probability distributions are classification results obtained by prediction, namely belong to 'predicted values', and on the basis of the 'predicted values', corresponding first loss values can be calculated according to the modal probability distributions and corresponding single-modal class labels respectively. At the same time, the content class label is "true value", and the content probability distribution belongs to "predicted value", so the corresponding second loss value is calculated based on the content probability distribution and the content class label.

Specifically, the sum of the first loss value and the second loss value is calculated to obtain a target loss value. Therefore, a random gradient descent method is adopted, the target loss value is subjected to gradient backward calculation, the update values of all model parameters are obtained, and the content classification model is updated. In one case, a depletion type criterion may be used as a basis for determining whether the model training condition is satisfied, for example, a threshold number of iterations is set, and when the number of iterations reaches the threshold number of iterations, the model training condition is satisfied. In another case, an observation-type criterion may be employed as a basis for determining whether the model training condition is satisfied, for example, when the loss result has converged, i.e., indicating that the model training condition is satisfied.

In the embodiment of the application, a training method of a content classification model is provided. Through the mode, M single-mode classification branches are added on the basis of a multi-mode classification task to serve as supervision of corresponding modes, and a multi-branch collaborative training method is used for achieving the purpose of optimizing a multi-mode fusion network and a feature extraction network simultaneously, achieving the effect of balancing each network in a content classification model, and facilitating improvement of the precision and the effect of multi-mode classification results.

Optionally, on the basis of the various embodiments corresponding to fig. 3, in another optional embodiment provided in the embodiment of the present application, after obtaining the content training sample, the method may further include:

based on the content training sample, obtaining a text recognition result through a text recognition network included in the content classification model;

based on the content training sample, obtaining M modal feature vectors through a feature extraction network included in the content classification model, which may specifically include:

based on the text recognition result, obtaining a text feature vector through a text feature extraction network included in the content classification model, wherein the text feature vector is contained in the M modal feature vectors;

In one or more embodiments, a way to extract individual modal feature vectors based on teletext content is presented. As can be seen from the foregoing embodiment, the content training sample may be a text content (in this case, M is equal to 2), where the text content refers to a content displayed on a picture. Based on this, OCR techniques may be employed to identify content training samples.

Specifically, the content classification model includes a text recognition network, and the text recognition network is used for recognizing characters in the whole image to obtain a text recognition result. The content classification model comprises a feature extraction network, wherein the feature extraction network specifically comprises a text feature extraction network and an image feature extraction network. The text feature extraction network is used for extracting text feature vectors, and the image feature extraction network is used for extracting image feature vectors.

Secondly, in the embodiment of the application, a mode for extracting characteristic vectors of various modes based on the image-text content is provided. By the method, the single-mode supervision is respectively introduced to the text characteristic vector and the image characteristic vector, and the classification learning of the text characteristic extraction network and the image characteristic extraction network is increased, so that the multi-mode image-text classification effect is improved, and the advantage of multi-task joint optimization is exerted.

Optionally, on the basis of the respective embodiments corresponding to fig. 3, in another optional embodiment provided in this embodiment of the present application, the M single-modality category tags include a text category tag and an image category tag;

based on the M modal feature vectors, obtaining M modal probability distributions through M modal classification networks, which may specifically include:

based on the text feature vector, obtaining text probability distribution through a text classification network;

based on the M modal feature vectors, obtaining content probability distribution through a multi-modal fusion network included in the content classification model may specifically include:

acquiring content probability distribution through a multi-mode fusion network included in a content classification model based on the text feature vector and the image feature vector;

updating the model parameters of the content classification model according to the M modal probability distributions, the M single modal class labels, the content probability distributions, and the content class labels, which may specifically include:

and updating the model parameters of the content classification model according to the text probability distribution, the text category label, the image probability distribution, the image category label, the content probability distribution and the content category label.

In one or more embodiments, a manner of training a content classification model is presented. As can be seen from the foregoing embodiments, the content training samples may be teletext content, and thus the labeled M single-modality category labels include a text category label and an image category label. The text category labels are respectively two kinds, one is a text malicious category which is expressed as 1. The other is "text normal category", denoted "0". Based on this, the text probability distribution can be represented as a two-dimensional vector. Assume that the image category labels are two kinds, one is "image contains person" and is denoted as "1". The other is "the image does not contain a person", and is represented as "0". Based on this, the image probability distribution can be represented as a two-dimensional vector.

Specifically, the model training can be divided into four parts, for easy understanding, please refer to fig. 4, where fig. 4 is a schematic structural diagram of a content classification model in the embodiment of the present application, and the content training sample is taken as an example for description. The first part is an integral image character recognition module which is mainly used for realizing the OCR character recognition from end to end of the integral image. Namely, the content training sample is used as the input of the text recognition network, and the text recognition result is output through the text recognition network. The second part is an image feature extraction and classification module which is mainly used for carrying out visual feature vector extraction and classification learning of the whole image. That is, the content training samples (i.e., images) are used as input to an image feature extraction network, through which image feature vectors are output. Meanwhile, the image probability distribution is obtained through the image classification network. And the third part is a text feature extraction and classification module which is used for extracting text feature vectors and performing classification learning on the OCR character sequences identified by the images. Namely, the text recognition result is used as the input of the text feature extraction network, and the text feature vector is output through the text feature extraction network. Meanwhile, the text probability distribution is obtained through the text classification network.

The fourth part is a multi-modal feature fusion and classification module, which inputs visual features and character features into a multi-modal fusion network (for example, MMBT), then uses the self-attention mechanism of the multi-modal fusion network to fuse the cross-modal features of images and characters, and finally performs multi-modal label classification learning. That is, the text feature vector and the image feature vector are input to the multimodal fusion network, and the content probability distribution is output via the multimodal fusion network. Therefore, a first loss value is obtained by adopting a cross entropy loss function according to the text probability distribution and the text category label. And calculating a second loss value by adopting a cross entropy loss function according to the image probability distribution and the image category label. And calculating a third loss value by adopting a cross entropy loss function according to the content probability distribution and the content category label. Based on this, the total loss value is calculated as follows:

Loss＝Loss 1+Loss 2+Loss 3；

wherein, Loss represents the total Loss value, Loss 1 represents the first Loss value, Loss 2 represents the second Loss value, and Loss 3 represents the third Loss value.

Thus, the model parameters of the content classification model are updated based on the total loss value. It can be appreciated that the feature extraction network, the M modal classification networks (including the text classification network and the image classification network), and the multi-modal fusion network need to be jointly trained. In general, a text recognition network may employ a network that has already been trained, and thus, the text recognition network may not be trained.

Third, in the embodiment of the present application, a method for training a content classification model is provided. By the mode, end-to-end training of the whole network structure is achieved, the text branch and the image branch are jointly optimized while multi-mode classification branch training is conducted, the text feature extraction network and the image feature extraction network are optimized towards the direction which is beneficial to multi-mode image-text classification through multi-branch collaborative training, single-mode discrimination features which are beneficial to multi-mode classification tasks are learned, and therefore the effect of multi-task joint training is improved.

Optionally, on the basis of each embodiment corresponding to fig. 3, in another optional embodiment provided in the embodiment of the present application, based on the content training sample, obtaining the text recognition result through the text recognition network included in the content classification model may specifically include:

based on the content training sample, obtaining a text detection result through a text detection network included in a text recognition network, wherein the text recognition network is included in the content classification model;

In one or more embodiments, a manner of performing OCR recognition based on a cascaded text recognition network is presented. As can be seen from the foregoing embodiments, the text recognition network may adopt a cascaded text detection network and a character recognition network, where the text detection network segments text lines in real time based on a Differentiable Binarization (DB) detection algorithm, that is, obtains a text detection result. The character recognition Network may be a Convolutional Recurrent Neural Network (Convolutional Recurrent Neural Network) for recognizing a text line character sequence for a text detection result (i.e., a text line).

Specifically, for convenience of understanding, please refer to fig. 5, where fig. 5 is a schematic structural diagram of a text recognition network in an embodiment of the present application, and as shown in the drawing, after input content training samples are sampled by different nodes, feature maps with different sizes are obtained. And then constructing a feature map (F) with a unified scale from the feature maps. The feature map (F) is used to predict a segmentation probability map (probability map) and a threshold map (threshold map). Based on the DB algorithm, a binary map (binary map) is obtained by combining the segmentation probability map and the threshold map, and thereby a text detection result (i.e., a text line) is detected from the content training sample.

The character recognition network mainly includes convolution layers (convolutional layers), recursive layers (recursive layers), and transcription layers (transcription layers). Based on this, a feature sequence (feature sequence) is obtained by the convolutional layer using the text detection result (i.e., text line) as an input of the character recognition network. The characteristic sequences are used as input of a recursive layer, and the value represented by each sequence is output through the recursive layer. Wherein, the recursive layer can adopt deep bidirectional long-term memory network (deep BilsTM). Finally, a text recognition result is predicted by a connection sense Temporal Classification (CTC) algorithm adopted by the transcription layer, for example, the text recognition result is "horizontal view ridge side peaking".

The text recognition results between lines are directly connected in series, and no separation symbols such as spaces, commas or line feed are needed.

Thirdly, in the embodiment of the application, a mode of performing OCR recognition based on a cascading text recognition network is provided. By the mode, OCR character recognition of the whole image can be realized, so that feasibility and operability of the scheme are improved.

based on the content training sample, obtaining the probability distribution of each character through a text recognition network included in a content classification model;

In one or more embodiments, a manner of performing OCR recognition based on an end-to-end text recognition network is presented. As can be seen from the foregoing embodiments, text line and character level positioning, as well as character classification recognition, can be based on the entire graph end-to-end. And after post-processing, the images are connected in series to obtain a line recognition result, and further obtain a text recognition result of the whole image.

Specifically, for easy understanding, please refer to fig. 6, where fig. 6 is another schematic structural diagram of the text recognition network in the embodiment of the present application, and as shown in the figure, the text recognition network may be a Feature Pyramid Network (FPN). Based on the method, in the end-to-end single model whole image character positioning and recognition algorithm, firstly, text lines and characters are segmented through FPN, then regions of interest (ROI) corresponding to the characters are extracted, and finally, probability distribution of each character is output through an FC layer, so that a text recognition result is obtained.

In the embodiment of the application, a mode for performing OCR recognition based on an end-to-end text recognition network is provided. By the mode, OCR character recognition of the whole image can be realized, so that feasibility and operability of the scheme are improved.

Optionally, on the basis of each embodiment corresponding to fig. 3, in another optional embodiment provided in the embodiment of the present application, based on the text recognition result, obtaining the text feature vector through the text feature extraction network included in the content classification model may specifically include:

based on the text recognition result, obtaining S symbol coding sequences through a text feature extraction network included in the content classification model, wherein S is an integer greater than or equal to 1;

In one or more embodiments, a way to extract text feature vectors is presented. It can be seen from the foregoing embodiment that the text recognition result is used as an input of the text feature extraction network, and S symbol encoding sequences can be output through the text feature extraction network. The Backbone (Backbone) of the text feature extraction network can adopt a BERT or a BERT-Tiny encoding network.

Specifically, taking BERT-Tiny as an example, assuming BERT-Tiny to have a 3-layer coding network and 6 attention heads, a maximum input sequence length of 128 is supported, i.e., an input of 128 symbol (token) sequences is supported. And for tokens with the arbitrary sequence length S, extracting S symbol coding sequences after passing through a BERT-Tiny coding network, namely obtaining S token sequences. Wherein a token may be a character or a word, e.g., "i" or "my". In this regard, a start code sequence "[ CLS ]", may be added at the beginning of the S symbol code sequences, and a stop code sequence "[ SEP ]", may be added at the end of the S symbol code sequences. Assuming that the length of each code sequence is 384, based on this, a text feature vector (i.e., Token T) of size (S +2) × 384 is obtained, which belongs to feature vectors at the character level.

Therefore, the text feature vectors are input into a text classification network, and pooling and full-connection classification are carried out, so that text probability distribution can be obtained. And calculating a loss value by adopting a cross entropy loss function, and meanwhile, taking the text feature vector as the input of the multi-mode fusion network for classifying the multi-mode content.

Thirdly, in the embodiment of the present application, a method for extracting text feature vectors is provided. Through the method, the text is subjected to character-level coding, so that the feasibility and operability of the scheme are improved.

Optionally, on the basis of each embodiment corresponding to fig. 3, in another optional embodiment provided in the embodiments of the present application, based on the content training sample, obtaining an image feature vector through an image feature extraction network included in the content classification model may specifically include:

based on the content training sample, obtaining a target characteristic diagram through an image characteristic extraction network included in a content classification model;

based on the image feature vector, obtaining the image probability distribution through an image classification network, which may specifically include:

acquiring image probability distribution through an image classification network based on the first image feature vector;

based on the text feature vector and the image feature vector, obtaining content probability distribution through a multi-modal fusion network included in the content classification model may specifically include:

and acquiring content probability distribution through a multi-mode fusion network included in the content classification model based on the text feature vector and the second image feature vector.

In one or more embodiments, a manner of extracting feature vectors for an image is presented. As can be seen from the foregoing embodiments, the content classification model includes an image feature extraction network, and the image feature extraction network may adopt, for example, a ResNet50 network, a ResNet101 network, a ResNet152 network, or the like, which is not limited herein.

Specifically, the content training sample is used as an input of an image feature extraction network, and a target feature map is output through the image feature extraction network. Where the content training samples are images of size 3 w h, scaled (resize) to a fixed size, e.g., 3 w 224. Inputting the resize content training samples into an image feature extraction network (for example, a ResNet50 network), and outputting a target feature map after multi-layer convolution and pooling, wherein the size of the target feature map can be 2048 × 7.

On the one hand, on the basis, a target feature map needs to be expanded (flattened), so that a one-dimensional first image feature vector is obtained, the first image feature vector is used as the input of an image classification network, and the image probability distribution is output through the image classification network. Wherein the image classification network comprises an FC layer, and the first image feature vector is reduced to the number of image classification categories (for example, 2 or 3, etc.) through the FC layer

On the other hand, a two-dimensional flip operation is required on the target feature map, for example, to obtain features with dimensions 2048 × 49. To facilitate the image vector embedding and dimensionality reduction operations on 2048 dimensions, the features may be transposed, resulting in 49 × 2048 two-dimensional image features. An image vector embedding operation and dimension reduction are performed on the two-dimensional image features, and N symbol encoding sequences (i.e., Token sequences) are extracted, where N is a fixed value, and may be equal to 3 for example. Thus, a start code sequence "[ CLS ]", may be added at the beginning of the N code sequences, and a stop code sequence "[ SEP ]", may be added at the end of the N code sequences. Assuming that the length of each coding sequence is 384, based on this, a second image feature vector (i.e., Token I) of size (N +2) × 384 is obtained.

And taking the text feature vector and the second image feature vector as the input of the multi-mode fusion network, and outputting the content probability distribution through the multi-mode fusion network.

Further, in the embodiment of the present application, a method for extracting an image feature vector is provided. Through the mode, the image is converted into the one-dimensional image feature vector and the two-dimensional image feature vector, so that the image classification task can be carried out on one hand, and on the other hand, the image classification task can be fused with the text feature vector, and the feasibility and the operability of the scheme are improved.

Optionally, on the basis of each embodiment corresponding to fig. 3, in another optional embodiment provided in the embodiment of the present application, based on the text feature vector and the second image feature vector, obtaining the content probability distribution through the multimodal fusion network included in the content classification model may specifically include:

splicing the text characteristic vector and the second image characteristic vector to obtain a target characteristic vector;

In one or more embodiments, a way to fuse text features and image features is presented. As can be seen from the foregoing embodiments, the multi-modal converged network may employ MMBT, which has the BERT network as the main body and the FC layer as the classifier.

Specifically, for ease of understanding, please refer to fig. 7, where fig. 7 is a schematic structural diagram of a modal fusion network in an embodiment of the present application, and as shown in the figure, a concat operation is performed on a text feature vector (i.e., Token T) and a second image feature vector (i.e., Token I) to obtain a target feature vector, where the target feature vector is embedded by a joint representation of an image and a text, and a size of the target feature vector may be (S + N +4) × 384. The image and the text share the same input space. The target feature vector is coded through a coding network (such as a BERT network) according to a sequence input mode, semantic interaction of image feature and text feature attention is carried out through the global characteristic of a multi-head self-attention mechanism, fusion of image and text modes is achieved, the target feature vector is input into a classification network after coding, pooling and full-connection classification are carried out through the classification network, and predicted content probability distribution is obtained.

It is understood that, taking the encoding network as a BERT network as an example, the target feature vector is formed by splicing a text feature vector (i.e., Token T) and a second image feature vector (i.e., Token I), wherein the text feature vector (i.e., Token T) includes S +2 tokens and the second image feature vector (i.e., Token I) includes N +2 tokens. The position-encoding vector is used to distinguish the position of each Token in the feature vector, and the segment-encoding vector is used to distinguish the class to which the feature vector belongs (e.g., belongs to a text class or an input image class).

It should be noted that the encoding network included in the multi-modal fusion network may also use a reduced BERT (a Lite BERT, ALBERT), or a robust Optimized BERT (a robust Optimized BERT, RoBERTa), and the like, which is not limited herein.

Furthermore, in the embodiment of the present application, a way of fusing text features and image features is provided. By the mode, based on the input characteristic of the MMBT, the input space of multi-mode features can be provided to support simultaneous input of image features and character features, and the features of texts and images are fused by using the self-attention mechanism of BERT, so that a better feature fusion effect is achieved, and the accuracy of model prediction and the effect of model prediction are improved.

based on the content training sample, obtaining an audio recognition result through an audio recognition network included in the content classification model;

based on the audio recognition result, acquiring audio text feature vectors through a text feature extraction network included in the content classification model, wherein the audio text feature vectors are contained in the M modal feature vectors;

In one or more embodiments, a manner of extracting individual modal feature vectors based on video content is presented. As can be seen from the foregoing embodiments, the content training samples may be video content (in this case, M is equal to 2). Based on this, the content training samples can be recognized using speech recognition techniques.

Specifically, the content classification model includes an audio recognition network, and the audio recognition network is configured to extract audio in the video and convert the audio into text, that is, obtain an audio recognition result. The content classification model comprises a feature extraction network, wherein the feature extraction network specifically comprises a text feature extraction network and an image feature extraction network. The text feature extraction network is used for extracting audio text feature vectors, and the image feature extraction network is used for extracting image feature vectors.

It is to be understood that the key frame may be selected from the content training sample as the input of the image feature extraction network, or any one or more frames of images may be selected from the content training sample as the input of the image feature extraction network, or another strategy may be adopted to select the image input to the image feature extraction network from the content training sample, which is not limited herein.

Secondly, in the embodiment of the present application, a method for extracting feature vectors of each modality based on video content is provided. By the method, the audio text feature vector and the image feature vector are respectively introduced into the single-mode supervision, and the classification learning of the text feature extraction network and the image feature extraction network is increased, so that the multi-mode video classification effect is improved, and the advantage of multi-task joint optimization is exerted.

Optionally, on the basis of the respective embodiments corresponding to fig. 3, in another optional embodiment provided by the embodiments of the present application, the M single-modality category tags include an audio category tag and an image category tag;

acquiring audio text probability distribution through a text classification network based on the audio text feature vector;

acquiring content probability distribution through a multi-mode fusion network included in a content classification model based on the audio text feature vector and the image feature vector;

and updating the model parameters of the content classification model according to the audio text probability distribution, the audio category label, the image probability distribution, the image category label, the content probability distribution and the content category label.

In one or more embodiments, another way to train a content classification model is presented. As can be seen from the foregoing embodiments, the content training sample may be video content, and thus, the labeled M single-modality category labels include an audio category label and an image category label. Assume that the audio category labels are respectively two, one is "audio malicious category", denoted as "1". The other is "audio normal class", denoted "0". Based on this, the audio text probability distribution can be represented as a two-dimensional vector. Assume that the image category labels are two kinds, one is "image contains person" and is denoted as "1". The other is "the image does not contain a person", and is represented as "0". Based on this, the image probability distribution can be represented as a two-dimensional vector.

Specifically, the model training can be divided into four parts, for easy understanding, please refer to fig. 8, where fig. 8 is another schematic structural diagram of the content classification model in the embodiment of the present application, and the content training sample is taken as an example for description. The first part is an audio recognition module which is mainly used for converting audio into text, namely, obtaining an audio recognition result through an audio recognition network. The second part is an image feature extraction and classification module which is mainly used for carrying out visual feature vector extraction and classification learning of the whole image. That is, a frame of image is selected from the content training sample as an input of the image feature extraction network, and the image feature vector is output through the image feature extraction network. Meanwhile, the image probability distribution is obtained through the image classification network. And the third part is a text feature extraction and classification module which is used for extracting text feature vectors and performing classification learning on the audio recognition result. Namely, the audio recognition result is used as the input of the text feature extraction network, and the audio text feature vector is output through the text feature extraction network. Meanwhile, audio text probability distribution is obtained through a text classification network.

The fourth part is a multi-modal feature fusion and classification module, which inputs visual features and character features into a multi-modal fusion network (for example, MMBT), then utilizes the self-attention mechanism of the multi-modal fusion network to fuse the cross-modal features of images and characters, and finally performs multi-modal label classification learning. That is, the audio text feature vector and the image feature vector are input to the multimodal fusion network, and the content probability distribution is output through the multimodal fusion network. Therefore, according to the audio text probability distribution and the audio category label, a first loss value is obtained by adopting a cross entropy loss function. And calculating a second loss value by adopting a cross entropy loss function according to the image probability distribution and the image category label. And calculating a third loss value by adopting a cross entropy loss function according to the content probability distribution and the content category label. Based on this, the total loss value is calculated as follows:

Loss＝Loss 1+Loss 2+Loss 3；

Thus, the model parameters of the content classification model are updated based on the total loss value. It can be appreciated that the feature extraction network, the M modal classification networks (including the text classification network and the image classification network), and the multi-modal fusion network need to be jointly trained. In general, the audio recognition network may employ a network that has already been trained, and thus, the audio recognition network may not be trained.

Third, in the embodiment of the present application, a method for training a content classification model is provided. By the mode, end-to-end training of the whole network structure is achieved, the text branch and the image branch are jointly optimized while multi-mode classification branch training is conducted, the text feature extraction network and the image feature extraction network are optimized towards the direction which is beneficial to multi-mode video classification through multi-branch collaborative training, single-mode discrimination features which are beneficial to multi-mode classification tasks are learned, and therefore the effect of multi-task joint training is improved.

based on the audio recognition result, obtaining audio text feature vectors through a first text feature extraction network included in the content classification model, wherein the audio text feature vectors are contained in the M modal feature vectors;

In one or more embodiments, a manner of extracting individual modal feature vectors based on video content is presented. As can be seen from the foregoing embodiment, the content training sample may be video content (in this case, M is equal to 3). Based on this, speech recognition techniques as well as OCR techniques may be employed to recognize content training samples.

Specifically, the content classification model includes a text recognition network, and the text recognition network is used for recognizing characters in the whole image to obtain a text recognition result. Specifically, the content classification model includes an audio recognition network, and the audio recognition network is configured to extract audio in the video and convert the audio into text, that is, obtain an audio recognition result. The content classification model comprises a feature extraction network, wherein the feature extraction network specifically comprises a first text feature extraction network, a second text feature extraction network and an image feature extraction network. The first text feature extraction network and the second text feature extraction network are used for extracting text feature vectors, and the image feature extraction network is used for extracting image feature vectors.

Secondly, in the embodiment of the present application, a method for extracting feature vectors of each modality based on video content is provided. By the method, the audio text feature vector, the text feature vector and the image feature vector are respectively subjected to single-mode supervision, and the classification learning of the first text feature extraction network, the second text feature extraction network and the image feature extraction network is increased, so that the multi-mode video classification effect is improved, and the advantage of multi-task joint optimization is exerted.

Optionally, on the basis of the respective embodiments corresponding to fig. 3, in another optional embodiment provided in this embodiment of the present application, the M single-modality category tags include an audio category tag, a text category tag, and an image category tag;

acquiring audio text probability distribution through a first text classification network based on the audio text feature vector;

acquiring content probability distribution through a multi-mode fusion network included in a content classification model based on the audio text feature vector, the text feature vector and the image feature vector;

updating the model parameters of the content classification model according to the audio text probability distribution, the audio category label, the text probability distribution, the text category label, the image probability distribution, the image category label, the content probability distribution and the content category label.

In one or more embodiments, a manner of training a content classification model is presented. As can be seen from the foregoing embodiments, the content training sample may be video content, and thus, the labeled M single-modality category labels include an audio category label, a text category label, and an image category label. It is understood that examples of the category labels have been described in the foregoing embodiments, and therefore are not described herein.

Specifically, the model training may be divided into five parts, for easy understanding, please refer to fig. 9, and fig. 9 is another schematic structural diagram of the content classification model in the embodiment of the present application, which is described by taking a content training sample as an example. The first part is an integral image character recognition module which is mainly used for realizing the OCR character recognition from end to end of the integral image. Namely, the content training sample is used as the input of the text recognition network, and the text recognition result is output through the text recognition network. The second part is an audio recognition module which is mainly used for converting audio into text, namely, an audio recognition result is obtained through an audio recognition network. And the third part is an image feature extraction and classification module which is mainly used for carrying out visual feature vector extraction and classification learning of the whole image. That is, the content training samples (i.e., images) are used as input to an image feature extraction network, through which image feature vectors are output. Meanwhile, the image probability distribution is obtained through the image classification network. And the fourth part is a text feature extraction and classification module which respectively extracts text feature vectors and performs classification learning on the text recognition result and the audio recognition result. Namely, the audio recognition result is used as the input of the first text feature extraction network, and the audio text feature vector is output through the first text feature extraction network. And the text recognition result is used as the input of a second text feature extraction network, and the text feature vector is output through the second text feature extraction network. Meanwhile, the audio text probability distribution corresponding to the audio text feature vector is obtained through the first text classification network, and the text probability distribution corresponding to the text feature vector is obtained through the second text classification network.

The fifth part is a multi-modal feature fusion and classification module, which inputs visual features and character features into a multi-modal fusion network (e.g., MMBT), then utilizes the self-attention mechanism of the multi-modal fusion network to fuse the cross-modal features of images and characters, and finally performs multi-modal label classification learning. That is, the audio text feature vector, the text feature vector, and the image feature vector are input to the multimodal fusion network, and the content probability distribution is output through the multimodal fusion network. Therefore, according to the audio text probability distribution and the audio category label, a first loss value is obtained by adopting a cross entropy loss function. And calculating a second loss value by adopting a cross entropy loss function according to the text probability distribution and the text category label. And calculating a third loss value by adopting a cross entropy loss function according to the image probability distribution and the image category label. And calculating a fourth loss value by adopting a cross entropy loss function according to the content probability distribution and the content category label. Based on this, the total loss value is calculated as follows:

Loss＝Loss 1+Loss 2+Loss 3+Loss 4；

wherein, Loss represents a total Loss value, Loss 1 represents a first Loss value, Loss 2 represents a second Loss value, Loss 3 represents a third Loss value, and Loss 4 represents a fourth Loss value.

Thus, the model parameters of the content classification model are updated based on the total loss value. It can be appreciated that the feature extraction network, the M modal classification networks (including the first text classification network, the second text classification network, and the image classification network), and the multi-modal fusion network need to be jointly trained. In general, the text recognition network and the audio recognition network may use already trained networks, and thus, the text recognition network and the audio recognition network may not be trained.

Third, in the embodiment of the present application, a method for training a content classification model is provided. By the aid of the method, end-to-end training of the whole network structure is achieved, the text branches and the image branches are jointly optimized while multi-mode classification branches are trained, the first text feature extraction network, the second text feature extraction network and the image feature extraction network are optimized in the direction beneficial to multi-mode video classification through multi-branch collaborative training, single-mode discrimination features beneficial to multi-mode classification tasks are learned, and accordingly multi-task joint training effects are improved.

With reference to fig. 10, a method for content classification in the present application will be described below, and an embodiment of the content classification method in the embodiment of the present application includes:

210. acquiring target content;

in one or more embodiments, the content classification device obtains the target content, where the target content may be content uploaded by a user in real time or content stored in a database.

It should be noted that the content classification apparatus may be deployed in a server, or may be deployed in a terminal device, or may be deployed in a system composed of a server and a terminal device, which is not limited herein.

220. Based on the target content, obtaining content probability distribution corresponding to the target content through a content classification model, wherein the content classification model is obtained by training by adopting the training method provided by the embodiment;

in one or more embodiments, the content classification device takes the target content as an input to a content classification model, and outputs a corresponding content probability distribution via the content classification model.

It should be noted that the content classification model is obtained by training using the method described in the foregoing embodiment, and details are not described here. Where the content classification model supports classification of multimodal content, it will be explained below with reference to examples.

For example, please refer to fig. 11 for easy understanding, fig. 11 is a schematic diagram of a content classification scenario in an embodiment of the present application, and as shown in the figure, specifically, the target content may be "teletext content", and based on this, the target content is used as an input of a content classification model. The content classification model comprises a text recognition network, a text feature extraction network, an image feature extraction network and a multi-mode fusion network. Firstly, target content is used as input of a text recognition network, and a text recognition result is output through the text recognition network. Then, the text recognition result can be used as the input of a text feature extraction network, and the text feature vector is output through the text feature extraction network. Meanwhile, the target content is taken as an input of an image feature extraction network through which an image feature vector (e.g., a second image feature vector) is acquired. Finally, the text feature vector and the image feature vector (e.g., the second image feature vector) are used as input of the multi-modal fusion network, and the content probability distribution is output through the multi-modal fusion network.

For example, referring to fig. 12 for ease of understanding, fig. 12 is another schematic diagram of a content classification scenario in an embodiment of the present application, and as shown in the figure, specifically, the target content may be "video content", and based on this, the target content is used as an input of a content classification model. The content classification model comprises a text recognition network, an audio recognition network, a first text feature extraction network, a second text feature extraction network, an image feature extraction network and a multi-mode fusion network. Firstly, the target content is used as the input of the audio recognition network, and the audio recognition result is output through the audio recognition network. And the target content is used as the input of the text recognition network, and the text recognition result is output through the text recognition network. The audio recognition result may then be used as an input to a first text feature extraction network through which a first text feature vector is output. The text recognition result may be used as an input to a second text feature extraction network, through which a second text feature vector is output. Meanwhile, the target content is taken as an input of an image feature extraction network through which an image feature vector (e.g., a second image feature vector) is acquired. Finally, the first text feature vector, the second text feature vector and the image feature vector (for example, the second image feature vector) are used as the input of the multi-modal fusion network, and the content probability distribution is output through the multi-modal fusion network.

230. And determining a classification result of the target content according to the content probability distribution corresponding to the target content.

In one or more embodiments, the content classification device may determine a classification result to which the target content belongs according to a content probability distribution output by the content classification model. Based on this, it is assumed that the content probability distribution is (0.9,0.1), where "0.9" indicates that the probability that the target content belongs to the "content malicious category" is 0.9, and "0.1" indicates that the probability that the target content belongs to the "content normal category" is 0.1, and thus, the classification result of the target content is the "content malicious category".

In the embodiment of the application, a method for content classification is provided. Through the mode, M single-mode classification branches are added on the basis of a multi-mode classification task to serve as supervision of corresponding modes, and a multi-branch collaborative training method is used for achieving the purpose of optimizing a multi-mode fusion network and a feature extraction network simultaneously, achieving the effect of balancing each network in a content classification model, and facilitating improvement of the precision and the effect of multi-mode classification results.

Referring to fig. 13, fig. 13 is a schematic view of an embodiment of the model training device in the embodiment of the present application, and the model training device 30 includes:

an obtaining module 310, configured to obtain a content training sample, where the content training sample corresponds to labeled M single-mode category labels and content category labels, and M is an integer greater than 1;

the obtaining module 310 is further configured to obtain M modal feature vectors through a feature extraction network included in the content classification model based on the content training sample;

the obtaining module 310 is further configured to obtain content probability distribution through a multi-modal fusion network included in the content classification model based on the M modal feature vectors;

the obtaining module 310 is further configured to obtain M modal probability distributions through M modal classification networks based on the M modal feature vectors, where an input modal feature vector of each modal classification network and an output modal probability distribution have a corresponding relationship;

the training module 320 is configured to update the model parameters of the content classification model according to the M modal probability distributions, the M single modal class labels, the content probability distribution, and the content class labels.

In the embodiment of the application, a model training device is provided. By adopting the device, M single-mode classification branches are added on the basis of a multi-mode classification task to serve as supervision of corresponding modes, and a multi-branch collaborative training method is adopted, so that the purpose of simultaneously optimizing a multi-mode fusion network and a feature extraction network is achieved, the effect of balancing each network in a content classification model is achieved, and the precision and the effect of multi-mode classification results are favorably improved.

Alternatively, on the basis of the embodiment corresponding to fig. 13, in another embodiment of the model training device 30 provided in the embodiment of the present application,

the obtaining module 310 is further configured to obtain a text recognition result through a text recognition network included in the content classification model based on the content training sample after obtaining the content training sample;

an obtaining module 310, configured to obtain a text feature vector through a text feature extraction network included in the content classification model based on a text recognition result, where the text feature vector includes M modal feature vectors;

In the embodiment of the application, a model training device is provided. By adopting the device, single-mode supervision is respectively introduced into the text characteristic vector and the image characteristic vector, and classification learning of the text characteristic extraction network and the image characteristic extraction network is increased, so that the multi-mode image-text classification effect is improved, and the advantage of multi-task joint optimization is exerted.

Optionally, on the basis of the embodiment corresponding to fig. 13, in another embodiment of the model training apparatus 30 provided in the embodiment of the present application, the M single-mode category labels include a text category label and an image category label;

an obtaining module 310, specifically configured to obtain text probability distribution through a text classification network based on a text feature vector;

an obtaining module 310, configured to obtain content probability distribution through a multi-modal fusion network included in the content classification model based on the text feature vector and the image feature vector;

the training module 320 is specifically configured to update the model parameters of the content classification model according to the text probability distribution, the text category label, the image probability distribution, the image category label, the content probability distribution, and the content category label.

In the embodiment of the application, a model training device is provided. By adopting the device, the end-to-end training of the whole network structure is realized, the text branch and the image branch are jointly optimized while the multi-mode classification branch training is carried out, the multi-branch cooperative training enables the text feature extraction network and the image feature extraction network to be optimized towards the direction beneficial to multi-mode image-text classification, the single-mode discrimination features beneficial to multi-mode classification tasks are learned, and therefore the multi-task joint training effect is improved.

an obtaining module 310, specifically configured to obtain a text detection result through a text detection network included in a text recognition network based on the content training sample, where the text recognition network is included in the content classification model;

In the embodiment of the application, a model training device is provided. By adopting the device, OCR character recognition of the whole image can be realized, so that the feasibility and operability of the scheme are improved.

an obtaining module 310, specifically configured to obtain probability distribution of each character through a text recognition network included in a content classification model based on a content training sample;

Optionally, on the basis of the embodiment corresponding to fig. 13, in another embodiment of the model training device 30 provided in the embodiment of the present application,

an obtaining module 310, configured to obtain S symbol encoding sequences through a text feature extraction network included in the content classification model based on a text recognition result, where S is an integer greater than or equal to 1;

In the embodiment of the application, a model training device is provided. By adopting the device, the character-level coding of the text is realized, so that the feasibility and operability of the scheme are improved.

an obtaining module 310, specifically configured to obtain a target feature map through an image feature extraction network included in a content classification model based on a content training sample;

an obtaining module 310, specifically configured to obtain, based on the first image feature vector, an image probability distribution through an image classification network;

the obtaining module 310 is specifically configured to obtain the content probability distribution through a multi-modal fusion network included in the content classification model based on the text feature vector and the second image feature vector.

In the embodiment of the application, a model training device is provided. By adopting the device, the image is converted into the one-dimensional image feature vector and the two-dimensional image feature vector, so that the image classification task can be carried out on one hand, and on the other hand, the image classification task can be fused with the text feature vector, thereby improving the feasibility and operability of the scheme.

the obtaining module 310 is specifically configured to perform stitching processing on the text feature vector and the second image feature vector to obtain a target feature vector;

In the embodiment of the application, a model training device is provided. By adopting the device, based on the input characteristic of the MMBT, the input space of multi-mode characteristics can be provided to support the simultaneous input of image characteristics and character characteristics, and the characteristics of texts and images are fused by using the self-attention mechanism of BERT, so that a better characteristic fusion effect is achieved, and the accuracy of model prediction and the effect of model prediction are favorably improved.

the obtaining module 310 is further configured to obtain, after obtaining the content training sample, an audio recognition result through an audio recognition network included in the content classification model based on the content training sample;

an obtaining module 310, configured to obtain, based on the audio recognition result, an audio text feature vector through a text feature extraction network included in the content classification model, where the audio text feature vector is included in the M modal feature vectors;

In the embodiment of the application, a model training device is provided. By adopting the device, the audio text characteristic vector and the image characteristic vector are respectively introduced into the single-mode supervision, and the classification learning of the text characteristic extraction network and the image characteristic extraction network is increased, so that the multi-mode video classification effect is improved, and the advantage of multi-task joint optimization is exerted.

Optionally, on the basis of the embodiment corresponding to fig. 13, in another embodiment of the model training apparatus 30 provided in the embodiment of the present application, the M single-modality category labels include an audio category label and an image category label;

an obtaining module 310, specifically configured to obtain probability distribution of an audio text through a text classification network based on an audio text feature vector;

an obtaining module 310, configured to obtain content probability distribution through a multi-modal fusion network included in the content classification model based on the audio text feature vector and the image feature vector;

the training module 320 is specifically configured to update the model parameters of the content classification model according to the audio text probability distribution, the audio class label, the image probability distribution, the image class label, the content probability distribution, and the content class label.

In the embodiment of the application, a model training device is provided. By adopting the device, the end-to-end training of the whole network structure is realized, the text branch and the image branch are jointly optimized while the multi-mode classification branch is trained, the text feature extraction network and the image feature extraction network are optimized towards the direction beneficial to multi-mode video classification through multi-branch collaborative training, the single-mode discrimination feature beneficial to the multi-mode classification task is learned, and therefore the multi-task joint training effect is improved.

the obtaining module 310 is further configured to obtain a text recognition result through a text recognition network included in the content classification model based on the content training sample;

an obtaining module 310, configured to obtain, based on the audio recognition result, an audio text feature vector through a first text feature extraction network included in the content classification model, where the audio text feature vector is included in the M modal feature vectors;

In the embodiment of the application, a model training device is provided. By adopting the device, single-mode supervision is respectively introduced to the audio text feature vector, the text feature vector and the image feature vector, and classification learning of the first text feature extraction network, the second text feature extraction network and the image feature extraction network is added, so that the multi-mode video classification effect is improved, and the advantage of multi-task joint optimization is exerted.

Optionally, on the basis of the embodiment corresponding to fig. 13, in another embodiment of the model training apparatus 30 provided in the embodiment of the present application, the M single-modality category labels include an audio category label, a text category label, and an image category label;

an obtaining module 310, specifically configured to obtain probability distribution of an audio text through a first text classification network based on an audio text feature vector;

an obtaining module 310, configured to obtain content probability distribution through a multi-modal fusion network included in the content classification model based on the audio text feature vector, the text feature vector, and the image feature vector;

the training module 320 is specifically configured to update the model parameters of the content classification model according to the audio text probability distribution, the audio category label, the text probability distribution, the text category label, the image probability distribution, the image category label, the content probability distribution, and the content category label.

In the embodiment of the application, a model training device is provided. By adopting the device, end-to-end training of the whole network structure is realized, text branches and image branches are jointly optimized while multi-mode classification branches are trained, and multi-branch cooperative training enables the first text feature extraction network, the second text feature extraction network and the image feature extraction network to be optimized towards the direction beneficial to multi-mode video classification, single-mode discrimination features beneficial to multi-mode classification tasks are learned, so that the effect of multi-task joint training is improved.

Referring to fig. 14, fig. 14 is a schematic diagram of an embodiment of a content classification device in an embodiment of the present application, and the content classification device 40 includes:

an obtaining module 410 for obtaining the target content

The obtaining module 410 is further configured to obtain, based on the target content, content probability distribution corresponding to the target content through a content classification model, where the content classification model is obtained by training through the methods in the foregoing aspects;

the classification module 420 is configured to determine a classification result of the target content according to the content probability distribution corresponding to the target content.

In the embodiment of the application, a content classification device is provided. By adopting the device, M single-mode classification branches are added on the basis of a multi-mode classification task to serve as supervision of corresponding modes, and a multi-branch collaborative training method is adopted, so that the purpose of simultaneously optimizing a multi-mode fusion network and a feature extraction network is achieved, the effect of balancing each network in a content classification model is achieved, and the precision and the effect of multi-mode classification results are favorably improved.

For convenience of understanding, please refer to fig. 15, wherein fig. 15 is a schematic structural diagram of a server provided in this embodiment, and the server 500 may generate a relatively large difference due to different configurations or performances, and may include one or more Central Processing Units (CPUs) 522 (e.g., one or more processors) and a memory 532, and one or more storage media 530 (e.g., one or more mass storage devices) storing an application program 542 or data 544. Memory 532 and storage media 530 may be, among other things, transient storage or persistent storage. The program stored on the storage medium 530 may include one or more modules (not shown), each of which may include a series of instruction operations for the server. Still further, the central processor 522 may be configured to communicate with the storage medium 530, and execute a series of instruction operations in the storage medium 530 on the server 500.

The Server 500 may also include one or more power supplies 526, one or more wired or wireless network interfaces 550, one or more input-output interfaces 558, and/or one or more operating systems 541, such as a Windows Server^TM，Mac OS X^TM，Unix^TM,Linux^TM，FreeBSD^TMAnd so on.

The steps performed by the server in the above embodiment may be based on the server structure shown in fig. 15.

The model training device and/or the content classification device provided by the present application may be deployed in a terminal device, and for convenience of understanding, please refer to fig. 16, an embodiment of the present application further provides a terminal device, as shown in fig. 16, for convenience of description, only a part related to the embodiment of the present application is shown, and details of a specific technology are not disclosed, and please refer to a method part in the embodiment of the present application. In the embodiment of the present application, a terminal device is taken as an example to explain:

fig. 16 is a block diagram illustrating a partial structure of a smartphone related to a terminal device provided in an embodiment of the present application. Referring to fig. 16, the smart phone includes: radio Frequency (RF) circuitry 610, memory 620, input unit 630, display unit 640, sensor 650, audio circuitry 660, wireless fidelity (WiFi) module 670, processor 680, and power supply 690. Those skilled in the art will appreciate that the smartphone configuration shown in fig. 16 is not limiting and may include more or fewer components than shown, or some components in combination, or a different arrangement of components.

The following describes each component of the smartphone in detail with reference to fig. 16:

the RF circuit 610 may be used for receiving and transmitting signals during information transmission and reception or during a call, and in particular, receives downlink information of a base station and then processes the received downlink information to the processor 680; in addition, the data for designing uplink is transmitted to the base station. In general, the RF circuit 610 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a Low Noise Amplifier (LNA), a duplexer, and the like. In addition, the RF circuitry 610 may also communicate with networks and other devices via wireless communications. The wireless communication may use any communication standard or protocol, including but not limited to global system for mobile communications (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Long Term Evolution (LTE), email, Short Message Service (SMS), etc.

The memory 620 may be used to store software programs and modules, and the processor 680 may execute various functional applications and data processing of the smart phone by operating the software programs and modules stored in the memory 620. The memory 620 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the smartphone, and the like. Further, the memory 620 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.

The input unit 630 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the smartphone. Specifically, the input unit 630 may include a touch panel 631 and other input devices 632. The touch panel 631, also referred to as a touch screen, may collect touch operations of a user (e.g., operations of the user on the touch panel 631 or near the touch panel 631 by using any suitable object or accessory such as a finger or a stylus) thereon or nearby, and drive the corresponding connection device according to a preset program. Alternatively, the touch panel 631 may include two parts of a touch detection device and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, sends the touch point coordinates to the processor 680, and can receive and execute commands sent by the processor 680. In addition, the touch panel 631 may be implemented using various types, such as resistive, capacitive, infrared, and surface acoustic wave. The input unit 630 may include other input devices 632 in addition to the touch panel 631. In particular, other input devices 632 may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a mouse, a joystick, and the like.

The display unit 640 may be used to display information input by or provided to the user and various menus of the smartphone. The display unit 640 may include a display panel 641, and optionally, the display panel 641 may be configured in the form of a Liquid Crystal Display (LCD), an organic light-emitting diode (OLED), or the like. Further, the touch panel 631 can cover the display panel 641, and when the touch panel 631 detects a touch operation thereon or nearby, the touch panel is transmitted to the processor 680 to determine the type of the touch event, and then the processor 680 provides a corresponding visual output on the display panel 641 according to the type of the touch event. Although in fig. 16, the touch panel 631 and the display panel 641 are two separate components to implement the input and output functions of the smart phone, in some embodiments, the touch panel 631 and the display panel 641 may be integrated to implement the input and output functions of the smart phone.

The smartphone may also include at least one sensor 650, such as a light sensor, motion sensor, and other sensors. Specifically, the light sensor may include an ambient light sensor that may adjust the brightness of the display panel 641 according to the brightness of ambient light, and a proximity sensor that may turn off the display panel 641 and/or the backlight when the smartphone is moved to the ear. As one of the motion sensors, the accelerometer sensor can detect the magnitude of acceleration in each direction (generally, three axes), can detect the magnitude and direction of gravity when stationary, and can be used for applications (such as horizontal and vertical screen switching, related games, magnetometer attitude calibration) for recognizing the attitude of the smartphone, and related functions (such as pedometer and tapping) for vibration recognition; as for other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor, which can be configured on the smart phone, further description is omitted here.

Audio circuit 660, speaker 661, microphone 662 can provide an audio interface between the user and the smartphone. The audio circuit 660 may transmit the electrical signal converted from the received audio data to the speaker 661, and convert the electrical signal into an audio signal through the speaker 661 for output; on the other hand, the microphone 662 converts the collected sound signals into electrical signals, which are received by the audio circuit 660 and converted into audio data, which are processed by the audio data output processor 680 and then passed through the RF circuit 610 to be sent to, for example, another smartphone or output to the memory 620 for further processing.

WiFi belongs to short-distance wireless transmission technology, and the smart phone can help a user to receive and send e-mails, browse webpages, access streaming media and the like through the WiFi module 670, and provides wireless broadband internet access for the user. Although fig. 16 shows the WiFi module 670, it is understood that it does not belong to the essential constitution of the smartphone and can be omitted entirely as needed within the scope not changing the essence of the invention.

The processor 680 is a control center of the smart phone, connects various parts of the entire smart phone using various interfaces and lines, and performs various functions of the smart phone and processes data by operating or executing software programs and/or modules stored in the memory 620 and calling data stored in the memory 620. Optionally, processor 680 may include one or more processing units; optionally, the processor 680 may integrate an application processor and a modem processor, wherein the application processor mainly handles operating systems, user interfaces, application programs, and the like, and the modem processor mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into processor 680.

The smart phone further includes a power supply 690 (e.g., a battery) for supplying power to the various components, and optionally, the power supply may be logically connected to the processor 680 through a power management system, so as to implement functions of managing charging, discharging, and power consumption through the power management system.

Although not shown, the smart phone may further include a camera, a bluetooth module, and the like, which are not described herein.

The steps performed by the terminal device in the above-described embodiment may be based on the terminal device configuration shown in fig. 16.

Embodiments of the present application also provide a computer-readable storage medium, in which a computer program is stored, and when the computer program runs on a computer, the computer is caused to execute the method described in the foregoing embodiments.

Embodiments of the present application also provide a computer program product including a program, which, when run on a computer, causes the computer to perform the methods described in the foregoing embodiments.

It is understood that in the specific implementation of the present application, the data related to the user information, the content sample, etc. need to be approved or agreed by the user when the above embodiments of the present application are applied to specific products or technologies, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards of the related countries and regions.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A method for training a content classification model, comprising:

obtaining a content training sample, wherein the content training sample corresponds to M marked single-mode category labels and content category labels, and M is an integer greater than 1;

obtaining M modal feature vectors through a feature extraction network included in a content classification model based on the content training sample;

based on the M modal feature vectors, acquiring content probability distribution through a multi-modal fusion network included in the content classification model;

acquiring M modal probability distributions through M modal classification networks based on the M modal feature vectors, wherein the modal feature vector input by each modal classification network has a corresponding relation with the output modal probability distribution;

updating the model parameters of the content classification model according to the M modal probability distributions, the M single modal class labels, the content probability distribution and the content class labels.

2. The training method of claim 1, wherein after the obtaining of the content training samples, the method further comprises:

the obtaining of M modal feature vectors through a feature extraction network included in a content classification model based on the content training samples includes:

based on the text recognition result, obtaining a text feature vector through a text feature extraction network included in the content classification model, wherein the text feature vector is included in the M modal feature vectors;

based on the content training samples, obtaining image feature vectors through an image feature extraction network included in the content classification model, wherein the image feature vectors are included in the M modal feature vectors.

3. The training method of claim 2, wherein the M single-modality category labels comprise a text category label and an image category label;

the obtaining of M modal probability distributions over M modal classification networks based on the M modal feature vectors includes:

the obtaining of the content probability distribution through the multi-modal fusion network included in the content classification model based on the M modal feature vectors includes:

acquiring the content probability distribution through a multi-modal fusion network included in the content classification model based on the text feature vector and the image feature vector;

updating the model parameters of the content classification model according to the M modal probability distributions, the M single modal class labels, the content probability distributions, and the content class labels, including:

updating the model parameters of the content classification model according to the text probability distribution, the text category label, the image probability distribution, the image category label, the content probability distribution and the content category label.

4. The training method according to claim 2, wherein the training samples based on the content to obtain the text recognition result through the text recognition network included in the content classification model comprises:

based on the content training sample, obtaining a text detection result through a text detection network included in the text recognition network, wherein the text recognition network is included in the content classification model;

and acquiring the text recognition result through a character recognition network included in the text recognition network based on the text detection result.

5. The training method according to claim 2, wherein the training samples based on the content to obtain the text recognition result through the text recognition network included in the content classification model comprises:

based on the content training sample, obtaining the probability distribution of each character through a text recognition network included in the content classification model;

and acquiring the text recognition result according to the probability distribution of each character.

6. The training method according to claim 2, wherein the obtaining a text feature vector through a text feature extraction network included in the content classification model based on the text recognition result comprises:

and adding a start character coding sequence and a stop character coding sequence to the S symbol coding sequences to obtain the text characteristic vector.

7. The training method according to any one of claims 3 to 6, wherein the obtaining an image feature vector through an image feature extraction network included in the content classification model based on the content training sample comprises:

based on the content training sample, obtaining a target feature map through an image feature extraction network included in the content classification model;

converting the target feature map into a first image feature vector and a two-dimensional image feature, wherein the first image feature vector is included in the image feature vector;

the obtaining of the image probability distribution through the image classification network based on the image feature vector comprises:

obtaining the image probability distribution through the image classification network based on the first image feature vector;

the obtaining the content probability distribution through a multi-modal fusion network included in the content classification model based on the text feature vector and the image feature vector includes:

and acquiring the content probability distribution through a multi-modal fusion network included by the content classification model based on the text feature vector and the second image feature vector.

8. The training method according to claim 7, wherein the obtaining the content probability distribution through a multi-modal fusion network included in the content classification model based on the text feature vector and the second image feature vector comprises:

based on the target feature vector, the position coding vector and the segment coding vector, obtaining a picture-text coding result through a coding network included in the multi-mode fusion network, wherein the multi-mode fusion network is included in the content classification model, the position coding vector represents the position of each symbol in the feature vector, and the segment coding vector represents the category to which the feature vector belongs;

and acquiring the content probability distribution through a classification network included in the multi-mode fusion network based on the image-text encoding result.

9. The training method of claim 1, wherein after the obtaining of the content training samples, the method further comprises:

based on the audio recognition result, obtaining audio text feature vectors through a text feature extraction network included in the content classification model, wherein the audio text feature vectors are included in the M modal feature vectors;

10. Training method according to claim 9, wherein the M single-modality category labels comprise an audio category label and an image category label;

based on the audio text feature vector, acquiring audio text probability distribution through a text classification network;

acquiring the content probability distribution through a multi-modal fusion network included in the content classification model based on the audio text feature vector and the image feature vector;

the updating the model parameters of the content classification model according to the M modal probability distributions, the M single modal category labels, the content probability distributions, and the content category labels includes:

updating the model parameters of the content classification model according to the audio text probability distribution, the audio category label, the image probability distribution, the image category label, the content probability distribution and the content category label.

11. The training method of claim 1, wherein after the obtaining of the content training samples, the method further comprises:

based on the audio recognition result, obtaining an audio text feature vector through a first text feature extraction network included in the content classification model, wherein the audio text feature vector is included in the M modal feature vectors;

based on the text recognition result, obtaining a text feature vector through a second text feature extraction network included in the content classification model, wherein the text feature vector is included in the M modal feature vectors;

12. Training method according to claim 11, wherein the M single-modality category labels comprise an audio category label, a text category label and an image category label;

the obtaining of M modal probability distributions through M modal classification networks based on the M modal feature vectors includes:

based on the audio text feature vector, acquiring audio text probability distribution through a first text classification network;

acquiring the content probability distribution through a multi-modal fusion network included in the content classification model based on the audio text feature vector, the text feature vector and the image feature vector;

13. A method of content classification, comprising:

acquiring target content;

based on the target content, obtaining content probability distribution corresponding to the target content through a content classification model, wherein the content classification model is obtained by training through the training method of any one of the claims 1 to 12;

14. A model training apparatus, comprising:

the content training system comprises an acquisition module, a storage module and a display module, wherein the acquisition module is used for acquiring a content training sample, the content training sample corresponds to M marked single-mode class labels and content class labels, and M is an integer greater than 1;

the obtaining module is further configured to obtain M modal feature vectors through a feature extraction network included in a content classification model based on the content training sample;

the obtaining module is further configured to obtain content probability distribution through a multi-modal fusion network included in the content classification model based on the M modal feature vectors;

the obtaining module is further configured to obtain M modal probability distributions through M modal classification networks based on the M modal feature vectors, where a modal feature vector input by each modal classification network has a corresponding relationship with an output modal probability distribution;

and the training module is used for updating the model parameters of the content classification model according to the M modal probability distributions, the M single modal class labels, the content probability distribution and the content class labels.

15. A content classification apparatus, comprising:

the acquisition module is used for acquiring target content;

the obtaining module is further configured to obtain, based on the target content, content probability distribution corresponding to the target content through a content classification model, where the content classification model is obtained by training using the training method according to any one of claims 1 to 12;

16. A computer device, comprising: a memory, a processor, and a bus system;

wherein the memory is used for storing programs;

the processor is configured to execute a program in the memory, the processor is configured to perform the training method of any one of claims 1 to 12 or the method of claim 13 according to instructions in program code;

17. A computer-readable storage medium comprising instructions which, when executed on a computer, cause the computer to perform the training method of any one of claims 1 to 12, or to perform the method of claim 13.

18. A computer program product comprising a computer program and instructions, characterized in that the computer program/instructions, when executed by a processor, implement the training method according to any one of claims 1 to 12 or implement the method according to claim 13.