CN115269781A

CN115269781A - Modal association degree prediction method, device, equipment, storage medium and program product

Info

Publication number: CN115269781A
Application number: CN202210933409.3A
Authority: CN
Inventors: 邓文超
Original assignee: Tencent Technology Wuhan Co Ltd
Current assignee: Tencent Technology Wuhan Co Ltd
Priority date: 2022-08-04
Filing date: 2022-08-04
Publication date: 2022-11-01

Abstract

The application discloses a modal relevance prediction method, a modal relevance prediction device, modal relevance prediction equipment, a modal relevance prediction storage medium and a modal relevance prediction program product, and relates to the field of machine learning. The method comprises the following steps: the method comprises the steps of obtaining a sample content set, extracting a second modal feature vector corresponding to second modal data of sample content, determining feature vector centers corresponding to a plurality of sample classifications, determining the distance between the second modal feature vector and the feature vector center as a relevancy label between the second modal data and first modal data in the sample content, training a candidate relevancy identification model based on the relevancy label to obtain a relevancy identification model, using the model for identifying the relevancy between the first modal data and the second modal data in target content, and extracting semantic feature representation of the target content based on the relevancy. The relevance recognition model can better learn the relevance relation among all the modes in the multi-mode content, so that the comprehension capability of the article semantics is assisted and enhanced.

Description

Modal association degree prediction method, device, equipment, storage medium and program product

Technical Field

The embodiments of the present application relate to the field of machine learning, and in particular, to a method, an apparatus, a device, a storage medium, and a program product for predicting a modal relevance.

Background

The information flow article contains contents of various modes such as a text mode, an image mode and the like, and a multi-mode pre-training model is continuously developed and is applied to the information flow article to help to enhance understanding of article semantics and improve accuracy of downstream tasks such as article classification, article label extraction, article quality prediction and the like.

In the related technology, the training tasks of the multi-modal pre-training model of the information flow article mainly include a mask restoration task in a text mode and a matching task between the text mode and a picture mode.

However, in the training task of the multi-modal pre-training model of the information flow article, the matching task of the text mode and the picture mode depends on cross-mode data with high correlation between the text mode and the picture mode, and the modeling capability of the model is affected by data with low correlation, so that the training effect of the model is poor.

Disclosure of Invention

The embodiment of the application provides a method, a device, equipment, a storage medium and a program product for predicting modal relevance, which can improve the semantic relevance among the modalities of multi-modal content. The technical scheme is as follows:

in one aspect, a method for predicting modal relevance is provided, where the method includes:

acquiring a sample content set, wherein sample content in the sample content set comprises text modal data and image modal data, and the sample content is marked with a classification label which is used for indicating a sample classification to which the sample content belongs;

extracting a second modality feature vector corresponding to the second modality data of the sample content;

determining feature vector centers respectively corresponding to the multiple sample classifications based on second modal feature vectors corresponding to sample contents belonging to the same sample classification;

determining a distance between a second modality feature vector corresponding to the second modality data and a feature vector center of a sample classification corresponding to the second modality data, wherein the distance is used as an association degree label between the second modality data and the first modality data in the sample content;

training a candidate relevance identification model based on the sample content and relevance labels corresponding to the sample content to obtain a relevance identification model, wherein the relevance identification model is used for identifying the relevance between first modal data and second modal data in target content, and extracting semantic feature representation of the target content based on the relevance, and the semantic feature representation is used for representing the semantics of the target content.

In another aspect, a modality association degree prediction apparatus is provided, the apparatus including:

the system comprises an acquisition module, a storage module and a processing module, wherein the acquisition module is used for acquiring a sample content set, the sample content in the sample content set comprises first modality data and second modality data, and the sample content is marked with a classification label which is used for indicating the sample classification to which the sample content belongs;

the extraction module is used for extracting a second modality feature vector corresponding to the second modality data of the sample content;

the determining module is used for determining characteristic vector centers corresponding to a plurality of sample classifications respectively based on second modal characteristic vectors corresponding to sample contents belonging to the same sample classification; determining a distance between a second modality feature vector corresponding to the second modality data and a feature vector center of a sample classification corresponding to the second modality data as an association degree label between the second modality data and the first modality data in the sample content;

the training module is used for training a candidate relevance identification model based on the sample content and relevance labels corresponding to the sample content to obtain a relevance identification model, the relevance identification model is used for identifying relevance between first modal data and second modal data in target content, semantic feature representation of the target content is extracted based on the relevance, and the semantic feature representation is used for representing semantics of the target content.

In another aspect, a computer device is provided, which includes a processor and a memory, where at least one instruction, at least one program, a set of codes, or a set of instructions is stored in the memory, and the at least one instruction, the at least one program, the set of codes, or the set of instructions is loaded and executed by the processor to implement the modal relevance prediction method according to any of the embodiments of the present application.

In another aspect, a computer-readable storage medium is provided, in which at least one instruction, at least one program, a set of codes, or a set of instructions is stored, which is loaded and executed by a processor to implement the method for predicting a modality association degree as described in any one of the embodiments of the present application.

In another aspect, a computer program product or computer program is provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to execute the modal relevance prediction method in any of the above embodiments.

The beneficial effects brought by the technical scheme provided by the embodiment of the application at least comprise:

the candidate relevance degree recognition model is trained by assisting classification information of the multi-modal content and combining cross-modal content corpus data to obtain a relevance degree recognition model, the relevance degree between first modal data and second modal data in the multi-modal content is recognized, and semantic feature representation of the multi-modal content is extracted based on the relevance degree. The semantic comprehension capability of multi-modal content taking information flow articles as an example is enhanced, and the accuracy of tasks such as article classification, article label extraction, article quality prediction and the like is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a diagram illustrating a process of training a relevance recognition model in the related art according to an exemplary embodiment of the present application;

FIG. 2 is a schematic diagram of a training process in a context of an information flow article in the related art according to an exemplary embodiment of the present application;

FIG. 3 is a schematic illustration of an implementation environment provided by an exemplary embodiment of the present application;

FIG. 4 is a flowchart of a correlation recognition model training process in the related art according to an exemplary embodiment of the present application;

FIG. 5 is a flow diagram of an iterative training process for a relevance recognition model provided based on the embodiment shown in FIG. 4;

FIG. 6 is a flow chart illustrating the extraction and application of semantic features of multimodal content as provided by another exemplary embodiment of the present application;

FIG. 7 is a flowchart of a process for constructing a category label and a hierarchical construction process of a multi-level category sub-label based on the category label in the related art according to an embodiment of the present application;

FIG. 8 is a diagram of semantic vectors for outputting multimodal content in the related art as provided by an exemplary embodiment of the present application;

fig. 9 is a block diagram illustrating a configuration of a modality association degree prediction apparatus according to an exemplary embodiment of the present application;

fig. 10 is a block diagram illustrating a configuration of a modality association degree prediction apparatus according to another exemplary embodiment of the present application;

FIG. 11 is a block diagram of a computer device provided in an exemplary embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, extraction technologies for large feature representations, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

In the embodiment of the present application, in the field of artificial intelligence, multimodal content refers to content containing data of multiple modalities, such as: taking the information flow article as an example, the information flow article generally includes text modality data and image modality data, where there is strong or weak correlation between the text modality data and the image modality data. Such as: when the text modality data is about a constellation and the image modality data is about an animal, the correlation between the text modality data and the image modality data is weak; when the text mode data is related to a constellation and the image mode data is related to a star map, the correlation between the text mode data and the image mode data is strong.

In the related art, when feature extraction is performed on multi-modal content, multiple modality data in the multi-modal content is input into a feature extraction network for feature extraction, and the extracted features are combined, so that the multi-modal content is analyzed, for example: predicting a topic corresponding to the multimodal content, analyzing a recommendation probability between the multimodal content and a specified user, and the like.

However, the feature extracted from the multi-modal content is less effective due to the weak correlation between the modal data in the multi-modal content, and thus the downstream application of the extracted feature is affected, such as: the topic prediction accuracy is low, the user recommendation accuracy is low, and the like.

In the embodiment of the application, the relevance among the modal data in the multi-modal content is predicted, the feature extraction is carried out on the multi-modal content according to the relevance among the modal data, and the problem that the feature extraction effect on the multi-modal content is poor under the condition that the relevance among the modal data is low is solved.

Schematically, as shown in fig. 1, taking analysis of an information flow article as an example, in the embodiment of the present application, a sample corpus 100 is first obtained, where the sample corpus 100 includes an article corpus labeled with a classification label, and the article corpus includes text modal data and image modal data meeting a requirement of relevance.

Extracting image characteristic vectors 110 corresponding to image modal data in each article corpus respectively, clustering the image characteristic vectors in a characteristic space according to classification labels labeled by the article corpuses to obtain a characteristic vector center 120 corresponding to each classification, and determining the association degree 130 between the image modal data and the text modal data in the article corpus according to the distance between the image characteristic vectors 110 and the characteristic vector centers 120.

Based on the degree of association 130 between the image modality data and the text modality data, the machine learning model for downstream applications including, but not limited to: article classification, article keyword extraction, article quality prediction, article recommendation and the like.

Schematically, as shown in fig. 2, a scene of a flow article entitled "Capricorn fortune today" is taken as an example. The information flow article 200 includes text modality content 210 and image modality content 220. For the pre-training model, the input content of the text modality generally uses the article title and the article body content of the information flow article, and performs a certain truncation according to a model and a maximum length threshold set in advance, for example, in the present example article, the threshold is set to retain the first 256 characters of the input text content.

Schematically, fig. 2 is a schematic diagram of a relevance analysis process proposed in the present application. The text-mode content 210 is "Capricorn today", that is, the title of the flow article, and the image-mode content 220 is a captured image of the flow article. Generally, a specified article illustration of the information flow article is selected, and a cover picture and other article illustrations of the information flow article can also be selected.

As shown in fig. 2, the left portion is an input portion of text modality content 210 and the right portion is an input portion of image modality content 220. In the text modal content 210, 7 characters of Capricorn today are split, and become character features 240C, T1, T2, T3, T4, T5 and T6 after passing through a feature conversion model 230. The 2 nd character in the text modality content 210 in order of "Capricorn today" word, "Capricorn" is covered by a mask (mask). The masked character is predicted and restored by the feature translation model 230.

The right part is an input part of the image modality content 220, firstly, two representative pictures in the information flow article are trained in advance to obtain a model 250, and then corresponding picture modality vectors 251 and picture modality vectors 252 are obtained, and after the two representative pictures are analyzed by the feature conversion model 230, a correlation degree analysis result of the text modality content 210 corresponding to each image modality content 220 is obtained.

Next, the embodiment environment related to the embodiment of the present application is described, schematically, please refer to fig. 3, the implementation environment involves a terminal 310, a server 320, and a connection between the terminal 310 and the server 320 via a communication network 330.

In some embodiments, the terminal 310 is used to send multimodal content to the server 320. In some embodiments, the terminal 310 has an application program with a multi-modal content analysis function (e.g., a topic prediction function for an information stream article including multi-modal content), and illustratively, the terminal 310 has an application program with a news topic prediction function. Such as: the terminal 310 is installed with a search engine program, a travel application program, a life support application program, an instant messaging application program, a video program, a game program, a news application program, and the like, which are not limited in the embodiment of the present application.

After obtaining the multi-modal content, the server 320 obtains a correlation analysis result by analyzing the correlation between the modal data in the multi-modal content, and extracts a feature vector corresponding to the multi-modal content based on the correlation analysis result, so as to be applied to a downstream multi-modal content analysis task, such as: extraction of multimodal content tags, multimodal content classification, multimodal content recommendation, and the like.

The relevance analysis refers to analyzing relevance between different modal data in the multi-modal content, such as: and performing relevance analysis on text modality data and image modality data in the information flow article.

The terminal may be a mobile phone, a tablet computer, a desktop computer, a portable notebook computer, an intelligent television, a vehicle-mounted terminal, an intelligent home device, and other terminal devices in various forms, which is not limited in the embodiment of the present application.

It should be noted that the server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like.

The Cloud technology (Cloud technology) is a hosting technology for unifying series resources such as hardware, software, network and the like in a wide area network or a local area network to realize calculation, storage, processing and sharing of data. The cloud technology is based on the general names of network technology, information technology, integration technology, management platform technology, application technology and the like applied in the cloud computing business model, can form a resource pool, is used as required, and is flexible and convenient. Cloud computing technology will become an important support. Background services of the technical network system require a large amount of computing and storage resources, such as video websites, picture-like websites and more web portals. With the high development and application of the internet industry, each article may have its own identification mark and needs to be transmitted to a background system for logic processing, data in different levels are processed separately, and various industrial data need strong system background support and can only be realized through cloud computing.

In some embodiments, the servers described above may also be implemented as nodes in a blockchain system.

In combination with the above noun introduction and application scenarios, the method for predicting the modal relevance degree provided by the present application is described, and the method may be executed by a server or a terminal, or may be executed by both the server and the terminal.

Step 401, a sample content set is obtained, sample content in the sample content set includes first modality data and second modality data, and the sample content is labeled with a classification label.

In some embodiments, the sample content set includes a plurality of sample contents, and each sample content is composed of multi-modal data, including first-modal data and second-modal data. Optionally, the first modality data and the second modality data belong to two different modalities.

Optionally, the sample content in the sample content set may be implemented as an information flow article, where the information flow article includes text content as a main body and includes image content as a matching graph, where the text content is text modality data and the image content is image modality data.

Optionally, the sample content in the sample content set may also be implemented as social published content in a social platform, that is, public content published in the social platform by a social account on the social platform, such as: social videos and social texts issued by the social accounts on the social platform; or the social image and the social text which are published by the social account on the social platform; or social voice, social images, and the like issued by the social account on the social platform, which are not limited in the embodiment of the present application. The social video is composed of a plurality of image frames, the image frames or the social images serve as image modality data, the social text serves as text modality data, and the social voice can serve as voice modality data or serve as text modality data after being subjected to text-to-text processing.

It should be noted that the form of the sample content is only an illustrative example, and the present embodiment does not limit the form.

Optionally, the first modality data and the second modality data are included in sample content in the sample content set. In this embodiment, a sample content set is implemented as an information flow article corpus for explanation, where the information flow article corpus includes a plurality of information flow articles, and each information flow article includes text modality data as first modality data and image modality data as second modality data.

In some embodiments, the sample content in the sample content set is pre-selected content, and the pre-selected sample content meets the modal relevance requirement. That is, in the pre-selected sample content, the first modality data and the second modality data meet the requirement of the degree of association, such as: and pre-selecting the constituting sample contents with strong correlation between the text modality data and the image modality data, and adding the constituting sample contents into the sample content set.

Optionally, when sample content is obtained, a classification label is labeled to the sample content, where the classification label is used to represent a classification to which the sample content belongs, and taking text modality data and image modality data as an example, when there is a strong correlation between the text modality data and the image modality data, the classification label can express both a text topic classification of the text modality data and an image content classification of the image modality data. Such as: the method comprises the steps of obtaining text content with a theme surrounding a constellation as text modal data, obtaining a constellation star map as image modal content, forming sample content by the text modal data and the image modal data, and labeling a classification label of the constellation so as to represent that the text modal data and the image modal data in the sample content are related to the constellation.

In the embodiment of the application, the sample content set is screened in advance, the sample content in the sample content set not only contains the first modality data and the second modality data, but also has the sample classification label, and the classification label can clearly represent the sample content, so that the sample content can be better understood, and the relation between the first modality data and the second modality data of the sample content can be more conveniently researched in the subsequent training process.

Step 402, extracting a second modality feature vector corresponding to the second modality data of the sample content.

And inputting second modal data in the sample content into a feature extraction model obtained by pre-training, and outputting to obtain a second modal feature vector corresponding to the second modal data.

It should be noted that, in the embodiment of the present application, the second modality feature vector of the second modality data is taken as an example for description, in some embodiments, the first modality feature vector of the first modality data may also be extracted, and the first modality feature vector is clustered and oriented in the center based on the classification label, which is not limited in the embodiment.

In some embodiments, the feature extraction model is a pre-trained model for feature extraction for second modality data. Illustratively, taking the second modality data as the image modality data as an example, the feature extraction model is an image feature extraction model. Optionally, the image feature extraction model is used for extracting texture features, color features, shape features, spatial relationship features, and the like of the image modality data.

In the embodiment of the application, a second modal feature vector corresponding to second modal data of sample content is extracted, the second modal data in different forms in the sample content is converted into the feature vector, various features of the second modal data are converted into numbers, and the numbers are used for representing the second modal data of the sample content more intuitively and concisely.

Step 403, determining feature vector centers respectively corresponding to the multiple sample classifications based on the second modal feature vectors corresponding to the sample contents belonging to the same sample classification.

Optionally, after the second modal feature vector corresponding to the sample content is extracted and obtained, because the sample content itself is labeled with a classification label, and the classification label is used for representing the sample classification, based on the classification label, the second modal feature vector corresponding to the sample content labeled with the same classification label (that is, belonging to the same sample classification) is clustered in a feature space to obtain a clustering result corresponding to each sample classification, and for the clustering result corresponding to each sample classification, a feature center corresponding to each second modal feature vector in the clustering result is obtained and used as a feature vector center corresponding to the sample classification. That is, the second modal feature vectors corresponding to the sample contents belonging to the same sample classification are averaged to obtain feature vector centers corresponding to the plurality of sample classifications.

Illustratively, sample content belonging to a classification label "constellation" includes sample content a, sample content B and sample content C, where feature 1 is extracted from an image a in the sample content a, feature 2 is extracted from an image B in the sample content B, feature 3 is extracted from an image C in the sample content C, and the feature 1, the feature 2 and the feature 3 are averaged in a feature space, that is, elements at the same position in 3 feature vectors are calculated and averaged, so as to obtain a feature vector center corresponding to the classification label "constellation".

It should be noted that, in the above example, the sample content includes an image as an example for illustration, in some embodiments, the first modality data and the second modality data in the sample content may be implemented as one or more data, and this embodiment is not limited to this, for example: the information flow article comprises a piece of text content and a plurality of image contents.

In the embodiment of the application, the characteristic vector center corresponding to each sample classification is determined, the sample classifications belonging to different fields are represented by numbers and presented in a vector form, and the sample classifications to which the sample contents belong are more intuitively and simply represented.

Step 404, determining a distance between a second modality feature vector corresponding to the second modality data and a feature vector center of a sample classification corresponding to the second modality data, as an association degree label between the second modality data and the first modality data in the sample content.

Optionally, all sample contents in the sample content set are pre-screened, and each sample content belongs to a respective sample category. The first-modality data and the sample classification in the sample content are strongly correlated, so the degree of association between the first-modality data and the second-modality data in each sample content is converted into the degree of association between the sample classification and the second-modality data.

In the related art, when it is predicted in the multi-modal model whether the first modal data and the second modal data match, a hard tag with a correlation degree of 0 or 1 is marked on the first modal data and the second modal data. Label 0 represents that the first modality data and the second modality data do not match, i.e., the first modality data and the second modality data are not correlated; and tag 1 represents that the first modality data and the second modality data match, i.e., that the first modality data and the second modality data are related. The above method of labeling the relevance degree between the first modality data and the second modality data directly determines whether the first modality data and the second modality data are related, and an error is likely to occur in a scene where the first modality data and the second modality data are strongly related but are not strictly matched.

Therefore, in the embodiment of the present application, instead of directly labeling the first modality data and the second modality data in the sample content with a correlation degree of 0 or 1, a correlation degree value between the first modality data and the second modality data in the sample content is predicted, which is specifically represented by calculating a distance between a second modality feature vector corresponding to the second modality data of the sample content and centers of all second modality feature vectors in a sample classification corresponding to the sample content, and using the distance as the correlation degree between the second modality data and the first modality data.

Optionally, after determining the feature vector center corresponding to each sample classification in the feature space, for the target sample content, since the target sample content is labeled with the classification label, calculating a distance between the feature vector center corresponding to the classification label of the target sample content and the second modality feature vector corresponding to the second modality data in the target sample content, and taking the distance as an association label between the first modality data and the second modality data in the target sample content.

Illustratively, the second modality feature vector corresponding to the second modality data of the target sample content is T1, and the sample to which the second modality feature vector belongs is classified as a "constellation". In the sample content set, all sample contents belonging to the "constellation" are 3, and the second modality feature vectors corresponding to the respective second modality data in the 3 sample contents include T1, T2, and T3.

And adding the vectors T1, T2 and T3 to obtain an average value, wherein the center of the characteristic vector of the classification of the constellation is T, and calculating the cosine similarity between T1 and T to obtain the cosine similarity of 0.8. In some embodiments, the cosine similarity may also be embodied as the distance between T1 and T. And taking the cosine similarity as the correlation degree between the first modality data and the second modality data in the target sample content.

Optionally, the first modality data is text modality data and the second modality data is image modality data, and in some embodiments, the first modality data and the second modality data may be data of any modality, but the first modality data and the second modality data are data of different modalities.

It should be noted that the second modal feature vector corresponding to the second modal data in the sample content is a multi-dimensional vector, and a distance between the second modal feature vector corresponding to the second modal data in the sample content and a feature vector center of the sample classification corresponding to the second modal data may be any value, that is, the association degree tag may be any value, which is not limited in this embodiment.

In the embodiment of the application, through screening in advance, the first modal data and the sample classification of all sample contents in the sample content set are strongly correlated, so that the correlation degree between the first modal data and the second modal data of the sample contents is converted into the correlation degree between the second modal data and the sample classification of the sample contents, the second modal data and the sample classification are represented in a vector form, the distance between the second modal feature vector corresponding to the second modal data and the feature vector center of the sample classification corresponding to the second modal data is used as a correlation degree label between the second modal data and the first modal data in the sample contents, the correlation degree is presented in a digital form, and the relationship between the first modal data and the second modal data in the sample contents is more intuitively and concisely represented.

Step 405, training the candidate relevance degree recognition model based on the sample content and the relevance degree label corresponding to the sample content to obtain the relevance degree recognition model.

Optionally, the relevance identification model obtained after the training process is used to identify the relevance between the first modality data and the second modality data in the target content, and a semantic feature representation of the target content may also be extracted based on the relevance, where the semantic feature representation is used to represent the semantics of the target content.

Alternatively, the semantic feature representation may have functions including, but not limited to, the following:

1. inputting the semantic feature representation of the target content into a keyword extraction model, and extracting to obtain keywords of the target content;

2. inputting the semantic feature representation of the target content into a classification model, classifying the target content, and outputting to obtain a classification result of the target content, wherein the classification model comprises a preset classification set, and the semantic feature representation of the target content is matched with the classification in the classification set to obtain a classification result corresponding to the target content;

3. and inputting the semantic feature representation of the target content into a recommendation model, and matching the semantic feature representation with the user features, thereby recommending the target content meeting the matching degree requirement to the user.

The target content is multi-modal content including first modal data and second modal data.

Optionally, after obtaining the association degree label (i.e. a numerical value) corresponding to the sample content and the sample content, the association degree label is used as an information content of the target sample for standby. Inputting a target sample to a candidate relevance identification model, wherein the target sample comprises first modal data and second modal data, and the candidate relevance identification model analyzes the predicted relevance between the first modal data and the second modal data in the target sample and outputs the predicted relevance. And inputting the predicted relevance and the relevance label into a preset loss function to obtain a relevance loss value, and returning the relevance loss value to a relevance recognition model for training model parameters. Illustratively, the relevance label of the first modal data and the second modal data in the target sample content S is 0.8, after the target sample content S is input into the candidate relevance identification model M, a predicted relevance between the first modal data and the second modal data in the target sample is output to be 0.7, the relevance label and the predicted relevance are input into a preset loss function F (x), a relevance loss value is calculated through the loss function F (x) to be 0.1, the relevance loss value 0.1 is returned to the candidate relevance identification model M, and subsequent training is continued.

It should be noted that the predicted relevance output by the candidate relevance recognition model may be any value, the preset loss function F (x) may be any function meeting the actual requirement, and the relevance loss value calculated by the loss function F (x) may also be any value, which is not limited in this embodiment.

In summary, in the method provided by this embodiment, the sample content and the relevance label corresponding to the sample content are used to train the candidate relevance identification model, and the sample content in the sample content set is the content labeled with the classification label, that is, the first modality content and the second modality content are both the content associated with the classification label, so that the sample content in the sample content set is used to train the candidate relevance identification model, and the result of predicting relevance output by the candidate relevance identification model can be more accurate.

Moreover, the loss value is calculated according to the predicted relevance and the relevance label between the first modal data and the second modal data, the loss value is returned to the candidate relevance recognition model for updating, training is continued, the result of the predicted relevance output by the candidate relevance recognition model can be more accurate than the relevance label, the error is smaller, and the continuous optimization process of the model is realized.

In an optional embodiment, the candidate relevance degree recognition model is trained based on the relevance degree label and the loss value calculated by predicting the relevance degree. As shown in fig. 5, the above step 405 can also be implemented as the following steps:

step 4051, obtain target sample content in the sample content set, where the target sample content includes target first-modality data and target second-modality data.

In some embodiments, the sample content set includes a plurality of sample contents, and each sample content is composed of multimodal data including first modality data and second modality data. Optionally, the first modality data and the second modality data belong to two different modalities.

Optionally, the first modality data and the second modality data are both included in sample content in the sample content set. In this embodiment, a sample content set is implemented as an information flow article corpus, which is taken as an example for explanation, and then the information flow article corpus includes a plurality of information flow articles, and each information flow article includes text mode data as first mode data and image mode data as second mode data.

It should be noted that the first modality data and the second modality data of the sample content are only exemplary, and the embodiment does not limit the present invention.

Step 4052, inputting the target sample content into the candidate relevance degree identification model, and outputting to obtain the predicted relevance degree between the target first modality data and the target second modality data.

The candidate relevance identification model is a preset model to be trained currently, first modal data and second modal data of the target sample content are input, cosine similarity is output by the model, and the cosine similarity represents the similarity between the first modal data and the second modal data in the target sample content and serves as the predicted relevance between the first modal data and the second modal data in the target sample content. The cosine similarity is to measure the similarity between two characteristic vectors by measuring a cosine value of an included angle between the two characteristic vectors, wherein the two characteristic vectors are respectively a characteristic vector extracted from the first modality data and a characteristic vector extracted from the second modality data.

Optionally, the first modality data and the second modality data of the sample content are presented in a vector form after being converted by the candidate relevance degree identification model, and a cosine similarity between second modality feature vectors corresponding to the first modality data content and the second modality data is calculated, and the cosine similarity is used as output content of the candidate relevance degree identification model, that is, a predicted relevance degree between the first modality data and the second modality data of the target sample content.

Illustratively, the first-modality data content in the target sample content is text content of "Capricorn today's fortune", and the second-modality data content in the target sample content is image content of "Picture 1". The candidate relevance identification model is provided with a first feature extraction network, and is used for converting image content of second modality data content of 'picture 1' in target sample content into a second modality feature vector 'IMG 1' corresponding to second modality data. The relevance recognition model also has a second feature extraction network, such as: the transform conversion network is used for extracting a feature vector of first modality data and calculating cosine similarity of the first modality data and second modality data of target sample content, and is specifically represented as follows: and calculating the cosine similarity between the text content of the data of the first modality, namely the Capricorn today's fortune, and the data of the second modality, namely the picture 1. The cosine similarity is the output content, i.e. the predicted correlation between the first modality data and the second modality data of the target sample content.

It should be noted that, in the above example, the target sample content includes a text content and an image content, in some embodiments, the first modality data and the second modality data in the target sample content may be implemented as one or more data, and this embodiment is not limited thereto. Furthermore, in the above example, the first modality data in the target sample content is taken as a text content and the second modality data is taken as an image content, in some embodiments, the first modality data may be any content such as a text, an image, and a video, and the second modality data may also be any content such as a text, an image, and a video, which is not limited in this embodiment. However, the first modality data and the second modality data are specific to different types of modality data. If the first modality data or the second modality data is the video content, performing a key frame extraction operation on the video content, that is, performing screenshot on the video key frame, converting the screenshot into an image content form, and performing the operation according to the method of the embodiment.

Step 4053, obtaining a relevance loss value based on the relevance label and the predicted relevance labeled on the target sample content, where the relevance loss value is used to represent a difference between the relevance label and the predicted relevance.

Optionally, the relevance loss value refers to a relevance label between the first modality data and the second modality data in the target sample content and a predicted relevance between the first modality data and the second modality data output by the relevance identification model, and a preset loss function F (x) is input to obtain an output result, that is, the relevance loss value.

Illustratively, the relevance degree label between the first modality data and the second modality data in the target sample content is 0.8, the predicted relevance degree between the first modality data and the second modality data output by the relevance degree identification model is 0.7, and the relevance degree label and the predicted relevance degree are input into a preset loss function F (x) to obtain a relevance degree loss value of 0.1.

It should be noted that, in the above example, the relevance label between the first modality data and the second modality data is 0.8, and the predicted relevance between the first modality data and the second modality data output through the relevance identification model is 0.7, in some embodiments, the relevance label and the predicted relevance may be any values, which is not limited in this embodiment. The correlation degree loss value obtained based on the correlation degree label and the predicted correlation degree may be any value, and the preset loss function may be any function meeting the actual requirement, which is not limited in this embodiment.

Step 4054, training the candidate relevance degree recognition model based on the relevance degree loss value to obtain a relevance degree recognition model.

Optionally, training the candidate relevance recognition model based on the relevance loss value means updating and iteratively training model parameters in the candidate relevance recognition model based on the relevance loss value. Optionally, the model parameters of the candidate relevance recognition model are iteratively updated and trained based on a gradient descent method.

Illustratively, a relevance loss value is obtained based on a relevance label labeled by target sample content and the predicted relevance, model parameters in the candidate relevance recognition model are adjusted based on the relevance loss value under the condition that the relevance loss value does not meet the training requirement, and further iterative training is carried out based on the candidate relevance recognition model after the parameters are adjusted until the loss value corresponding to the predicted relevance between the first modal data and the second modal data of the output target sample content meets the training requirement. Optionally, the loss value meeting the training requirement comprises: 1. the loss value is converged; 2. the loss value is less than a preset threshold value; 3. the number of iterative training reaches the required number, and the like, which is not limited in this embodiment.

It should be noted that, the relevance degree label between the first modality data and the second modality data of the target sample content, and the predicted relevance degree of the first modality data and the second modality data obtained by the target sample content through the candidate relevance degree identification model, and the relevance degree loss value obtained based on the relevance degree label and the predicted relevance degree may be any value; in addition, the predicted relevance in the above example converges to a certain value or a preset threshold, and the converged value and the threshold may be any values, which is not limited in this embodiment.

In summary, in the method provided in this embodiment, based on the relevance label between the first modality data and the second modality data in the target sample content and the predicted relevance between the first modality data and the second modality data output by the target sample content via the candidate relevance identification model, a relevance loss value is calculated through a preset loss function, the relevance loss value is returned to the candidate relevance identification model, an error problem existing between a result predicted by using the model and an actual result is fully considered, and a process of returning the loss value to the candidate relevance identification model is subjected to parameter updating and an iterative training process is added, so that the model is continuously optimized, and an error between the predicted relevance and the relevance label is smaller and smaller. The comprehension capability of the model to the multi-modal target sample content is further enhanced, and the modeling of the target sample content is better performed, so that the content comprehension accuracy of the whole information flow article is improved.

In an optional embodiment, the relevance recognition model in the embodiment of the present application may further extract semantic feature representations of the multi-modal content, output semantic vectors of the multi-modal content, and apply the semantic vectors, fig. 6 is a schematic flow chart of extracting and applying the semantic feature representations of the multi-modal content in the embodiment of the present application, and the specific process is as follows:

step 601, obtaining semantic similarity distribution corresponding to the classification label library based on the semantic similarity relation among the classification labels in the classification label library.

Optionally, the classification tag library includes classification tags corresponding to the multimodal contents in the multimodal content set. And carrying out classification prediction on the target multi-modal content, counting the probability of the target multi-modal content appearing on each dimension and each classification label, and determining the topic representation content corresponding to each classification label, wherein the topic representation content is used for representing the topic implicit semantics corresponding to the classification labels. Wherein the targeted multi-modal content comprises first modality data and second modality data.

Illustratively, information flow articles are selected, and 5 information flow articles with the highest occurrence probability under each classification label are used as the topic representation content of the classification label, as shown in table 1 below.

TABLE 1

The five articles are five information flow articles with the highest probability of occurrence under the classification label '2000-dimensional theme-theme 1 oral health', and are used as theme representation contents of the classification label '2000-dimensional theme-theme 1 oral health'.

The 2000-dimensional theme refers to a theme at a granularity of 2000 dimensions, and the content of the classification label may be the content of the theme at any granularity, which is not limited in this embodiment. The topic representation contents under each classification label correspond to the classification label, and in the above example, the "2000-dimensional topic-topic 1 oral health" is taken as the classification label, and the topic representation contents can be seen from the titles of the corresponding topic representation contents, and are the contents surrounding the classification label of "2000-dimensional topic-topic 1 oral health". The title of the above topic representation content is only an example, and may be any information flow article content corresponding to the classification label of "2000-dimensional topic-topic 1 oral health", which is not limited in this embodiment.

And averaging the semantic feature vectors of the theme representation content to obtain a classification label semantic vector corresponding to the classification label. The semantic feature vector of the topic representation content is obtained by inputting the multi-modal content of the topic representation content into the relevance recognition model, converting the multi-modal content by the relevance recognition model, and outputting the converted content to obtain a semantic feature vector, namely the semantic feature vector of the topic representation content.

And obtaining cosine similarity among the classification label semantic vectors of the classification labels to obtain semantic similarity distribution corresponding to the classification label library. The classification label semantic vector is obtained by weighting the semantic feature vector of each topic representation content under each classification label, and the specific expression is to take the average value of the semantic feature vectors.

Illustratively, the semantic feature vector of the topic representation content is:

x1 (1,2,3), X2 (1,1,1), X3 (3,4,5), X4 (1,1,2), X5 (4,2,9); the class label semantic vector X is (X1 + X2+ X3+ X4+ X5)/5 = (2,2,4).

And calculating cosine similarity among the semantic vectors of the classification labels with different dimensions to obtain semantic similarity distribution of each classification label.

Illustratively, cosine similarity between classification label semantic vectors of "2000-dimensional theme-theme 1 oral health" and classification label semantic vectors of "5000-dimensional theme-theme 1 caries treatment", "5000-dimensional theme-theme 2 kinds of dental knowledge", and "5000-dimensional theme-theme 3 princess honor pharmacist" are calculated respectively, cosine similarity between different classification label semantic vectors is taken as semantic similarity, and corresponding semantic similarity distribution is obtained by combination, as shown in table 2 below.

TABLE 2

2000 dimensional theme	5000D theme	Semantic similarity
			Subject 1 oral health	Subject 1 dental caries health	0.92
Subject 1 oral health	Topic 2 tooth knowledge	0.85
			Subject 1 oral health	Theme 3 Rong Yao French out dress	0.60

As shown in table 2, the semantic similarity between "2000-dimensional theme-theme 1 oral health" and "5000-dimensional theme-theme 1 caries treatment" was 0.92, the semantic similarity between "5000-dimensional theme-theme 2-kind of dental knowledge" was 0.85, and the semantic similarity between "5000-dimensional theme-theme 3-royal glory chef-costume" was 0.60.

It should be noted that, in the above example, the topic representation content is five information stream articles that appear most frequently under the category label, and in some embodiments, the topic representation content may be any multiple target multi-modal content that appears most frequently under the category label, which is not limited in this embodiment. In addition, in the above example, the semantic feature vector and the classification tag semantic vector of the topic representation content are both three-dimensional feature vectors, each bit of element is a positive integer, in some embodiments, the semantic feature vector and the classification tag semantic vector of the topic representation content may be any multidimensional vector, which is not limited in this embodiment. The semantic similarity distribution in table 2 is only an example, and the semantic similarity distribution between different classification labels may be any value, which is not limited in this embodiment.

In the embodiment of the application, the relevance recognition model is applied, and semantic similarity distribution corresponding to the classification label library is obtained based on the semantic similarity relation among the classification labels in the classification label library. The multi-modal content is input into the relevance recognition model, and the obtained output result is a semantic vector which can represent the complete semantics of the multi-modal content, so that the comprehension capability of the multi-modal content is improved, and the accuracy of the semantic vector is improved. It is not limited to entering data of only one modality, such as: only data of the text modality is input.

Step 602, based on the semantic features extracted from the multi-modal target content by the relevance recognition model, obtaining a classification label content probability distribution corresponding to the multi-modal target content, where the classification label content probability distribution is used to indicate semantic relevance between the semantic features of the multi-modal target content and each classification label.

Optionally, a co-occurrence probability of the classification labels among different granularities is calculated to obtain a classification label content probability distribution, where the co-occurrence probability refers to a probability that the target multi-modal content under the first granularity classification label appears under the second granularity classification label.

The granularity is the thickness degree of data statistics in the same dimension, the first granularity and the second granularity both serve to partition the article classification label of the auxiliary information stream, and the first granularity and the second granularity may be any granularity, but the first granularity and the second granularity are different granularities, which is not limited in this embodiment.

Illustratively, taking the first granularity as 2000 dimensions and the second granularity as 5000 dimensions as an example, table 3 below is a table of probability distributions of contents of an example classification label.

TABLE 3

2000 dimensional theme	5000D theme	Probability of co-occurrence
			Subject 1 oral health	Subject 1 treatment of dental caries	0.60
Themes1 oral health	Topic 2 tooth knowledge	0.30
			Subject 1 oral health	Theme 3 game court out	0

Wherein 10000 articles labeled as 'subject 1 oral health' under 2000-dimensional subject, and 6000 articles labeled as 'subject 1 caries treatment' under 5000-dimensional subject, the co-occurrence probability of the classification labels is 0.60;

similarly, 10000 articles labeled as 'topic 1 oral health' under a 2000-dimensional topic, 3000 articles labeled as 'topic 2 tooth knowledge' under a 5000-dimensional topic, and the co-occurrence probability of the classification labels is 0.30;

10000 articles with a classification label of 'theme 1 oral health' under a 2000-dimensional theme, and 0 article with a classification label of 'theme 3 gambling law outsourcing' under a 5000-dimensional theme, the co-occurrence probability of the classification labels is 0. And combining the co-occurrence probabilities of the classification labels to obtain the content probability distribution of the classification labels.

It should be noted that, in the above example, ten thousand information flow articles are used as an example of the multimodal content, in some embodiments, the multimodal content may be in a form other than the information flow articles, such as social content on a social platform, and the number of the multimodal content may also be arbitrary, but is not limited to be greater than or equal to 2.

In the embodiment of the application, the semantic features extracted from the target multi-modal content based on the relevance recognition model take the co-occurrence probability of the target multi-modal content in each classification label as the probability distribution of the classification label content, so that the semantic understanding of the multi-modal content is further improved.

And 603, fusing the semantic similarity distribution and the classification label content probability distribution to obtain a multi-level classification sub-label corresponding to the target multi-modal content.

Optionally, the multi-level classification sub-tag is a classification tag obtained by further dividing the hierarchy based on the existing classification tag of the target multi-modal content.

Performing weighted fusion on the content probability distribution and the semantic similarity distribution of the classification labels to obtain hierarchical classification label probability distribution; and under the condition that the numerical value corresponding to the hierarchical classification label probability distribution accords with the preset probability, the multi-level classification sub-labels corresponding to the target multi-modal content have a hierarchical relationship.

The weighted fusion of the content probability distribution and the semantic similarity distribution of the classification labels means that the semantic similarity of the classification labels is weighted by using the co-occurrence probability of the classification labels and then normalized.

It should be noted that in the above embodiments, there are three classification tags corresponding to the target multi-modal content, in some embodiments, the number of classification tags corresponding to the target multi-modal content may be any number, and if there is only one classification tag, there is no hierarchical relationship, which is not limited in this embodiment. Moreover, the content probability distribution and the semantic similarity distribution of the classification labels of the target multi-modal content may be any values, which is not limited in this embodiment.

In the embodiment of the application, the multi-level classification sub-labels corresponding to the classification labels are obtained by the method and serve as important identification information of the target multi-modal content, so that the semantic features of the target multi-modal content are highly summarized, the semantic understanding of the target multi-modal content can be enhanced, and the overall understanding of the multi-modal content is improved.

Fig. 7 is a flowchart of a method for predicting a modal relevance according to an exemplary embodiment of the present application, where a classification label of sample content in the embodiment of the present application includes multiple levels of classification sub-labels, as shown in fig. 7, and before step 401 shown in fig. 4, the method further includes the following steps:

step 701, obtaining semantic similarity distribution corresponding to the classification label library based on the semantic similarity relation between the classification labels in the classification label library.

Optionally, the classification tag library includes classification tags corresponding to the multimodal contents in the multimodal content set. And carrying out classification prediction on the contents in the multi-modal content set, counting the occurrence probability of the target multi-modal content on each classification label, and taking the multi-modal content with the occurrence probability higher than a preset threshold value under each classification label as the topic representation content corresponding to the classification label.

And inputting the topic representation contents into a preset feature extraction model, and extracting semantic feature vectors corresponding to each topic representation content. The semantic feature vector of the topic representation content is obtained by inputting the title content and the text content of the topic representation content into a preset feature extraction model (such as a BERT model), converting the title content and the text content by the feature extraction model, and outputting the semantic feature vector, namely the semantic feature vector of the topic representation content.

And averaging the semantic feature vectors of the topic representation contents to obtain classification label semantic vectors corresponding to the classification labels. And obtaining cosine similarity among the classification label semantic vectors of the classification labels to obtain semantic similarity distribution corresponding to the classification label library. The classification label semantic vector is obtained by weighting the semantic feature vector of each topic representation content under each classification label, and the specific expression is to take the average value of the semantic feature vectors.

Illustratively, taking an information flow article with an article title of "health science popularization, children's mouth" as an example, the information flow article is input to the feature extraction model as target multi-modal content. The information flow article includes textual modality data. As shown in fig. 8, first, the text modality data 810 of the information flow article is input, a layer of conversion network 820 is provided in the feature extraction model, and the conversion network 820 converts the text modality data 810 into vectors of different dimensions, namely, token features (tokens), segment features (segment features), and position features (position features). The vector is transformed by a layer of transform network 830 to obtain a semantic representation vector 840, and the semantic representation vector 840 is used for representing the whole content of the information flow article. That is, inputting the text modal content of the multimodal content into the feature extraction model outputs a semantic representation vector.

It should be noted that the semantic vector of the topic representation article is obtained by performing feature extraction through a preset BERT model, and in some embodiments, the preset feature extraction model may be another model meeting actual requirements, which is not limited in this embodiment.

In the embodiment of the application, semantic similarity distribution corresponding to the classification label library is obtained based on the semantic similarity relation among the classification labels in the classification label library. Firstly, a topic representation article of classification labels contained in a classification label library is obtained, a topic table article is represented in a vector form, respective vectors capable of representing each classification label are further obtained, cosine similarity between the vectors is calculated, semantic similarity relations between the classification labels are obtained, a semantic similarity distribution is formed, abstract similarity relations are converted into vectors and digital forms, relations between the classification labels can be visually seen, and convenience is provided for further dividing levels in the classification labels.

Step 702, obtaining a classification label content probability distribution corresponding to the target multi-modal content, where the classification label content probability distribution is used to indicate a semantic association relationship between semantic features of the target multi-modal content and each classification label.

Optionally, the target multi-modal content is classified and predicted, and a classification label with probability distribution meeting probability requirement corresponding to the target multi-modal content is reserved. The condition that the probability value obtained on each classification label reaches a preset threshold value after the target multi-modal content is predicted by the preset model is met with the probability requirement.

Illustratively, after the target multi-modal content is predicted by the preset model, the probability value obtained on the classification label Q1 is 0.6, the probability value obtained on the classification label Q2 is 0.1, the preset threshold value is 0.5, the probability value obtained on the classification label Q1 meets the probability requirement, the probability value obtained on the classification label Q2 does not meet the probability requirement, the target multi-modal content belongs to the classification label Q1, and the process completes the classification prediction of the target multi-modal content.

It should be noted that, in the above example, 0.5 is used as the preset threshold, in some embodiments, the preset threshold may be any non-negative value, but the threshold must be equal to or less than one and equal to or greater than zero, which is not limited in this embodiment. In addition, only two classification tags are shown in the above example for the classification prediction, in some embodiments, any number of classification tags may be used for the classification prediction, and this embodiment does not limit this.

Calculating the co-occurrence probability of the classification labels among different granularities to obtain the content probability distribution of the classification labels, wherein the co-occurrence probability refers to the probability that target multi-modal content under a first granularity classification label appears under a second granularity classification label.

In the embodiment of the application, the frequency of the multi-modal content in the first granularity appearing in the second granularity is used as the co-occurrence probability and is used as the content probability distribution of the classification labels, and the semantic association relationship between the semantic features of the target multi-modal content and the classification labels is further indicated, so that convenience is provided for obtaining the hierarchical division of the classification labels corresponding to the target multi-modal content, and the semantic understanding of the multi-modal content is enhanced.

And 703, fusing the semantic similarity distribution and the classification label content probability distribution to obtain a multi-level classification sub-label corresponding to the target multi-modal content.

Carrying out weighted normalization processing on the semantic similarity of the classification labels by using the co-occurrence probability of the classification labels to obtain hierarchical classification label probability distribution; and obtaining the multi-level classification sub-labels with the hierarchical relationship in the target multi-mode content under the condition that the numerical values corresponding to the hierarchical classification label probability distribution accord with the preset probability.

It should be noted that the number and type of the multi-level classification sub-labels may be arbitrary, and this embodiment is not limited thereto.

Step 704, according to the multi-level classification sub-label corresponding to the target multi-modal content, and taking the target multi-modal content as sample content to construct a sample content set.

Optionally, the sample content set formed by the target multi-modal content includes, but is not limited to, first-modality data, second-modality data, a classification label of the multi-modal content, and a multi-level classification sub-label corresponding to the multi-modal content.

Illustratively, the sample content set of targeted multimodal content includes a plurality of information stream articles, each information stream article including: the first modality data is the content of text modality data, the second modality data is the content of image modality data, the classification label corresponding to the information flow article content, and the multi-level classification sub-label based on the information flow article content.

It should be noted that, in the above example, the sample content set composed of the target multimodal content uses the information flow article as sample multimodal content, and each sample multimodal content includes two kinds of modality data, corresponding classification tags, and multi-level classification sub-tags. In some embodiments, the target multimodal content can be any form of content, such as articles, videos, social content published by a social platform, etc., and each sample multimodal content can have any type of modality data, which is not limited by the embodiment. The first modality data and the second modality data may also be any modality data, such as text modality data, image modality data, video modality data, and the like, which is not limited in this embodiment. However, the type of the first modality data and the type of the second modality data must be different. The classification tags and multi-level classification sub-tags included in the target multi-modal content may be of any type, which is not limited in this embodiment.

In summary, the method provided by this embodiment completes the hierarchical relationship construction method between classification tags at different granularities based on the semantic similarity distribution of the classification tags and the co-occurrence probability distribution of the classification tags, and the decoupling classification tag construction process is applicable to both common classification tag methods, and effectively ensures the semantic consistency of the hierarchical classification tags and the consistency of the multi-modal content distribution. The hierarchical classification label constructed based on the method not only has high off-line evaluation accuracy, but also has obvious promotion on-line classification label recall effect through the hierarchical classification label relation.

Fig. 9 is a block diagram illustrating a structure of a modality association degree prediction apparatus according to an exemplary embodiment of the present application, where, as shown in fig. 9, the apparatus includes:

an obtaining module 910, configured to obtain a sample content set, where sample content in the sample content set includes first modality data and second modality data, and the sample content is labeled with a classification label, where the classification label is used to indicate a sample classification to which the sample content belongs;

an extracting module 920, configured to extract a second modality feature vector corresponding to the second modality data of the sample content;

a determining module 930, configured to determine, based on the second modal feature vector corresponding to the sample content belonging to the same sample classification, feature vector centers corresponding to the multiple sample classifications, respectively; determining a distance between a second modality feature vector corresponding to the second modality data and a feature vector center of a sample classification corresponding to the second modality data, wherein the distance is used as an association degree label between the second modality data and the first modality data in the sample content;

a training module 940, configured to train a candidate relevance identification model based on the sample content and a relevance label corresponding to the sample content to obtain a relevance identification model, where the relevance identification model is configured to identify relevance between first modality data and second modality data in target content, and extract a semantic feature representation of the target content based on the relevance, where the semantic feature representation is used to represent semantics of the target content.

In an optional embodiment, the obtaining module 910 is further configured to obtain target sample content in the sample content set, where the target sample content includes target first-modality data and target second-modality data; inputting the target sample content into the candidate relevance degree identification model, and outputting to obtain the predicted relevance degree between the target first modality data and the target second modality data;

the training module 940 is further configured to obtain a relevance loss value based on the relevance label labeled on the target sample content and the predicted relevance, where the relevance loss value is used to represent a difference between the relevance label and the predicted relevance; and training the candidate relevance recognition model based on the relevance loss value to obtain the relevance recognition model.

In an optional embodiment, the training module 940 is further configured to perform iterative training on the candidate relevance degree recognition models based on relevance degree loss values respectively corresponding to sample contents in the sample content sets, so as to obtain the relevance degree recognition models.

In an optional embodiment, the determining module 930 is further configured to perform average processing on the second modal feature vectors corresponding to the sample contents belonging to the same sample classification, so as to obtain feature vector centers respectively corresponding to a plurality of sample classifications.

In an optional embodiment, the obtaining module 910 is further configured to obtain, based on a semantic similarity relationship between each classification tag in the classification tag library, a semantic similarity distribution corresponding to the classification tag library;

the obtaining module 910 is further configured to obtain, based on the semantic features extracted by the relevance recognition model for the target multi-modal content, a classification tag content probability distribution corresponding to the target multi-modal content, where the classification tag content probability distribution is used to indicate semantic association relationships between the semantic features of the target multi-modal content and each classification tag;

as shown in fig. 10, the apparatus further includes:

and a fusion module 950, configured to fuse the semantic similarity distribution and the classification label content probability distribution to obtain a multi-level classification sub-label corresponding to the target multi-modal content.

In an optional embodiment, the obtaining module 910 is further configured to perform classification prediction on the target multi-modal content, and determine topic representation content corresponding to each classification tag, where the topic representation content is used to represent topic implicit semantics corresponding to the classification tag; averaging the semantic feature vectors of the topic representation content to obtain a classification label semantic vector corresponding to the classification label; and obtaining cosine similarity among the classification label semantic vectors of the classification labels to obtain semantic similarity distribution.

In an optional embodiment, the obtaining module 910 is further configured to perform classification prediction on the target multi-modal content, and retain a classification label whose probability distribution corresponds to the target multi-modal content meets a probability requirement; calculating the co-occurrence probability of the classification labels among different granularities to obtain the content probability distribution of the classification labels, wherein the co-occurrence probability refers to the content under the classification label of the first granularity and the probability under the classification label of the second granularity.

In an optional embodiment, the fusion module 950 is further configured to perform weighted fusion on the content probability distribution of the classification tag and the semantic similarity distribution to obtain a hierarchical classification tag probability distribution; and under the condition that the numerical value corresponding to the hierarchical classification label probability distribution is higher than a preset threshold value, the multi-level classification sub-labels corresponding to the target multi-modal content have a hierarchical relationship.

In summary, in the apparatus provided in this embodiment, the sample content and the relevance label corresponding to the sample content are used to train the candidate relevance identification model, and the sample content in the sample content set is the content labeled with the classification label, that is, the first modality content and the second modality content are both the content associated with the classification label, so that the candidate relevance identification model is trained by using the sample content in the sample content set, and the result of the predicted relevance output by the candidate relevance identification model can be more accurate.

It should be noted that: the modal relevance degree predicting apparatus provided in the foregoing embodiment is only illustrated by the division of the functional modules, and in practical applications, the functions may be distributed by different functional modules as needed, that is, the internal structure of the device is divided into different functional modules to complete all or part of the functions described above.

Fig. 11 shows a block diagram of a computer device 1100 provided in an exemplary embodiment of the present application. The computer device 1100 may be: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, motion Picture Experts compression standard Audio Layer 3), an MP4 player (Moving Picture Experts Group Audio Layer IV, motion Picture Experts compression standard Audio Layer 4), a notebook computer, or a desktop computer. Computer device 1100 may also be referred to by other names such as user equipment, portable terminals, laptop terminals, desktop terminals, and the like.

Generally, the computer device 1100 includes: a processor 1101 and a memory 1102.

Processor 1101 may include one or more processing cores, such as a 4-core processor, an 8-core processor, or the like. The processor 1101 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 1101 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 1101 may be integrated with a GPU (Graphics Processing Unit) that is responsible for rendering and drawing the content that the display screen needs to display. In some embodiments, processor 1101 may also include an AI processor for processing computational operations related to machine learning.

Memory 1102 may include one or more computer-readable storage media, which may be non-transitory. Memory 1102 can also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 1102 is used to store at least one instruction for execution by processor 1101 to implement the method of interacting virtual objects provided by the method embodiments of the present application.

In some embodiments, computer device 1100 also includes other components, and those skilled in the art will appreciate that the configuration shown in FIG. 11 does not constitute a limitation of terminal 1100, and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components may be employed.

Embodiments of the present application further provide a computer device, which includes a processor and a memory, where at least one instruction, at least one program, a set of codes, or a set of instructions is stored in the memory, and the at least one instruction, the at least one program, the set of codes, or the set of instructions is loaded and executed by the processor to implement the modal relevance prediction method provided by the above method embodiments.

Embodiments of the present application further provide a computer-readable storage medium, on which at least one instruction, at least one program, a code set, or a set of instructions is stored, and the at least one instruction, the at least one program, the code set, or the set of instructions is loaded and executed by a processor to implement the modality association degree prediction method provided by the foregoing method embodiments.

Embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to execute the modal relevance prediction method in any of the above embodiments.

Optionally, the computer-readable storage medium may include: a Read Only Memory (ROM), a Random Access Memory (RAM), a Solid State Drive (SSD), or an optical disc. The Random Access Memory may include a resistive Random Access Memory (ReRAM) and a Dynamic Random Access Memory (DRAM). The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only exemplary of the present application and should not be taken as limiting, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method for predicting modal relevance, the method comprising:

obtaining a sample content set, wherein sample content in the sample content set comprises first modality data and second modality data, and the sample content is labeled with a classification label, and the classification label is used for indicating a sample classification to which the sample content belongs;

determining characteristic vector centers respectively corresponding to a plurality of sample classifications based on second modal characteristic vectors corresponding to sample contents belonging to the same sample classification;

2. The method of claim 1, wherein training a candidate relevance recognition model based on the sample content and the relevance label corresponding to the sample content to obtain a relevance recognition model comprises:

obtaining target sample content in the sample content set, wherein the target sample content comprises target first modality data and target second modality data;

inputting the target sample content into the candidate relevance degree identification model, and outputting to obtain the predicted relevance degree between the target first modality data and the target second modality data;

obtaining a relevance loss value based on the relevance label labeled by the target sample content and the prediction relevance, wherein the relevance loss value is used for representing the difference between the relevance label and the prediction relevance;

and training the candidate relevance recognition model based on the relevance loss value to obtain the relevance recognition model.

3. The method of claim 2, wherein training the candidate relevance recognition model based on the relevance loss value to obtain the relevance recognition model comprises:

and performing iterative training on the candidate relevance recognition model based on relevance loss values respectively corresponding to the sample contents in the sample content set to obtain the relevance recognition model.

4. The method according to any one of claims 1 to 3, wherein the determining the feature vector centers corresponding to the plurality of sample classifications based on the second modal feature vector corresponding to the sample content belonging to the same sample classification comprises:

and averaging the second modal characteristic vectors corresponding to the sample contents belonging to the same sample classification to obtain characteristic vector centers respectively corresponding to a plurality of sample classifications.

5. The method according to any one of claims 1 to 3, wherein after the training of the candidate relevance identification model based on the sample content and the relevance label corresponding to the sample content to obtain the relevance identification model, the method further comprises:

obtaining semantic similarity distribution corresponding to a classification label library based on semantic similarity relation among the classification labels in the classification label library;

based on semantic features extracted from the target multi-modal content by the relevance recognition model, obtaining classification label content probability distribution corresponding to the target multi-modal content, wherein the classification label content probability distribution is used for indicating semantic relevance between the semantic features of the target multi-modal content and each classification label;

and fusing the semantic similarity distribution and the classification label content probability distribution to obtain a multi-level classification sub-label corresponding to the target multi-modal content.

6. The method according to claim 5, wherein the obtaining semantic similarity distribution corresponding to the classification tag library based on the semantic similarity relationship between the classification tags in the classification tag library comprises:

performing classification prediction on the target multi-modal content, and determining topic representation content corresponding to each classification label, wherein the topic representation content is used for representing topic implicit semantics corresponding to the classification label;

averaging the semantic feature vectors of the topic representation content to obtain a classification label semantic vector corresponding to the classification label;

and obtaining cosine similarity among the classification label semantic vectors of the classification labels to obtain semantic similarity distribution.

7. The method of claim 5, wherein obtaining a classification label content probability distribution corresponding to the targeted multimodal content comprises:

carrying out classification prediction on the target multi-modal content, and reserving a classification label with probability distribution corresponding to the target multi-modal content meeting probability requirements;

calculating the co-occurrence probability of the classification labels among different granularities to obtain the content probability distribution of the classification labels, wherein the co-occurrence probability refers to the content under the classification label of the first granularity and the probability under the classification label of the second granularity.

8. The method according to claim 5, wherein the fusing the semantic similarity distribution and the classification label content probability distribution to obtain a multi-level classification sub-label corresponding to the target multi-modal content comprises:

performing weighted fusion on the content probability distribution of the classification labels and the semantic similarity distribution to obtain hierarchical classification label probability distribution; and under the condition that the numerical value corresponding to the hierarchical classification label probability distribution is higher than a preset threshold value, the multi-level classification sub-labels corresponding to the target multi-modal content have a hierarchical relationship.

9. A modality association degree prediction apparatus, characterized in that the apparatus comprises:

the system comprises an acquisition module, a storage module and a processing module, wherein the acquisition module is used for acquiring a sample content set, the sample content in the sample content set comprises first modality data and second modality data, and the sample content is marked with a classification label which is used for indicating a sample classification to which the sample content belongs;

the determining module is used for determining the characteristic vector centers corresponding to the multiple sample classifications based on the second modal characteristic vector corresponding to the sample content belonging to the same sample classification; determining a distance between a second modality feature vector corresponding to the second modality data and a feature vector center of a sample classification corresponding to the second modality data, wherein the distance is used as an association degree label between the second modality data and the first modality data in the sample content;

10. A computer device comprising a processor and a memory having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, which is loaded and executed by the processor to implement the modal relevance prediction method of any of claims 1 to 8.

11. A computer-readable storage medium, wherein at least one program is stored in the storage medium, and the at least one program is loaded and executed by a processor to implement the modality association degree prediction method according to any one of claims 1 to 8.

12. A computer program product comprising a computer program which, when executed by a processor, implements a modal relevance prediction method as claimed in any one of claims 1 to 8.