CN117009798A

CN117009798A - Modal alignment model training, modal alignment method, device and storage medium

Info

Publication number: CN117009798A
Application number: CN202210438305.5A
Authority: CN
Inventors: 翟彬旭
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-04-25
Filing date: 2022-04-25
Publication date: 2023-11-07

Abstract

The application relates to a modal alignment model training method, a modal alignment model training device, computer equipment, a storage medium and a computer program product. The method comprises the following steps: inputting the first training mode information and the second training mode information into an initial mode alignment model for mode alignment characterization to obtain a training mode pair Ji Biaozheng vector; calculating the mode similarity degree between a first mode pair Ji Biaozheng vector in the training mode alignment characterization vector and a second mode alignment characterization vector in the training mode alignment characterization vector; calculating probability distribution distances of alignment characterization vectors of the first modality pair Ji Biaozheng vector and the second modality based on the modality similarity degree; and reversely updating the initial modal alignment model based on the probability distribution distance and carrying out loop iteration to obtain a first target modal alignment model, wherein the first target modal alignment model is used for extracting semantic representations of different modal information, and the semantic representations of the same instance in the different modal information have a corresponding relation. By adopting the method, the accuracy of modal alignment can be improved.

Description

Modal alignment model training, modal alignment method, device and storage medium

Technical Field

The present application relates to the field of internet technologies, and in particular, to a method, an apparatus, a computer device, a storage medium, and a computer program product for training a modality alignment model.

Background

With the development of artificial intelligence technology, a mode alignment technology appears, and mode alignment refers to searching for a corresponding relation for sub-branches/elements of different mode information from the same instance, such as aligning "shoes" described in text to "shoes" in a picture. Modalities refer to the organized form or source of information, common modalities including visual, auditory, text, and the like. Different modality information has different organization forms or sources. Currently, when performing modality alignment, modality alignment is generally performed using a two-class machine learning model, that is, using training data labeled with whether alignment is performed to train to obtain a machine learning model, and using the machine learning model to perform modality alignment.

However, since the quality of the data labeling in the modal alignment cannot be guaranteed, the machine learning model obtained through training has the problem of low accuracy in the process of modal alignment.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a modality alignment model training, a modality alignment method, an apparatus, a computer device, a computer-readable storage medium, and a computer program product that can improve the accuracy of modality alignment.

In one aspect, the application provides a modal alignment model training method. The method comprises the following steps:

acquiring first training mode information and second training mode information;

inputting the first training mode information and the second training mode information into an initial mode alignment model for mode alignment characterization, obtaining a training mode pair Ji Biaozheng vector, wherein the training mode alignment characterization vector comprises a first mode pair Ji Biaozheng vector corresponding to the first training mode information and a second mode pair Ji Biaozheng vector corresponding to the second training mode information, and the same instance characterization in the first mode pair Ji Biaozheng vector and the second mode alignment characterization vector has an initial corresponding relation;

calculating the similarity between the Ji Biaozheng vector of the first modality pair and the alignment characterization vector of the second modality pair to obtain the similarity of modalities;

calculating probability distribution distance of the alignment characterization vector of the first modality pair Ji Biaozheng vector and the second modality based on the modality similarity degree, and obtaining vector loss information based on the probability distribution distance;

the initial mode alignment model is reversely updated based on the vector loss information to obtain an updated mode alignment model, the updated mode alignment model is used as the initial mode alignment model, the step of obtaining the first training mode information and the second training mode information is returned to be executed until the training completion condition of the alignment model is reached, a first target mode alignment model is obtained, the first target mode alignment model is used for extracting semantic representation of different mode information, and the semantic representation of the same instance in the semantic representation of different mode information has a corresponding relation.

On the other hand, the application also provides a device for training the modal alignment model. The device comprises:

the information acquisition module is used for acquiring the first training mode information and the second training mode information;

the initial alignment module is used for inputting the first training mode information and the second training mode information into an initial mode alignment model for carrying out mode alignment characterization, a training mode pair Ji Biaozheng vector is obtained, the training mode alignment characterization vector comprises a first mode pair Ji Biaozheng vector corresponding to the first training mode information and a second mode pair Ji Biaozheng vector corresponding to the second training mode information, and the same instance characterization in the first mode pair Ji Biaozheng vector and the second mode alignment characterization vector has an initial corresponding relation;

the similarity calculation module is used for calculating the similarity between the Ji Biaozheng vector of the first modality pair and the alignment characterization vector of the second modality pair to obtain the similarity of the modalities;

the loss calculation module is used for calculating probability distribution distances of the first modality pair Ji Biaozheng vectors and the second modality alignment characterization vectors based on the modality similarity degree, and obtaining vector loss information based on the probability distribution distances;

the iteration module is used for reversely updating the initial modal alignment model based on the vector loss information to obtain an updated modal alignment model, taking the updated modal alignment model as the initial modal alignment model, and returning to execute the step of acquiring the first training modal information and the second training modal information until reaching the training completion condition of the alignment model to obtain a first target modal alignment model, wherein the first target modal alignment model is used for extracting semantic representations of different modal information, and the semantic representations of the same instance in the semantic representations of different modal information have a corresponding relation.

On the other hand, the application also provides computer equipment. The computer device comprises a memory storing a computer program and a processor which when executing the computer program performs the steps of:

acquiring first training mode information and second training mode information;

In another aspect, the present application also provides a computer-readable storage medium. The computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:

acquiring first training mode information and second training mode information;

In another aspect, the present application also provides a computer program product. The computer program product comprises a computer program which, when executed by a processor, implements the steps of:

acquiring first training mode information and second training mode information;

acquiring first to-be-aligned modal information and second to-be-aligned modal information;

inputting the first to-be-aligned modal information and the second to-be-aligned modal information into a first target modal alignment model;

the method comprises the steps that first training mode information and second training mode information are input into an initial mode alignment model to perform mode alignment characterization, a training mode pair Ji Biaozheng vector is obtained, the training mode alignment characterization vector comprises a first mode pair Ji Biaozheng vector corresponding to the first training mode information and a second mode pair Ji Biaozheng vector corresponding to the second training mode information, the similarity between a first mode pair Ji Biaozheng vector and the second mode alignment characterization vector is calculated, and the mode similarity is obtained; calculating probability distribution distance of the Ji Biaozheng vector of the first modality pair and the alignment characterization vector of the second modality based on the modal similarity degree, obtaining vector loss information based on the probability distribution distance, and iteratively updating an initial modality alignment model based on the vector loss information until reaching an alignment model training completion condition, so as to obtain a first target modality alignment model;

And carrying out modal alignment characterization on the first to-be-aligned modal information and the second to-be-aligned modal information through the first target modal alignment model to obtain a target modal pair Ji Biaozheng vector.

the to-be-aligned information acquisition module is used for acquiring first to-be-aligned modal information and second to-be-aligned modal information;

the input module is used for inputting the first to-be-aligned modal information and the second to-be-aligned modal information into a first target modal alignment model; the method comprises the steps that first training mode information and second training mode information are input into an initial mode alignment model to perform mode alignment characterization, a training mode pair Ji Biaozheng vector is obtained, the training mode alignment characterization vector comprises a first mode pair Ji Biaozheng vector corresponding to the first training mode information and a second mode pair Ji Biaozheng vector corresponding to the second training mode information, the similarity between the first mode pair Ji Biaozheng vector and the second mode alignment characterization vector is calculated, and the mode similarity is obtained; calculating probability distribution distances of the Ji Biaozheng vectors of the first modality pair and the alignment characterization vectors of the second modality based on the modality similarity degree, obtaining vector loss information based on the probability distribution distances, and iteratively updating the initial modality alignment model based on the vector loss information until the alignment model training completion condition is reached, so as to obtain the first target modality alignment model;

And the modal alignment module is used for carrying out modal alignment characterization on the first to-be-aligned modal information and the second to-be-aligned modal information through the first target modal alignment model to obtain a target modal pair Ji Biaozheng vector.

The training mode alignment model training, the mode alignment method, the device, the computer equipment, the storage medium and the computer program product are used for obtaining a training mode pair Ji Biaozheng vector by inputting the first training mode information and the second training mode information into the initial mode alignment model for carrying out mode alignment characterization, wherein the training mode alignment characterization vector comprises a first mode pair Ji Biaozheng vector corresponding to the first training mode information and a second mode pair Ji Biaozheng vector corresponding to the second training mode information. And then calculating the similarity between the first modality pair Ji Biaozheng vector and the second modality alignment characterization vector to obtain the modality similarity. And calculating probability distribution distance of the alignment characterization vector of the first modality pair Ji Biaozheng vector and the second modality based on the modality similarity degree, and obtaining vector loss information based on the probability distribution distance. The vector loss information can be obtained by calculating the probability distribution distance, the obtained vector loss information is more accurate, then an initial modal alignment model is trained based on the vector loss information until the training completion condition of the alignment model is reached, a first target modal alignment model is obtained, so that the modal alignment model obtained through training improves the accuracy of modal alignment, and then the first target modal alignment model obtained through training is used for carrying out modal alignment characterization on the first to-be-aligned modal information and the second to-be-aligned modal information, a target modal pair Ji Biaozheng vector is obtained, and the accuracy of the obtained target modal alignment characterization vector is improved.

Drawings

FIG. 1 is a diagram of an application environment for a model alignment model training method in one embodiment;

FIG. 2 is a flow diagram of a model alignment model training method in one embodiment;

FIG. 3 is a flow diagram of a second target modality alignment model according to one embodiment;

FIG. 4 is a schematic diagram of a network architecture of a modal alignment classification recognition model in one embodiment;

FIG. 5 is a flow diagram of obtaining training modality alignment characterization vectors in one embodiment;

FIG. 6 is a diagram of a network structure for text feature vector extraction in one embodiment;

FIG. 7 is a diagram illustrating a network structure for extracting image feature vectors in one embodiment;

FIG. 8 is a flow diagram of obtaining training modality alignment characterization vectors in one embodiment;

FIG. 9 is a schematic diagram of a process for probability distribution transitions in one embodiment;

FIG. 10 is a flow diagram of a method of modality alignment in one embodiment;

FIG. 11 is a flow diagram of a method for obtaining a target multi-modal classification recognition model in one embodiment;

FIG. 12 is a flow chart of a model alignment model training method in one embodiment;

FIG. 13 is a schematic diagram of a particular embodiment modal alignment effect;

FIG. 14 is a block diagram of a model alignment model training device in one embodiment;

FIG. 15 is a block diagram of a mode alignment device in one embodiment;

FIG. 16 is an internal block diagram of a computer device in one embodiment;

fig. 17 is an internal structural view of a computer device in one embodiment.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

Computer Vision (CV) is a science of studying how to "look" a machine, and more specifically, to replace a human eye with a camera and a Computer to perform machine Vision such as recognition and measurement on a target, and further perform graphic processing to make the Computer process an image more suitable for human eye observation or transmission to an instrument for detection. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision technologies typically include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technology, virtual reality, augmented reality, synchronous positioning and mapping, autopilot, intelligent transportation, etc., as well as common biometric technologies such as face recognition, fingerprint recognition, etc.

Key technologies of the voice technology (Speech Technology) are an automatic voice recognition technology and a voice synthesis technology, and a voiceprint recognition technology. The method can enable the computer to listen, watch, say and feel, is the development direction of human-computer interaction in the future, and voice becomes one of the best human-computer interaction modes in the future.

Natural language processing (Nature Language processing, NLP) is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Thus, the research in this field will involve natural language, i.e. language that people use daily, so it has a close relationship with the research in linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like.

The scheme provided by the embodiment of the application relates to technologies such as image semantic understanding, voice technology and text processing of artificial intelligence, and is specifically described by the following embodiments:

the mode alignment model training method provided by the embodiment of the application can be applied to an application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The data storage system may store data that the server 104 needs to process. The data storage system may be integrated on the server 104 or may be located on the cloud or other servers. The terminal 102 can send a model training instruction to the server 104, and the server 104 can acquire the first training mode information and the second training mode information from the data storage system according to the model training instruction; the server 104 inputs the first training mode information and the second training mode information into an initial mode alignment model for mode alignment characterization, a training mode pair Ji Biaozheng vector is obtained, the training mode alignment characterization vector comprises a first mode pair Ji Biaozheng vector corresponding to the first training mode information and a second mode pair Ji Biaozheng vector corresponding to the second training mode information, and the same instance characterization in the first mode pair Ji Biaozheng vector and the second mode alignment characterization vector has an initial corresponding relationship; the server 104 calculates the similarity between the Ji Biaozheng vector of the first modality pair and the alignment characterization vector of the second modality to obtain the similarity of modalities; calculating probability distribution distance of the alignment characterization vector of the first modality pair Ji Biaozheng vector and the second modality based on the modality similarity degree, and obtaining vector loss information based on the probability distribution distance; the server 104 reversely updates the initial modality alignment model based on the vector loss information to obtain an updated modality alignment model, takes the updated modality alignment model as the initial modality alignment model, and returns to the step of obtaining the first training modality information and the second training modality information to execute until reaching the training completion condition of the alignment model, so as to obtain a first target modality alignment model, wherein the first target modality alignment model is used for extracting semantic characterizations of different modality information, and the semantic characterizations of the same instance in the semantic characterizations of different modality information have a corresponding relationship. The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, internet of things devices, and portable wearable devices, where the internet of things devices may be smart speakers, smart televisions, smart air conditioners, smart vehicle devices, and the like. The portable wearable device may be a smart watch, smart bracelet, headset, or the like. The server 104 may be implemented as a stand-alone server or as a server cluster of multiple servers.

In one embodiment, as shown in fig. 2, a method for training a modal alignment model is provided, and the method is used for the server in fig. 1 as an example, it is understood that the method can also be used for the server, and can also be used for a system including a terminal and the server, and is implemented through interaction between the terminal and the server. In this embodiment, the method for training the modal alignment model includes the following steps:

step 202, acquiring first training mode information and second training mode information.

The first training mode information refers to mode information used in training, and the mode information includes but is not limited to text information, image information, voice information, video information and the like. The second training mode information is mode information used in training, and the first training mode information and the second training mode information are information of different modes used in training.

In particular, the server may obtain training samples from a database, the training samples comprising first training modality information and second training modality information. The server can also acquire the first training mode information and the second training mode information uploaded by the terminal. The server may acquire the first training mode information and the second training mode information from a server providing the data. The server may obtain the first training modality information and the second training modality information from the business party.

Step 204, inputting the first training mode information and the second training mode information into an initial mode alignment model for mode alignment characterization, obtaining a training mode pair Ji Biaozheng vector, wherein the training mode alignment characterization vector comprises a first mode pair Ji Biaozheng vector corresponding to the first training mode information and a second mode pair Ji Biaozheng vector corresponding to the second training mode information, and the same instance characterization in the first mode pair Ji Biaozheng vector and the second mode alignment characterization vector has an initial corresponding relationship.

The initial modal alignment model refers to a modal alignment model of model parameter initialization, and the model parameter initialization can be random initialization, zero initialization, gaussian distribution initialization and the like. The mode alignment model can extract the semantic representation vectors which correspond to the mode information and are subjected to mode alignment, namely the same instance representation in the semantic representation vectors extracted by different mode information has a corresponding relation. The training mode pair Ji Biaozheng vector refers to a mode information semantic characterization vector which is obtained after the mode output by the model is aligned during training. The first modality pair Ji Biaozheng vector refers to a semantic characterization vector after modality alignment corresponding to the first training modality information. The second modality pair Ji Biaozheng vector refers to a semantic characterization vector after modality alignment corresponding to the second training modality information. Semantic token vectors refer to the conversion of modality information into a vector representation in semantic space. The same instance refers to the same instance in different modality information, e.g., the "clothing" description in the text is the same instance as the "clothing" region in the image. The initial correspondence refers to a correspondence extracted by using the initialization model parameters.

Specifically, the server may use the deep neural network to build an initial modality alignment model, and then train the initial modality alignment model. The first training mode information and the second training mode information are input into an initial mode alignment model, and mode alignment characterization is carried out by using initialized mode alignment parameters, so that an output training mode pair Ji Biaozheng vector is obtained. The initialized modal alignment parameters are used for modal alignment in semantic vector characterization. The training mode alignment characterization vector comprises a first mode pair Ji Biaozheng vector corresponding to the first training mode information and a second mode pair Ji Biaozheng vector corresponding to the second training mode information, and the first mode pair Ji Biaozheng vector and the same instance characterization in the second mode alignment characterization vector have initial corresponding relations.

In step 206, the similarity between the first modality pair Ji Biaozheng vector and the second modality alignment characterization vector is calculated, so as to obtain the modality similarity.

The higher the degree of similarity of the modes is, the more similar the first training mode information corresponding to the vector Ji Biaozheng of the first mode pair and the second training mode information corresponding to the vector Ji Biaozheng of the second mode pair is.

Specifically, the server calculates the similarity between the first modality pair Ji Biaozheng vector and the second modality alignment characterization vector by using a similarity algorithm, so as to obtain the modality similarity, wherein the similarity algorithm includes, but is not limited to, a cosine similarity algorithm, a distance similarity algorithm, and the like.

And step 208, calculating probability distribution distances of the first modality pair Ji Biaozheng vector and the second modality alignment characterization vector based on the modality similarity degree, and obtaining vector loss information based on the probability distribution distances.

The probability distribution distance refers to the minimum moving distance required by the first probability distribution corresponding to the vector Ji Biaozheng of the first modality pair to be converted to the second probability distribution corresponding to the vector Ji Biaozheng of the second modality pair. I.e. the probability distribution distance refers to the sum of the moving distances of all points in one probability distribution to the nearest point in the other probability distribution, which is minimized. The vector loss information is used to characterize modality alignment errors between the first modality pair Ji Biaozheng vector and the second modality alignment characterization vector. The smaller the vector loss information, the smaller the modal alignment error, and the higher the accuracy of the modal alignment model in modal alignment.

Specifically, the server may calculate the probability distribution distance of the first modality pair Ji Biaozheng vector aligned with the second modality alignment characterization vector based on the modality similarity by using a probability distribution distance measurement algorithm, and then may directly use the probability distribution distance as the loss of model training, i.e. obtain vector loss information. The probability distribution distance measurement algorithm is an algorithm for measuring the distance between two probability distributions.

Step 210, reversely updating the initial modality alignment model based on the vector loss information to obtain an updated modality alignment model, taking the updated modality alignment model as the initial modality alignment model, and returning to execute the step of obtaining the first training modality information and the second training modality information until reaching the training completion condition of the alignment model to obtain a first target modality alignment model, wherein the first target modality alignment model is used for extracting semantic characterizations of different modality information, and the semantic characterizations of the same instance in the semantic characterizations of different modality information have a corresponding relationship.

The updated mode alignment model refers to the mode alignment model with updated mode parameters. The first target modality alignment model refers to a trained modality alignment model. The first target modal alignment model is used for extracting semantic representations of different modal information, and semantic representations of the same instance in the semantic representations of different modal information have a corresponding relationship.

Specifically, the server uses the vector loss information to reversely update the initialized model parameters in the initial modal alignment model to obtain an updated modal alignment model. The model parameters can be updated by using a gradient descent algorithm, and the model parameters can also be counter-propagated by adopting heuristic algorithms such as simulated annealing and the like and second-order optimization algorithms such as self-adaptive learning rate and the like. At this time, when the alignment model training completion condition is not reached, the step of taking the updated modal alignment model as the initial modal alignment model and returning to acquire the first training modal information and the second training modal information is executed until the alignment model training completion condition is reached, and the modal alignment model when the alignment model training completion condition is reached is taken as the first target modal alignment model, wherein the alignment model training completion condition refers to a condition that the first target modal alignment model is completed, and includes but is not limited to that the number of iterations reaches the maximum upper limit, the loss information reaches the preset threshold, the model parameters are not changed any more, and the like. The first target mode alignment model is used for extracting semantic representation of different mode information, and semantic representation of the same instance in the semantic representation of different mode information has a corresponding relation.

According to the model alignment model training method, the first training mode information and the second training mode information are input into the initial mode alignment model to perform mode alignment characterization, a training mode pair Ji Biaozheng vector is obtained, and the training mode alignment characterization vector comprises a first mode pair Ji Biaozheng vector corresponding to the first training mode information and a second mode pair Ji Biaozheng vector corresponding to the second training mode information. And then calculating the similarity between the first modality pair Ji Biaozheng vector and the second modality alignment characterization vector to obtain the modality similarity. And calculating probability distribution distance of the alignment characterization vector of the first modality pair Ji Biaozheng vector and the second modality based on the modality similarity degree, and obtaining vector loss information based on the probability distribution distance. The vector loss information can be obtained by calculating the probability distribution distance, the obtained vector loss information is more accurate, then the initial modal alignment model is trained based on the vector loss information until the training completion condition of the alignment model is reached, and the first target modal alignment model is obtained, so that the modal alignment model obtained by training improves the accuracy of modal alignment.

In one embodiment, as shown in fig. 3, the method for training a modal alignment model further includes:

Step 302, acquiring a modality alignment label corresponding to the first training modality information and the second training modality information.

The mode alignment label refers to a label which is marked in advance and used for carrying out alignment or not on the first training mode information and the second training mode information.

Specifically, the server may obtain, from the database, a modality alignment label corresponding to the first training modality information and the second training modality information. The server can also acquire the mode alignment labels corresponding to the first training mode information and the second training mode information uploaded by the terminal.

Step 304, inputting the first training mode information and the second training mode information into an initial mode alignment classification recognition model for mode alignment characterization, obtaining a training mode pair Ji Biaozheng vector, and carrying out mode alignment classification recognition based on the training mode alignment characterization vector, so as to obtain a mode alignment classification recognition result.

The initial mode alignment classification recognition model refers to a mode alignment classification recognition model initialized by mode parameters, and the mode alignment classification recognition model is a classification model used for recognizing whether the same instance between the first training mode information and the second training mode information is aligned. The mode alignment classification recognition result refers to classification recognition of whether the first training mode information and the second training mode information are aligned in a mode or not, and comprises the same instance alignment result between the first training mode information and the second training mode information and the same instance misalignment result between the first training mode information and the second training mode information.

Specifically, the server inputs the first training mode information and the second training mode information into an initial mode alignment classification recognition model, uses initialized mode alignment parameters to perform mode alignment characterization to obtain a training mode pair Ji Biaozheng vector, and uses the initial classification recognition parameters to perform mode alignment classification recognition based on the training mode alignment characterization vector to obtain a mode alignment classification recognition result. The initialized modal alignment parameters are used for modal alignment during semantic vector characterization. The initial classification recognition parameters are used for classifying and recognizing whether the same instances in the training modality alignment characterization vector are aligned.

And step 306, performing classification loss calculation based on the modal alignment classification recognition result and the modal alignment label to obtain classification loss information.

The classification loss information is used for representing errors between the modal alignment classification recognition result and the modal alignment label, and the smaller the classification loss information is, the smaller the errors are, the more accurate the trained modal alignment classification recognition model is when the modes are aligned.

Specifically, the server may calculate an error between the modal alignment classification recognition result and the modal alignment label using a cross entropy loss function, resulting in classification loss information.

Step 308, performing model loss information based on the classification loss information and the vector loss information to obtain model loss information.

The model loss information is used for representing errors of modal alignment classification recognition by the modal alignment classification recognition model during training.

Specifically, the server calculates average loss information of the classification loss information and the vector loss information, and uses the average loss information as model loss information.

In one embodiment, the server may weight the classification loss information and the vector loss information to obtain weighted classification loss information and weighted vector loss information, where the weighting weight may be preset. And then calculating the sum of the weighted classified loss information and the weighted vector loss information to obtain model loss information.

Step 310, updating the initial modality alignment classification recognition model reversely based on the model loss information to obtain an updated modality alignment classification recognition model, taking the updated modality alignment classification recognition model as the initial modality alignment classification recognition model, and returning to execute the step of obtaining the first training modality information and the second training modality information until the training completion condition of the classification model is reached, so as to obtain the target modality alignment classification recognition model.

The training completion condition of the classification model refers to a condition that a training completion target mode aligns with the classification recognition model, and comprises that the classification loss information reaches a preset threshold value, the iteration number reaches the maximum iteration number, or the model parameters are not changed any more, and the like. The target modal alignment classification recognition model refers to a trained modal alignment classification recognition model.

Specifically, the server uses a back propagation algorithm to update the initialized model parameters in the initial modality alignment classification identification model back through the model loss information to obtain an updated modality alignment classification identification model. The back propagation algorithm may be a gradient descent algorithm, a simulated annealing heuristic algorithm, an adaptive learning rate (AdamW) second-order optimization algorithm, and the like. And judging whether the training completion condition of the classification model is met, when the training completion condition of the classification model is not met, taking the updated modal alignment classification recognition model as an initial modal alignment classification recognition model, and returning to execute the step of acquiring the first training modal information and the second training modal information until the training completion condition of the classification model is met, and taking the modal alignment classification recognition model when the training completion condition of the classification model is met as a target modal alignment classification recognition model.

Step 312, obtaining a second target modality alignment model based on the target modality alignment classification identification model.

The second target modality alignment model is a target modality alignment model trained by using the classification loss information and the vector loss information.

Specifically, the server may use the trained modal alignment parameters in the target modal alignment classification recognition model and the network structure corresponding to the modal alignment parameters as the second target modal alignment model.

In one embodiment, the initial modality alignment classification recognition model includes an initial modality alignment characterization network and an initial classification network;

step 304, inputting the first training mode information and the second training mode information into an initial mode alignment classification recognition model for mode alignment characterization, obtaining a training mode pair Ji Biaozheng vector, and performing mode alignment classification recognition based on the training mode alignment characterization vector, obtaining a mode alignment classification recognition result, comprising the steps of:

performing modal alignment characterization on the first training modal information and the second training modal information through an initial modal alignment characterization network in the initial modal alignment classification recognition model to obtain a training modal pair Ji Biaozheng vector; and carrying out modal alignment classification recognition through an initial classification network in the initial modal alignment classification recognition model to obtain a modal alignment classification recognition result.

The initial modality alignment characterization network refers to a semantic characterization neural network initialized by parameters, and is used for extracting semantic characterization vectors of model information after modality alignment. The initial classification network is a classification neural network initialized by parameters and is used for performing classification identification on whether the training modal alignment characterization vector is in modal alignment or not.

Specifically, when the initial modality alignment classification recognition model in the server obtains the input first training modality information and second training modality information, the initial modality alignment characterization network is used for performing modality alignment characterization to obtain a training modality pair Ji Biaozheng vector output by the initial modality alignment characterization network, and then the training modality pair Ji Biaozheng vector is input to the initial classification network for performing modality alignment classification recognition to obtain an output modality alignment classification recognition result.

Step 312, i.e. obtaining a second target modality alignment model based on the target modality alignment classification identification model, comprises the steps of:

and taking the target modality alignment characterization network in the target modality alignment classification identification model as a second target modality alignment model.

The target modal alignment characterization network refers to a trained modal alignment characterization network.

Specifically, the server may directly use the target modality alignment characterization network in the target modality alignment classification identification model as the second target modality alignment model.

In a specific embodiment, as shown in fig. 4, a network architecture schematic of a modal alignment classification recognition model is provided, specifically: the modal alignment classification recognition model includes a modal alignment network, a full connection layer and a classification layer, wherein the modal alignment network is established by using a network architecture of a coding network of a transducer network. The method comprises the steps of obtaining input first training mode information and second training mode information, carrying out mode alignment characterization through a mode alignment network, namely carrying out attention feature extraction through a multi-head attention network in the mode alignment network, merging attention features with the input mode information and carrying out standardization, carrying out forward propagation through a neural network, obtaining a forward propagation result, merging the forward propagation result with a last standardization result and carrying out standardization, obtaining an output training mode pair Ji Biaozheng vector, carrying out mode alignment classification recognition on the training mode alignment characterization vector through a full connection layer and a classification layer, and obtaining an output mode alignment classification recognition result.

In the embodiment, the mode alignment recognition result is obtained by obtaining the mode alignment labels corresponding to the first training mode information and the second training mode information, then the mode alignment classification recognition result is obtained by the initial mode alignment classification recognition model, then the classification loss calculation is carried out based on the mode alignment classification recognition result and the mode alignment labels, the classification loss information is obtained, the model loss information is carried out by using the classification loss information and the vector loss information, and the model loss information is obtained, so that the obtained model loss information is more accurate, finally the initial mode alignment classification recognition model is trained by using the model loss information, the target mode alignment classification recognition model is obtained, and then the second target mode alignment model is obtained from the target mode alignment classification recognition model, so that the accuracy of the obtained second target mode alignment model is improved.

In one embodiment, as shown in fig. 5, step 204, inputting the first training mode information and the second training mode information into the initial mode alignment model for mode alignment characterization, to obtain a training mode pair Ji Biaozheng vector, includes:

step 502, feature extraction is performed on the first training mode information and the second training mode information respectively, so as to obtain a first mode feature vector and a second mode feature vector.

The first modal feature vector refers to the feature vector of the extracted first training modal information. The second modal feature vector refers to the feature vector of the extracted second training modal information, and the first modal feature vector and the second modal feature are feature vectors with misaligned modalities.

Specifically, the server performs feature extraction on the first training mode information and the second training mode information to obtain a first mode feature vector and a second mode feature vector, wherein the feature extraction can be performed in parallel to obtain the first mode feature vector and the second mode feature vector, or the first training mode information feature extraction can be performed first to obtain the first mode feature vector, and then the feature extraction can be performed on the second training mode information to obtain the second mode feature vector. The feature extraction can be performed on the second training mode information to obtain a second mode feature vector, and then the feature extraction is performed on the first training mode information to obtain a first mode feature vector. Different feature extraction methods are adopted for different modal information, for example, when the first training modal information is text information, text feature extraction can be performed by using a text feature extraction algorithm to obtain text feature vectors, when the first training modal information is image information, image feature extraction can be performed by using an image feature extraction algorithm to obtain image feature vectors, and when the first training modal information is voice information, voice can be converted into text, feature extraction can be performed by using the text feature extraction algorithm to obtain voice feature vectors and the like. When the second training mode information is text information, text feature extraction can be performed by using a text feature extraction algorithm to obtain text feature vectors, when the second training mode information is image information, image feature extraction can be performed by using an image feature extraction algorithm to obtain image feature vectors, and when the second training mode information is voice information, voice can be converted into text, feature extraction can be performed by using the text feature extraction algorithm to obtain voice feature vectors and the like. The text feature extraction algorithm may be a bert (Bidirectional Encoder Representation from Transformers, pre-trained language characterization model) model to extract text features, a Long Short-Term Memory (LSTM) model to extract text modal features, a Convolutional Neural Network (CNN) model to extract text modal features, and the like. The image feature extraction algorithm may extract image features by using vision transformer (visual translation) model, or may extract image features by using a model such as a resnet (residual network) or a noise student (visual semi-supervision model).

In a specific embodiment, as shown in fig. 6, a network structure diagram for text feature vector extraction is shown, specifically: the method comprises the steps of acquiring input text information 'lovely me cat' and inputting the text information into a bert model for classification and identification. The text information is subjected to text semantic vectorization by the bert model, and then classified and identified by a classification layer. At this time, all the characterization vectors of the output of the hidden layer of the previous layer of the bert classification layer are obtained, the vectors of the CLS token are removed from all the characterization vectors, the characterization vectors corresponding to all the words are obtained, and the characterization vectors corresponding to all the words are used as text feature vectors extracted by the text information.

In a specific embodiment, as shown in fig. 7, a network structure diagram for extracting an image feature vector is shown, specifically: the method comprises the steps of obtaining image modal information, inputting images into a vision transformer model, partitioning the images into blocks by a vision transformer model, flattening the blocks into a sequence, inputting the sequences into an image vectorization layer for vectorization, inputting the sequences into an encoding layer for encoding, and finally inputting the sequences into a full-connection layer for classification to obtain classification categories. And then outputting all characterization vectors of a previous layer, namely an implicit layer, of vision transformer classifie (visual translation classification layer), removing the vectors of the CLS token from all the characterization vectors to obtain characterization vectors corresponding to all the image blocks, and taking the characterization vectors corresponding to all the image blocks as image feature vectors extracted by the image modal information.

Step 502, fusing the first modal feature vector and the second modal feature vector to obtain a fused feature vector.

Specifically, the server may fuse the first modal feature vector and the second modal feature vector, where the first modal feature vector and the second modal feature vector may be directly spliced to obtain a fused feature vector, and when the first modal feature vector is spliced, the second modal feature vector may be spliced after the first modal feature vector, or the first modal feature vector may be spliced after the second modal feature vector, and the second modal feature vector may be spliced before the first modal feature vector. And then vector operation is carried out on the first modal feature vector and the second modal feature vector to obtain fusion features, for example, vector sum operation, vector product operation and the like can be carried out.

Step 502, inputting the fusion feature vector into an initial modal alignment model for modal alignment characterization, and obtaining a training modal pair Ji Biaozheng vector.

Specifically, the server inputs the fusion feature vector into an initial modal alignment model for modal alignment characterization, and a training modal pair Ji Biaozheng vector is obtained.

In a specific embodiment, the server may input the fused feature vector into an initial modality alignment model of the network structure as shown in fig. 3 for modality alignment characterization, to obtain a training modality pair Ji Biaozheng vector.

In the embodiment, the feature extraction is performed on the different training mode information to obtain the corresponding feature vectors, and then the mode alignment characterization is performed after the feature vectors corresponding to the different training mode information are fused to obtain the training mode pair Ji Biaozheng vector, so that the accuracy of the obtained training mode alignment characterization vector is improved.

In one embodiment, the first training modality information comprises text information and the second training modality information comprises picture information;

feature extraction is performed on the first training mode information and the second training mode information respectively to obtain a first mode feature vector and a second mode feature vector, and the feature extraction method comprises the following steps:

inputting the text information into a text feature extraction model, obtaining a text global characterization vector and a text character characterization vector through the text feature extraction model, and taking the text character characterization vector as a first modal feature vector;

inputting the picture information into a picture feature extraction model, obtaining a picture global characterization vector and a picture content characterization vector through the picture feature extraction model, and taking the picture content characterization vector as a second mode feature vector.

The text information refers to the mode information in the text form, and the picture information refers to the mode information in the picture form. The text global characterization vector refers to a vector for characterizing the text information global. Text character token vectors refer to token vectors used to token characters in text. The picture global characterization vector refers to a vector for characterizing the image information global. The picture content characterization vector refers to a characterization vector used for characterizing the content in the image

Specifically, the server inputs text information into a text feature extraction model, obtains a text global characterization vector and a text character characterization vector through the text feature extraction model, and then takes the text character characterization vector as a first modal feature vector. The server inputs the picture information into a picture feature extraction model, a picture global characterization vector and a picture content characterization vector are extracted through the picture feature extraction model, and the picture content characterization vector is used as a second mode feature vector. In one embodiment, the vector dimensions of the first modality feature vector and the second modality feature vector are the same.

In the embodiment, the text character characterization vector and the picture content characterization vector are used as the modal feature vectors, so that accuracy of the modal feature vectors in the process of modal alignment can be improved.

In one embodiment, as shown in fig. 8, step 204, inputting the first training mode information and the second training mode information into the initial mode alignment model for mode alignment characterization, to obtain a training mode pair Ji Biaozheng vector, includes:

step 802, inputting first training modality information and second training modality information into an initial modality alignment model.

And step 804, respectively extracting features of the first training mode information and the second training mode information through the initial mode alignment model to obtain a first mode feature vector and a second mode feature vector.

Specifically, the server may directly input the first training mode information and the second training mode information into an initial mode alignment model, and then the initial mode alignment model performs feature extraction on the first training mode information and the second training mode information to obtain a first mode feature vector and a second mode feature vector which need to be subjected to mode alignment. Feature extraction may be performed through an initial modality feature extraction network in the initial modality alignment model, for example, a transducer network may be used to perform modality feature extraction and fusion.

Step 806, fusing the first modal feature vector and the second modal feature vector through the initial modal alignment model to obtain a fused feature vector, and performing modal alignment characterization based on the fused feature vector to obtain a training modal pair Ji Biaozheng vector.

Specifically, the server fuses the first modality feature vector and the second modality feature vector through the initial modality alignment model to obtain a fused feature vector, wherein the fusing can be directly splicing the first modality feature vector and the second modality feature vector or carrying out vector operation on the first modality feature vector and the second modality feature vector. And then carrying out modal alignment characterization by using the fusion feature vector to obtain a training modal pair Ji Biaozheng vector.

In the above embodiment, feature extraction of the mode information is performed through the initial mode alignment model, then fusion is performed to obtain a fused feature vector, and then mode alignment characterization is directly performed to obtain a training mode pair Ji Biaozheng vector, so that efficiency of obtaining the training mode alignment characterization vector can be improved.

In one embodiment, step 206, namely, calculating the similarity between the first modality pair Ji Biaozheng vector and the second modality alignment characterization vector to obtain the modality similarity, includes the steps of:

and calculating the cosine distance between the alignment characterization vector of the first modality pair Ji Biaozheng vector and the second modality pair to obtain the modality similarity degree.

Specifically, the server may calculate the similarity between the first modality pair Ji Biaozheng vector and the second modality alignment characterization vector using a cosine distance algorithm, resulting in a modality similarity. The degree of modality similarity may measure the cost of converting from one modality to another.

In one embodiment, the first modality pair Ji Biaozheng vector and the second modality alignment characterization vector may be normalized to obtain a normalized first modality pair Ji Biaozheng vector and a normalized second modality pair Ji Biaozheng vector, and then the euclidean distance between the normalized first modality pair Ji Biaozheng vector and the normalized second modality alignment characterization vector is calculated to obtain the modality similarity. For example, the degree of modal similarity may be calculated using equation (1) as shown below.

Wherein w is _i Refers to a first modality pair Ji Biaozheng vector, V _j Refers to the second modality pair Ji Biaozheng vector and c refers to the degree of modal similarity.

In the above embodiment, the similarity between the first modality pair Ji Biaozheng vector and the second modality alignment characterization vector is calculated by using the cosine distance, so as to obtain the modality similarity, and improve the accuracy of the modality similarity.

In one embodiment, step 208, namely calculating a probability distribution distance of the alignment characterization vector of the first modality pair Ji Biaozheng vector and the second modality based on the modality similarity, and obtaining the vector loss information based on the probability distribution distance, includes the steps of:

acquiring target probability distribution conversion parameter information; calculating the product of the target probability distribution conversion parameter information and the modal similarity degree to obtain the probability distribution distance of the first modal pair Ji Biaozheng vector and the second modal alignment characterization vector; the probability distribution distance is taken as vector loss information.

The target probability distribution conversion parameter information refers to minimum transmission quality, i.e. optimal transmission distance, required when the first modality information is converted into the second modality information. For example, a matrix of conversion parameters between modalities may be provided. The target probability distribution transition parameter information is used to characterize the minimum value of the average distance that the data needs to move when moving from one distribution to another, i.e., the minimum consumption under the optimal movement path.

Specifically, the server acquires initialized probability distribution transition parameter information, which may be an identity matrix. And then continuously carrying out iterative optimization calculation on the initialized probability distribution conversion parameter information to obtain target probability distribution conversion parameter information, carrying out product operation on the target probability distribution conversion parameter information and the similarity degree of the modes to obtain the probability distribution distance of the first mode pair Ji Biaozheng vector aligned with the second mode representing vector, and directly taking the probability distribution distance as vector loss information.

In one particular embodiment, the probability distribution distance is calculated using a probability distribution distance measurement algorithm including, but not limited to, the KL (Kullback-Leibler divergence, which may measure the distance between two random distributions), the JS (Jensen-Shannon) divergence, which measures the similarity of two probability distributions, and the Wasserstein distance algorithm, among others. As shown in fig. 9, a process diagram of the optimal probability distribution transition is shown. Wherein the target probability distribution transformation parameter information is used to transform one probability distribution into another probability distribution,

In a specific embodiment, the loss information may be calculated using equation (2) as shown below as a loss function.

Wherein L is _wpa The method refers to a target loss function, and minimum loss information is obtained through iterative optimization through the target loss function. Mu represents a first probability distribution, v represents a second probability distribution, D _ot Refers to a first probability distribution and a second probability distributionWasserstein distance between. T is target probability distribution conversion parameter information and is represented by a transmission matrix. a and b are parameters of the iterative calculation T. a and b may be calculated from the initial probability distribution transition parameter information, the values of the first probability distribution, and the values of the second probability distribution.

In one embodiment, obtaining the target probability distribution transition parameter information includes the steps of:

the method comprises the steps of obtaining initial probability distribution conversion parameter information, first probability distribution information corresponding to a Ji Biaozheng vector of a first modality pair and second probability distribution information corresponding to a Ji Biaozheng vector of a second modality pair.

The initial probability distribution conversion parameter information refers to initialized probability distribution conversion parameter information, and may be an identity matrix. The first probability distribution information refers to the value of the discrete distribution to which the first modality pair Ji Biaozheng vector belongs when taken as a sample, the second probability distribution information refers to the value of the discrete distribution to which the second modality pair Ji Biaozheng vector belongs when taken as a sample, and the first probability distribution and the second probability distribution are respectively subject to corresponding constraint conditions, wherein the constraint conditions of the first probability distribution can be that the sum of samples of the first probability distribution converted into the second probability distribution is the same as the sum of original samples of the first probability distribution. The constraint of the first probability distribution may be that the sum of samples of the second probability distribution obtained after conversion is the same as the original sum of samples of the first probability distribution.

Specifically, the server acquires initial probability distribution transition parameter information, first probability distribution information corresponding to the first modality pair Ji Biaozheng vector, and second probability distribution information corresponding to the second modality pair Ji Biaozheng vector.

Furthermore, iterative computation is carried out on the initial probability distribution conversion parameter information based on the initial probability distribution conversion parameter information, the modal similarity degree, the first probability distribution information and the second probability distribution information, and when the preset iterative computation completion condition is reached, target probability distribution conversion parameter information is obtained.

Specifically, the server uses the modal similarity degree to perform exponential operation based on a natural constant to obtain an exponential operation result, then uses initial probability distribution conversion parameter information and the exponential operation result to perform product operation to obtain a product operation result, then obtains a preset unit parameter matrix, calculates the product of the preset unit matrix and the product operation result, and then calculates the ratio of first probability distribution information and the product to obtain first matrix parameters. And then calculating the product of the first matrix parameter and the transpose of the product operation result, and then calculating the ratio of the second probability distribution information to the product to obtain a second matrix parameter. And performing matrix multiplication operation by using a multiplication operation result, a first matrix parameter and a second matrix parameter to obtain a target matrix, and then acquiring a vector formed by diagonal elements from the target matrix as probability distribution conversion parameter information obtained by first iterative computation, wherein diagonal elements in the first matrix parameter and the second matrix parameter can also be acquired as a first vector and a second vector, and then multiplying the first vector, the second vector and the multiplication operation result to obtain probability distribution conversion parameter information of the first iteration. And then taking the probability distribution conversion parameter information as initial probability distribution conversion parameter information and continuously carrying out loop iteration, and obtaining target probability distribution conversion parameter information when the preset iteration calculation completion condition is reached. The preset iteration calculation completion condition refers to a preset iteration number upper limit.

In a specific embodiment, the target probability distribution transformation parameter information may be calculated by using an IPOT (Inexact Proximal point method for Optimal Transport, an approximate solution algorithm for optimal transportation) algorithm, or may be calculated by using a numerical analysis algorithm based on a regular constraint class, such as sink horn (an iterative solution algorithm for optimal transportation). In a specific embodiment, the target probability distribution transition parameter information may be calculated using a third party library.

In a specific embodiment, the following formula (3) may be used to perform iterative calculation, and finally obtain the target probability distribution transformation parameter information.

Wherein, T is% ^t+1 ) And representing target probability distribution conversion parameter information which is obtained after the t+1st iteration. T (T) ^(t) Refers to the initial probability distribution transition parameter information. t is a positive integer. Beta is a preset value and can be 0.5.C refers to the degree of similarity of the modes. T refers to probability distribution conversion parameter information to be optimized. B refers to the Bregman (Bridgman divergence) divergence.

In the above embodiment, by performing iterative computation on the initial probability distribution conversion parameter information by using the initial probability distribution conversion parameter information, the modal similarity degree, the first probability distribution information and the second probability distribution information, when a preset iterative computation completion condition is reached, the target probability distribution conversion parameter information is obtained, and the accuracy of the obtained target probability distribution conversion parameter information is improved.

In one embodiment, as shown in fig. 10, a method for aligning a modality is provided, which is illustrated by taking the application of the method to the server in fig. 1 as an example, it is understood that the method may also be applied to the server, and may also be applied to a system including a terminal and a server, and implemented through interaction between the terminal and the server. In this embodiment, the method for training the modal alignment model includes the following steps:

step 902, acquiring first to-be-aligned mode information and second to-be-aligned mode information.

Specifically, the first to-be-aligned modality information refers to modality information that needs to be subjected to modality alignment with the second to-be-aligned modality information. The first band alignment mode information may be text information, picture information, voice information, video information, etc., and the second band alignment mode information may be mode information that needs to be aligned with the first band alignment mode information. The first to-be-aligned mode information and the second to-be-aligned mode information are information of different modes. The server can acquire the first to-be-aligned mode information and the second to-be-aligned mode information from the database, can acquire the first to-be-aligned mode information and the second to-be-aligned mode information uploaded by the terminal, and can acquire the first to-be-aligned mode information and the second to-be-aligned mode information from the service side.

Step 904, inputting the first to-be-aligned mode information and the second to-be-aligned mode information into a first target mode alignment model; the method comprises the steps that first training mode information and second training mode information are input into an initial mode alignment model to perform mode alignment characterization, a training mode pair Ji Biaozheng vector is obtained, the training mode alignment characterization vector comprises a first mode pair Ji Biaozheng vector corresponding to the first training mode information and a second mode pair Ji Biaozheng vector corresponding to the second training mode information, the similarity between a first mode pair Ji Biaozheng vector and the second mode alignment characterization vector is calculated, and the mode similarity is obtained; and calculating the probability distribution distance between the Ji Biaozheng vector of the first modality pair and the alignment characterization vector of the second modality based on the modal similarity degree, obtaining vector loss information based on the probability distribution distance, and iteratively updating the initial modality alignment model based on the vector loss information until reaching the alignment model training completion condition, thereby obtaining the first target modality alignment model.

Specifically, the server may call a pre-trained first target modality alignment model to perform modality alignment on the first to-be-aligned modality information and the second to-be-aligned modality information, that is, the server inputs the first to-be-aligned modality information and the second to-be-aligned modality information into the first target modality alignment model. The method comprises the steps that first training mode information and second training mode information are input into an initial mode alignment model to perform mode alignment characterization, a training mode pair Ji Biaozheng vector is obtained, the training mode alignment characterization vector comprises a first mode pair Ji Biaozheng vector corresponding to the first training mode information and a second mode pair Ji Biaozheng vector corresponding to the second training mode information, the similarity between a first mode pair Ji Biaozheng vector and the second mode alignment characterization vector is calculated, and the mode similarity is obtained; and calculating the probability distribution distance between the Ji Biaozheng vector of the first modality pair and the alignment characterization vector of the second modality based on the modal similarity degree, obtaining vector loss information based on the probability distribution distance, and iteratively updating the initial modality alignment model based on the vector loss information until reaching the alignment model training completion condition, thereby obtaining the first target modality alignment model.

In one embodiment, the server may input the first to-be-aligned modality information and the second to-be-aligned modality information into the modality alignment model trained using any of the foregoing modality alignment model training methods to perform the modality pair Ji Biaozheng. For example, the server may input the first to-be-aligned modality information and the second to-be-aligned modality information into the second target modality alignment model to perform the modality pair Ji Biaozheng.

In step 906, the first to-be-aligned mode information and the second to-be-aligned mode information are subjected to mode alignment characterization through the first target mode alignment model, and a target mode pair Ji Biaozheng vector is obtained.

Specifically, the server performs modality alignment characterization on the first modality information to be aligned and the second modality information to be aligned through a modality pair Ji Biaozheng parameter in the first target modality alignment model, and obtains an output target modality pair Ji Biaozheng vector. The modality pair Ji Biaozheng parameters refer to trained model parameters in the first target modality alignment model. The target modality alignment characterization vector may include a modality pair Ji Biaozheng vector corresponding to the first modality information to be aligned and a modality pair Ji Biaozheng vector corresponding to the second modality information to be aligned. The target modality alignment token vector may then be used to process downstream tasks, such as tasks that classify and identify modality information, and recommend based on the modality information.

In one embodiment, the server may perform feature extraction on the first to-be-aligned mode information and the second to-be-aligned mode information to obtain a first to-be-aligned feature vector and a second to-be-aligned feature vector, then splice the first to-be-aligned feature vector and the second to-be-aligned feature vector to obtain a spliced vector, and input the spliced vector to the first target mode alignment model to perform mode alignment characterization to obtain a target mode pair Ji Biaozheng vector.

According to the modal alignment method, the first to-be-aligned modal information and the second to-be-aligned modal information are subjected to modal alignment characterization through the first target modal alignment model, the target modal pair Ji Biaozheng vector is obtained, and as the first target modal alignment model is obtained through calculating the similarity between the first modal pair Ji Biaozheng vector and the second modal alignment characterization vector, the modal similarity is obtained, the probability distribution distance between the first modal pair Ji Biaozheng vector and the second modal alignment characterization vector is calculated based on the modal similarity, the vector loss information is obtained based on the probability distribution distance, and the vector loss information is obtained through iterative training, so that the obtained target modal pair Ji Biaozheng vector is more accurate, and the accuracy of modal alignment is improved.

In one embodiment, as shown in fig. 11, the modality alignment method further includes:

step 1102, an initial multi-modal information classification recognition model is built based on the first target modality alignment model.

The initial multi-modal information classification recognition model is an initial multi-modal information classification recognition model, and the multi-modal information classification recognition model is used for classifying and recognizing a plurality of modal information and recognizing categories in the modal information.

Specifically, the server may use the first target modality alignment model as a pre-training model, and then add a fully connected classification network based on the network structure of the first target modality alignment model to obtain an initial multi-modality information classification recognition model. And then using the training data of classification recognition to fine tune the initial multi-mode information classification recognition model.

Step 1102, a multi-modal training sample and a corresponding classification identification tag are obtained.

The multi-modal training sample refers to a training sample comprising different modal information, and the multi-modal training sample is used for multi-modal classification and identification. The multi-modal training sample includes two types of modal information. When more than two types of modal information are required to be classified and identified, the method can be split into two types of modal information to be combined for classification and identification, and finally the classification and identification results are combined to obtain the classification and identification results of more than two types of modal information. The classification recognition tag refers to a class tag corresponding to the mode information in the multi-mode training sample, and is used for performing classification recognition task training.

Specifically, the server may obtain the multimodal training sample and the corresponding classification identification tag from the database, or may obtain the multimodal training sample and the corresponding classification identification tag from the server providing the data service, or may obtain the multimodal training sample and the corresponding classification identification tag uploaded by the terminal.

Step 1102, inputting the multi-modal training sample into an initial multi-modal information classification and identification model, performing modal alignment characterization on the multi-modal training sample through the initial multi-modal information classification and identification model to obtain a training modal pair Ji Biaozheng vector, and performing multi-modal classification and identification based on the training modal alignment characterization vector to obtain an initial classification and identification result.

Specifically, the server inputs the multi-modal training sample into an initial multi-modal information classification and identification model, carries out modal alignment characterization on the multi-modal training sample through a modal pair Ji Biaozheng parameter in the initial multi-modal information classification and identification model to obtain a training modal pair Ji Biaozheng vector, and carries out multi-modal classification and identification through initial classification and identification parameters based on the training modal alignment characterization vector to obtain an initial classification and identification result. The initial classification recognition parameter refers to an initialized classification recognition parameter. The modality pair Ji Biaozheng parameters refer to model parameters in the first target modality alignment model.

And 1102, performing multi-mode classification recognition loss calculation based on the initial classification recognition result and the classification recognition label to obtain multi-mode classification recognition loss information.

Specifically, the server calculates an error between the initial classification recognition result and the classification recognition tag using the classification loss function, resulting in multi-modal classification recognition loss information. Wherein the classification loss function may use a cross entropy loss function.

And 1102, reversely updating an initial multi-modal information classification recognition model based on the multi-modal classification recognition loss information and performing loop iteration to obtain a target multi-modal classification recognition model.

Specifically, the server reversely updates the initial multi-modal information classification recognition model through multi-modal classification recognition loss information by using a gradient descent algorithm to obtain an updated multi-modal information classification recognition model, then takes the updated multi-modal information classification recognition model as the initial multi-modal information classification recognition model, and returns to the step of acquiring the multi-modal training sample and the corresponding classification recognition label to execute until a preset training completion condition is reached, so that the target multi-modal classification recognition model is obtained.

In the above embodiment, the initial multi-modal information classification and identification model is built by using the first target modal alignment model, and then the initial multi-modal information classification and identification model is trained, so that the target multi-modal classification and identification model is obtained, and the obtained target multi-modal classification and identification model can improve the accuracy of classification and identification.

In one embodiment, after step 1102, i.e. after obtaining the target multimodal recognition task model, the steps further comprise:

acquiring identification mode information to be classified; inputting the identification mode information to be classified into a target multi-mode identification task model; performing modal alignment characterization on the to-be-classified identification modal information through a target multi-modal classification identification model to obtain a target modal pair Ji Biaozheng vector; and carrying out multi-mode classification and identification on the target modal alignment characterization vector through a target multi-mode classification and identification model to obtain classification and identification results corresponding to the identification modal information to be classified.

The identifying mode information to be classified refers to mode information needing to be classified and identified. Each piece of identifying mode information to be classified comprises information of at least two different modes.

Specifically, the server acquires information of each recognition mode to be classified, wherein the information of each recognition mode to be classified comprises information of two different modes, namely, the information of the two recognition modes to be classified is obtained, at the moment, the information of the two recognition modes to be classified is directly input into a target multi-mode recognition task model, the information of each recognition mode to be classified is subjected to mode alignment characterization through the target multi-mode recognition model, a target mode pair Ji Biaozheng vector is obtained, and the target multi-mode alignment characterization vector is subjected to multi-mode classification recognition through the target multi-mode recognition model, so that classification recognition results corresponding to the information of the two recognition modes to be classified are obtained.

When the information of more than two different modes in the identification mode information to be classified is obtained, the identification mode information to be classified is combined in pairs to obtain combined identification mode information to be classified. And then respectively inputting the combined to-be-classified recognition mode information into a target multi-mode recognition task model for classification recognition to obtain a classification recognition result output by each combined to-be-classified recognition mode information, and finally fusing the classification recognition results output by each combined to-be-classified recognition mode information to obtain classification recognition results corresponding to more than two different modes of information.

In the embodiment, the target multi-mode recognition task model is used for carrying out classification recognition on the to-be-classified recognition mode information to obtain the classification recognition result corresponding to the to-be-classified recognition mode information, so that the accuracy of classification recognition is improved.

In a specific embodiment, as shown in fig. 12, a flow chart of a method for training a modal alignment model includes the following steps:

step 1202, acquiring first training mode information and second training mode information and a mode alignment label. Respectively extracting features of the first training mode information and the second training mode information to obtain a first mode feature vector and a second mode feature vector; and fusing the first modal feature vector and the second modal feature vector to obtain a fused feature vector.

In step 1204, the fused feature vector is input into an initial modal alignment classification recognition model to perform modal alignment characterization, a training modal pair Ji Biaozheng vector is obtained, modal alignment classification recognition is performed based on the training modal alignment characterization vector, a modal alignment classification recognition result is obtained, and the training modal alignment characterization vector comprises a first modal pair Ji Biaozheng vector corresponding to the first training modal information and a second modal pair Ji Biaozheng vector corresponding to the second training modal information.

Step 1206, calculating the similarity between the first modality pair Ji Biaozheng vector and the second modality alignment characterization vector to obtain a modality similarity; the method comprises the steps of obtaining initial probability distribution conversion parameter information, first probability distribution information corresponding to a Ji Biaozheng vector of a first modality pair and second probability distribution information corresponding to a Ji Biaozheng vector of a second modality pair.

Step 1208, performing iterative computation on the initial probability distribution transformation parameter information based on the initial probability distribution transformation parameter information, the modal similarity degree, the first probability distribution information and the second probability distribution information, and obtaining target probability distribution transformation parameter information when a preset iterative computation completion condition is reached.

In step 1210, the product of the target probability distribution transformation parameter information and the similarity of the modes is calculated, so as to obtain the probability distribution distance of the alignment characterization vector of the first mode pair Ji Biaozheng vector and the second mode, and the probability distribution distance is used as the vector loss information. And performing classification loss calculation based on the modal alignment classification recognition result and the modal alignment label to obtain classification loss information, and performing model loss information based on the classification loss information and the vector loss information to obtain model loss information.

Step 1212, updating the initial modal alignment classification recognition model reversely based on the model loss information to obtain an updated modal alignment classification recognition model, taking the updated modal alignment classification recognition model as the initial modal alignment classification recognition model, and returning to execute the step of obtaining the first training modal information and the second training modal information until the training completion condition of the classification model is reached to obtain a target modal alignment classification recognition model;

step 1214, obtaining a second target modality alignment model based on the target modality alignment classification identification model.

In a specific embodiment, the mode alignment method is applied to an electronic commerce platform, and a merchant uploads commodity information to be put on shelf, wherein the commodity information comprises a commodity with a title of 'new fashion waisted lacing with cap clothing' and an original commodity schematic diagram. And then, the category of commodity information is to be identified, and at the moment, the platform server inputs the title text and the original commodity schematic diagram into a target modal alignment classification identification model to conduct modal alignment and classification identification, so that the identified clothing category is obtained. The text characterization vector and the vector of clothes in the image characterization vector, which are obtained by improving the target modality alignment classification recognition model, have a corresponding relationship. And then, classifying and identifying by using the text characterization vector and the image characterization vector to obtain the identified clothes category, thereby improving the accuracy of classifying and identifying. As shown in fig. 13, a schematic view of the alignment effect of the modes is shown, wherein the left image is a commodity image, the middle image is an alignment effect before the alignment of the modes, and the right image is an alignment effect after the alignment of the modes, wherein it is obvious that the alignment effect of the clothes commodity in the image and the clothes in the text description is better in the application, that is, the accuracy of the mode alignment is higher, the alignment result is more accurate and clear, and the accuracy of the subsequent task processing is higher.

It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.

Based on the same inventive concept, the embodiment of the application also provides a mode alignment model training device for realizing the mode alignment model training method and a mode alignment device for realizing the mode alignment model training method. The implementation of the solution provided by the device is similar to the implementation described in the above method, so the specific limitation in the embodiment of the one or more modality alignment model training devices or the modality alignment device provided below may be referred to the above limitation on the modality alignment model training method or the modality alignment method, which is not repeated herein.

In one embodiment, as shown in fig. 14, a modality alignment model training arrangement 1400 is provided, comprising: an information acquisition module 1402, an initial alignment module 1404, a similarity calculation module 1406, a loss calculation module 1408, and an iteration module 1410, wherein:

an information acquisition module 1402, configured to acquire first training mode information and second training mode information;

the initial alignment module 1404 is configured to input the first training mode information and the second training mode information into an initial mode alignment model for performing mode alignment characterization, so as to obtain a training mode pair Ji Biaozheng vector, where the training mode alignment characterization vector includes a first mode pair Ji Biaozheng vector corresponding to the first training mode information and a second mode pair Ji Biaozheng vector corresponding to the second training mode information, and the same instance characterization in the first mode pair Ji Biaozheng vector and the second mode alignment characterization vector has an initial correspondence;

a similarity calculation module 1406, configured to calculate a similarity between the first modality pair Ji Biaozheng vector and the second modality alignment characterization vector, to obtain a modality similarity;

a loss calculation module 1408, configured to calculate a probability distribution distance of the first modality pair Ji Biaozheng vector and the second modality alignment characterization vector based on the modality similarity degree, and obtain vector loss information based on the probability distribution distance;

The iteration module 1410 is configured to reversely update the initial modality alignment model based on the vector loss information to obtain an updated modality alignment model, and execute the step of returning the updated modality alignment model to obtain the first training modality information and the second training modality information until reaching the training completion condition of the alignment model, where the first target modality alignment model is used to extract semantic characterizations of different modality information, and semantic characterizations of the same instance in the semantic characterizations of different modality information have a corresponding relationship.

In one embodiment, the modality alignment model training arrangement 1400 further includes:

the label acquisition module is used for acquiring a modal alignment label corresponding to the first training modal information and the second training modal information;

the classification module is used for inputting the first training mode information and the second training mode information into the initial mode alignment classification recognition model for mode alignment characterization to obtain a training mode pair Ji Biaozheng vector, and carrying out mode alignment classification recognition based on the training mode alignment characterization vector to obtain a mode alignment classification recognition result;

the classification loss calculation module is used for carrying out classification loss calculation based on the modal alignment classification recognition result and the modal alignment label to obtain classification loss information;

The model loss obtaining module is used for carrying out model loss information based on the classification loss information and the vector loss information to obtain model loss information;

the classification model iteration module is used for reversely updating the initial modal alignment classification recognition model based on the model loss information to obtain an updated modal alignment classification recognition model, taking the updated modal alignment classification recognition model as the initial modal alignment classification recognition model, and returning to execute the step of acquiring the first training modal information and the second training modal information until the training completion condition of the classification model is reached to obtain a target modal alignment classification recognition model;

and the second model obtaining module is used for obtaining a second target modality alignment model based on the target modality alignment classification recognition model.

the classification module is also used for carrying out modal alignment characterization on the first training modal information and the second training modal information through an initial modal alignment characterization network in the initial modal alignment classification recognition model to obtain a Ji Biaozheng vector of the training modal pair; performing modal alignment classification recognition through an initial classification network in the initial modal alignment classification recognition model to obtain a modal alignment classification recognition result;

The second model obtaining module is further used for taking the target modality alignment characterization network in the target modality alignment classification identification model as a second target modality alignment model.

In one embodiment, the initial alignment module 1404 is further configured to perform feature extraction on the first training mode information and the second training mode information, to obtain a first mode feature vector and a second mode feature vector; fusing the first modal feature vector and the second modal feature vector to obtain a fused feature vector; and inputting the fusion feature vector into an initial modal alignment model for modal alignment characterization to obtain a training modal pair Ji Biaozheng vector.

the initial alignment module 1404 is further configured to input text information into a text feature extraction model, obtain a text global feature vector and a text character feature vector through the text feature extraction model, and use the text character feature vector as a first modal feature vector; inputting the picture information into a picture feature extraction model, obtaining a picture global characterization vector and a picture content characterization vector through the picture feature extraction model, and taking the picture content characterization vector as a second mode feature vector.

In one embodiment, initial alignment module 1404 is further configured to input first training modality information and second training modality information into an initial modality alignment model; respectively extracting features of the first training mode information and the second training mode information through an initial mode alignment model to obtain a first mode feature vector and a second mode feature vector; and fusing the first modal feature vector and the second modal feature vector through an initial modal alignment model to obtain a fused feature vector, and carrying out modal alignment characterization based on the fused feature vector to obtain a training modal pair Ji Biaozheng vector.

In one embodiment, the similarity calculation module 1406 is further configured to calculate a cosine distance between the alignment characterization vector of the first modality pair Ji Biaozheng and the second modality, to obtain the degree of similarity of the modalities.

In one embodiment, the loss calculation module 1408 is further configured to obtain target probability distribution transition parameter information; calculating the product of the target probability distribution conversion parameter information and the modal similarity degree to obtain the probability distribution distance of the first modal pair Ji Biaozheng vector and the second modal alignment characterization vector; the probability distribution distance is taken as vector loss information.

In one embodiment, the loss calculation module 1408 is further configured to obtain initial probability distribution transformation parameter information, first probability distribution information corresponding to the first modality pair Ji Biaozheng vector, and second probability distribution information corresponding to the second modality pair Ji Biaozheng vector; and carrying out iterative computation on the initial probability distribution conversion parameter information based on the initial probability distribution conversion parameter information, the modal similarity degree, the first probability distribution information and the second probability distribution information, and obtaining target probability distribution conversion parameter information when a preset iterative computation completion condition is reached.

In one embodiment, as shown in fig. 15, there is provided a modality alignment apparatus 1500 comprising: the device comprises an information to be aligned acquisition module 1502, an input module 1504 and a modality alignment module 1506, wherein:

the to-be-aligned information acquiring module 1502 is configured to acquire first to-be-aligned modal information and second to-be-aligned modal information;

an input module 1504, configured to input the first to-be-aligned modality information and the second to-be-aligned modality information into a first target modality alignment model; the method comprises the steps that first training mode information and second training mode information are input into an initial mode alignment model to perform mode alignment characterization, a training mode pair Ji Biaozheng vector is obtained, the training mode alignment characterization vector comprises a first mode pair Ji Biaozheng vector corresponding to the first training mode information and a second mode pair Ji Biaozheng vector corresponding to the second training mode information, the similarity between the first mode pair Ji Biaozheng vector and the second mode alignment characterization vector is calculated, and the mode similarity is obtained; calculating probability distribution distances of the Ji Biaozheng vectors of the first modality pair and the alignment characterization vectors of the second modality based on the modality similarity degree, obtaining vector loss information based on the probability distribution distances, and iteratively updating the initial modality alignment model based on the vector loss information until the alignment model training completion condition is reached, so as to obtain the first target modality alignment model;

The modality alignment module 1506 is configured to perform modality alignment characterization on the first to-be-aligned modality information and the second to-be-aligned modality information through the first target modality alignment model, so as to obtain a target modality pair Ji Biaozheng vector.

In one embodiment, the modality alignment apparatus 1500 further includes:

the classification recognition model training module is used for establishing an initial multi-mode information classification recognition model based on the first target mode alignment model; acquiring a multi-mode training sample and a corresponding classification identification label; inputting a multi-modal training sample into an initial multi-modal information classification and identification model, carrying out modal alignment characterization on the multi-modal training sample through the initial multi-modal information classification and identification model to obtain a training modal pair Ji Biaozheng vector, and carrying out multi-modal classification and identification based on the training modal alignment characterization vector to obtain an initial classification and identification result; performing multi-mode classification recognition loss calculation based on the initial classification recognition result and the classification recognition label to obtain multi-mode classification recognition loss information; and reversely updating the initial multi-modal information classification recognition model based on the multi-modal classification recognition loss information and carrying out loop iteration to obtain the target multi-modal classification recognition model.

In one embodiment, the modality alignment apparatus 1500 further includes:

the classification recognition module is used for acquiring the recognition mode information to be classified; inputting the identification mode information to be classified into a target multi-mode identification task model; performing modal alignment characterization on the to-be-classified identification modal information through a target multi-modal classification identification model to obtain a target modal pair Ji Biaozheng vector; and carrying out multi-mode classification and identification on the target modal alignment characterization vector through a target multi-mode classification and identification model to obtain classification and identification results corresponding to the identification modal information to be classified.

The above-described modal alignment model training apparatus or each module in the modal alignment apparatus may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a server, and the internal structure of which may be as shown in fig. 16. The computer device includes a processor, a memory, an Input/Output interface (I/O) and a communication interface. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface is connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used to store various modality data used in training. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a modality alignment model training method or a modality alignment method.

In one embodiment, a computer device is provided, which may be a terminal, and the internal structure thereof may be as shown in fig. 17. The computer device includes a processor, a memory, an input/output interface, a communication interface, a display unit, and an input means. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface, the display unit and the input device are connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless mode can be realized through WIFI, a mobile cellular network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a modality alignment model training method or a modality alignment method. The display unit of the computer equipment is used for forming a visual picture, and can be a display screen, a projection device or a virtual reality imaging device, wherein the display screen can be a liquid crystal display screen or an electronic ink display screen, the input device of the computer equipment can be a touch layer covered on the display screen, can also be a key, a track ball or a touch pad arranged on a shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

It will be appreciated by those skilled in the art that the structures shown in fig. 16 or 17 are merely block diagrams of portions of structures associated with the present inventive arrangements and are not limiting of the computer device to which the present inventive arrangements may be implemented, and that a particular computer device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.

In an embodiment, there is also provided a computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the method embodiments described above when the computer program is executed.

In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when executed by a processor, carries out the steps of the method embodiments described above.

In an embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the steps of the method embodiments described above.

It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are information and data authorized by the user or sufficiently authorized by each party, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards of the related country and region.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, or the like, but is not limited thereto.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims

1. A method of training a modal alignment model, the method comprising:

acquiring first training mode information and second training mode information;

inputting the first training mode information and the second training mode information into an initial mode alignment model for mode alignment characterization, and obtaining a training mode pair Ji Biaozheng vector, wherein the training mode alignment characterization vector comprises a first mode pair Ji Biaozheng vector corresponding to the first training mode information and a second mode pair Ji Biaozheng vector corresponding to the second training mode information, and the first mode pair Ji Biaozheng vector and the same instance characterization in the second mode alignment characterization vector have initial corresponding relations;

calculating probability distribution distances of the first modality pair Ji Biaozheng vectors and the second modality alignment characterization vectors based on the modality similarity degree, and obtaining vector loss information based on the probability distribution distances;

and reversely updating the initial modal alignment model based on the vector loss information to obtain an updated modal alignment model, taking the updated modal alignment model as the initial modal alignment model, and returning to execute the step of acquiring the first training modal information and the second training modal information until reaching the training completion condition of the alignment model to obtain a first target modal alignment model, wherein the first target modal alignment model is used for extracting semantic representations of different modal information, and the semantic representations of the same instance in the semantic representations of the different modal information have a corresponding relation.

2. The method according to claim 1, characterized in that the method further comprises:

acquiring a modality alignment label corresponding to the first training modality information and the second training modality information;

Inputting the first training mode information and the second training mode information into an initial mode alignment classification recognition model for mode alignment characterization to obtain a Ji Biaozheng vector of the training mode pair, and carrying out mode alignment classification recognition based on the training mode alignment characterization vector to obtain a mode alignment classification recognition result;

based on the modal alignment classification recognition result and the modal alignment label, performing classification loss calculation to obtain classification loss information;

model loss information is carried out based on the classification loss information and the vector loss information, and model loss information is obtained;

reversely updating the initial modal alignment classification recognition model based on the model loss information to obtain an updated modal alignment classification recognition model, taking the updated modal alignment classification recognition model as the initial modal alignment classification recognition model, and returning to execute the step of acquiring the first training modal information and the second training modal information until the training completion condition of the classification model is reached to obtain a target modal alignment classification recognition model;

and obtaining a second target modality alignment model based on the target modality alignment classification recognition model.

3. The method of claim 2, wherein the initial modality alignment classification recognition model includes an initial modality alignment characterization network and an initial classification network;

Inputting the first training mode information and the second training mode information into an initial mode alignment classification recognition model for mode alignment characterization to obtain a Ji Biaozheng training mode pair vector, and performing mode alignment classification recognition based on the training mode alignment characterization vector to obtain a mode alignment classification recognition result, wherein the method comprises the following steps of:

performing modal alignment characterization on the first training modal information and the second training modal information through an initial modal alignment characterization network in the initial modal alignment classification recognition model to obtain a Ji Biaozheng vector of the training modal pair;

performing modal alignment classification recognition through an initial classification network in the initial modal alignment classification recognition model to obtain a modal alignment classification recognition result;

the obtaining a second target modality alignment model based on the target modality alignment classification recognition model includes:

and taking the target modality alignment characterization network in the target modality alignment classification identification model as the second target modality alignment model.

4. The method of claim 1, wherein inputting the first training modality information and the second training modality information into an initial modality alignment model for modality alignment characterization results in a training modality pair Ji Biaozheng vector, comprising:

Respectively extracting features of the first training mode information and the second training mode information to obtain a first mode feature vector and a second mode feature vector;

fusing the first modal feature vector and the second modal feature vector to obtain a fused feature vector;

and inputting the fusion feature vector into an initial modal alignment model for modal alignment characterization to obtain a training modal pair Ji Biaozheng vector.

5. The method of claim 4, wherein the first training modality information comprises text information and the second training modality information comprises picture information;

the feature extraction is performed on the first training mode information and the second training mode information respectively to obtain a first mode feature vector and a second mode feature vector, including:

inputting the text information into a text feature extraction model, obtaining a text global characterization vector and a text character characterization vector through the text feature extraction model, and taking the text character characterization vector as the first modal feature vector;

and inputting the picture information into a picture feature extraction model, obtaining a picture global characterization vector and a picture content characterization vector through the picture feature extraction model, and taking the picture content characterization vector as the second modal feature vector.

6. The method of claim 1, wherein inputting the first training modality information and the second training modality information into an initial modality alignment model for modality alignment characterization results in a training modality pair Ji Biaozheng vector, comprising:

inputting the first training mode information and the second training mode information into an initial mode alignment model;

respectively extracting features of the first training mode information and the second training mode information through the initial mode alignment model to obtain a first mode feature vector and a second mode feature vector;

and fusing the first modal feature vector and the second modal feature vector through the initial modal alignment model to obtain a fused feature vector, and carrying out modal alignment characterization based on the fused feature vector to obtain the training modal pair Ji Biaozheng vector.

7. The method of claim 1, wherein the calculating the degree of similarity between the first modality pair Ji Biaozheng vector and the second modality alignment characterization vector results in a modality degree of similarity, comprising:

and calculating the cosine distance between the first modality pair Ji Biaozheng vector and the second modality alignment characterization vector to obtain the modality similarity degree.

8. The method of claim 1, wherein the calculating a probability distribution distance for the first modality pair Ji Biaozheng vector to align with the second modality alignment characterization vector based on the modality similarity, deriving vector loss information based on the probability distribution distance, comprises:

acquiring target probability distribution conversion parameter information;

calculating the product of the target probability distribution conversion parameter information and the modal similarity degree to obtain the probability distribution distance of the first modal pair Ji Biaozheng vector and the second modal alignment characterization vector;

and taking the probability distribution distance as the vector loss information.

9. The method of claim 8, wherein the obtaining the target probability distribution transition parameter information comprises:

acquiring initial probability distribution conversion parameter information, first probability distribution information corresponding to the Ji Biaozheng vector of the first mode pair and second probability distribution information corresponding to the Ji Biaozheng vector of the second mode pair;

and carrying out iterative computation on the initial probability distribution conversion parameter information based on the initial probability distribution conversion parameter information, the modal similarity degree, the first probability distribution information and the second probability distribution information, and obtaining the target probability distribution conversion parameter information when a preset iterative computation completion condition is reached.

10. A method of modality alignment, the method comprising:

the method comprises the steps that first training mode information and second training mode information are input into an initial mode alignment model to perform mode alignment characterization, a training mode pair Ji Biaozheng vector is obtained, the training mode alignment characterization vector comprises a first mode pair Ji Biaozheng vector corresponding to the first training mode information and a second mode pair Ji Biaozheng vector corresponding to the second training mode information, the similarity between the first mode pair Ji Biaozheng vector and the second mode alignment characterization vector is calculated, and the mode similarity is obtained; calculating probability distribution distances of the Ji Biaozheng vectors of the first modality pair and the alignment characterization vectors of the second modality based on the modality similarity degree, obtaining vector loss information based on the probability distribution distances, and iteratively updating the initial modality alignment model based on the vector loss information until the alignment model training completion condition is reached, so as to obtain the first target modality alignment model;

11. The method according to claim 10, wherein the method further comprises:

establishing an initial multi-modal information classification recognition model based on the first target modal alignment model;

acquiring a multi-mode training sample and a corresponding classification identification label;

inputting the multi-modal training sample into the initial multi-modal information classification and identification model, carrying out modal alignment characterization on the multi-modal training sample through the initial multi-modal information classification and identification model to obtain a training modal pair Ji Biaozheng vector, and carrying out multi-modal classification and identification based on the training modal alignment characterization vector to obtain an initial classification and identification result;

performing multi-mode classification recognition loss calculation based on the initial classification recognition result and the classification recognition label to obtain multi-mode classification recognition loss information;

and reversely updating the initial multi-modal information classification recognition model based on the multi-modal classification recognition loss information and carrying out loop iteration to obtain a target multi-modal classification recognition model.

12. The method of claim 11, further comprising, after said deriving the target multimodal recognition task model:

acquiring identification mode information to be classified;

inputting the identification mode information to be classified into the target multi-mode identification task model;

performing modal alignment characterization on the to-be-classified recognition modal information through the target multi-modal classification recognition model to obtain a target modal pair Ji Biaozheng vector;

and carrying out multi-mode classification and identification on the target modal alignment characterization vector through the target multi-mode classification and identification model to obtain classification and identification results corresponding to the identification modal information to be classified.

13. A modality alignment model training device, the device comprising:

the initial alignment module is used for inputting the first training mode information and the second training mode information into an initial mode alignment model for mode alignment characterization, a training mode pair Ji Biaozheng vector is obtained, the training mode alignment characterization vector comprises a first mode pair Ji Biaozheng vector corresponding to the first training mode information and a second mode pair Ji Biaozheng vector corresponding to the second training mode information, and the first mode pair Ji Biaozheng vector and the same instance characterization in the second mode alignment characterization vector have initial corresponding relations;

The similarity calculation module is used for calculating the similarity between the Ji Biaozheng vector of the first modality pair and the alignment characterization vector of the second modality pair to obtain the similarity of modalities;

the loss calculation module is used for calculating probability distribution distances of the first modality pair Ji Biaozheng vector and the second modality alignment characterization vector based on the modality similarity degree, and obtaining vector loss information based on the probability distribution distances;

the iteration module is used for reversely updating the initial modal alignment model based on the vector loss information to obtain an updated modal alignment model, taking the updated modal alignment model as the initial modal alignment model, and returning to execute the step of acquiring the first training modal information and the second training modal information until reaching the training completion condition of the alignment model to obtain a first target modal alignment model, wherein the first target modal alignment model is used for extracting semantic representations of different modal information, and the semantic representations of the same instance in the semantic representations of the different modal information have a corresponding relation.

14. A modality alignment device, the device comprising:

15. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 12 when the computer program is executed.

16. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 12.

17. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any one of claims 1 to 12.