CN114898192A

CN114898192A - Model training method, prediction method, device, storage medium, and program product

Info

Publication number: CN114898192A
Application number: CN202210602521.9A
Authority: CN
Inventors: 田俊峰; 蒋勇; 孙增辉
Original assignee: Alibaba China Co Ltd
Current assignee: Alibaba China Co Ltd
Priority date: 2022-05-30
Filing date: 2022-05-30
Publication date: 2022-08-12

Abstract

The application provides a model training method, a prediction method, equipment, a storage medium and a program product, wherein the method comprises the following steps: according to the image and text to be processed, the visual representation characteristics corresponding to the image are determined by the visual coding module, the language representation characteristics corresponding to the text are determined by the language coding module, according to the visual representation characteristics and the language representation characteristics, the attention value corresponding to each image block in the image and/or each character in the text is determined, the attention loss is determined according to the attention value, wherein the attention value of an image block is used to represent the contribution of the image block to a text prediction, the attention value of a character is used to represent the contribution of the character to an image prediction, according to the visual representation characteristics and the language representation characteristics, determining a prediction result corresponding to the image and/or the text through a fusion module, and determining a prediction loss according to the prediction result, according to the attention loss and the prediction loss, the parameters of the model are adjusted, and the accuracy of the model can be improved.

Description

Model training method, prediction method, device, storage medium, and program product

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a model training method, a prediction method, a device, a storage medium, and a program product.

Background

With the continuous development of artificial intelligence technology, the data modality that the artificial intelligence model can process is also continuously expanding. In some technologies, information of multiple modes can be processed through a multi-mode interaction model, and the model prediction effect is improved.

For example, data of a plurality of modalities such as images and texts may be available for a product, and by comprehensively processing the data through a multi-modal interaction model, interaction between modalities such as images and texts can be realized, which contributes to improvement of a prediction effect on the product. However, the accuracy of the current multi-modal interaction model still needs to be improved.

Disclosure of Invention

The embodiments of the present application mainly aim to provide a model training method, a prediction method, a device, a storage medium, and a program product, so as to improve the accuracy of a multi-modal interaction model.

In a first aspect, an embodiment of the present application provides a model training method, where the model includes a visual coding module, a language coding module, and a fusion module; the method comprises the following steps:

according to the image and the text to be processed, determining the visual representation characteristics corresponding to the image through a visual coding module, and determining the language representation characteristics corresponding to the text through a language coding module;

according to the visual representation features and the language representation features, determining attention values corresponding to all image blocks in the image and/or all characters in the text, and determining attention loss according to the attention values; wherein the attention value of the image block is used for representing the contribution of the image block to the text prediction, and the attention value of the character is used for representing the contribution of the character to the image prediction;

according to the visual representation features and the language representation features, determining a prediction result corresponding to the image and/or the text through a fusion module, and determining prediction loss according to the prediction result;

adjusting parameters of the model based on the attention loss and the predicted loss.

Optionally, determining an attention value corresponding to each image block in the image and/or each character in the text according to the visual representation feature and the language representation feature, including:

calculating according to the visual representation characteristics of a plurality of image blocks in the image and the language representation characteristics of a plurality of characters in the text to obtain a cross attention matrix, wherein elements in the cross attention matrix are used for representing the contribution of the image blocks to the characters and/or the contribution of the characters to the image blocks;

adding the contributions of the image block to each character aiming at any image block to obtain an attention value corresponding to the image block; and/or adding the contribution of any character to each image block to obtain the attention value corresponding to the character.

Optionally, determining the attention loss according to the attention value comprises:

determining attention loss according to the attention value and the corresponding label of each image block and/or the attention value and the corresponding label of each character;

wherein the label used in determining the loss of attention matches the label used in determining the predicted loss.

Optionally, determining attention loss according to the attention value and the corresponding label of each image block, and/or according to the attention value and the corresponding label of each character, includes:

calculating first cross entropy loss according to the attention value and the corresponding label of each image block;

calculating a second cross entropy loss according to the attention value of each character and the corresponding label;

and determining corresponding attention loss according to the first cross entropy loss and the second cross entropy loss.

Optionally, the model further comprises a visual prediction module and/or a language prediction module; determining a prediction result corresponding to the image and/or the text through a fusion module according to the visual representation feature and the language representation feature, wherein the prediction result comprises:

inputting the visual representation features and the language representation features into a fusion module to obtain multi-modal representation features;

and according to the multi-modal representation characteristics, obtaining a prediction result of each image block through a visual prediction module, and/or obtaining a prediction result of each character through a language prediction module.

Optionally, adjusting parameters of the model according to the attention loss and the predicted loss includes:

adjusting parameters of the visual coding module and the language coding module according to the attention loss;

and adjusting parameters of each module in the model according to the predicted loss.

In a second aspect, an embodiment of the present application provides a model training method, where the model includes a visual coding module, a language coding module, and a fusion module; the method comprises the following steps:

according to a commodity image and a commodity title corresponding to a commodity, determining a visual representation characteristic corresponding to the commodity image through a visual coding module, and determining a language representation characteristic corresponding to the commodity title through a language coding module;

according to the visual representation features and the language representation features, determining attention values corresponding to all image blocks in the commodity image and/or all characters in the commodity title, and determining attention loss according to the attention values; wherein the attention value of the image block is used for representing the contribution of the image block to the text prediction, and the attention value of the character is used for representing the contribution of the character to the image prediction;

according to the visual representation features and the language representation features, determining a prediction result corresponding to the commodity image and/or the commodity title through a fusion module, and determining prediction loss according to the prediction result; the prediction result corresponding to the commodity image is used for positioning a commodity main body in the commodity image, and the prediction result corresponding to the commodity title is used for positioning a central word of the commodity title;

In a third aspect, an embodiment of the present application provides a model training method, where the model includes a first modality encoding module, a second modality encoding module, and a fusion module; the method comprises the following steps:

determining a first modal representation characteristic corresponding to the first modal information through a first modal coding module, and determining a second modal representation characteristic corresponding to the second modal information through a second modal coding module;

according to the first modal representation features and the second modal representation features, determining attention values corresponding to each first sub-modal information in the first modal information and/or each second sub-modal information in the second modal information, and determining attention loss according to the attention values; the attention value of the first sub-modal information is used for representing the contribution of the first sub-modal information to the prediction of the second modal information, and the attention value of the second sub-modal information is used for representing the contribution of the second sub-modal information to the prediction of the first modal information;

determining a prediction result and a corresponding prediction loss of the first modality information and/or the second modality information through a fusion module according to the first modality representation characteristics and the second modality representation characteristics;

In a fourth aspect, an embodiment of the present application provides a prediction method, including:

acquiring first modality information and second modality information to be processed; wherein the first modality information and the second modality information include any two of the following information: image, text, audio, video, sensory information;

obtaining a prediction result corresponding to the first modal information and/or the second modal information through a multi-modal interaction model according to the first modal information and the second modal information to be processed;

the multi-modal interaction model is obtained by training based on the model training method of any one of the above aspects.

In a fifth aspect, an embodiment of the present application provides an electronic device, including:

at least one processor; and

a memory communicatively coupled to the at least one processor;

wherein the memory stores instructions executable by the at least one processor to cause the electronic device to perform the method of any of the above aspects.

In a sixth aspect, an embodiment of the present application provides a computer-readable storage medium, where computer-executable instructions are stored, and when a processor executes the computer-executable instructions, the method according to any one of the above aspects is implemented.

In a seventh aspect, the present application provides a computer program product, which includes a computer program, and when the computer program is executed by a processor, the computer program implements the method of any one of the above aspects.

The model training method, the prediction method, the device, the storage medium and the program product provided by the embodiments of the present application may determine, according to an image and a text to be processed, a visual representation feature corresponding to the image through a visual coding module, determine a language representation feature corresponding to the text through a language coding module, determine, according to the visual representation feature and the language representation feature, an attention value corresponding to each image block in the image and/or each character in the text, and determine an attention loss according to the attention value, where the attention value of an image block is used to represent a contribution of the image block to a text prediction, the attention value of a character is used to represent a contribution of the character to the image prediction, and determine, according to the visual representation feature and the language representation feature, a prediction result corresponding to the image and/or the text through a fusion module, and determining a prediction loss according to a prediction result, and adjusting parameters of the model according to the attention loss and the prediction loss, so that the model can be trained through the loss corresponding to the prediction result and the intermediate attention loss, the final prediction result can be supervised, the intermediate representation features can be supervised, the extracted representation features can more accurately reflect the importance degree of the image blocks or characters, the fusion can be better performed in a multi-mode fusion module, more attention can be given to the important image blocks or characters, the accuracy of the final prediction result is improved, and the model effect is improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.

Fig. 1 is a schematic view of an application scenario provided in an embodiment of the present application;

fig. 2 is a schematic diagram of a training principle provided in an embodiment of the present application;

fig. 3 is a schematic flowchart of a model training method according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram illustrating a principle of calculating attention values of image blocks and characters according to an embodiment of the present disclosure;

FIG. 5 is a schematic diagram of a model training process provided in an embodiment of the present application;

FIG. 6 is a schematic flow chart illustrating another model training method according to an embodiment of the present disclosure;

FIG. 7 is a diagram illustrating a conventional image search;

fig. 8 is a schematic diagram of a graph search according to an embodiment of the present application;

FIG. 9 is a schematic flow chart diagram illustrating yet another model training method according to an embodiment of the present disclosure;

fig. 10 is a flowchart illustrating a prediction method according to an embodiment of the present application;

fig. 11 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

With the above figures, there are shown specific embodiments of the present application, which will be described in more detail below. These drawings and written description are not intended to limit the scope of the inventive concepts in any manner, but rather to illustrate the inventive concepts to those skilled in the art by reference to specific embodiments.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application.

The term "and/or" is used herein to describe an association relationship of associated objects, and specifically means that there may be three relationships, for example, a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone.

The terms referred to in this application are explained first:

the attention mechanism is a data processing method in machine learning, which is derived from human vision and can focus on key areas in a global image, and more attention is paid to the key areas, so that high-value information is quickly screened out from a large amount of information, attention to other information is reduced, even irrelevant information is filtered out, and the efficiency and accuracy of task processing are improved. One conventional attention mechanism includes query, key and value, the value is scored by the relationship between query and key, and finally the most relevant value is taken out by soft or hard.

And selecting a main body of the commodity picture, wherein a plurality of objects are detected in the commodity picture, only one object is a commodity for sale, and the main body selection is a main body frame for selecting the commodity for sale from the plurality of objects in the commodity picture.

And identifying the central words of the commodity titles: the most core word in the title of the product is identified, such as the title "2022 spring lady harbor wind antique design shirt jacket", and the core word is "shirt".

The method and the device can be applied to any field needing cross-modal interaction. For the image-text interaction, the embodiment of the application provides a multi-modal fusion scheme of images and texts, and the obtained multi-modal features can be used for processing the texts and the images respectively. Illustratively, the method can be used in an e-commerce scene as an important module for commodity understanding, and improves the accuracy of downstream graph search and text search.

In an e-commerce scenario, understanding of the goods is the fundamental capability of the e-commerce platform. For a commodity, the corresponding data is usually composed of a title and a picture, and the title and the picture can assist in prediction mutually. In practical applications, a merchant usually stacks a large number of redundant words in a title of a product, and a product subject in a product picture is usually placed in a core position, so that a title headword can be clarified through the product subject. Meanwhile, when a plurality of articles are included in the product picture, the products on sale can be identified by the product titles. Therefore, the understanding capacity of the commodity can be effectively improved through the interaction of the pictures and the characters.

At present, for text recognition and image recognition, two different schemes are generally adopted to respectively build models for two tasks. For example, when a text recognition model needs to be trained, a text and an image can be used as input, and a central word is used as a prediction result for training; when the image recognition model needs to be trained, the text and the image can be used as input, and the main body box is used as a prediction result for training. Although this training scheme uses the feature of text-text interaction, the training result may not be ideal if only a single task is used as a supervisory signal.

In order to solve the problem, a multi-task learning scheme can be adopted, wherein the input of the model is the picture and the title of the commodity, and the output is the main frame and the central word of the title of the commodity image. And performing joint training on main body detection and title identification by adopting a multi-task learning method, wherein two tasks share the same multi-mode encoder (encoder), and meanwhile, the classification Loss (Loss) of the title and the regression Loss selected by the main body are constructed.

Fig. 1 is a schematic view of an application scenario provided in an embodiment of the present application. As shown in fig. 1, after the embedding features of each image block in the image and each character in the text are respectively extracted by the embedding feature extraction module, the embedding features (visual patch embedding) of the image block and the embedding features (language token embedding) of the character may be input to a visual-language encoder (visual-language encoder), and when inputting, in addition to the embedding features, some special tokens such as REG (regression), CLS (classification), SEP (partitioning) and the like may be added, so that the input information adapts to the task of the upper layer.

The visual-language encoder can realize cross-modal fusion to obtain multi-modal characteristics after visual and text interaction, and the multi-modal characteristics can be used for realizing an upper-layer image prediction task and a text prediction task.

In the image prediction task, the multi-modal characteristics after the interaction of vision and text can be represented by REG, the multi-modal characteristics are input into a coordinate regression (coordinates regression) module to obtain four data of x, y, w and h, which respectively represent horizontal and vertical coordinates and width and height, and the main frame of the product can be determined by the four data.

In the text prediction task, multi-modal characteristics after visual and text interaction can be input into a text prediction module to obtain a prediction result of each character, wherein the prediction results comprise three types: o, B, I, respectively representing non-center word, the initial word of center word, and the non-initial word of center word, if a character belongs to B or I, it means that the character belongs to the center word. For example, "spliced frock shorts dress women" is a central word.

In the multi-task learning scheme, the model training process can be supervised through the central words and the main body frames of the commodities, the effect of the model is effectively improved, and the accuracy of the model still has a promotion space.

In view of this, the embodiments of the present application provide a multi-modal attention loss function, which can effectively use the core word and the body box to guide cross-modal interaction. Since the commodity texts and pictures contain numerous redundant information, the texts and the pictures are directly fused with high difficulty, and the text core words and the picture main body frame provide high summarization capability for multiple modes, the cross-mode interaction between the images and the texts can be guided by taking the core words and the main body frame as supervision signals, so that the key information is fused more effectively.

Fig. 2 is a schematic diagram of a training principle provided in an embodiment of the present application. As shown in fig. 2, the model may be input by embedding features of image blocks and embedding features of characters in a text, in the modal representation layer, the visual representation features of the image blocks and the language representation features of the characters may be extracted by the visual encoder and the language encoder, respectively, in the multimodal fusion layer, the visual representation features and the language representation features may be fused, and the obtained fused features may be input to the upper application layer to obtain a prediction result.

After the modal representation layer, an attention loss function can be designed, and the representation characteristics output by the modal representation layer can be constrained. Wherein the attention loss function may be determined by the attention values of the image blocks and the characters. In the multi-modal fusion layer and the upper application layer, the visual representation features of the image blocks will play a role in the text prediction process, while the linguistic representation features of the characters will play a role in the image prediction process. In constructing the attention loss function, the attention value of each image block may be used to measure the contribution of the image block in the text prediction, and the attention value of each character may be used to measure the contribution of the character in the image prediction.

If a character in the text belongs to a core word or a region in the image belongs to a body box, a higher weight may be given during multimodal fusion, and a lower weight may be given otherwise. By the design of a multi-modal attention loss function, a fusion process of picture information and text information is designed into a training target, and parameter updating and learning are carried out based on a supervision signal.

In the upper application layer, after the prediction result is obtained, the loss function corresponding to the prediction result can be determined according to the prediction result and the real label. Between the modality representation layer and the multi-modality fusion layer, the output of the modality representation layer can be constrained by a multi-modality loss function. The attention loss function can align the outputs of the visual coding module and the linguistic coding module into a space, knowing each other which parts are important for themselves and for each other, relieving the fusion module of the burden. The model is trained through the loss function corresponding to the prediction result and the middle multi-modal loss function, the final prediction result can be supervised, the middle single-modal representation feature can be supervised, the extracted single-modal representation feature can reflect the importance degree of the image blocks or the characters more accurately, fusion is carried out on a multi-modal fusion layer better, more attention is given to the important image blocks or the characters, the accuracy of the final prediction result is improved, and the model effect is improved.

Besides the above-mentioned electronic market scenes, the scheme of the embodiment of the application can also be applied to any scenes capable of performing image-text interaction. The image and the text in the process of image-text interaction can be matched image and text.

Optionally, in a social platform scenario, for each user, an image and a text of any piece of content published by the user on the social platform may be used as the matched image and text, according to the scheme provided by the embodiment of the present application, a key position in the image is extracted as a prediction result of the image, and a keyword in the text is extracted as a prediction result of the text, so that a tag corresponding to the content may be determined according to the key position of the image and the keyword of the text. For a target user browsing the social platform, content published by other users can be recommended to the target user through the portrait of the target user or the real-time preference.

After a sports video published by a coach or a user is played on a sports platform, a cover image and a recommended text can be set for the video, on this basis, according to the scheme provided by the embodiment of the application, the cover image and the recommended text are used as the matched image and text, a training position in the cover image and a training part in the text are extracted, for example, a main training part is a waist, a waist can be highlighted or a key word is used as the waist in the text, and the waist can be highlighted in the cover image, so that other users can determine the main training part of the sports video more intuitively.

Some embodiments of the present application will be described in detail below with reference to the accompanying drawings. The features of the embodiments and examples described below may be combined with each other without conflict between the embodiments. In addition, the sequence of steps in each method embodiment described below is only an example and is not strictly limited.

Fig. 3 is a schematic flowchart of a model training method according to an embodiment of the present application. The execution subject of the method in this embodiment may be applied to any device having a data processing function, such as a terminal device or a server. The method can be used to train models, in particular multi-modal interaction models, which can include a visual coding module, a language coding module, and a fusion module. As shown in fig. 3, the method may include:

step 301, according to an image and a text to be processed, determining a visual representation characteristic corresponding to the image through a visual coding module, and determining a language representation characteristic corresponding to the text through a language coding module.

The image and the text to be processed may refer to matched images and texts, for example, the image and the text may correspond to the same commodity, or may be an image and a text in a piece of content published by a user on a social platform.

The image and the text may be passed through a visual coding module and a language coding module, respectively, to extract corresponding visual representation features and language representation features.

In one example, the image may be directly input to the visual coding module to obtain the visual representation characteristics, and the text may be input to the language coding module to obtain the language representation characteristics.

In another example, the visual coding module and the language coding module may be preceded by corresponding embedded feature extraction modules, respectively. And inputting the image into a visual embedded feature extraction module to obtain an embedded feature corresponding to the image, and inputting the embedded feature corresponding to the image into a visual coding module to obtain a corresponding visual representation feature. And inputting the text into a language embedding feature extraction module to obtain the embedding feature corresponding to the text, and then inputting the embedding feature corresponding to the text into a visual coding module to obtain the corresponding language representation feature.

Optionally, the image and the text may be divided into multiple parts, for example, the image may be divided into a plurality of image blocks, and the text may be divided into a plurality of characters. For each image block or each character, the corresponding visual representation feature or language representation feature can be obtained through the visual coding module or the language coding module.

Illustratively, 10 × 10 image blocks may be obtained by dividing the image into ten equal parts in the height and width directions, and 10 × 10 image blocks may be input to the visual coding module to obtain corresponding 10 × 10 visual representation features. If the text contains 20 characters, the 20 characters can be input into the language coding module to obtain the corresponding 20 language representation characteristics.

Alternatively, the visual coding module and the language coding module may be implemented by any encoder, such as a transform, Resnet, or other CNN (Convolutional Neural network) model.

Step 302, according to the visual representation features and the language representation features, determining attention values corresponding to each image block in the image and/or each character in the text, and determining attention loss according to the attention values.

Wherein the attention value of the image block is used to represent the contribution of the image block to the text prediction, and the attention value of the character is used to represent the contribution of the character to the image prediction.

The image or text can be predicted through interaction of the image and the text. When the text is predicted, the image plays a certain role, and when the image is predicted, the text also plays a certain role. The present embodiment therefore introduces an attention value to measure the contribution of each image block and each character, which may refer to the importance of the image block or character in the subsequent prediction process.

Specifically, the contribution of the image block may be used to indicate the importance degree of the image block in the text prediction process, and if a certain image block belongs to a positive result (for example, belongs to a main body box; and the label is generally 1) in the image prediction result, it indicates that the certain image block belongs to an important area, and the importance degree in the text prediction process is higher than that of other image blocks which are not positive results, and more attention should be given to the certain image block, so that the corresponding attention value may be higher.

Similarly, the contribution of a character can be used to indicate the importance of the character in the image prediction process, and if a certain character belongs to a positive result (for example, belongs to a central word; the label is 1 in general) in the text prediction result, it is indicated that the certain character belongs to an important character, and the importance in the image prediction process is higher than that of the character blocks of other non-positive results, and more attention should be given to the certain character, so that the corresponding attention value can be higher.

In the model training process, the attention values of the image blocks and the characters can be determined through the visual representation features and the language representation features of the image blocks and the characters, so that the representation features output by the encoding module can be constrained, and the important image blocks or the representation features corresponding to the characters are higher in importance degree when the image blocks or the characters are input into the representation features of the fusion module.

After determining the contribution of the image block and the character by the visual representation feature and the linguistic representation feature, the attention loss value may be determined according to the contribution, thereby enabling the contribution of the image block or the character to approach its own importance. Alternatively, the degree of importance may be determined by the label of the image block or the character.

And step 303, determining a prediction result corresponding to the image and/or the text through a fusion module according to the visual representation feature and the language representation feature, and determining a prediction loss according to the prediction result.

The fusion module can fuse the visual representation features and the language representation features to obtain multi-modal features after visual-language interaction, and the multi-modal features can be used for predicting images or texts to obtain a prediction result.

In one example, the visual representation feature and the language representation feature may be input to a fusion module to directly obtain a prediction result of an image and a prediction result of a text.

In another example, the visual representation feature and the language representation feature may be input to a fusion module, and the obtained fused features may be input to a prediction module to obtain a prediction result.

Optionally, the prediction result may be set according to actual needs, for example, the prediction result of the image may be to predict whether each image block belongs to the main frame, so as to determine the main frame in the image, and the prediction result of the text may be to predict whether each character belongs to the central word, so as to determine the central word in the text.

After the prediction result is obtained, the corresponding prediction loss can be obtained according to the labels of the image and the text. The label of the image may be used to indicate whether each image block belongs to the main body frame, if any image block belongs to the main body frame, the corresponding label is 1, otherwise, the corresponding label is 0; the labels of the text are similar and can be used to indicate whether each character belongs to the central word, if any character belongs to the central word, the corresponding label can be 1, otherwise, the corresponding label is 0.

And step 304, adjusting parameters of the model according to the attention loss and the predicted loss.

The prediction loss may be used to represent a difference between a prediction result and a label, and the attention loss may be used to represent a difference between an attention value corresponding to a representation feature of a character or an image block output by the encoding module and an importance degree of the character or the image block.

The parameters of the model can be adjusted by attention loss and predictive loss. Optionally, the model parameters may be adjusted by using a gradient descent method, and after the adjustment, steps 301 to 304 may be re-executed until the iteration number or the loss value meets a preset condition, and then the model training is completed.

It should be noted that, in the present embodiment, the limitation is that "and/or" may be used to separately process the image block, separately process the text, and simultaneously process the image and the text, which are described below separately.

In an optional implementation manner, the model may be used to predict an image, and in the training process, a contribution of each character in the text in image prediction may be constructed, and a corresponding attention loss may be obtained according to the contribution, and in addition, a prediction result corresponding to the image may be obtained according to the visual representation feature and the language representation feature, and a prediction loss may be determined according to the prediction result, and a model parameter may be adjusted according to the attention loss and the prediction loss.

In another alternative implementation, the model may be used to predict text, and the attention loss and the prediction loss may be used to service text prediction during training. The contribution of each image block in the image in text prediction can be constructed, the corresponding attention loss can be obtained according to the contribution, in addition, the prediction result corresponding to the text can be obtained according to the visual representation characteristic and the language representation characteristic, and the prediction loss can be determined according to the prediction result.

In yet another alternative implementation, the model may be used to predict a text and an image, and then in the training process, the contribution of each image block in the image in the text prediction and the contribution of each character in the text in the image prediction may be constructed, and a corresponding attention loss may be determined according to the constructed contributions, and in addition, according to the visual representation feature and the language representation feature, a prediction result corresponding to the image may be obtained, and also a prediction result corresponding to the text may be obtained, and a prediction loss may be determined according to the prediction result, so that the prediction result and the attention value of the text and the image may be supervised at the same time.

In summary, the model training method provided in this embodiment can train the model through the loss corresponding to the prediction result and the intermediate attention loss, and not only can monitor the final prediction result, but also can monitor the intermediate representation features, so that the extracted representation features can more accurately reflect the importance degree of the image blocks or characters, and thus the fusion is better performed in the multi-modal fusion module, more attention is given to the important image blocks or characters, the accuracy of the final prediction result is improved, and the model effect is improved.

In one or more embodiments of the present application, optionally, determining, according to the visual representation feature and the language representation feature, an attention value corresponding to each image block in the image and/or each character in the text, may include: calculating according to the visual representation characteristics of a plurality of image blocks in the image and the language representation characteristics of a plurality of characters in the text to obtain a cross attention matrix, wherein elements in the cross attention matrix are used for representing the contribution of the image blocks to the characters and/or the contribution of the characters to the image blocks; and/or adding the contributions of the characters to the image blocks to obtain the attention value corresponding to the characters aiming at any character.

Optionally, for each image block, the contribution of the image block to each character in the text may be calculated, that is, the importance degree of the image block when calculating the prediction result corresponding to each character, and then the attention value of the image block in the whole text prediction process may be calculated according to the sum of the contributions of the image block to all the characters. Similarly, for each character, the contribution of the character to each image block in the image, that is, the importance degree of the character in calculating the corresponding prediction result of each image block, may be calculated, and then the attention value of the character in the whole image prediction process may be calculated according to the sum of the contributions of the character to all the image blocks.

Optionally, dot product (dot product) calculation may be performed on the visual representation features of the plurality of image blocks and the linguistic representation features of the plurality of characters output by the encoding module, so as to obtain a cross attention (cross attention) matrix, where each element in the matrix may be used to represent a contribution of a corresponding image block to a character, or a contribution of a character to an image block.

Fig. 4 is a schematic diagram illustrating a principle of calculating attention values of image blocks and characters according to an embodiment of the present application. As shown in fig. 4, assuming that the image includes M image blocks and the text includes N characters, the visual coding module may output visual representation features corresponding to the M image blocks, and the language coding module may output language representation features corresponding to the N characters, and perform dot product operation on the M visual representation features and the N language representation features to obtain an M × N cross attention matrix. Wherein each row in the matrix corresponds to an image block and each column corresponds to a character, more specifically, an element in the ith row and the jth column in the matrix may represent a contribution of the ith image block to the jth character, where the contribution of the image block to the character and the contribution of the character to the image block may be considered to be the same, that is, the contribution of the ith image block to the jth character may be equal to the contribution of the jth character to the ith image block.

Alternatively, each small square in the visual representation features of fig. 4 may represent a visual representation feature corresponding to an image block, the visual representation feature corresponding to each image block may be a D-dimensional vector, and the visual representation feature of the entire image may be N × D-dimensional; similarly, each small square in the language representation feature may represent a visual representation feature corresponding to one character, the visual representation feature corresponding to each character may also be a vector in D dimension, and the visual representation feature of the whole text may be in M × D dimension.

The visual representation features of the image blocks may be operated on with the linguistic representation features of the characters to obtain elements in the cross-attention matrix. Illustratively, the visual representation feature of the ith image block and the language representation feature of the jth character are operated to obtain the elements of the ith row and the jth column in the matrix.

After the cross attention matrix is obtained, for each row in the matrix, the elements of the row may be added to obtain the contribution of the image block corresponding to the row to the whole text, that is, the attention value of the image block; for each column in the matrix, the elements of the column may be added to obtain the contribution of the character corresponding to the column to the whole image, i.e. the attention value of the character.

Optionally, before "for any image block, add the contributions of the image block to each character to obtain the attention value corresponding to the image block, and/or add the contributions of the character to each image block to obtain the attention value corresponding to the character", a normalization operation may be performed on the matrix, and after the normalization operation, the addition operation is performed to obtain the attention value, or after the addition, the normalization operation is performed to obtain the attention value after the normalization operation. The normalization operation can limit the finally obtained attention value to a preset range, and the accuracy of attention loss is improved.

Alternatively, the elements of each row or each column in the matrix may be added by a pooling operation to obtain the corresponding attention value.

In alternative implementations, the cross-attention matrix may also be derived by other operations than dot product, such as linear transformation, etc.

The 'and/or' defined in the embodiment includes multiple execution modes, when the model is used for performing image prediction, the attention value of characters in a text to the image can be calculated and attention loss can be constructed, when the model is used for performing text prediction, the attention value of image blocks in the image to the text can be calculated and attention loss can be constructed, and when the model can be simultaneously used for image prediction and text prediction, a matrix obtained through calculation can simultaneously represent the contribution of the characters to the image blocks and the contribution of the image blocks to the characters, so that the attention value of the image blocks and the attention value of the characters can be determined.

In conclusion, the cross attention matrix is obtained by performing operations such as dot product on the representation features of the image blocks and the characters, and then the attention value is determined according to the matrix, so that mutual contribution can be determined directly through interaction of the visual representation features and the language representation features, and the efficiency and the accuracy are effectively improved.

In one or more embodiments of the present application, optionally, the model further comprises a visual prediction module and/or a language prediction module; determining a prediction result corresponding to the image and/or the text through a fusion module according to the visual representation feature and the language representation feature, wherein the prediction result comprises: inputting the visual representation features and the language representation features into a fusion module to obtain multi-modal representation features; and according to the multi-modal representation characteristics, obtaining a prediction result of each image block through a visual prediction module, and/or obtaining a prediction result of each character through a language prediction module.

In each embodiment of the application, the vision corresponds to an image, the language corresponds to a text, the module related to the vision can be used for processing the image to obtain vision-related features, and the module related to the language can be used for processing the text to obtain language-related features. The prediction operations for images and text may be performed by a visual prediction module and a language prediction module, respectively.

Specifically, if the model is used to implement two functions, namely text prediction and image prediction, a visual prediction module and a language prediction module may be included, and if only one of the functions is implemented, a corresponding prediction module may be included.

The information output by the fusion module can be multi-modal representation features, and is obtained by interacting the visual representation features and the language representation features. The features output to the visual prediction module and the features output to the language prediction module may be the same or different, and this embodiment is not limited. Optionally, the fusion module may be implemented based on an attention mechanism, so as to improve the effect of feature fusion.

In summary, in the embodiment, the visual and language features are fused through the fusion module to obtain the multi-modal features, and then the corresponding result is determined through the prediction module of the service layer, the fusion module only needs to pay attention to the fusion effect of the visual and language, the prediction module can be set according to the actual service requirement, the feature fusion effect is improved, the model accuracy is further improved, the decoupling of the feature fusion and the service prediction is conveniently realized, and the use requirements of different scenes are met.

In one or more embodiments of the present application, optionally, determining the attention loss according to the attention value may include: determining attention loss according to the attention value and the corresponding label of each image block and/or the attention value and the corresponding label of each character; wherein the label used in determining the loss of attention matches the label used in determining the predicted loss.

Optionally, the label of the image and/or the text is used to indicate whether each image block in the image and/or each character in the text meets a preset requirement; the attention loss is used to represent the difference between the attention value corresponding to the image block and/or character and the label corresponding to the image block and/or character.

For example, a label of an image may be used to indicate whether each image block in the image belongs to a main frame, if the image block belongs to the main frame, the image block is 1, otherwise, the image block is 0, and therefore the label of the image may be regarded as an N-dimensional vector, where N is the number of the image blocks; similarly, the label of the text can be regarded as a vector in M dimensions, where M is the number of characters. The label herein may specifically refer to a label corresponding to the prediction result.

Fig. 5 is a schematic diagram of a model training process according to an embodiment of the present application. As shown in fig. 5, the embedding features of the image and the embedding features of the text are respectively input to corresponding encoding modules to obtain visual representation features and language representation features, the visual representation features and the language representation features are jointly calculated to obtain an attention value of the image and an attention value of the text, the visual representation features and the language representation features are input to a fusion module, and then a prediction result of the image and a prediction result of the text are obtained through a visual prediction module and a language prediction module.

For example, in the input layer, for an image, a patch-based method may be adopted to divide the image into M image blocks, and the linear layer is used to obtain the embedded features of the image blocks; for the text, the embedding characteristics of N characters in the text can be obtained, and the two parts are spliced together through a special token such as REG, CLS and the like to be used as input. The embedded feature of each image block or character may be a 256-dimensional vector.

In the encoding layer, the visual coding module and the language coding module can obtain visual representation features of M image blocks and language representation features of N characters, M + N multi-mode features can be obtained through the fusion module, and each visual representation feature or language representation feature or multi-mode feature can be a vector of 768 dimensions.

Optionally, the visual coding module, the language coding module and the fusion module may be implemented by a self-attention (self-attention) mechanism. Illustratively, in the visual coding module, for each image block, the influence of the remaining M-1 image blocks on it may be calculated, determining the characteristics of the output according to the attention scores; similarly, in the language coding module, for each character, the influence of the remaining N-1 characters on it can be calculated; in the fusion module, for each image block or character, the influence of the rest M + N-1 images and characters on the image block or character is calculated. Through the self-attention mechanism, the final output characteristics can be determined by combining the interaction between the characters and the image blocks, and the accuracy of characteristic representation is improved.

In the prediction layer, two parts can be adopted for prediction, regression loss is adopted as an optimization target for commodity main body selection, and category loss is adopted as an optimization target for central word recognition. Specifically, the multi-modal features corresponding to the M image blocks may be input to the visual prediction module to obtain a prediction result of the image, and the multi-modal features corresponding to the N characters may be input to the language prediction module to obtain a prediction result of the text.

Optionally, the prediction result of the image, the attention value of the image, and the label of the image may be regarded as N-dimensional vectors, and the prediction result of the text, the attention value of the text, and the label of the text may be regarded as M-dimensional vectors. The prediction loss of the image can be determined according to the prediction result of the image and the label of the image; the attention loss of the image can be determined through the attention value of the image and the label of the image. The prediction loss of the text can be determined according to the prediction result of the text and the label of the text; the attention loss of the text can be determined through the attention value of the text and the label of the text. Thus, through the label, not only the prediction result but also the attention value can be supervised.

Optionally, the label used in determining the attention loss matches the label used in determining the predicted loss, i.e. the prediction result label described above.

In one example, the label used in determining the loss of attention may be the same as the label used in determining the predicted loss. For any image block, if the image block belongs to the main frame, the label corresponding to the prediction result and the label corresponding to the attention value of the image block can both be 1, so that the prediction result and the attention value of the image block both approach to 1.

In another example, the label used in determining the attention loss and the label used in determining the predicted loss may have a proportional relationship, for example, the label corresponding to the attention value may be equal to the label corresponding to the prediction result divided by K, where K is the number of image blocks or characters of the forward result. For example, if 4 image blocks in an image belong to the main body frame, the labels corresponding to the prediction results of the 4 image blocks may be all 1, and the label corresponding to the attention value may be 1/4, so that the attention value approaches 1/4.

In conclusion, by setting the label used when the attention loss is determined to be matched with the label used when the attention loss is determined, the label corresponding to the prediction result can supervise the prediction loss and the attention loss and guide the cross-modal interaction between the image and the text, so that the key information is more effectively fused, the attention value of the image block and the character of which the prediction result is a forward result is greater, higher weight is given to the image block and the character of which the prediction result is the forward result in the prediction process, the training effects of all stages of the model are consistent, the performance of the model is improved, too many features are not required to be introduced, the algorithm is simplified, and the operation cost is reduced.

In other alternative embodiments, the model may be used to classify predictions, such as images and text of a given item, to predict categories of the item, which may include: clothing, household appliances, food, make-up, and the like. The label used when the prediction result corresponding to the image and the text is determined can be used for representing the category of the commodity, on the basis, the label used when the attention loss is determined can be different from the label used when the prediction result is determined, the label used when the attention loss is determined can be used for representing the key position of the text or the image, and the label can be determined through manual marking or other manners.

In one or more embodiments of the present application, optionally, determining the attention loss according to the attention value and the corresponding label of each image block, and/or the attention value and the corresponding label of each character, may include: calculating first cross entropy loss according to the attention value and the corresponding label of each image block; calculating a second cross entropy loss according to the attention value of each character and the corresponding label; and determining corresponding attention loss according to the first cross entropy loss and the second cross entropy loss.

An example of the calculation formula at each stage of the model training process in this embodiment is given below.

Alternatively, the multi-modal features may be calculated by the following formulas (1) to (4), and the prediction result may be determined according to the multi-modal features.

P′ _v ＝Visual-Transformer(P _v ) (2)

P′ _l ＝Linguistic-Transformer(P _l ) (3)

P″ _v ，P″ _l ＝Visual-Linguistic-Transformer(P′ _v ，P′ _l ) (4)

Wherein X is information input into the coding module, v represents vision, l represents language, and visual tokens P _v For representing image-corresponding embedded features, linear tokens P _l For representing the corresponding embedded features of the text,

for the embedded feature of the REGtoken,

for embedded features corresponding to CLS tokens, N _v Is the number of image blocks in the image, N _l Is the number of characters in the text. P' _v For visually representing the features, the Visual-Transformer is a Visual coding module; p' _l For Linguistic representation characteristics, the linear-Transformer is a Linguistic coding module, P ″ _v ，P″ _l For multi-modal features, Visual-Linguistic-Transformer is a fusion module for enabling Visual and Linguistic interactions. Will P ″) _v ，P″ _l The prediction results are respectively input into a visual prediction module and a language prediction module to obtain corresponding prediction results.

Further, the cross-attention matrix can be calculated by equation (5):

calculating the attention value corresponding to the image by formula (6):

calculating the attention value corresponding to the text by formula (7):

the attention loss is calculated by equation (8):

wherein Cross-Entrophy represents the Cross Entropy loss, y _v And y _l Respectively, a label corresponding to the image and a label corresponding to the text.

In conclusion, through the above formula, the cross entropy losses corresponding to the image and the text can be calculated, the cross entropy losses of the image and the text are added to obtain the corresponding attention loss, and then the model parameters are adjusted according to the attention loss, so that the attention value of the image block can be approximated to the label of the image block, the attention value of the character can be approximated to the label of the character, the accuracy of the representation characteristics of the image and the text is improved, and the accuracy of model prediction is further improved.

In one or more embodiments of the present application, optionally, adjusting parameters of the model according to the attention loss and the predicted loss may include: adjusting parameters of the visual coding module and the language coding module according to the attention loss; and adjusting parameters of each module in the model according to the predicted loss.

Specifically, the attention loss is used for measuring the performance of the representation features output by the coding module, so that only the visual coding module and the language coding module can be adjusted according to the attention loss, and other modules in the model are not adjusted. And the prediction loss is used for measuring the performance of a prediction result output by the model, so that the parameters of all modules in the model can be adjusted according to the prediction loss.

Alternatively, each module may be implemented by a neural network or the like.

It should be noted that, the order of calculating each loss value and adjusting the parameter is not limited in the embodiments of the present application, for example, the attention loss and the predicted loss may be calculated simultaneously, or the attention loss may be calculated first and then the predicted loss is calculated, or the predicted loss is calculated first and then the attention loss is calculated, and the order of adjusting the parameter of each module is not limited.

In conclusion, the attention loss constraint visual coding module and the language coding module have no influence on other subsequent modules, so that the attention loss can focus on the adjustment of the coding module, the accuracy of modal representation output by the coding module is improved, and the overall training efficiency of the model is improved.

Fig. 6 is a schematic flowchart of another model training method according to an embodiment of the present application. The embodiment may be applied to the e-commerce field, and as shown in fig. 6, the method may include:

step 601, according to the commodity image and the commodity title corresponding to the commodity, determining the visual representation feature corresponding to the commodity image through a visual coding module, and determining the language representation feature corresponding to the commodity title through a language coding module.

Optionally, the e-commerce platform has a large number of commodities, each commodity has a corresponding commodity detail page, and on the commodity detail page, an image corresponding to the commodity, such as a commodity main map, and a title corresponding to the commodity are provided. For any commodity, the method provided by the embodiment can be adopted to extract the corresponding image main body frame and the title headword.

Specifically, in this step, the image and the title corresponding to the product may be input to the visual coding module and the language coding module, respectively, to obtain the visual representation feature and the language representation feature.

Step 602, according to the visual representation features and the language representation features, determining attention values corresponding to each image block in the commodity image and/or each character in the commodity title, and determining attention loss according to the attention values.

Step 603, determining a prediction result corresponding to the commodity image and/or the commodity title through a fusion module according to the visual representation feature and the language representation feature, and determining a prediction loss according to the prediction result.

The prediction result corresponding to the commodity image is used for positioning a commodity main body in the commodity image, and the prediction result corresponding to the commodity title is used for positioning a central word of the commodity title.

Optionally, the commodity main body may be a main body frame in the commodity image, and the main body frame may be a frame where the target commodity is located. In the image corresponding to the commodity, there may be other articles except the commodity, and the target commodity in the image can be located by the method in this embodiment.

And step 604, adjusting parameters of the model according to the attention loss and the predicted loss.

The implementation process and principle of the method in this embodiment can be referred to the foregoing embodiments, and are not described herein again.

By the method, the central word corresponding to each commodity in the E-commerce platform and the main frame of the commodity image can be constructed.

Optionally, a keyword or an image to be queried input by the user may be acquired, and according to the keyword or the image, a commodity matched with the keyword or the image to be queried is selected from a plurality of commodities of the e-commerce platform and displayed. The matched commodities can include commodities with the same central word as the keyword or commodities with the main body frame matched with the image.

For example, if the keyword input by the user is "shorts", the commodity with the central word of "shorts" can be found and displayed to the user. The commodity expected to be queried by the user can be more accurately determined through matching of the central word or the main body box, and the method in the embodiment can assist in improving the accuracy of determining the central word or the main body box.

Fig. 7 is a schematic diagram of a conventional image search. As shown in fig. 7, the image to be queried input by the user is bottled drinking water, and if the image to be queried is directly matched with the image of the commodity, it may happen that some commodities are not bottled water, but the image contains bottled water, and the image is also presented to the user. For example, a certain beer merchant sells beer, and in order to indicate the volume of beer, 500ml of standard bottled water is used as a comparison, in this case, the user wants to buy the bottled water, but the beer is recommended to the user, and the recommendation accuracy is poor.

Fig. 8 is a schematic diagram of a map search according to an embodiment of the present application, as shown in fig. 8, after the method provided by the embodiment of the present application is used, a body frame of each product image can be located, for example, for a product, beer, in combination with a title of the product, it can be assisted to know that a body in the product image is beer instead of bottled water, so that when the map search is performed, if a user searches for bottled water, the user is not shown beer. Bottled water that some trade company sold, probably contained ornaments such as bouquet in its corresponding image, through the commodity title, can assist the main part of knowing in the image be bottled water, consequently, accord with user's demand, can give the user with this commodity propelling movement, improve user experience degree.

The embodiment can realize E-commerce multi-modal core capability output, and can have other applications besides image search and text search.

In one example, after the body frame of the commodity image is determined, the body frame can be input into the image recognition model, and core products in the body frame can be recognized more accurately.

In another example, the classification of the goods may be predicted according to the main frame or the central word of the title of the goods image, so as to realize the classification operation of the goods, wherein the classification may include clothing, food, home appliances, and the like.

In summary, the model training method provided in this embodiment can train the model through the loss corresponding to the prediction result and the intermediate attention loss, and not only can monitor the final prediction result, but also can monitor the intermediate representation features, so that the extracted representation features can more accurately reflect the importance degree of the image blocks or characters, thereby better performing fusion in the multi-modal fusion module, giving more attention to the important image blocks or characters, improving the accuracy of the final prediction result, more quickly and accurately determining the central word in the commodity title and the main frame in the commodity image, improving the accuracy of the text search and the image search, and improving the user experience.

Fig. 9 is a schematic flowchart of another model training method according to an embodiment of the present application. The model comprises a first modal coding module, a second modal coding module and a fusion module. As shown in fig. 9, the method includes:

step 901, determining a first modality representation characteristic corresponding to the first modality information through a first modality encoding module, and determining a second modality representation characteristic corresponding to the second modality information through a second modality encoding module.

The modality may refer to a form in which data exists, for example, text, image, audio, video, sensing data, and the like belong to different modalities. The first modality information and the second modality information may be first modality information and second modality information corresponding to a target object, and the target object may be a commodity or the like.

Optionally, the first modality information and the second modality information include any two of the following information: image, text, audio, video, sensory information. For example, the first modality information may include any one of image, text, audio, video, and sensing information, and the second modality information may include any one of audio, video, and sensing information and is different from the first modality information.

And 902, determining an attention value corresponding to each piece of first sub-modal information in the first modal information and/or each piece of second sub-modal information in the second modal information according to the first modal representation feature and the second modal representation feature, and determining attention loss according to the attention value.

The attention value of the first sub-modal information is used for representing the contribution of the first sub-modal information to the prediction of the second modal information, and the attention value of the second sub-modal information is used for representing the contribution of the second sub-modal information to the prediction of the first modal information.

Alternatively, the first modality information may be divided into a plurality of first sub-modality information, and the second modality information may be divided into a plurality of second sub-modality information.

Illustratively, the first modality information may be a video, and the first sub-modality information may be each frame image in the video, or a video clip. Alternatively, the first modality information may be audio, and the first sub-modality information may be an audio clip. Alternatively, the first modality information may be sensing information, and may be specifically divided into multiple groups, where each group serves as one piece of first sub-modality information. The relationship between the second modality information and the second sub-modality information is similar to that, and is not described herein again.

And 903, determining a prediction result and a corresponding prediction loss of the first modality information and/or the second modality information through a fusion module according to the first modality representation characteristics and the second modality representation characteristics.

Optionally, the first modality representation feature and the second modality representation feature may be input to the fusion module to obtain a multi-modality feature, and the multi-modality feature may be used to represent a feature corresponding to the target object in a multi-modality space, and then the multi-modality feature is input to the first modality information prediction module and the second modality information prediction module to obtain a corresponding prediction result.

Therein, a multi-modal space may refer to a common space to which data of different modalities is mapped. In the multi-modal space, the more important a certain sub-modal information is, the higher its corresponding weight is. Through the multi-modal space, high-level features of data of various modes can be abstracted so as to realize the prediction function of the model.

And 904, adjusting the parameters of the model according to the attention loss and the predicted loss.

In this embodiment, the model may be trained according to the first modality information and the second modality information, and the obtained model may be used to predict the first modality information and/or the second modality information.

The implementation process and principle of the method in this embodiment can be seen in the foregoing embodiment, and only the image in the foregoing embodiment needs to be replaced with the first modality information, the text needs to be replaced with the second modality information, the image block needs to be replaced with the first sub-modality information, the character needs to be replaced with the second sub-modality information, and the vision and the language need to be replaced with the first modality and the second modality respectively.

Illustratively, the first modality information and the second modality information may be videos and texts, respectively, for example, videos and texts of the goods, specifically, videos and titles of the goods detail page, and the corresponding prediction results are used to represent video clips where the goods are located in the videos and the central words of the titles, respectively. The attention values of all the segments and all the characters in the title in the video are processed to construct attention loss, and the accuracy of the model can be improved in an auxiliary mode, so that the segments and the central words of the title in which the commodity appears can be located more quickly and accurately in the video and the method is used for subsequent character search, image search or video search.

In another example, the first modality information and the second modality information may be images and audios, respectively, for example, images and audios of the commodities, and the corresponding prediction results are used to represent commodity subject frames in the images and audio clips that mention the commodities in the audios, respectively, so that the method provided by the embodiment can be used to more quickly and accurately locate the commodity subject frames and the audio clips from the images for subsequent image search or audio search.

In another example, the first modality information and the second modality information may be sensing information and an image, respectively, for example, the sensing information and the image may be acquired by a vehicle or a roadside device, the sensing information may be infrared sensing data, point cloud data, or the like, the sensing information and the image may be processed by the method provided by this embodiment, and the trained model may be used to determine a position where an obstacle in the sensing information or the image is located, so as to provide a basis for vehicle driving, and improve driving safety.

In summary, the model training method provided in this embodiment can train the model through the loss corresponding to the prediction result and the intermediate attention loss, and not only can supervise the final prediction result, but also can supervise the intermediate representation features, so that the extracted representation features can more accurately reflect the importance degree of each of the first sub-modal information and the second sub-modal information, thereby better performing fusion in the multi-modal fusion module, giving more attention to the important first sub-modal information or second sub-modal information, improving the accuracy of the final prediction result, and improving the model effect.

Fig. 10 is a flowchart illustrating a prediction method according to an embodiment of the present application. As illustrated in fig. 10, the method includes:

step 1001, obtaining first modality information and second modality information to be processed.

Wherein the first modality information and the second modality information include any two of the following information: image, text, audio, video, sensory information.

Step 1002, obtaining a prediction result corresponding to the first modality information and/or the second modality information through a multi-modality interaction model according to the first modality information and the second modality information to be processed.

Wherein, the multi-modal interaction model is obtained by training based on the method of any one of the above embodiments.

Optionally, when the model obtained by training using the embodiments shown in fig. 1 to 8 is used, the first modality information may be specific to the second modality information, such as an image and a text.

For example, an image and a text of a commodity may be input into the model, and a prediction result corresponding to the image, that is, a main body box where the commodity is located in the image, may be obtained through interaction between the image and the text, or a prediction result corresponding to the text, that is, a central word of the text, may be obtained, or a prediction result of the image and the text may also be obtained simultaneously.

In summary, in the prediction method provided in this embodiment, the model is obtained by training the loss corresponding to the prediction result and the intermediate attention loss, and not only can monitor the final prediction result, but also can monitor the intermediate representation features, so that the extracted representation features can more accurately reflect the importance degree of each of the first sub-modal information and the second sub-modal information, thereby better performing fusion in the multi-modal fusion module, giving more attention to the important first sub-modal information or second sub-modal information, and improving the accuracy of the prediction result.

Corresponding to the model training method, the embodiment of the application also provides a model training device, wherein the model comprises a visual coding module, a language coding module and a fusion module; the device comprises:

the first input unit is used for determining visual representation characteristics corresponding to the image through a visual coding module according to the image and the text to be processed, and determining language representation characteristics corresponding to the text through a language coding module;

the first attention processing unit is used for determining an attention value corresponding to each image block in the image and/or each character in the text according to the visual representation feature and the language representation feature, and determining attention loss according to the attention value; wherein the attention value of the image block is used for representing the contribution of the image block to the text prediction, and the attention value of the character is used for representing the contribution of the character to the image prediction;

the first fusion unit is used for determining a prediction result corresponding to the image and/or the text through a fusion module according to the visual representation characteristic and the language representation characteristic and determining a prediction loss according to the prediction result;

a first adjusting unit for adjusting parameters of the model according to the attention loss and the prediction loss.

In one or more embodiments of the present application, optionally, when determining, according to the visual representation feature and the language representation feature, the attention value corresponding to each image block in the image and/or each character in the text, the first attention processing unit is specifically configured to:

In one or more embodiments of the present application, optionally, when determining the attention loss according to the attention value, the first attention processing unit is specifically configured to:

In one or more embodiments of the present application, optionally, when determining attention loss according to the attention value and the corresponding label of each image block and/or the attention value and the corresponding label of each character, the first attention processing unit is specifically configured to:

In one or more embodiments of the present application, optionally, the model further comprises a visual prediction module and/or a language prediction module; the first fusion unit is specifically configured to:

In one or more embodiments of the present application, optionally, the first adjusting unit is specifically configured to:

The embodiment of the application also provides another model training device, wherein the model comprises a visual coding module, a language coding module and a fusion module; the device comprises:

the second input unit is used for determining visual representation characteristics corresponding to the commodity images through a visual coding module according to the commodity images and the commodity titles corresponding to the commodities and determining language representation characteristics corresponding to the commodity titles through a language coding module;

the second attention processing unit is used for determining an attention value corresponding to each image block in the commodity image and/or each character in the commodity title according to the visual representation feature and the language representation feature, and determining attention loss according to the attention value; wherein the attention value of the image block is used for representing the contribution of the image block to the text prediction, and the attention value of the character is used for representing the contribution of the character to the image prediction;

the second fusion unit is used for determining a prediction result corresponding to the commodity image and/or the commodity title through a fusion module according to the visual representation feature and the language representation feature, and determining a prediction loss according to the prediction result; the prediction result corresponding to the commodity image is used for positioning a commodity main body in the commodity image, and the prediction result corresponding to the commodity title is used for positioning a central word of the commodity title;

and the second adjusting unit is used for adjusting the parameters of the model according to the attention loss and the prediction loss.

the third input unit is used for determining a first modal representation characteristic corresponding to the first modal information through the first modal coding module and determining a second modal representation characteristic corresponding to the second modal information through the second modal coding module;

the third attention processing unit is used for determining an attention value corresponding to each piece of first sub-modal information in the first modal information and/or each piece of second sub-modal information in the second modal information according to the first modal representation feature and the second modal representation feature, and determining attention loss according to the attention value; the attention value of the first sub-modal information is used for representing the contribution of the first sub-modal information to the prediction of the second modal information, and the attention value of the second sub-modal information is used for representing the contribution of the second sub-modal information to the prediction of the first modal information;

the third fusion unit is used for determining a prediction result and a corresponding prediction loss of the first modality information and/or the second modality information through a fusion module according to the first modality representation characteristics and the second modality representation characteristics;

and a third adjusting unit for adjusting the parameters of the model according to the attention loss and the prediction loss.

An embodiment of the present application further provides a prediction apparatus, including:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring first modality information and second modality information to be processed; wherein the first modality information and the second modality information include any two of the following information: image, text, audio, video, sensory information;

the prediction unit is used for obtaining a prediction result corresponding to the first modal information and/or the second modal information through a multi-modal interaction model according to the first modal information and the second modal information to be processed;

The model training device and the prediction device provided in the embodiments of the present application may be used to implement the technical solutions of the embodiments shown in fig. 1 to fig. 10, which have similar implementation principles and technical effects, and are not described herein again.

Fig. 11 is a schematic structural diagram of an electronic device according to an embodiment of the present application. As shown in fig. 11, the electronic device of the present embodiment may include:

at least one processor 1101; and

a memory 1102 communicatively coupled to the at least one processor;

wherein the memory 1102 stores instructions executable by the at least one processor 1101 to cause the electronic device to perform a method according to any one of the embodiments described above.

Alternatively, the memory 1102 may be separate or integrated with the processor 1101.

For the implementation principle and the technical effect of the electronic device provided by this embodiment, reference may be made to the foregoing embodiments, which are not described herein again.

The embodiment of the present application further provides a computer-readable storage medium, in which computer-executable instructions are stored, and when a processor executes the computer-executable instructions, the method described in any one of the foregoing embodiments is implemented.

The present application further provides a computer program product, which includes a computer program, and when the computer program is executed by a processor, the computer program implements the method described in any of the foregoing embodiments.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described device embodiments are merely illustrative, and for example, the division of the modules is only one logical division, and other divisions may be realized in practice, for example, a plurality of modules may be combined or integrated into another system, or some features may be omitted, or not executed.

The integrated module implemented in the form of a software functional module may be stored in a computer-readable storage medium. The software functional module is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) or a processor to execute some steps of the methods described in the embodiments of the present application.

It should be understood that the Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in the incorporated application may be directly implemented by a hardware processor, or may be implemented by a combination of hardware and software modules in the processor. The memory may comprise a high-speed RAM memory, and may further comprise a non-volatile storage NVM, such as at least one disk memory, and may also be a usb disk, a removable hard disk, a read-only memory, a magnetic or optical disk, etc.

The storage medium may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.

An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. Of course, the storage medium may also be integral to the processor. The processor and the storage medium may reside in an Application Specific Integrated Circuits (ASIC). Of course, the processor and the storage medium may reside as discrete components in an electronic device or host device.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present application.

In the technical scheme of the application, the collection, storage, use, processing, transmission, provision, publication and other processing of the related user data and other information all accord with the regulations of related laws and regulations, and do not violate the common customs of public order.

The above description is only a preferred embodiment of the present application, and not intended to limit the scope of the present application, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the specification and the drawings of the present application, or which are directly or indirectly applied to other related technical fields, are included in the scope of the present application.

Claims

1. A model training method is characterized in that the model comprises a visual coding module, a language coding module and a fusion module; the method comprises the following steps:

2. The method of claim 1, wherein determining attention values corresponding to respective image blocks in the image and/or respective characters in the text according to the visual representation features and the linguistic representation features comprises:

3. The method of claim 1, wherein determining attention loss from the attention value comprises:

4. The method of claim 3, wherein determining the attention loss according to the attention value and the corresponding label of each image block and/or the attention value and the corresponding label of each character comprises:

5. The method according to any one of claims 1-4, wherein the model further comprises a visual prediction module and/or a language prediction module; determining a prediction result corresponding to the image and/or the text through a fusion module according to the visual representation feature and the language representation feature, wherein the prediction result comprises:

6. The method according to any one of claims 1-4, wherein adjusting parameters of the model based on the attention loss and the predicted loss comprises:

7. A model training method is characterized in that the model comprises a visual coding module, a language coding module and a fusion module; the method comprises the following steps:

8. A model training method is characterized in that the model comprises a first modal coding module, a second modal coding module and a fusion module; the method comprises the following steps:

9. A prediction method, comprising:

wherein the multi-modal interaction model is trained based on the method of any one of claims 1-8.

10. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor;

wherein the memory stores instructions executable by the at least one processor to cause the electronic device to perform the method of any of claims 1-9.

11. A computer-readable storage medium having computer-executable instructions stored thereon which, when executed by a processor, implement the method of any one of claims 1-9.

12. A computer program product comprising a computer program, characterized in that the computer program realizes the method according to any of claims 1-9 when executed by a processor.