CN116842384A

CN116842384A - Multi-mode model training method and device, electronic equipment and readable storage medium

Info

Publication number: CN116842384A
Application number: CN202310798376.0A
Authority: CN
Inventors: 赵哲一; 于非; 贺颖; 孙喜龙; 施斯; 陈加壹
Original assignee: Guangdong Provincial Laboratory Of Artificial Intelligence And Digital Economy Shenzhen
Current assignee: Guangdong Provincial Laboratory Of Artificial Intelligence And Digital Economy Shenzhen
Priority date: 2023-06-29
Filing date: 2023-06-29
Publication date: 2023-10-03

Abstract

The application provides a multi-mode model training method, a multi-mode model training device, electronic equipment and a readable storage medium. The method comprises the following steps: inputting the first image sample into an image processing model to obtain first image information output by the image processing model; inputting a first instruction into the multimodal model, training the multimodal model, the first instruction comprising first image information and a corresponding text description; determining first text information according to the text description in the multimodal model; after the first image information and the first text information are aligned, a first text answer corresponding to the first image sample is determined according to the aligned first image information and the first text information. According to the application, the first image information and the first text information are added, so that the multi-modal model aligns the first image information and the first text information based on contrast learning, the alignment speed of visual information and language information is improved, and the multi-modal model reduces the requirement on huge model capacity and data volume.

Description

Multi-mode model training method and device, electronic equipment and readable storage medium

Technical Field

The application belongs to the technical field of deep learning, and particularly relates to a multi-modal model training method and device, electronic equipment and a readable storage medium.

Background

Currently, multimodal models refer to machine learning models that are trained and inferred using a variety of different data types (e.g., text, images, speech, video, etc.). The information of different modes is fused together, so that the perception and cognition process of human beings is better simulated.

The method for fusing the visual information and the language information by the multi-mode model is to align the visual information and the language information in the hidden state space. But this fusion method can result in a loss of reasoning when aligning the visual information. In order to compensate for the inference loss generated when visual information is aligned, the multimodal model requires a huge model capacity and data volume.

Disclosure of Invention

The embodiment of the application provides a multi-mode model training method, a device, electronic equipment, a readable storage medium and a computer program product, which can solve the problem that a multi-mode model needs huge model capacity and data volume in order to make up for reasoning loss generated by aligning visual information.

In a first aspect, an embodiment of the present application provides a multi-modal model training method, including:

Acquiring an image description sample, wherein the image description sample comprises a first image sample and a corresponding text description;

inputting the first image sample into an image processing model to obtain first image information output by the image processing model;

inputting a first instruction into a multimodal model, training the multimodal model, the first instruction comprising the first image information and the corresponding text description;

determining, in the multimodal model, first text information from the text description;

after aligning the first image information and the first text information, determining a first text answer corresponding to the first image sample according to the aligned first image information and the first text information;

and when the global loss value of the multi-modal model is smaller than a preset threshold value, obtaining a trained multi-modal model, wherein the global loss value comprises an alignment loss value between the first image information and the first text information and a loss value between the first text answer and the text description.

In one embodiment, the inputting the first image sample into an image processing model includes:

dividing the first image sample into a plurality of image blocks;

And inputting the image block of the first image sample and a token corresponding to the first image information into the image processing model.

In one embodiment, the determining the first text information according to the text description includes:

in the multimodal model, acquiring the text description from the first instruction based on a preset mask;

and processing the text description to obtain the first text information.

In one embodiment, said aligning said first image information and said first text information comprises:

according to the first image information and the first text information, a sample matrix is constructed, elements on a diagonal line in the sample matrix are positive samples, other elements in the sample matrix are negative samples, the positive samples comprise the first image information and the first text information, and the negative samples comprise the first image information and the tampered first text information;

and aligning the first image information with the first text information according to the sample matrix.

In one embodiment, the alignment loss value between the first image information and the first text information includes a loss value of the positive sample and a loss value of the negative sample.

In one embodiment, after determining the first text answer corresponding to the first image sample according to the aligned first image information and the first text information, the method further includes:

acquiring an image question-answer sample, wherein the image question-answer sample comprises a second image sample and corresponding question-answer information;

inputting the second image sample into the image processing model to obtain second image information output by the image processing model;

inputting a second instruction to the multi-modal model, and training the multi-modal model, wherein the second instruction comprises the second image information and corresponding question-answer information;

determining second text information according to the question-answer information in the multi-modal model;

determining a second text answer corresponding to the second image sample according to the second image information and the second text information;

the global loss value further comprises a loss value between the second text answer and an answer sample, and the question-answer information comprises a question sample and the answer sample.

In a second aspect, an embodiment of the present application provides an agent control method, including:

acquiring an image to be processed and a corresponding text description to be processed;

Inputting the image to be processed and the text description to be processed into a trained multimodal model to obtain a third text answer output by the trained multimodal model, wherein the trained multimodal model is obtained by training by the method of any one of the first aspects;

using the third text answer to instruct the LLM to select an action to be executed from all primitive actions in the primitive action set;

and sending the action to be executed to the intelligent agent so that the intelligent agent executes the action to be executed.

In a third aspect, an embodiment of the present application provides a multi-modal model training apparatus, including:

the system comprises an acquisition module, a display module and a display module, wherein the acquisition module is used for acquiring an image description sample, and the image description sample comprises a first image sample and a corresponding text description;

the image processing module is used for inputting the first image sample into an image processing model to obtain first image information output by the image processing model;

a training module for inputting a first instruction into a multimodal model, training the multimodal model, the first instruction comprising the first image information and the corresponding text description;

and further configured to determine, in the multimodal model, first text information from the text description;

The first text answer corresponding to the first image sample is determined according to the aligned first image information and the first text information after the first image information and the first text information are aligned;

and the method is further used for obtaining the trained multi-modal model when the global loss value of the multi-modal model is smaller than a preset threshold, wherein the global loss value comprises an alignment loss value between the first image information and the first text information and a loss value between the first text answer and the text description.

In a fourth aspect, an embodiment of the present application provides an electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the method according to any one of the first or second aspects when executing the computer program.

In a fifth aspect, embodiments of the present application provide a computer readable storage medium storing a computer program which, when executed by a processor, implements a method as in any of the first or second aspects above.

In a sixth aspect, embodiments of the present application provide a computer program product for, when run on an electronic device, causing the electronic device to perform the method of any one of the first or second aspects above.

Compared with the prior art, the embodiment of the application has the beneficial effects that:

the embodiment of the application comprises the steps of inputting the first image sample into an image processing model to obtain first image information output by the image processing model; inputting a first instruction into the multimodal model, training the multimodal model, the first instruction comprising first image information and a corresponding text description; determining first text information according to the text description in the multimodal model; after the first image information and the first text information are aligned, a first text answer corresponding to the first image sample is determined according to the aligned first image information and the first text information, and the first image information and the first text information are added, so that the multi-modal model aligns the first image information and the first text information based on contrast learning, the alignment speed of visual information and language information is improved, and the multi-modal model reduces the requirement on huge model capacity and data volume.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments or the description of the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of a multi-modal model training method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a second flow chart of a multi-modal model training method according to an embodiment of the present application;

FIG. 3 is a schematic flow chart of an agent control method according to an embodiment of the present application;

FIG. 4 is a schematic structural diagram of a multi-modal training apparatus according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth such as the particular system architecture, techniques, etc., in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

It should be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It should also be understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

As used in the present description and the appended claims, the term "if" may be interpreted as "when..once" or "in response to a determination" or "in response to detection" depending on the context. Similarly, the phrase "if a determination" or "if a [ described condition or event ] is detected" may be interpreted in the context of meaning "upon determination" or "in response to determination" or "upon detection of a [ described condition or event ]" or "in response to detection of a [ described condition or event ]".

Furthermore, the terms "first," "second," "third," and the like in the description of the present specification and in the appended claims, are used for distinguishing between descriptions and not necessarily for indicating or implying a relative importance.

Reference in the specification to "one embodiment" or "some embodiments" or the like means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," and the like in the specification are not necessarily all referring to the same embodiment, but mean "one or more but not all embodiments" unless expressly specified otherwise. The terms "comprising," "including," "having," and variations thereof mean "including but not limited to," unless expressly specified otherwise.

Fig. 1 is a schematic flow chart of a multi-modal model training method according to an embodiment of the application.

As shown in fig. 1, the method includes:

s11: an image description sample is acquired.

The image description sample comprises a first image sample and a corresponding text description.

In an application, an image description sample database is pre-built, and the image description sample comprises a first image sample and a corresponding text description. Illustratively, the first image sample includes one cat and one dog, and the text describes one cat and one dog. An image description sample is obtained from an image description sample database.

S12: and inputting the first image sample into the image processing model to obtain first image information output by the image processing model.

In an application, inputting a first image sample into an image processing model, comprising:

s121: the first image sample is divided into a plurality of image blocks.

S122: the image block of the first image sample and the token corresponding to the first image information are input to an image processing model.

Wherein the first image information is used to characterize global information or an overall description of the first image sample. By setting the token of the first image information, the image processing model outputs the first image information after processing the image block. Because the image processing model processes the image block such that the information of the first image sample is integrated multiple times, the first image information may characterize global information or an overall description of the first image sample.

The image processing model is, for example, image ViT (Vision Transformer). Image ViT is 12 layers in total. Dividing the first image sample into 256 blocks, setting a token of the first image information: token < cls_img >. The position code of the token is set to 0 position, and the position code of the image block is set to 1 to 256 positions. And extracting information of 12 layers to form an ebedding sequence, and outputting the ebedding information of the 0 position. The ebedding information of the 0 position is the content of < cls_img >, i.e., the first image information.

S13: the first instruction is input to the multimodal model to train the multimodal model.

Wherein the first instruction includes first image information and a corresponding text description.

In one possible implementation, the first instruction may be expressed as "please describe the image < |cls_img| > < IMG > < answer > <|end_answer| > <|cls_answer| >". The content of < cls_img > is first image information, the content of < IMG > is a first image sample, the content of < answer > is a text description corresponding to the first image information, the < end_answer > represents the end of the answer, the token < |cls_answer > is a token of the first text information, and the token < |cls_answer| > is not used for feature fusion.

S14: in the multimodal model, first text information is determined from the text description.

In one possible implementation, the first text information is obtained from the text description based on a token < |cls_answer| > and a corresponding mask.

S15: after the first image information and the first text information are aligned, a first text answer corresponding to the first image sample is determined according to the aligned first image information and the first text information.

In the application, after the first image information and the first text information are obtained, based on contrast learning, the first image information and the first text information are aligned, so that visual language alignment is realized, and the distance difference between the aligned first image information and the aligned first text information in a vector space is smaller than a preset distance. The aligned first image information has language properties and can be understood by the model.

And predicting according to the aligned first image information and the first text information to obtain a first text answer corresponding to the first image sample.

S16: and when the global loss value of the multi-modal model is smaller than a preset threshold value, obtaining the trained multi-modal model.

Wherein the global penalty value comprises an alignment penalty value between the first image information and the first text information and a penalty value between the first text answer and the text description.

In an application, the alignment loss value is obtained by calculating an inner product between the first image information and the first text information. And obtaining a loss value between the first text answer and the text description by calculating the cross entropy between the first text answer and the text description.

The preset threshold value can be set according to an actual application scene.

In one embodiment, step S14 includes:

s141: in the multimodal model, a textual description is obtained from a first instruction based on a preset mask.

S142: and processing the text description to obtain the first text information.

In an application, the content of token < |cls_answer| > after performing the attention calculation may represent information of < answer >, and the first text information is obtained.

According to the embodiment of the application, the text description is acquired from the first instruction based on the preset mask, the text description is processed to obtain the first text information, and the text description can be accurately acquired from the first instruction, so that the first text information can be determined according to the text description.

In one embodiment, aligning the first image information and the first text information includes:

S21: and constructing a sample matrix according to the first image information and the first text information, wherein elements on diagonal lines in the sample matrix are positive samples, and the rest elements in the sample matrix are negative samples.

The positive sample comprises first image information and first text information, and the negative sample comprises the first image information and tampered first text information.

In an application, based on contrast learning, a matrix comprising positive and negative samples is constructed such that the multimodal model learns and aligns the first image information and the first text information.

For example, the first text information corresponding to the first image information is a sentence. And obtaining tampered first text information after the sentence is tampered. The rows of the sample matrix are set as images, the columns as text, the elements on the diagonal of the sample matrix as positive samples, and the other positions as negative samples.

S21: the first image information is aligned with the first text information according to the sample matrix.

In application, according to the sample matrix, the multi-modal model learns so that the first image information in the positive sample is close to the first text information, and the first image information in the negative sample is far away from the tampered first text information, namely the difference between the aligned first image information and the first text information in the vector space is smaller than a preset distance.

In one possible implementation, the visual language alignment step is performed in an intermediate layer of the multimodal model.

Correspondingly, the alignment loss value between the first image information and the first text information includes a loss value of the positive sample and a loss value of the negative sample.

And taking the inner product between the first image information and the first text information in the positive sample as a loss value of the positive sample, wherein the larger the inner product is, the smaller the loss value is. And taking the inner product between the first image information in the negative sample and the tampered first text information as a loss value of the negative sample, wherein the larger the inner product is, the larger the loss value is.

According to the embodiment of the application, the sample matrix is constructed according to the first image information and the first text information, the elements on the diagonal line in the sample matrix are positive samples, the other elements in the sample matrix are negative samples, and the first image information and the first text information are aligned according to the sample matrix, so that the multi-mode model can better learn Xi Di an image information and the first text information so as to better align the first image information and the first text information.

Fig. 2 is a schematic diagram of a second flow chart of a multi-modal model training method according to an embodiment of the application.

As shown in fig. 2, after step S14, the method further includes:

S31: and acquiring an image question-answer sample.

The image question and answer samples comprise a second image sample and corresponding question and answer information. The question and answer information includes a question sample and an answer sample.

In the application, an image question and answer sample database is preset, and can be used as a supplement of a multi-modal model training library, and the multi-modal model can be finely tuned so that the multi-modal model can better understand images. The image question and answer samples comprise a second image sample and a corresponding question and answer sample. The second image sample, by way of example, includes a cat and a dog. What is the question-answer sample included in the picture? There is one cat and one dog. An image question and answer sample is obtained from an image question and answer sample database.

Wherein the second image sample and the first image sample may be the same or different.

S32: and inputting the second image sample into the image processing model to obtain second image information output by the image processing model.

s321: the second image sample is divided into a plurality of image blocks.

S322: the image block of the second image sample and the token corresponding to the second image information are input to the image processing model.

Wherein the second image information is used to characterize global information or an overall description of the second image sample. By setting the token of the second image information, the image processing model outputs the second image information after processing the image block. Because the image processing model processes the image blocks such that the information of the first image sample is integrated multiple times, the second image information may characterize global information or an overall description of the first image sample.

S33: the second instruction is input to the multimodal model to train the multimodal model.

Wherein the second instruction includes second image information and corresponding question-answer information.

In one possible implementation, the second instruction may be expressed as "please answer < Visual Question > < answer > <|end_answer| > < |cls_answer| > according to picture < |cls_img > < IMG >. The content of < |cls_img| > is second image information, the content of < IMG > is a second image sample, the content of < Visual Question > < answer > is Question answer information corresponding to the second image information, the < |end_answer| > indicates that the answer is over, the token is not used for feature fusion, and the token < |cls_answer| > is a token of second text information.

S34: in the multimodal model, second text information is determined based on the question-answer information.

In application, step S34 includes:

s341: in the multimodal model, an answer sample is obtained from the second instruction based on a preset mask.

S342: and processing the answer sample to obtain second text information.

In an application, the content of token < |cls_answer| > after performing the attention calculation may represent information of < answer >, and second text information is obtained.

S35: and determining a second text answer corresponding to the second image sample according to the second image information and the second text information.

In the application, the multi-modal model predicts according to the second image information and the second text information, and obtains a second text answer corresponding to the second image sample.

Correspondingly, the global penalty value further includes a penalty value between the second text answer and the answer sample.

In the application, the cross entropy between the second text answer and the answer sample is calculated, and a loss value between the second text answer and the answer sample is obtained. That is, the global penalty values include an alignment penalty value between the first image information and the first text information, a penalty value between the first text answer and the text description, and a penalty value between the second text answer and the answer sample.

According to the embodiment of the application, the image question-answer sample is obtained, and comprises a second image sample and a corresponding question-answer sample; inputting the second image sample into the image processing model to obtain second image information output by the image processing model; inputting a second instruction into the multimodal model, and training the multimodal model, wherein the second instruction comprises second image information and corresponding question-answer information; determining second text information according to the question-answer information in the multi-modal model; and determining a second text answer corresponding to the second image sample according to the second image information and the second text information, training the multi-modal model, and combining the two databases to train the multi-modal model so that the multi-modal database can understand the image more.

Fig. 3 is a flow chart of an agent control method according to an embodiment of the application. As shown in fig. 3, the method includes:

s41: and acquiring the image to be processed and the corresponding text description to be processed.

Wherein the image to be processed is an image of an external environment acquired by a sensor or an image pickup apparatus. The external environment is the environment in which the object operated by the agent is located.

The image to be processed is an image, and the text description to be processed includes at least one sentence.

S42: and inputting the image to be processed and the text description to be processed into the trained multi-modal model to obtain a third text answer output by the trained multi-modal model.

Wherein the trained multi-modal model is obtained by training the multi-modal model training method described in the above embodiments.

In the application, the trained multi-mode model predicts according to the image to be processed and the text description to be processed, and a third text answer corresponding to the image to be processed is obtained.

S43: and using the third text answer to instruct the LLM to select the action to be executed from all primitive actions in the primitive action set.

In the application, a high-level instruction is constructed based on the third text answer, the high-level instruction being expressed as "user command+please select one action to execute according to the third text answer and the following primitive actions < action1> < action2> < action3>. And inputting the high-level instruction into the LLM to obtain the action to be executed output by the LLM.

And after the intelligent agent executes the action to be executed, if the user command is not completed, reconstructing the high-level instruction to obtain the reconstructed high-level instruction. The high-level instruction after reconstruction is expressed as "user instruction+please select one action to execute according to the third text answer and the following primitive actions < action1> < action2> < action3>. And inputting the reconstruction advanced instruction into the LLM to obtain the action to be executed output by the LLM. And so on until the user command is completed.

S44: and sending the action to be executed to the intelligent agent so that the intelligent agent executes the action to be executed.

In an application, an agent is an intelligent device, such as a robotic arm, robot, etc., that can perform an action to be performed.

The embodiment of the application obtains the image to be processed and the corresponding text description to be processed; inputting the image to be processed and the text description to be processed into the trained multi-modal model to obtain a third text answer output by the trained multi-modal model, and indicating the LLM to select actions to be executed from all primitive actions in the primitive action set by using the third text answer; and sending the action to be executed to the intelligent agent so that the intelligent agent executes the action to be executed, and enabling the LLM to interact with the environment to enable the environment to participate in LLM reasoning, so that the LLM can select the most favorable decision for the intelligent agent in the environment.

It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic, and should not limit the implementation process of the embodiment of the present application.

Corresponding to the methods described in the above embodiments, only those relevant to the embodiments of the present application are shown for convenience of explanation.

Fig. 4 is a schematic structural diagram of a multi-modal training apparatus according to an embodiment of the application. As shown in fig. 4, the apparatus includes:

the obtaining module 10 is configured to obtain an image description sample, where the image description sample includes a first image sample and a corresponding text description.

The image processing module 11 is configured to input the first image sample to the image processing model, and obtain first image information output by the image processing model.

A training module 12 for inputting a first instruction into the multimodal model, training the multimodal model, the first instruction comprising first image information and a corresponding textual description;

the method is also used for determining first text information according to the text description in the multi-modal model;

the first text answer corresponding to the first image sample is determined according to the first image information and the first text information after the first image information and the first text information are aligned;

And the method is also used for obtaining the trained multi-modal model when the global loss value of the multi-modal model is smaller than a preset threshold value, wherein the global loss value comprises an alignment loss value between the first image information and the first text information and a loss value between the first text answer and the text description.

In one embodiment, the image processing module is specifically configured to divide the first image sample into a plurality of image blocks; the image block of the first image sample and the token corresponding to the first image information are input to an image processing model.

In one embodiment, the training module is specifically configured to obtain, in the multimodal model, a text description from the first instruction based on a preset mask; and processing the text description to obtain the first text information.

In one embodiment, the training module is specifically configured to construct a sample matrix according to the first image information and the first text information, where elements on a diagonal line in the sample matrix are positive samples, and other elements in the sample matrix are negative samples, where the positive samples include the first image information and the first text information, and the negative samples include the first image information and the tampered first text information; the first image information is aligned with the first text information according to the sample matrix.

In one embodiment, the training module is further configured to obtain an image question-answer sample, where the image question-answer sample includes a second image sample and corresponding question-answer information; inputting the second image sample into the image processing model to obtain second image information output by the image processing model; inputting a second instruction into the multimodal model, and training the multimodal model, wherein the second instruction comprises second image information and corresponding question-answer information; determining second text information according to the question-answer information in the multi-modal model; determining a second text answer corresponding to the second image sample according to the second image information and the second text information; the global loss value further comprises a loss value between the second text answer and the answer sample, and the question-answer information comprises a question sample and an answer sample.

In one embodiment, an agent control device includes:

the acquisition module is used for acquiring the image to be processed and the corresponding text description to be processed.

The answer prediction module is used for inputting the image to be processed and the text description to be processed into a trained multi-modal model to obtain a third text answer output by the trained multi-modal model, wherein the trained multi-modal model is obtained through training by the multi-modal training method of the embodiment.

And the execution module is used for indicating the LLM to select the action to be executed from all primitive actions in the primitive action set by utilizing the third text answer.

And the method is also used for sending the action to be executed to the intelligent agent so that the intelligent agent can execute the action to be executed.

Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application. As shown in fig. 5, the electronic apparatus 2 of this embodiment includes: at least one processor 20 (only one is shown in fig. 5), a memory 21 and a computer program 22 stored in the memory 21 and executable on the at least one processor 20, the processor 20 implementing the steps in any of the various method embodiments described above when executing the computer program 22.

The electronic device 2 may be a computing device such as a desktop computer or a cloud server. The electronic device 2 may include, but is not limited to, a processor 20, a memory 21. It will be appreciated by those skilled in the art that fig. 5 is merely an example of the electronic device 2 and is not meant to be limiting of the electronic device 2, and may include more or fewer components than shown, or may combine certain components, or different components, such as may also include input-output devices, network access devices, etc.

The processor 20 may be a central processing unit (Central Processing Unit, CPU), and the processor 20 may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 21 may in some embodiments be an internal storage unit of the electronic device 2, such as a hard disk or a memory of the electronic device 2. The memory 21 may in other embodiments also be an external storage device of the electronic device 2, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided on the electronic device 2. Further, the memory 21 may also include both an internal storage unit and an external storage device of the electronic device 2. The memory 21 is used for storing an operating system, application programs, boot loader (BootLoader), data, other programs, etc., such as program codes of the computer program. The memory 21 may also be used for temporarily storing data that has been output or is to be output.

It should be noted that, because the content of information interaction and execution process between the above devices/units is based on the same concept as the method embodiment of the present application, specific functions and technical effects thereof may be referred to in the method embodiment section, and will not be described herein.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit. In addition, the specific names of the functional units and modules are only for distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working process of the units and modules in the above system may refer to the corresponding process in the foregoing method embodiment, which is not described herein again.

Embodiments of the present application also provide a computer readable storage medium storing a computer program which, when executed by a processor, performs the steps of the respective method embodiments described above.

Embodiments of the present application provide a computer program product which, when run on an electronic device, causes the electronic device to perform the steps of the method embodiments described above.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the present application may implement all or part of the flow of the method of the above embodiments, and may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and when the computer program is executed by a processor, the computer program may implement the steps of each of the method embodiments described above. Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable medium may include at least: any entity or device capable of carrying computer program code to a photographing device/terminal apparatus, recording medium, computer Memory, read-Only Memory (ROM), random access Memory (RAM, random Access Memory), electrical carrier signals, telecommunications signals, and software distribution media. Such as a U-disk, removable hard disk, magnetic or optical disk, etc. In some jurisdictions, computer readable media may not be electrical carrier signals and telecommunications signals in accordance with legislation and patent practice.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and in part, not described or illustrated in any particular embodiment, reference is made to the related descriptions of other embodiments.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus/network device and method may be implemented in other manners. For example, the apparatus/network device embodiments described above are merely illustrative, e.g., the division of the modules or units is merely a logical functional division, and there may be additional divisions in actual implementation, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection via interfaces, devices or units, which may be in electrical, mechanical or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

The above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application.

Claims

1. A method of multimodal model training comprising:

2. The method of claim 1, wherein the inputting the first image sample into an image processing model comprises:

dividing the first image sample into a plurality of image blocks;

3. The method of claim 1, wherein said determining the first text information from the text description comprises:

and processing the text description to obtain the first text information.

4. The method of claim 1, wherein said aligning said first image information and said first text information comprises:

5. The method according to claim 4, wherein: the alignment loss value between the first image information and the first text information includes a loss value of the positive sample and a loss value of the negative sample.

6. The method of any one of claims 1 to 5, wherein after determining a first text answer corresponding to the first image sample from the aligned first image information and the first text information, further comprising:

7. An agent control method, comprising:

inputting the image to be processed and the text description to be processed into a trained multimodal model, obtaining a third text answer output by the trained multimodal model, the trained multimodal model being obtained by training the method of any of claims 1 to 6;

8. A multimodal model training apparatus, comprising:

9. An electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the method of any one of claims 1 to 7 when the computer program is executed.

10. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the method according to any one of claims 1 to 7.