CN117094419B

CN117094419B - Multi-modal content output-oriented large language model training method, device and medium

Info

Publication number: CN117094419B
Application number: CN202311333184.9A
Authority: CN
Inventors: 谭明奎; 孙鑫宇; 邓泽帅; 杜卿; 陈健
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2023-10-16
Filing date: 2023-10-16
Publication date: 2024-01-30
Anticipated expiration: 2043-10-16
Also published as: CN117094419A

Abstract

The invention discloses a large language model training method, device and medium for multi-mode content output, and belongs to the technical field of artificial intelligence. The method comprises the following steps: constructing a picture-sound-text triplet dataset for training a large language model; constructing a multi-mode large language model, embedding a plurality of parallel LoRA plugins in an output layer of the large language model, and initializing the LoRA plugins and a gating selector; reconstructing the composition and sound based on the text description, and training a multi-mode large language model according to the reconstructed data; and fine tuning the multi-modal large language model. The invention carries out multi-mode alignment on the large model from the output end, and realizes end-to-end pre-training and fine tuning by adding a plurality of LoRA plug-ins and gating selectors in the output layer of the model, so that the large language model has original multi-mode generation capacity; and finally, the reasoning result is presented through multi-mode output, so that the efficiency of the large language model in interaction with human beings is improved.

Description

Multi-modal content output-oriented large language model training method, device and medium

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a large language model training method, device and medium for multi-mode content output.

Background

In recent years, large language models have achieved great success in various fields. Large language models typically train their text understanding capabilities through a large corpus crawled from the internet, which masks the text in the corpus in random proportions, and output text is obtained by predicting the probability of which word each token vector belongs to, which makes large language models difficult to process and generate modal data beyond the text. Some existing researches treat large models of different modes as mutually independent function functions, and the large language model can complete tasks such as multi-mode data processing, analysis, generation and the like by using different models in a mode of generating calling functions. However, because the different models can only interact through text or call function interfaces, the cost of multi-mode data circulation and processing is greatly increased, and the joint reasoning of different mode information cannot be realized.

To solve the above-mentioned problems, the multimodal big model tries to embed multimodal data content into the understanding process of the big language model, so that it has a cross-modal perception reasoning capability. Existing methods typically utilize additional data from other modalities, such as picture modalities. However, these multi-modal large models only have the ability to understand and infer multiple modes at the input, which do not have the ability to output multiple modes at the output. Large language models are still limited in their text output capabilities and are difficult to interact with humans more abundantly and vividly through low-dimensional data (such as pictures or sounds).

Disclosure of Invention

In order to solve at least one of the technical problems existing in the prior art to a certain extent, the invention aims to provide a large language model training method, device and medium for multi-mode content output.

The technical scheme adopted by the invention is as follows:

a large language model training method facing multi-mode content output comprises the following steps:

constructing a picture-sound-text triplet dataset for training a large language model;

constructing a multi-modal large language model, wherein the multi-modal large language model comprises a pre-trained large language model, a cross attention model, a visual model and a sound model; embedding a plurality of parallel LoRA plugins in an output layer of the large language model, and initializing the LoRA plugins and a gating selector;

reconstructing a film and sound based on the text description, and training a multi-mode large language model according to the reconstructed data so that the multi-mode large language model has the generation capacity of picture mode and sound mode data;

the multimodal big language model is fine tuned such that the multimodal big language model generates multimodal content conforming to the context description according to the instructions.

Further, the constructing a picture-sound-text triplet dataset for training a large language model includes:

acquiring a picture-text data pair, and generating corresponding sound for a picture in the picture-text data pair through a sound synthesis tool based on visual guidance to acquire a picture-sound-text triplet; and/or the number of the groups of groups,

extracting a plurality of picture-sound-text triples from a preset video data set; randomly extracting a key frame and audio corresponding to the key frame from each video in a video data set to serve as picture-sound pairing, and carrying out text description on picture content by utilizing a visual description model to obtain a picture-sound-text triplet;

and constructing a picture-sound-text triple data set according to the obtained picture-sound-text triple.

Further, the constructing a picture-sound-text triplet data set from the obtained picture-sound-text triplet includes:

converting text description in the picture-sound-text triples into an instruction dialogue form based on scenes according to a preset instruction template by using a natural language processing model;

the preset instruction templates comprise an image-text sound chat robot template and a multi-mode content editing template based on language instructions.

Further, in the training process of the multi-modal large language model, parameters of the large language model, the cross-attention model, the visual model and the acoustic model are fixed so as to avoid the catastrophic forgetting problem of the model and the expensive training expenditure.

Further, embedding a plurality of parallel LoRA plugins in an output layer of the large language model, and initializing the LoRA plugins and a gating selector, wherein the method comprises the following steps:

embedding a plurality of parallel LoRA plug-ins in an output layer of the large language model, and dividing parameters of the large language model into fixed weights through matrix low-rank decompositionAnd learnable parameters->And->The method comprises the steps of carrying out a first treatment on the surface of the Each LoRA plug-in->With corresponding parameters->And->Parameter->Initializing to a random Gaussian distribution, and adding the parameter +.>Initializing to all 0;

random initialization gating selector；

Initializing an output decoder, wherein the output decoder comprises a picture decoderAnd an audio decoder->。

Further, in the large language modelIn the layer, gate selector->Modeling as a single layer MLP model with input of +.>Layer output, gate selector->Weights for predicting LoRA plugins>The expression is as follows:

in the method, in the process of the invention,an output representing the n-1 th layer of the large language model;

in the training process, the update mode of the LoRA plugin is as follows:

in the method, in the process of the invention,large language model parameters representing freezing, +.>Representing LoRA plug-in parameters to be updated, < ->Indicates the gate selector pair +.>Weights for individual LoRA plugins predictions.

Further, the reconstructing the composition and sound based on the text description trains a multi-modal large language model according to the reconstructed data, including:

in the pre-training stage of the multi-modal large language model, the input of the model is a prompt of a picture-sound-text tripletThe method comprises the steps of carrying out a first treatment on the surface of the Wherein the text comprises a description of pictures and sounds, and instructions for a multimodal large language model, the text is processed by a marker into blocks of words +.>The method comprises the steps of carrying out a first treatment on the surface of the Pictures and sounds are encoded as multi-modal blocks of words via cross-attention mechanisms +.>And->；

In the training process, the output of the multi-modal large language model is expected to simultaneously contain label pairs of pictures and sound modal contents and corresponding discrete codes; wherein the prediction results of the large model are supervised in discrete coding dimensions, rather than pixel dimensions of the picture and spectrogram.

Further, the expression of the loss function in the training process is:

in the method, in the process of the invention,is->Word block->Is the context window length;

probability ofThe writing is as follows:

in the method, in the process of the invention,coding matrix for words>Is a position coding matrix; />A prompt for a picture-sound-text triplet;self-attention mechanism module representing large language model,/->Is a normalized exponential function.

The invention adopts another technical scheme that:

a large language model training device for multimodal content output, comprising:

at least one processor;

at least one memory for storing at least one program;

the at least one program, when executed by the at least one processor, causes the at least one processor to implement the method as described above.

The invention adopts another technical scheme that:

a computer readable storage medium, in which a processor executable program is stored, which when executed by a processor is adapted to carry out the method as described above.

The beneficial effects of the invention are as follows: the invention carries out multi-mode alignment on the large model from the output end, and realizes end-to-end pre-training and fine tuning by adding a plurality of LoRA plug-ins and gating selectors in the output layer of the model, so that the large language model has original multi-mode generation capacity; and finally, the reasoning result is presented through multi-mode output, so that the efficiency of the large language model in interaction with human beings is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following description is made with reference to the accompanying drawings of the embodiments of the present invention or the related technical solutions in the prior art, and it should be understood that the drawings in the following description are only for convenience and clarity of describing some embodiments in the technical solutions of the present invention, and other drawings may be obtained according to these drawings without the need of inventive labor for those skilled in the art.

FIG. 1 is a flow chart of steps of a method for training a large language model for multimodal content output in accordance with an embodiment of the present invention;

FIG. 2 is a schematic diagram of a multi-modal large language model in accordance with an embodiment of the present invention.

Detailed Description

Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the invention. The step numbers in the following embodiments are set for convenience of illustration only, and the order between the steps is not limited in any way, and the execution order of the steps in the embodiments may be adaptively adjusted according to the understanding of those skilled in the art.

In the description of the present invention, it should be understood that references to orientation descriptions such as upper, lower, front, rear, left, right, etc. are based on the orientation or positional relationship shown in the drawings, are merely for convenience of description of the present invention and to simplify the description, and do not indicate or imply that the apparatus or elements referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus should not be construed as limiting the present invention.

In the description of the present invention, a number means one or more, a number means two or more, and greater than, less than, exceeding, etc. are understood to not include the present number, and above, below, within, etc. are understood to include the present number. The description of the first and second is for the purpose of distinguishing between technical features only and should not be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated or implicitly indicating the precedence of the technical features indicated.

Furthermore, in the description of the present invention, unless otherwise indicated, "a plurality" means two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship.

In the description of the present invention, unless explicitly defined otherwise, terms such as arrangement, installation, connection, etc. should be construed broadly and the specific meaning of the terms in the present invention can be reasonably determined by a person skilled in the art in combination with the specific contents of the technical scheme.

Term interpretation:

GPT4: a large language model issued for OpenAI company;

as shown in fig. 1, the present embodiment provides a large language model training method for multi-modal content output, which includes the following steps:

a1, constructing a picture-sound-text triple data set for training a large language model;

a2, constructing a multi-mode large language model, wherein the multi-mode large language model comprises a pre-trained large language model, a cross attention model, a visual model and a sound model; embedding a plurality of parallel LoRA plugins in an output layer of the large language model, and initializing the LoRA plugins and a gating selector;

a3, reconstructing a film and sound based on the text description, and training a multi-mode large language model according to the reconstructed data so that the multi-mode large language model has the generation capacity of picture mode and sound mode data;

a4, fine-tuning the multi-modal large language model so that the multi-modal large language model generates multi-modal content conforming to the context description according to the instruction.

In this embodiment, the first step requires building a picture-sound-text triplet dataset for training a large language model. As an alternative embodiment we use the public dataset CC3M as a picture-text data pair, and the picture-sound-text triples are obtained by generating corresponding sounds for these pictures by means of a visual guided based sound synthesis tool. As another alternative embodiment, about 1M picture-sound-text triples extracted from the video dataset Kinetics-600; because the visual and sound modes in the videos are naturally aligned, key frames and corresponding audios are randomly extracted from each video in the data set to serve as picture-sound pairing, and then the visual description model BLIP-2 is used for carrying out text description on picture contents, so that a picture-sound-text triplet which can be used for training is finally obtained.

Since a human dialogue can be easily generated from subtitles by an existing audio generation tool, in the present embodiment, the generated contents of the sound modality contain only sounds in natural environments. Subsequently, in order to maintain the original text mode output capability of the model in the initial training stage, we initialize the LoRA plug-ins and gating selectors of different modes of the output layer in the model to be trained. Thereafter, the model is trained in two stages. In the first stage, by reconstructing the input picture and corresponding audio according to the text description, the model can preliminarily have the generation capability of the picture and audio modal data. In the second stage, we use a natural language processing model (such as GPT 4) to further sort the pre-training data of the first stage, generating about 5k instructions to fine tune the large language model to have the ability to generate specific multi-modal content based on the text instructions and context.

The above method will be described in detail with reference to the accompanying drawings and specific examples.

The embodiment provides a large language model training method for multi-mode content output, which comprises the following steps:

s1: a picture-sound-text triplet dataset is collected for training a large language model.

S1-1: based on an open source picture-text data set CC3M, an audio generation model SpecVQGAN of a picture subjected to visual guidance is used for generating natural audio conforming to the content of the picture, and a picture-sound-triplet data set is obtained.

S1-2: additionally, we add about 1M picture-sound-text triples extracted from the video dataset Kinetics-600. Because the visual and sound modes in the videos are naturally aligned, key frames and corresponding audios are randomly extracted from each video in the data set to serve as picture-sound pairing, and then the visual description model BLIP-2 is used for carrying out text description on picture contents, so that a picture-sound-text triplet which can be used for training is finally obtained. These triples will be used for subsequent pre-training of large language models.

S1-3: based on the triplet data set obtained in step S1-1 and step S1-2, we use GPT4 to further process, and convert the text description in the triplet into instruction dialogue form according to the scene according to the instruction template.

Specifically, we designed two different instruction templates, including "teletext acoustic chat robot" and "language instruction based multimodal content editing"; the 'image-text sound chat robot' template extracts a specific scene according to the text in the original triplet data, and a dialogue is generated based on the scene and specific images and sound contents by using GPT 4; the "language instruction-based multi-mode content editing" template defines 4 kinds of picture editing tools for clipping, mapping, replacing background, modifying color and 4 kinds of sound editing tools for clipping, mixing, tone changing and speed changing, and the GPT4 selects a specific tool to perform content editing on the pictures and the sound of the triad, so as to generate corresponding instruction data. These two different instruction data will be used for subsequent fine-tuning of the large language model.

S2: and initializing LoRA plug-ins of different modes of the output layer and gating the selector.

S2-1: as shown in fig. 2, the multimodal large language model includes: a large language model Vicuna with a 60B parameter and pre-trained on a large scale corpus, a cross-attention model Q-Former pre-trained on picture-text data sets LAION 115M and CC3M, CC M, SBU, and a visual model and a sound model trained on picture and sound data, respectively. During the subsequent training process, the parameters of the models are kept completely fixed, so that the problem of catastrophic forgetting of the models and the expensive training expenditure are avoided.

S2-2: embedding a plurality of parallel LoRA plug-ins in an output layer of a large language model Vicura, and dividing parameters of the large language model into fixed weights through matrix low-rank decompositionAnd learnable parameters->And->. Wherein the learnable parameters->And->As a fixed weight +.>Bypass of->And->. Each LoRA plug-in->All have corresponding parametersAnd->. The update process of the training process weights can be written as:

wherein the method comprises the steps ofFor gating selector pair +.>Weights for individual LoRA plugins predictions. In order to preserve the text generating capability of the large language model during the initial phase of training, the parameter +.>Initializing to a random gaussian distribution, +.>Initialized to all 0.

S2-3: random initialization gating selectorWhich predicts the appropriate weights for the different modalities of the LoRA plug-ins according to the context. In the big language model->In the layer, gate selector->Modeling as a single layer MLP model, the inputs of which areLayer output, predicting the weights of different LoRA plugins +.>. The process can be written as:

s2-4: initializing an output decoder, the output decoder including a picture decoderAnd an audio decoder->。

Picture decoderDecoder initialized as pre-trained picture variation discrete automatic encoder (VQ-GAN) that decodes 256 discrete encoded vectors output by large language model into one +.>3-channel pictures of size, each discretely encoded +.>. The audio decoder is initialized to a codebook decoder of the sonogram codebook model (Spectrogram Codebook), which decodes 212 discrete encoded vectors output by the large language model into +.>Melgan-sized spectrograms, which are then converted into audio signals by the MelGAN model, each discrete coding +.>. Only in the output of large language models<image></image>Or (b)<audio></audio>The word blocks in the tag pairs are mapped into discrete codes by simple linearity and decoded by the corresponding output decoder.

S3: based on the text description reconstruction sheet and the sound input, training is performed on the multimodal large language model.

S3-1: in the pre-training stage of the multi-modal large language model, the input of the modelTo include hints (promts) for picture-sound-text triples in a dataset, where the text includes detailed descriptions of the picture and sound, and instructions for a large language model, specifically "please redraw input pictures and audio from detailed text descriptions" on this pre-training task. Text is processed into word blocks by a marker (Tokenizer)The method comprises the steps of carrying out a first treatment on the surface of the Pictures and sounds are encoded as multi-modal blocks of words via cross-attention mechanisms +.>And->. The hints input to the multimodal large language model can be written as: />Wherein->Representing a stitching operation.

S3-2: during training, the output of the large language model will be expected to contain both the picture, tag pairs of the sound modality content and the corresponding discrete codes. To reduce the operations, we supervise the prediction results of large models in the discrete coding dimension, rather than the pixel dimensions of the picture and spectrogram.

The trained loss function can be written as:

wherein the method comprises the steps ofFor the ith word block, k is the context window length. Probability->Can be written as:

wherein the method comprises the steps ofCoding matrix for words>Is a position coding matrix.

Through this stage of pre-training, the large language model will have the ability to generate picture, sound modality data from the text description.

S4: the large language model is fine-tuned to generate multi-modal content that conforms to the context description based on the instructions.

In the above step S1-3 we generated two instruction data sets "teletext acoustic chat robot" and "language instruction based multimodal content editing". In the fine tuning stage, we perform fine tuning on the two instruction data sets, so that the large language model predicts the output of the instruction. The true discrete code obtained by the encoder is embedded into the instruction data set through the pictures and spectrograms subjected to cutting, mapping and other operations. With fine tuning at this stage, the large language model will have the ability to generate descriptive picture, sound modality data from the text instructions and context.

In summary, the existing multi-modal large language model usually aligns multi-modal content at the input end, encodes the input content of different modes by using different encoders, performs cross-modal reasoning based on a cross self-attention mechanism, and finally aligns all modal data to the same encoding space as the text mode, so as to be regarded as a special text encoding input to the large language model, and does not have multi-modal output capability. Furthermore, language chain (LangChain) based methods consider the generation models of different modalities as tools that can be invoked by language instructions, which can cause the generated content to often deviate from the context due to ambiguity of the text description and different understanding capabilities of the text content by different models. Different from the existing method, the method performs multi-mode alignment on the large model from the output end, and realizes end-to-end pre-training and fine tuning by adding a plurality of LoRA plug-ins and a switch in the output layer of the model, so that the large language model has the original multi-mode generation capability. And finally, the reasoning result is presented through multi-mode output, so that the efficiency of the large language model in interaction with human beings is improved.

The embodiment also provides a large language model training device for multi-modal content output, which comprises:

at least one processor;

at least one memory for storing at least one program;

the at least one program, when executed by the at least one processor, causes the at least one processor to implement the method as shown in fig. 1.

The large language model training device for multi-mode content output can execute the large language model training method for multi-mode content output provided by the embodiment of the method, can execute any combination implementation steps of the embodiment of the method, and has corresponding functions and beneficial effects.

The present application also discloses a computer program product or a computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions may be read from a computer-readable storage medium by a processor of a computer device, and executed by the processor, to cause the computer device to perform the method shown in fig. 1.

The embodiment also provides a storage medium which stores instructions or programs for executing the large language model training method for multi-mode content output, and when the instructions or programs are run, the instructions or programs can execute the steps in any combination of the embodiments of the method, and the method has the corresponding functions and beneficial effects.

In some alternative embodiments, the functions/acts noted in the block diagrams may occur out of the order noted in the operational illustrations. For example, two blocks shown in succession may in fact be executed substantially concurrently or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved. Furthermore, the embodiments presented and described in the flowcharts of the present invention are provided by way of example in order to provide a more thorough understanding of the technology. The disclosed methods are not limited to the operations and logic flows presented herein. Alternative embodiments are contemplated in which the order of various operations is changed, and in which sub-operations described as part of a larger operation are performed independently.

Furthermore, while the invention is described in the context of functional modules, it should be appreciated that, unless otherwise indicated, one or more of the described functions and/or features may be integrated in a single physical device and/or software module or one or more functions and/or features may be implemented in separate physical devices or software modules. It will also be appreciated that a detailed discussion of the actual implementation of each module is not necessary to an understanding of the present invention. Rather, the actual implementation of the various functional modules in the apparatus disclosed herein will be apparent to those skilled in the art from consideration of their attributes, functions and internal relationships. Accordingly, one of ordinary skill in the art can implement the invention as set forth in the claims without undue experimentation. It is also to be understood that the specific concepts disclosed are merely illustrative and are not intended to be limiting upon the scope of the invention, which is to be defined in the appended claims and their full scope of equivalents.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.

More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). In addition, the computer readable medium may even be paper or other suitable medium on which the program is printed, as the program may be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory.

It is to be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.

In the foregoing description of the present specification, reference has been made to the terms "one embodiment/example", "another embodiment/example", "certain embodiments/examples", and the like, means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

While embodiments of the present invention have been shown and described, it will be understood by those of ordinary skill in the art that: many changes, modifications, substitutions and variations may be made to the embodiments without departing from the spirit and principles of the invention, the scope of which is defined by the claims and their equivalents.

While the preferred embodiment of the present invention has been described in detail, the present invention is not limited to the above embodiments, and various equivalent modifications and substitutions can be made by those skilled in the art without departing from the spirit of the present invention, and these equivalent modifications and substitutions are intended to be included in the scope of the present invention as defined in the appended claims.

Claims

1. A large language model training method facing multi-mode content output is characterized by comprising the following steps:

fine-tuning the multimodal big language model such that the multimodal big language model generates multimodal content conforming to the context description according to the instructions;

embedding a plurality of parallel LoRA plugins in an output layer of the large language model, initializing the LoRA plugins and a gating selector, and comprising:

embedding a plurality of parallel LoRA plug-ins in an output layer of the large language model, and dividing parameters of the large language model into fixed weights through matrix low-rank decompositionAnd learnable parameters->And->The method comprises the steps of carrying out a first treatment on the surface of the Each LoRA plug-in->With corresponding parameters->And->Parameters are taken into considerationInitializing to a random Gaussian distribution, and adding the parameter +.>Initializing to all 0;

random initialization gating selector；

Initializing an output decoder, wherein the output decoder comprises a picture decoderAnd an audio decoder->；

In the large language modelIn the layer, gate selector->Modeling as a single layer MLP model with input of +.>Layer output, gate selector->Weights for predicting LoRA plugins>The expression is as follows:

；

in the training process, the update mode of the LoRA plugin is as follows:

；

2. The method for training a large language model for multimodal content output according to claim 1, wherein said constructing a photo-sound-text triplet dataset for training a large language model comprises:

3. The method for training a large language model for multimodal content output according to claim 2, wherein said constructing a picture-sound-text triplet dataset from the obtained picture-sound-text triplet comprises:

4. The method for training a large language model for multimodal content output according to claim 1, wherein parameters of the large language model, the cross-attention model, the visual model and the acoustic model are fixed during the training of the large language model in multimodal.

5. The method for training a multi-modal content output oriented large language model according to claim 1, wherein the reconstructing of the composition and the sound based on the text description training the multi-modal large language model based on the reconstructed data comprises:

In the training process, the output of the multi-modal large language model is expected to simultaneously contain label pairs of pictures and sound modal contents and corresponding discrete codes; wherein the prediction results of the large model are supervised in discrete coding dimensions.

6. The method for training a large language model for multimodal content output according to claim 1 or 5, wherein the expression of the loss function in the training process is:

；

probability ofThe writing is as follows:

；

in the method, in the process of the invention,coding matrix for words>Is a position coding matrix; />A prompt for a picture-sound-text triplet; />Self-attention mechanism module representing large language model,/->Is a normalized exponential function.

7. A large language model training device for multimodal content output, comprising:

at least one processor;

at least one memory for storing at least one program;

the at least one program, when executed by the at least one processor, causes the at least one processor to implement the method of any one of claims 1-6.

8. A computer readable storage medium, in which a processor executable program is stored, characterized in that the processor executable program is for performing the method according to any of claims 1-6 when being executed by a processor.