CN117633707A

CN117633707A - Fine-grained multi-mode Chinese large language model construction method and computer storage medium

Info

Publication number: CN117633707A
Application number: CN202311630540.3A
Authority: CN
Inventors: 孙腾
Original assignee: Shenzhen Royole Technologies Co Ltd
Current assignee: Shenzhen Royole Technologies Co Ltd
Priority date: 2023-12-01
Filing date: 2023-12-01
Publication date: 2024-03-01
Anticipated expiration: 2043-12-01
Also published as: CN117633707B

Abstract

The application discloses a fine-grained multi-mode Chinese large language model construction method and a computer storage medium, belonging to the field of computers, wherein the method comprises the following steps: planning the architecture of a fine-grained multi-mode Chinese large language model, wherein the fine-grained multi-mode Chinese large language model comprises a multi-mode information extraction and fusion module, a core large language model and a multi-mode content generation module; the method and the device have the advantages that the fine-grained multi-mode Chinese large language model comprises a multi-mode information extraction and fusion module, a core large language model and a multi-mode content generation module, so that the Chinese large language model is used as a central connection understanding and generation module of a model system, a series of multi-mode content understanding and content generation tasks can be executed according to user instructions, and compared with the current multi-mode large model technology, the method and the device have the advantages of few pseudoscopy problems, multiple extensible functions, low training cost, deep understanding of complex multi-mode scenes and the like.

Description

Fine-grained multi-mode Chinese large language model construction method and computer storage medium

Technical Field

The application relates to the field of computers, in particular to a fine-granularity multi-mode Chinese large language model construction method and a computer storage medium thereof.

Background

With the explosive growth of digital media, we have generated a large amount of multimodal data in our daily lives. Multimodal content is emerging in the fields of social media, online video, virtual reality, and augmented reality applications, which contain a variety of information such as text, images, sound, and video. Therefore, it becomes critical to develop techniques that can understand and analyze these multimodal data. At present, a large language model for Chinese has become a research hotspot, but due to the problems of data resources and the like, the prior art has certain defects in multi-modal integration and model generalization, and the unavoidable pseudoscopic problem of the large model restricts the application of the model, and the pseudoscopic phenomenon of the large model occurs due to the fact that the granularity of the extracted multi-modal information is not fine enough and the multi-modal information is not well aligned.

Disclosure of Invention

The method for constructing the fine-grained multi-mode Chinese large language model and the computer storage medium thereof solve the problem that the large language model still has certain defects in multi-mode integration and model generalization in the prior art, and the unavoidable pseudoscopic problem restricts the application of the model; the capability of providing a large language model on multi-modal integration and model generalization is realized; reducing pseudoscopic problems and improving applicability. The application provides a fine-grained multi-mode Chinese large language model construction method, which comprises the following steps: planning the architecture of a fine-grained multi-mode Chinese large language model, wherein the fine-grained multi-mode Chinese large language model comprises a multi-mode information extraction and fusion module, a core large language model and a multi-mode content generation module; constructing a multi-mode information extraction and fusion module; constructing a core large language model; constructing a multi-mode content generation module; training and tuning.

In another aspect of the present application, there is also provided a computer storage medium storing a program for executing the above fine-grained multi-modal chinese large language model construction method.

The technical scheme provided by the application has at least the following technical effects or advantages:

the method and the device have the advantages that the fine-grained multi-mode Chinese large language model comprises a multi-mode information extraction and fusion module, a core large language model and a multi-mode content generation module, so that the Chinese large language model is used as a central connection understanding and generation module of a model system, a series of multi-mode content understanding and content generation tasks can be executed according to user instructions, and compared with the current multi-mode large model technology, the method and the device have the advantages of few pseudoscopy problems, multiple extensible functions, low training cost, deep understanding of complex multi-mode scenes and the like.

Drawings

FIG. 1 is a flowchart of a method for building a fine-grained multi-modal Chinese large language model in accordance with an embodiment of the present application;

FIG. 2 is a flowchart of a multi-modal information extraction and fusion module constructed in the fine-grained multi-modal Chinese large language model construction method according to an embodiment of the application;

FIG. 3 is a flowchart of a method for constructing a core large language model in a fine-grained multi-modal Chinese large language model construction method according to an embodiment of the application;

FIG. 4 is a flowchart of a multi-modal content generation module constructed in the fine-grained multi-modal Chinese large language model construction method according to an embodiment of the application.

Detailed Description

In order to better understand the above technical solutions, the following detailed description will refer to the accompanying drawings and specific embodiments.

In order to solve the technical problems involved in the background technology, three innovations are respectively made in the traditional multi-mode large model architecture in the embodiment of the application: 1. in the aspect of feature extraction, we propose a multi-modal information extraction and fusion method, which extracts fine-grained multi-modal high-level semantic features by means of SIA-CNN models, and simultaneously adopts a multi-modal high-level semantic extractor MMSSE model to extract and align multi-modal high-level semantics; 2. in the aspect of a text large model, a voice-meaning integrated large language model training method is adopted to train the model, so that the text large model has better compatibility when reasoning aligned multi-modal information; 3. in the aspect of multi-modal output, we propose a decoupled multi-modal content generation method, so that a model can accurately convert high-level multi-modal semantics after reasoning into appointed multi-modal output;

referring to fig. 1-4, an embodiment of the present application provides a method for constructing a fine-grained multi-mode chinese large language model, including S1, planning a framework of a fine-grained multi-mode chinese large language model, where the fine-grained multi-mode chinese large language model includes a multi-mode information extraction and fusion module, a core large language model, and a multi-mode content generation module; s2, constructing a multi-mode information extraction and fusion module; s3, constructing a core large language model; s4, constructing a multi-mode content generation module; s5, training and optimizing;

the method in an exemplary embodiment of the present application is composed of three major parts, namely a multi-modal information extraction and fusion method, a phonetic sense integrated large language model training method and a decoupled multi-modal content generation method:

aiming at the defect that the large voice model has certain defects in multi-modal integration and model generalization, unavoidable pseudoscopic problems restrict the applicable problem of the model, in an exemplary embodiment, S1, the architecture of a fine-grained multi-modal Chinese large language model is planned, and the fine-grained multi-modal Chinese large language model comprises a multi-modal information extraction and fusion module, a core large language model and a multi-modal content generation module;

s2: the multi-mode information extraction and fusion module is composed of a fine-grain visual feature extractor and a multi-mode high-level semantic extractor. The fine-grained visual feature extractor consists of Vision Transformer (ViT) and an autonomously developed spatial information adapter SIA-CNN based on a convolutional neural network; and ViT, fully interacting and fusing the global visual features extracted by ViT with the local visual features extracted by SIA-CNN to obtain fine-grained visual representation of the visual input signals. The MMSSE model of the multi-mode high-level semantic extractor extracts and aligns semantic information on the fine-grain visual representation according to a user instruction to obtain fine-grain multi-mode high-level semantic features, the multi-mode information can be represented on a plurality of fine-grain layers, and the understanding capability of the multi-mode large model on input is improved.

In an exemplary embodiment, the step S2 multi-modal information extraction and fusion method includes: s11: a fine-grained visual feature extractor was first constructed, using Vision Transformer (ViT) as a visual modality extraction base.

S22: the spatial information adapter SIA-CNN is added in ViT, each SIA-CNN layer consists of a 3-segment residual convolutional neural network, and each SIA-CNN layer is inserted in each layer of ViT transformers. The reasoning result of SIA-CNN is projected downwards to a smaller dimension through a multi-layer convolution neural network, then through a layer of nonlinear activation function, and then projected upwards to the original dimension through a multi-layer transposition convolution neural network. In addition, a residual connection is arranged between the input and the output of the whole SIA-CNN, so as to strengthen the robustness of the model structure. The newly added parameters of the module account for about 17% of the original model, so that ViT can fully extract depth and space information of the picture, and the representation capability of the output vector of the ViT model on fine granularity information is enhanced.

S23: the MMSSE model of the multi-mode high-level semantic extractor is a double-tower model formed by two 7-layer transformers, wherein one 7-layer transformer single-tower input is instruction information input by a user, and the other 7-layer transformer single-tower input is a learned random instruction vector and multi-mode information input; through the image text comparison learning task, the image text generation task, the image text matching task and the difficult negative sample mining strategy joint training, the multi-mode high-level semantic extractor can extract semantic information of the fine-grain visual representation according to user instructions, and fine-grain multi-mode high-level semantic features guided by the user instructions are obtained.

S24: the multi-mode high-level semantic features of the multi-mode high-level semantic extractor are subjected to linear transformation through a linear layer and mapped into the semantic space of the text, so that a large model can understand multi-mode information and perform reasoning.

S3: the large language model provides reasoning and understanding capability, is the core of a multi-mode large model architecture, has the integral parameter reaching 11B, and enhances the understanding capability on central and multi-mode information by adopting unified representation of sound, shape and meaning as input. The ultra-large scale data set used by the large language model is subjected to seven rounds of multi-dimensional and multi-level deep cleaning, and has the total character number of 1500B.

In an exemplary embodiment, the step S3 of the method for training a large-scale phonetic meaning model includes: s31: constructing a pre-training basic data set, collecting and arranging 40GB small red book data, 120G headline news data, 50G WeChat public number data, 110G known data, 50G encyclopedia data, 70G webpage data, 420G MNBVC Chinese data, 200G code data and 900G English data, constructing an automatic cleaning script, and cleaning the data according to specified rules.

S32: constructing a large language model architecture, adopting a decoder-only structure, using 28 layers of transformers as a model body, using rotary position coding as a coding scheme for maintaining the result of the model in a long context, assuming that the inner product operation between the query vector and the key vector can be represented by a function g', having

<f _q (x _m ,m),f _k (x _n ,n)≥g′(x _m ,x _n M-n), where f _q (x _m M) is a query vector; f (f) _k (x _n N) is a key vector; function g ^′ () Is word embedding vector x _m ，x _n And the relative position m-n between them, an equivalent position coding method satisfying the condition can be constructed so that the above formula can still be established, with

f _q (x _m ,m)＝(W _q x _m )e ^imθ

f _k (x _n ,n)＝(W _k x _n )e ^inθ

g′(x _m ,x _n ,m-n)＝Re[(W _q x _m )(W _k x _n )e ^i(m-n)θ ]

Based on the geometric meaning of complex multiplication, the transformation actually corresponds to the rotation of the vector, and is therefore called rotary position coding. Meanwhile, in the stage of embading, the input of the large language model uses the vector representation of three in one of sound, shape and meaning, the three features of sound, shape and meaning are deeply fused, and then unified training is carried out, so that the large language model can understand multi-mode information more deeply.

S33: the multi-card cluster integrated training is realized by using a deep training framework to accelerate model training, and in order to solve the problem of display card faults in a long-time training scene, a communication node hierarchical management mode is adopted to divide a computational power scheduling scheme: and setting part of server nodes as communication nodes, and collecting logs and values of subordinate computing forces of the server nodes in the calculation process, wherein the logs and values are responsible for unifying and collecting the subordinate computing forces of the server nodes, when part of the subordinate computing forces of a certain communication node have faults, the communication node can try to wake up a fault display card, and if the communication node fails to try for many times, the result is directly saved and the training is stopped, so that the subordinate computing forces of other communication nodes are not influenced.

S4: the multi-mode content generating module consists of a plurality of downstream sub-models, comprises an image generating model based on the StableDiffuse model framework and an audio generating model based on the EnCodec model framework, and the downstream models can directly use hidden vectors of the large language model to generate appointed content in a joint training mode, so that the expandability of the multi-mode large model framework to downstream tasks is enhanced.

In an exemplary embodiment, the step S4 may be a process of the method for generating multi-modal content, which includes:

s41: based on the large language model combined training StableDiffuse model, freezing the parameters of the large predictive model in the training process, using an alignment model of a 5-layer transformation structure to align the hidden vector in the large language model to the condition vector space in the StableDiffuse model, and mutually decoupling the StableDiffuse model and the large language model, so that the large language model can regulate and control the generation result of the StableDiffuse model at any time.

S42: based on the large language model joint training EnCodec audio generation model, the parameters of the large predictive model are frozen in the training process, a 3-layer perceptron structure is used for carrying out nonlinear transformation on hidden vectors of the large language model, and the output result is superimposed on the hidden space characteristics of the EnCodec audio generation model according to weight, so that the conversion from text output to audio output is realized.

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks. While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

The foregoing is merely a preferred embodiment of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art, within the scope of the present application, should apply to the present application, and all equivalents and modifications as fall within the scope of the present application.

Claims

1. A fine-grained multi-mode Chinese large language model construction method is characterized by comprising the following steps:

s1, planning a framework of a fine-grained multi-mode Chinese large language model, wherein the fine-grained multi-mode Chinese large language model comprises a multi-mode information extraction and fusion module, a core large language model and a multi-mode content generation module;

s2, constructing a multi-mode information extraction and fusion module;

s3, constructing a core large language model;

s4, constructing a multi-mode content generation module;

s5, training and optimizing.

2. The method for constructing a fine-grained multi-modal chinese large language model according to claim 1, wherein S2 comprises the steps of:

s21: constructing a fine-grained visual feature extractor, and using Vision Transformer as a visual mode extraction base;

s22: adding a spatial information adapter based on a convolutional neural network into Vision Transformer, and inserting a layer of spatial information adapter into a transducer layer of Vision Transformer;

s23: the multimode high-level semantic extractor is a double-tower model formed by two 7-layer transformers, wherein one 7-layer transformer single tower input is instruction information input by a user, and the other 7-layer transformer single tower input is a learned random instruction vector and multimode information input; through the image text comparison learning task, the image text generation task, the image text matching task and the difficult negative sample mining strategy joint training, the multi-mode high-level semantic extractor can extract semantic information of the fine-grain visual representation according to user instructions, and fine-grain multi-mode high-level semantic features guided by the user instructions are obtained;

3. The method for constructing a fine-grained multi-modal chinese large language model according to claim 2, wherein the spatial information adapter result in S22 projects features downward to a smaller dimension through a multi-layer convolutional neural network, then through a layer of nonlinear activation function, and then upward to an original dimension through a multi-layer transposed convolutional neural network.

4. The method for constructing a fine-grained multi-modal chinese large language model according to claim 2, wherein a residual connection is further provided between the input and the output of the entire spatial information adapter in S22, so as to enhance the robustness of the model structure.

5. The method for constructing a fine-grained multi-modal chinese large language model according to claim 1, wherein S3 comprises the steps of:

s31: constructing a pre-training basic data set, collecting and arranging a plurality of internet data, constructing an automatic cleaning script, and cleaning the data according to a specified rule;

s32: constructing a core large language model architecture, adopting a decoder-only structure, and using 28 transformer layers as a model main body;

s33: multi-card cluster integration training, model training is accelerated using a deep training framework.

6. The method for constructing a fine-grained multi-modal chinese large language model according to claim 5, wherein in S33, the power-calculation scheduling scheme is divided by adopting a hierarchical management manner of communication nodes: and setting part of server nodes as communication nodes, and collecting logs and values of subordinate computing forces of the communication nodes in the calculation process, wherein the logs and the values are responsible for unifying and collecting the subordinate computing forces of the communication nodes, when part of the subordinate computing forces of the communication nodes have faults, the communication nodes try to wake up the fault display card, and if the communication nodes fail to try for many times, the result is directly saved and the training is stopped.

7. The fine-grained multi-modal chinese large language model construction method of claim 5, wherein the S32 uses rotary position coding as the coding scheme.

8. The method for constructing a fine-grained multi-modal chinese large language model according to claim 5, wherein in step S32, the input of the core large language model uses three-in-one vector representation of sound, shape and meaning, the three features of sound, shape and meaning are deeply fused, and then unified training is performed, so that the large language model can understand multi-modal information more deeply.

9. The method for constructing a fine-grained multi-modal chinese large language model according to claim 1, wherein S4 comprises the steps of: s41: based on the large language model combined training Stable Diffusion model, freezing parameters of the large prediction model in the training process, and using an alignment model of a 5-layer transformation structure to align hidden vectors in the large language model into a condition vector space in the Stable Diffusion model, so that the large language model can regulate and control the generation result of the Stable Diffusion model at any time; s42: based on the large language model joint training EnCodec audio generation model, the parameters of the large predictive model are frozen in the training process, a 3-layer perceptron structure is used for carrying out nonlinear transformation on hidden vectors of the large language model, and the output result is superimposed on the hidden space characteristics of the EnCodec audio generation model according to weight, so that the conversion from text output to audio output is realized.

10. A computer storage medium storing a program capable of executing the fine-grained multi-modal chinese large language model construction method according to any one of claims 1 to 9.