CN117409431B

CN117409431B - Multi-mode large language model training method, electronic equipment and storage medium

Info

Publication number: CN117409431B
Application number: CN202311412797.1A
Authority: CN
Inventors: 罗引; 郝艳妮; 陈博; 马先钦; 徐楠; 曹家; 王磊
Original assignee: Beijing Zhongke Wenge Technology Co ltd
Current assignee: Beijing Zhongke Wenge Technology Co ltd
Priority date: 2023-10-27
Filing date: 2023-10-27
Publication date: 2024-04-26
Anticipated expiration: 2043-10-27
Also published as: CN117409431A

Abstract

The invention provides a multi-modal large language model training method, electronic equipment and a storage medium, which relate to the field of computer technology application and comprise the following steps: training the image-text alignment model by using a first training sample to obtain a trained image-text alignment model; training the large language model by a second training sample, wherein the first training sample pair comprises a first image sample and a corresponding original text; the first image sample includes only natural images; the second training sample set comprises a plurality of second training sample pairs, each second training sample pair comprises a second image sample and a corresponding question-answer pair text, a target detection frame is arranged in the second image sample, and the second image sample at least comprises a document, a table, a chart and a natural image. The invention can understand different kinds of charts and document data, has the capability of accurately positioning the region in the picture, and can unlock more various multi-mode capabilities.

Description

Multi-mode large language model training method, electronic equipment and storage medium

Technical Field

The present invention relates to the field of computer technology, and in particular, to a method for training a multi-modal large language model, an electronic device, and a storage medium.

Background

With the great ability of large language models such as ChatGPT, bloom, LLaMA to generate and understand text, there is a great deal of interest. These models may be further tuned by instructions to the user's intent, showing strong interaction capabilities and potential to improve productivity as an intelligent assistant. However, large language models are only suitable for plain text, and lack the ability to handle image, speech, and video modalities, which greatly limits the scope of application of the model. To break this limitation, large Language Models (LLM) such as MiniGPT-4, LLaVA, mpug-Owl, qwen-VL, etc. are used as language decoders for multi-Modal Large Language Models (MLLMs) aimed at enhancing large language models with the ability to perceive and understand visual signals, exhibit significant zero sample ability in various open vision and language tasks, and exhibit remarkable ability in various fields. These multi-modal large language models are trained to align text and images in a first stage of training, and to facilitate generalization of the model by instruction tuning in a second stage. Because MLLMs lacks specific training of the model without explicitly training the visual text understanding dataset, MLLMs remains challenged to understand complex relationships between visual text and objects in different types of images, such as charts and documents.

To explore the intelligent understanding of charts and documents by multi-modal large models, OCR-dependent methods and OCR-free methods are mainly used, where the OCR-dependent methods include: LLaVA (Large Language and Vision Assistant: large language and visual assistant) by enhancing the ability of the visual instruction model to read text in images, using the rich text image from the LAION dataset and the collected noisy instruction data, the pretraining and fine tuning phases of LLaVA are increased accordingly with both sets of data. Since LLaVAR is annotated on publicly available OCR tools, the model can only recognize picture data containing characters, and has poor recognition effect on OCR-free documents. The OCR-free method mainly comprises the following steps: donut the model trains an OCR-free transducer based on an end-to-end approach, which can be easily extended to multiple languages by using SynthDoG generators. In order to enable the model to understand the diverse chart data, a multimodal dialog generation model mPLUG-DocOwl is proposed for understanding OCR-free documents that jointly trains the model through unified instruction adjustment strategies on pure language, general vision and language, and autonomously constructed document instruction datasets, thereby enhancing OCR-free document understanding capabilities. Because the model cannot accurately locate the OCR area in the picture in the process of understanding the chart data. For this reason, UReader, which together accomplish the language understanding task of extensive visual localization through one unified instruction format, is used to accurately locate OCR regions in a picture.

While significant improvements have been made by designing various model architectures to encode more powerful multimodal features or to obtain more accurate alignment strategies, these methods often employ frozen language models or visual models, with limited performance of the model due to a limited number of parameters or training strategy flaws. The current multi-modal large model is difficult to understand different kinds of charts and document data and lacks the ability to accurately locate areas in the pictures, so that more diverse multi-modal capabilities still cannot be unlocked.

Disclosure of Invention

Aiming at the technical problems, the invention adopts the following technical scheme:

The embodiment of the invention provides a multi-mode large language model training method, wherein the multi-mode large language model at least comprises a large language model and an image-text alignment model, and the method comprises the following steps:

S100, acquiring a first training sample set and a second training sample set; the first training sample set is an image-text pair data set and comprises a plurality of first training sample pairs, and each first training sample pair comprises a first image sample and a corresponding original text; the first image sample includes only natural images; the second training sample set comprises a plurality of second training sample pairs, each second training sample pair comprises a second image sample and a corresponding question-answer pair text, a target detection frame is arranged in the second image sample, and the second image sample at least comprises a document, a table, a chart and a natural image.

S200, preprocessing the first training sample pair to obtain a corresponding first image feature vector set and a first text feature vector set, and preprocessing the second training sample pair to obtain a corresponding second image feature vector set and a corresponding second text feature vector set.

S300, respectively compressing the first image feature vector set and the second image feature vector set to obtain a corresponding first image compression feature vector set and second image compression feature vector set.

And S400, training the image-text alignment model based on the first image compression feature vector set and the first text feature vector set to obtain a trained image-text alignment model.

S500, inputting the second image compression feature vector set and the second text feature vector set into the trained image-text alignment model to obtain corresponding image-text pairing information.

And S600, training the large language model based on the second image compression feature vector set and the image-text pairing information to obtain a trained large language model.

S700, obtaining a trained multi-mode large language model based on the trained image-text alignment model and the trained large language model.

Embodiments of the present invention also provide a non-transitory computer readable storage medium having stored therein at least one instruction or at least one program loaded and executed by a processor to implement the foregoing method.

The embodiment of the invention also provides an electronic device comprising a processor and the non-transitory computer readable storage medium.

The invention has at least the following beneficial effects:

According to the embodiment of the invention, the image-text alignment model is trained by utilizing the image-text alignment data, and after the training is finished, the large-scale language model is trained by utilizing the document data set, the chart data set, the table data set and the natural image data set, so that the trained model can understand different kinds of charts and document data, has the capability of accurately positioning the region in the image, and can unlock more various multi-mode capabilities.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the invention or to delineate the scope of the invention. Other features of the present invention will become apparent from the description that follows.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of a multi-modal large language model training method provided by an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.

The embodiment of the invention provides a multi-modal large language model training method, as shown in fig. 1, which can comprise the following steps:

S100, acquiring a first training sample set and a second training sample set; the first training sample set is a graphic-text pair data set and comprises a plurality of first training sample pairs, and each first training sample pair comprises a first image sample and a corresponding original text, namely an image-text pair; the first image sample includes only natural images; the second training sample set comprises a plurality of second training sample pairs, each second training sample pair comprises a second image sample and a corresponding question-answer pair text, a target detection frame is arranged in the second image sample, and the second image sample at least comprises a document, a table, a chart and a natural image.

In an embodiment of the present invention, the first training sample set may be obtained based on LAION-400M, COYO-700M and Google ConCeptual Captions (CC 3M) data sets, wherein: the CC3M dataset contains 3.3M image-text pairs, LAION-400M dataset contains 400M image-text pairs, COYO-700M dataset contains 747M image-text pairs.

Those skilled in the art will appreciate that all of these datasets can be used on the initial multi-modal large language model, as well as extracting 80% of the data volume training model, extracting 10% of the data volume testing model and 10% of the data volume verification model.

In an embodiment of the present invention, the second training sample set includes a document data set, a table data set, a chart data set, and a natural image data set. Wherein the document dataset uses DocVQA dataset, infographics VQA dataset, deepForm dataset, wherein DocVQA dataset is a 50k question-answer pair from UCSF industrial document library, infographics VQA dataset is a 30k question-answer pair generated from 5k graphics collected on the internet, deepForm dataset contains 1.1k election data; the form dataset uses WikiTableQuestions dataset and TabFact dataset, where WikiTableQuestions dataset contains 2.1k form pictures from wikipedia, generating 23k question-answer pairs; the chart dataset contains 21k chart image data and 32k question-answer pairs of ChartQA datasets; the natural image dataset comprises TextVQA datasets that use the text filtered 28k natural images from the open image V3 and annotate 45k question-answer pairs.

In an embodiment of the invention, the question-answer has a uniform data format for text, for example, defined as a "Human: { question } Chatbot: { answer }" format.

In this embodiment of the present invention, the target detection frame in the second image sample may be set on the corresponding target object based on actual needs, and for the image dataset, the corresponding target object may be an entity in the image, for example, an animal, a building, an object, or the like. For a document dataset, a chart dataset, a table dataset, the corresponding target object may be a selected portion of a letter or number, or the like.

Because the target detection frame is marked on the image sample, the trained model has the capability of accurately positioning the region of the image.

In the embodiment of the invention, the text data in the training sample can be subjected to characteristic extraction through a large language model. The large language model firstly carries out word segmentation processing on each received text to obtain a word segmentation set corresponding to each text, then codes each word segment to obtain a corresponding feature vector, and further combines the feature vectors of the words corresponding to each text to obtain the feature vector of the text.

In an exemplary embodiment of the present invention, a large language model may employ Qwen-7B models that have strong extensibility in multiple languages and high training and reasoning efficiency.

In the embodiment of the invention, the image feature vector set of any image sample is obtained by the following steps:

s10, cutting the image sample to obtain at least one corresponding sub-image.

Because images with text have different aspect ratios and resolutions, simply adjusting the size of an image can cause text blurring, distortion, and unrecognizable in the picture. Therefore, in order to avoid the problem of text pixel transformation caused by image size transformation, the invention provides a shape-adaptive image clipping method. Specifically, the shape and size of the grid are preset first, then a proper grid is selected according to the size of the image, and finally the grid is fused with the original image so as to cut the original image into a plurality of sub-images.

In the embodiment of the present invention, the preset mesh shape may include first to seventh mesh shapes, wherein the first mesh shape includes one unit mesh, and the second mesh shape includes two unit meshes sequentially connected in a vertical direction; the third grid shape comprises two unit grids sequentially connected along the horizontal direction; the fourth grid shape comprises three unit grids which are sequentially connected in the vertical direction, the fifth grid shape comprises two grid groups which are symmetrically arranged in the vertical direction and are connected, and each grid group of the fifth grid shape comprises two unit grids which are sequentially connected in the horizontal direction; the sixth grid shape comprises two grid groups which are symmetrically arranged along the vertical direction and connected with each other, and each grid group of the sixth grid shape comprises three unit grids which are sequentially connected along the horizontal direction; the seventh grid shape comprises two grid groups which are symmetrically arranged along the horizontal direction and are connected, and each grid group of the seventh grid shape comprises three unit grids which are sequentially connected along the vertical direction; the eighth mesh shape includes three mesh groups connected in sequence in a vertical direction, and each mesh group of the eighth mesh shape includes three unit meshes connected in sequence in a horizontal direction.

The size of each cell grid may be set based on actual needs, and the present invention is not particularly limited.

Specifically, S10 may specifically include the following steps:

if k-1-k 0, the aspect ratio of the image sample is approximately equal to 1, and W < W1, cropping the image sample using the first mesh shape, i.e., not cropping the image sample; where k is the aspect ratio, k=w/H, W and H are the width and height of the image sample, respectively, and k0 is a set value, which may be 0 or a positive number very close to 0. W1 is a first set width threshold, which may be a custom value, preferably 224 pixels.

If k-2 k0, i.e. the aspect ratio of the image sample is approximately equal to 2, the image sample is cropped using the third grid shape, i.e. the currently processed image is cropped in the vertical direction, i.e. the up-down direction of the image, into two identical sub-images.

If k-0.5 k0, i.e. the aspect ratio of the image sample is approximately equal to 0.5, the image sample is cropped using the second grid shape, i.e. the currently processed image is cropped in the horizontal direction, i.e. the left-right direction of the image, into two identical sub-images.

If 0.5 < k < 1, i.e. the aspect ratio of the image sample is between 0.5 and 1, and W < W2, cropping the image sample using the sixth mesh shape, i.e. cropping the image sample into 6 sub-images of 3 x 2; w2 is a second set width threshold; w2 > W1. W2 may be a custom value, for example, may be 2000 pixels.

If 0.5 < k < 1 and W2 < W, the image sample is cropped using the seventh mesh shape, i.e., the image sample is cropped to 6 sub-images of 2 x 3.

If the k-1 is less than or equal to k0, and W3 is less than W2, clipping the image sample by using a fifth grid shape, namely clipping the image sample into four sub-images taking the center point of the image as an intersection point; w3 is a third set width threshold, W3 > W1, preferably W3 is 1000 pixels.

If k-1-k 0 and W > W2, clipping the image sample using the eighth mesh shape, i.e., clipping the image sample into 9 sub-images.

S12, performing block processing on each sub-image to obtain a plurality of corresponding image blocks.

S14, carrying out image feature coding on each image block to obtain a corresponding image feature vector; and obtaining an image characteristic vector set corresponding to the image sample.

In S14, each image block may be image feature encoded using a frozen image encoder to improve training efficiency of the model. In one exemplary embodiment, the frozen image encoder may be a CLIP ViTL-14-336 model.

In order to realize more efficient image-text alignment and solve the efficiency problem brought by a long image feature sequence, the invention compresses the image features, wherein the compression method adopts a point quantization method to compress, namely, points in space are mapped to a plurality of cluster centers through a kmeans algorithm, and then the vector compression method of the points is represented through the cluster centers.

Specifically, in S300, an image compression feature vector set corresponding to any one image compression feature vector set is obtained by:

s20, preprocessing each image feature vector set to obtain a corresponding preprocessed image feature vector set.

In embodiments of the invention, preprocessing may include feature rotation and principal component analysis, where rotation may reduce information loss after quantization and principal component analysis may reduce data dimensionality.

S21, uniformly segmenting each preprocessed image feature vector set into x feature segments, and carrying out quantization processing on each feature segment to obtain a corresponding feature quantization result.

In the embodiment of the present invention, x may be a custom value, and may be an even number, such as 2, 4, 6, 8, 10, that is divisible by the dimension of the image feature vector. It is known that the dimension of each feature segment may be 1/x of the dimension of the image feature vector. In embodiments of the present invention, a quantizer may be used to quantize each feature segment. Those skilled in the art will recognize that any method for performing quantization processing on each feature segment by using a quantizer to obtain a corresponding feature quantization result falls within the scope of the present invention.

S22, clustering the feature quantization results corresponding to the ith feature segment of the image feature vector set corresponding to all the image samples by using a KMeans clustering method to obtain corresponding k clustering centers; i has a value of 1 to x.

In the embodiment of the present invention, the value of k may be an empirical value, for example, 3 to 5. Those skilled in the art know that any method for clustering feature quantization results corresponding to the ith feature segment of all currently received image feature vectors by using KMeans clustering method to obtain corresponding k cluster centers belongs to the protection scope of the present invention.

S23, r=1; i=1.

S24, obtaining the distances between the feature quantization result corresponding to the ith feature segment F _ri in the nth image feature vector set and the corresponding k clustering centers, and obtaining the corresponding k distances D ¹ _ri to D ^k _ri; the value of r is 1 to m, and m is the number of image feature vector sets corresponding to all the image samples.

It is known to those skilled in the art that any method of obtaining the distance between the feature quantification result and the cluster center falls within the scope of the present invention.

S25, obtaining a distance corresponding to min (D ¹ _ri,……,D^k _ri) as a quantized code of F _ri and using the quantized code as an image compression feature vector of F _ri; min () represents taking the minimum value.

In an embodiment of the present invention, the quantized coding of each feature segment may be 128 unit8 shaped numbers.

S26, setting i=i+1, and if i is less than or equal to x, executing S24; otherwise, S27 is performed.

And S27, setting r=r+1, if r is less than or equal to m, executing S24, otherwise, exiting the current control program, and obtaining an image compression feature vector set corresponding to the r-th image feature vector set based on the image compression feature vectors of F _r1 to F _rx, wherein the image compression feature vector set is a feature library of x multiplied by 128.

In the embodiment of the invention, the image-text alignment model is used for bridging the difference between the text characteristic and the image characteristic, so that the multi-mode training of the model under the condition of freezing vision and large language model parameters is realized. The image-text alignment model can acquire visual representation information, namely image-text pairing information, which is more relevant to the text from the image feature vector.

In order to realize fusion between graphics and texts, a trainable Q-force module is used for bridging differences between graphics and text features, visual to language generation type learning is realized by connecting the output of the Q-force and a language model, and the aim to be realized is to explain the output visual expression information of the Q-force by using the language model.

In particular, a teletext alignment model (Q-Former) may comprise an image processing sub-model and a text processing sub-model, which share a self-attention layer. The image processing sub-model is used for processing the image feature vector set and extracting visual expression information based on the image feature vector; the text processing sub-model is used for processing the text feature vector set and has the functions of text decoding and text encoding. The image processing sub-model and the text processing sub-model share a self-attention layer through which interactions are performed to facilitate the graphic alignment model to learn visual expression information related to the input text.

Specifically, S400 may include:

S401, inputting the first training sample data of the current batch into a current image-text alignment model to obtain a corresponding image-text pairing prediction result; the first training sample data includes a first set of image compression feature vectors and a first set of text feature vectors;

S402, acquiring a current loss function value of a current model from a current image-text pairing result and a corresponding image-text pairing real result, judging whether the current loss function value accords with a preset model training ending condition, if so, acquiring a trained image-text alignment model, and exiting a current control program, otherwise, executing S403.

In embodiments of the present invention, the loss function value may be calculated based on an existing loss function, such as a cross entropy loss function. The preset model training ending condition can be set based on actual needs. For example, the loss is less than or equal to a set loss threshold and remains unchanged for a set period of time.

S403, updating parameters of the current image-text alignment model based on the current loss function value, and taking the first image compression feature vector set and the first text feature vector set of the next batch as first training sample data of the current batch, and executing S401.

It can be appreciated that the parameters of the large language model are not updated during the training of the teletext alignment model.

The trained initial multi-modal large language model has the capability of retaining a large amount of knowledge and provides reasonable answers for people's queries. However, to develop a more general visual orientation language understanding model, process various types of images and perform different understanding tasks, low-cost instruction optimization is required for a multi-modal large model. And under the condition that any large-scale pre-training data set is not introduced, directly integrating a plurality of downstream data sets and performing joint training to train a large-scale language model.

Further, S600 may specifically include:

S601, inputting second training sample data of a current batch into a current large-scale language model to obtain a corresponding prediction result; the second training sample data comprises a second image compression feature vector set and corresponding graphic pairing information. The prediction result comprises the position coordinates of the positioning area corresponding to the target detection frame and the reply information of the problem corresponding to the image sample.

S602, acquiring a current loss function value of a current model from a current obtained prediction result and a corresponding real result, judging whether the current loss function value accords with a preset model training ending condition, if so, acquiring a trained large-scale language model, and exiting a current control program, otherwise, executing S603.

And S603, updating parameters of the current large language model based on the current loss function value, and taking the second image compression feature vector set and the image-text pairing information of the next batch as second training sample data of the current batch to execute S601.

S700, obtaining a trained multi-modal large language model based on the trained image-text alignment model and the trained large language model, and taking the trained multi-modal large language model as a final multi-modal large language model.

According to the multi-mode large language model training method provided by the embodiment of the invention, the image-text alignment model is trained by utilizing the image-text alignment data, and after training, the large language model is trained by utilizing the document data set, the chart data set, the table data set and the natural image data set, so that in an actual application scene, the trained model can understand different kinds of charts and document data, has the capability of accurately positioning the region in the image, and can unlock more various multi-modes.

According to embodiments of the present invention, the present invention also provides an electronic device, a readable storage medium and a computer program product.

In an exemplary embodiment, an electronic device includes: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described in the above embodiments.

In an exemplary embodiment, the readable storage medium may be a non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method according to the above embodiment.

In an exemplary embodiment, the computer program product comprises a computer program which, when executed by a processor, implements the method according to the above embodiments.

Electronic devices are intended to represent various forms of user terminals, various forms of digital computers, such as desktop computers, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.

In one exemplary embodiment, the electronic device may include a computing unit that may perform various suitable actions and processes in accordance with a computer program stored in a Read Only Memory (ROM) or a computer program loaded from a storage unit into a Random Access Memory (RAM). In the RAM, various programs and data required for the operation of the device may also be stored. The computing unit and the RAM are connected to each other by a bus. An input/output (I/O) interface is also connected to the bus.

Further, a plurality of components in the electronic device are connected to the I/O interface, including: an input unit such as a keyboard, a mouse, etc.; an output unit such as various types of displays, speakers, and the like; a storage unit such as a magnetic disk, an optical disk, or the like; and communication units such as network cards, modems, wireless communication transceivers, and the like. The communication unit allows the device to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

The computing unit may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing units include, but are not limited to, central Processing Units (CPUs), graphics Processing Units (GPUs), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, digital Signal Processors (DSPs), and any suitable processors, controllers, microcontrollers, and the like. The computing unit performs the respective methods and processes described above, such as the service capacity regulation method. For example, in some embodiments, the service capacity tuning method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as a storage unit. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device via the ROM and/or the communication unit. One or more steps of the service capacity regulating method described above may be performed when the computer program is loaded into RAM and executed by the computing unit. Alternatively, in other embodiments, the computing unit may be configured to perform the service capacity regulation method by any other suitable means (e.g. by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present invention may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of the present invention, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present invention may be performed in parallel, sequentially, or in a different order, so long as the desired results of the technical solution disclosed in the present invention can be achieved, and are not limited herein.

The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.

Claims

1. A multi-modal large language model training method is characterized in that the multi-modal large language model at least comprises a large language model and an image-text alignment model, and the method comprises the following steps:

s100, acquiring a first training sample set and a second training sample set; the first training sample set is an image-text pair data set and comprises a plurality of first training sample pairs, and each first training sample pair comprises a first image sample and a corresponding original text; the first image sample includes only natural images; the second training sample set comprises a plurality of second training sample pairs, each second training sample pair comprises a second image sample and a corresponding question-answer pair text, wherein a target detection frame is arranged in the second image sample, and the second image sample at least comprises a document, a table, a chart and a natural image;

S200, preprocessing the first training sample pair to obtain a corresponding first image feature vector set and a first text feature vector set, and preprocessing the second training sample pair to obtain a corresponding second image feature vector set and a corresponding second text feature vector set;

s300, respectively compressing the first image feature vector set and the second image feature vector set to obtain a corresponding first image compression feature vector set and second image compression feature vector set;

S400, training the image-text alignment model based on the first image compression feature vector set and the first text feature vector set to obtain a trained image-text alignment model;

S500, inputting the second image compression feature vector set and the second text feature vector set into the trained image-text alignment model to obtain corresponding image-text pairing information;

S600, training a large language model based on the second image compression feature vector set and the image-text pairing information to obtain a trained large language model;

S700, obtaining a trained multi-mode large language model based on the trained image-text alignment model and the trained large language model;

in S300, an image compression feature vector set corresponding to any one image compression feature vector set is obtained by:

S20, preprocessing each image feature vector set to obtain a corresponding preprocessed image feature vector set;

S21, uniformly segmenting each preprocessed image feature vector set into x feature fragments, and carrying out quantization processing on each feature fragment to obtain a corresponding feature quantization result;

S22, clustering the feature quantization results corresponding to the ith feature segment of the image feature vector set corresponding to all the image samples by using a KMeans clustering method to obtain corresponding k clustering centers; i has a value of 1 to x;

s23, r=1; i=1;

S24, obtaining the distances between the feature quantization result corresponding to the ith feature segment F _ri in the nth image feature vector set and the corresponding k clustering centers, and obtaining the corresponding k distances D ¹ _ri to D ^k _ri; the value of r is 1 to m, and m is the number of image feature vector sets corresponding to all the image samples;

S25, obtaining a distance corresponding to min (D ¹ _ri,……,D^k _ri) as a quantized code of F _ri and using the quantized code as an image compression feature vector of F _ri;

S26, setting i=i+1, and if i is less than or equal to x, executing S24; otherwise, S27 is performed;

And S27, setting r=r+1, if r is less than or equal to m, executing S24, otherwise, exiting the current control program, and obtaining an image compression feature vector set corresponding to the r-th image feature vector set based on the image compression feature vectors of F _r1 to F _rx.

2. The method according to claim 1, wherein the set of image feature vectors for any one image sample is obtained by:

S10, cutting the image sample to obtain at least one corresponding sub-image;

S12, performing block processing on each sub-image to obtain a plurality of corresponding image blocks;

3. The method according to claim 2, wherein S10 comprises in particular: if |k-1|ε 0, and W < W1, clipping the image sample using a first mesh shape; where k is the aspect ratio, k=w/H, W and H are the width and height of the image sample, respectively, k0 is a set value, and W1 is a first set width threshold;

If |k-2|is less than or equal to k0, clipping the image sample by using a third grid shape;

if |k-0.5|k 0, clipping the image sample using a second mesh shape;

if 0.5 < k < 1 and W < W2, cropping the image sample using a sixth mesh shape; w2 is a second set width threshold; w2 > W1;

if 0.5 < k <1 and W2 < W, cropping the image sample using a seventh mesh shape;

If |k-1|is less than or equal to k0, and W3 < W2, cropping the image sample using a fifth mesh shape; w3 is a third set width threshold, W3 > W;

if |k-1|ε 0, and W > W2, clipping the image sample using an eighth mesh shape;

The first grid shape comprises a unit grid, and the second grid shape comprises two unit grids which are sequentially connected along the vertical direction; the third grid shape comprises two unit grids sequentially connected along the horizontal direction; the fourth grid shape comprises three unit grids which are sequentially connected in the vertical direction, the fifth grid shape comprises two grid groups which are symmetrically arranged in the vertical direction and are connected, and each grid group of the fifth grid shape comprises two unit grids which are sequentially connected in the horizontal direction; the sixth grid shape comprises two grid groups which are symmetrically arranged along the vertical direction and connected with each other, and each grid group of the sixth grid shape comprises three unit grids which are sequentially connected along the horizontal direction; the seventh grid shape comprises two grid groups which are symmetrically arranged along the horizontal direction and are connected, and each grid group of the seventh grid shape comprises three unit grids which are sequentially connected along the vertical direction; the eighth mesh shape includes three mesh groups connected in sequence in a vertical direction, and each mesh group of the eighth mesh shape includes three unit meshes connected in sequence in a horizontal direction.

4. The method of claim 1, wherein the pattern alignment model is a Q-Former.

5. The method according to claim 2, wherein in S14, each image block is image feature encoded using a frozen image encoder.

6. The method of claim 5, wherein the frozen image encoder is a CLIP ViTL-14-336 model.

7. The method of claim 1, wherein the large speech model is a Qwen-7B model.

8. A non-transitory computer readable storage medium having stored therein at least one instruction or at least one program, wherein the at least one instruction or the at least one program is loaded and executed by a processor to implement the method of any one of claims 1-7.

9. An electronic device comprising a processor and the non-transitory computer readable storage medium of claim 8.