CN117631833A

CN117631833A - Interactive perception method suitable for large language model and computer storage medium

Info

Publication number: CN117631833A
Application number: CN202311583817.1A
Authority: CN
Inventors: 孙腾
Original assignee: Shenzhen Royole Technologies Co Ltd
Current assignee: Shenzhen Royole Technologies Co Ltd
Priority date: 2023-11-24
Filing date: 2023-11-24
Publication date: 2024-03-01

Abstract

The application discloses an interactive perception method and a computer storage medium suitable for a large language model, belonging to the field of computers, wherein the method comprises the following steps: constructing an interactive perception network, wherein the interactive perception network is connected with a large language model LLMs, and the architecture of the interactive perception network comprises a plurality of modal coding units and a plurality of linear projection layers; the method and the device adopt the construction of the interactive perception network suitable for the large language model, so that the problem that the prior art lacks the interactive perception network suitable for the large language model is effectively solved, and further, the dynamic interactive perception is realized, and the large language model can better execute human instructions; in particular, enables large language models to integrate visual information required for different queries. The application utilizes the interactive perception network which is suitable for a large language model to understand human inquiry, transmits corresponding requests to a visual information interaction module based on the requests, and generates responses based on interleaved multi-mode information.

Description

Interactive perception method suitable for large language model and computer storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to an interactive sensing method suitable for a large language model and a computer storage medium.

Background

The importance of large visual language models is that they possess powerful multi-modal (image and text) understanding and generating capabilities in the field of deep learning. Such models can not only understand text content, but also parse and generate information related to images, thereby enabling a higher level of intelligence and language understanding. This has great potential for various applications such as natural language processing, computer vision, multimodal dialog systems, etc., so that machines can more fully understand and process a variety of information sources, further promoting the development of the artificial intelligence field. Previous methods incorporate visual information into LLM using a simple visual mapping network in which image features are projected through a linear layer into the embedded space of LLM. Such a mapping network projects image features only once, but does not take into account interactions between the image and the human input query. Therefore, the obtained visual information is not related to the intention of the person, and may not be enough for the LLM to make a response following the intention, so as to ensure that the LLM can truly understand and reasonably interact with the human instruction and the image input.

Disclosure of Invention

The interactive perception method and the computer storage medium suitable for the large language model solve the problem that an interactive perception network suitable for the large language model is lacking in the prior art, realize dynamic interactive perception and enable the large language model to better execute human instructions. The application provides an interactive perception method suitable for a large language model, which comprises the following steps: constructing an interactive perception network, wherein the interactive perception network is connected with a large language model LLMs, and the architecture of the interactive perception network comprises a plurality of modal coding units and a plurality of linear projection layers; an image input is obtained at one side of the interactive perception network; encoding the image by using a first mode encoding unit to obtain global image characteristics; the first linear projection layer is configured to project the global image feature map into a language embedding space of the large language model LLMs; the large language model LLMs obtain a query sequence, output semantic information containing the query sequence and map-project the semantic information into a second modal coding unit through a second linear projection layer to obtain introspection output; the third linear projection layer is configured to decompose global codes corresponding to global image features of the image into fine-grained codes, obtain fine-grained image features and project the fine-grained image features into the second modal coding unit; the fine-granularity image features and the introspection output are dynamically interacted to obtain dynamic interaction information; mapping and projecting the dynamic interaction information to a language embedding space of a large language model LLMs through a fourth linear projection layer; training an interactive perception network, and performing dynamic interactive perception on a large language model LLMs according to the dynamic interaction information, the query sequence and the global image characteristics

In another aspect of the present application, there is also provided a computer storage medium storing a program for executing the above method, and a plug-and-play module can be implemented using the storage medium.

The technical scheme provided by the application has at least the following technical effects or advantages:

the method and the device adopt the construction of the interactive perception network suitable for the large language model, so that the problem that the prior art lacks the interactive perception network suitable for the large language model is effectively solved, and further, the dynamic interactive perception is realized, and the large language model can better execute human instructions; in particular, enables large language models to integrate visual information required for different queries. The application utilizes the interactive perception network which is suitable for a large language model to understand human inquiry, transmits corresponding requests to a visual information interaction module based on the requests, and generates responses based on interleaved multi-mode information. The present application effectively avoids training a large-scale visual language model (LVLM) from scratch, requiring a significant amount of resources.

Drawings

FIG. 1 is a flow chart of an interactive perception method applicable to a large language model according to an embodiment of the present application;

FIG. 2 is a schematic diagram of an interactive sensory network applicable to a large language model according to an embodiment of the present application;

FIG. 3 is a diagram showing data types in an embodiment of the present application;

FIG. 4 is a schematic diagram of data types with descriptions in an embodiment of the present application.

Detailed Description

In order to better understand the above technical solutions, the following detailed description will refer to the accompanying drawings and specific embodiments.

The embodiment of the application provides an architecture of an interactive perception network suitable for a large language model:

the architecture of the interactive perception network 10 comprises a plurality of modal coding units 11, a plurality of linear projection layers 12 and a large language model LLMs20; in an exemplary embodiment, the interactive sensory network 10 includes the first modality encoding unit 111, the second modality encoding unit 112, and 4 linear projection layers 12;

the first modality encoding unit 111 is a visual modality encoder 1111, and the second modality encoding unit 112 is a text modality encoder 1121; the 4 linear projection layers 12 are respectively the first linear projection layer 121 to the fourth linear projection layer 124;

based on the architecture of the interactive awareness network 10, the embodiments of the present application provide an interactive awareness method applicable to a large language model, including:

s1, constructing an interactive perception network 10, wherein the interactive perception network 10 is connected with large language models LLMs20, S2, and one side of the interactive perception network 10 obtains an input image I; encoding the image I by using a first mode encoding unit to obtain global image characteristics h _I The method comprises the steps of carrying out a first treatment on the surface of the The first linear projection layer 121 is configured to map the global image feature h _I Mapping and projecting the large language model LLMs into a language embedding space; s3, obtaining a query sequence X by using a large language model LLMs, and outputting semantic information h containing the query sequence by using the large language model LLMs _r And mapped and projected into a second mode coding unit through a second linear projection layer 122 to obtain an introspection output h _R The method comprises the steps of carrying out a first treatment on the surface of the S4, the third linear projection layer 123 is configured to characterize the global image features h of the image _I Decomposing the corresponding global code into fine-grained codes to obtain fine-grained image characteristicsAnd projected into a second modality encoding unit; s5, the fine granularity image feature ∈>And introspection output h _R Dynamic interaction is carried out to obtain dynamic interaction information h _d The method comprises the steps of carrying out a first treatment on the surface of the The dynamic interaction information h _d Mapping the projection into the language embedding space of the large language model LLMs through a fourth linear projection layer 124; s6, training the interactive perception network 10, and enabling the large language model LLMs20 to perform dynamic interaction information h _d Query sequence X and the global image feature h _I Performing dynamic interactive sensing;

in S2, the first mode encoding unit 111 is a pre-trained visual mode encoder 1111, and the image input is processed by the pre-trained visual mode encoder 1111 to obtain a global image feature h _I At the same time, a special mark capable of learning is added in the word embedding list of LLMs<img>As an input position tag of image characteristics, another special mark is added at the tail of the image and query sequence<img-d>To capture the entire encoded information of the image and query. The first linear projection layer 121 in S2 projects the global image feature h _I In the process of mapping and projecting the language embedding space of the large language model LLMs, the characteristic alignment mode is as follows:

f(h _I )＝W·h _I +b

wherein; f () is a function of feature alignment; h is a _I For global image features, W and b are parameters. Semantic information h of the query sequence in S3 _r Mapping and projecting the second linear projection layer to a second modal coding unit to obtain an introspection output h _R ，h _R The mode coding unit and the linear projection layer can be guided to pay attention to the input detail information; the global codes corresponding to the global image features of the image in the S4 are decomposed into fine-grained codes to obtain fine-grained image featuresThe length of the steel is 5; the following formula is given:fine-grained image feature +.>And introspection output h _R In a frozen text encoder of the CLIP connected to the second modality encoding unit to perform the interaction. The input sequence in S5 is Wherein h is _g Is a global vector for capturing the entire multimodal interaction and is embedded by the text encoder in the table [ SEP ]]Initializing a word vector; for h _g The output of the last transducer layer is considered the resulting interactive visual information and will be passed through a linear layer to the LLMs. The final output is denoted as h _d The method comprises the steps of carrying out a first treatment on the surface of the Thereafter, a new representation sequence (h _img ,h _X ,h _d ) Is fed into the LLMs to generate the final response. In the step S6, training is performed on the interactive perception network, and a plurality of image text pairs are adopted, wherein the training target of a given image is y= (Y) ₁ ，y ₂ ，…,y _N ) Wherein y is _i Represents the ith mark, N is the total length; the losses are as follows: />In the training process, multi-mode instruction following tuning is adopted.

In an exemplary embodiment, as shown in FIG. 2. Given a picture I, and a query sequence x= (X) ₁ ，x ₂ ，…,x _m ) Wherein x is _i The ith tag representing the text query entered into the LLMs. Global image features obtained by a pre-trained visual encoder (CLIP-ViT-L/14)Wherein d is _v Representing the hidden state size of the visual encoder. At the same time, a special mark capable of learning is added in the word embedding list of LLMs<img>As an input position tag for image features, it will be used in the subsequent feature alignment process.

First linear projection layer: in an exemplary embodiment, a linear projection layer is employed as the visual mapping network to project the global image features into the language embedding space.

The manner in which the features are aligned is as follows:

f(h _I )＝W·h _I +b

wherein,and b is a parameter which can be learned, d _L Is the hidden state size of the LLMs. By doing so, LLMs can obtain basic/global perceptual information of an image, which information and a label<img>Is added together. In addition, another special mark is added at the end of the image and query sequence<img-d>To capture the entire encoded information of the image and query. Then the first input sequence of LLMs can be expressed as

(<img>,f(h _I ),<img-d>→(h _img ,h _X ,h _img-d ))

Wherein h is _img Refers to f (h _I ) And<img>is a superposition of the marked representations. h is a _X And h _img-d Is X and<img-d>is a corresponding word encoded representation of (c). At the last layer of the LLMs,<img-d>the final output contains semantic information of the query sequence, expressed asFurthermore, self-attention mechanism h through LLMs _r But in an exemplary embodiment, it is believed that text-only LLMs that are not pre-trained on multimodal data do not fuse as well as powerful pre-trained multimodal images and language modelsVisual information. In an exemplary embodiment, interactions between human queries and visual information are performed outside of the LLM. Thus, because the structure and parameters of the language model are not changed, the LLM can still maintain the original functions and generalization capability in the natural language task.

Second linear projection layer: applying a linear projection layer to transform the implicit state h _r Mapped into the space of the underlying information interaction module, denoted h _R This implicit state may also be referred to as an introspection output in an exemplary embodiment. For example, when a cat and a dog are input in the current picture, a human command inquires whether the dog in the picture has spots on the legs, in the traditional method, the visual extractor and the linear layer are more prone to extracting global information in the process of extracting features and aligning, and the detail feature information of 'spots on the legs' is difficult to be taken into consideration, but after LLMs are inferred, LLMs can find that the input information does not have 'spots on the legs' or corresponding feature blur, and then an introspection output h is given _R Through h _R To direct the visual extractor and the secondarily aligned linear layer to pay more attention to the speckle detailed information of the picture. Thus, after a request to obtain LLMs, in an exemplary embodiment, it is proposed to utilize h _R And carrying out multi-mode information interaction with the image features. To this end, in an exemplary embodiment, a pre-trained text encoder of CLIP-ViT/14 is employed as a backbone for information interaction, as the text encoder has a similar embedded space as the image encoder.

Third linear projection layer: the global coding of the original image is decomposed into fine-grained coding by one linear layer. The length of the fine-grained image feature is set to 5.

Fourth linear projection layer: then, in an exemplary embodiment, fine-grained image features are used And introspection output h _R Frozen text encoder connected to CLIPTo perform interactions. Specifically, the input sequence is +.>Wherein h is _g Is a global vector for capturing the entire multimodal interaction and is embedded by the text encoder in the table [ SEP ]]Word vector initialization. For h _g The output of the last transducer layer is considered the resulting interactive visual information and will be passed through a linear layer to the LLMs. The final output is denoted as h _d . Thereafter, a new representation sequence (h _img ,h _X ,h _d ) Is fed into the LLMs to generate the final response.

In an exemplary embodiment, the training of the linear projection layer in the interactive perceptive network is as follows:

first, multi-modal pre-training, the goal of this stage is to train the linear projection layer in feature alignment. In an exemplary embodiment, CC3M [1 ] is employed]、LAION-400M[2]、COCO Caption[3]And Flick3K 4]Tens of millions of pairs of image text issued. The total number of image text pairs is about 6900 ten thousand. The description (training object) of a given image is y= (Y) ₁ ，y ₂ ，…,y _N ) Where yi represents the ith marker and N is the total length, the optimization penalty at this stage is as follows:

secondly, the multi-mode instruction follows and adjusts; the IPN is made active by following the data using various multi-modal instructions. First, in an exemplary embodiment, two types of image text semantic matching data are constructed based on image text pairs in the CC3M, COCO capture, and Flick3k datasets. As shown in fig. 3, these two types are "true false inference" and four selection tasks, respectively, in which the title is randomly sampled from the corresponding training set. By doing so, the overall interactive perception network IPN can be trained to help and improve LLMs performing image text semantic alignment. Second, to adapt the interactive sensory network IPN to various human queries, in an exemplary embodiment, multimodal instruction following data about conversations and complex inferences issued by Liu et al is introduced. Finally, considering that complex images contain infinite levels of visual information and possibly additional external knowledge, as shown in FIG. 4, in an exemplary embodiment, data about detailed images is introduced to improve the multimodal long text generation capabilities of the interactive sensory network IPN. It includes corresponding data from the Liu et al and artwork description data set SemArt. The total number of all instruction data is about 730 ten thousand, including 710 ten thousand semantic matching data, 2 ten thousand artwork analysis samples, and 15 ten thousand additional samples.

In summary, the embodiments of the present application provide a novel interactive perception network IPN, so that LLMs can integrate visual information required by different queries. Specifically, the interactive perception network IPN provides the LLMs with basic global information of the image, i.e., feature alignment; and simultaneously acquiring the request of the LLMs, executing the visual information interaction based on the request, and transmitting the visual information after the interaction to the LLMs.

The whole training process of the interactive perception network IPN is divided into two phases. Specifically, the first phase is a multimodal pre-training phase, allowing LLMs to obtain basic global information of an image; the second stage is a multi-mode instruction following tuning stage, and mainly enables the whole information interaction flow to be effective and adaptive to various queries.

With the above improvements, LLMs generate responses based on interleaved multimodal information by understanding a human query, passing corresponding requests to a request-based visual information interaction module. In addition, training a large-scale visual language model (LVLM) from scratch requires a lot of resources. The embodiment of the application provides a computer storage medium storing an interactive perception method program for a large language model, namely a plug-and-play module for the Large Language Model (LLMs).

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks. While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

The foregoing is merely a preferred embodiment of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art, within the scope of the present application, should apply to the present application, and all equivalents and modifications as fall within the scope of the present application.

Claims

1. An interactive perception method suitable for a large language model, comprising:

s1, constructing an interactive perception network, wherein the interactive perception network is connected with a large language model LLMs, and the architecture of the interactive perception network comprises a plurality of modal coding units and a plurality of linear projection layers;

s2, obtaining an image input at one side of the interactive perception network; encoding the image by using a first mode encoding unit to obtain global image characteristics; the first linear projection layer is configured to project the global image feature map into a language embedding space of the large language model LLMs;

s3, a large language model LLMs obtain a query sequence, the large language model LLMs output semantic information containing the query sequence and are mapped and projected into a second modal coding unit through a second linear projection layer to obtain introspection output;

s4, the third linear projection layer is configured to decompose global codes corresponding to global image features of the image into fine-grained codes, obtain fine-grained image features and project the fine-grained image features into a second modal coding unit;

s5, carrying out dynamic interaction between the fine-granularity image features and the introspection output to obtain dynamic interaction information; mapping and projecting the dynamic interaction information to a language embedding space of a large language model LLMs through a fourth linear projection layer;

s6, training the interactive perception network, and carrying out dynamic interactive perception on the large language model LLMs according to the dynamic interaction information, the query sequence and the global image characteristics.

2. The interactive sensory method according to claim 1, wherein the architecture of the interactive sensory network in S1 comprises at least one visual modality encoder, a text modality encoder, and four learnable linear projection layers.

3. The interactive perception method according to claim 2, wherein in S2, the first modality encoding unit is a pre-trained visual modality encoder, the image input is processed by the pre-trained visual modality encoder to obtain global image features, a learnable special mark < img > is added in the word embedding table of LLMs as an input position tag of the image features, and another special mark < img-d > is added at the end of the image and query sequence to capture the whole encoded information of the image and query.

4. The interactive perception method according to claim 1, wherein in the process of the first linear projection layer projecting the global image feature map to the language embedding space of the large language model LLMs in S2, the feature alignment is as follows:

f(h _I )＝W·h _I +b

wherein; f () is a function of feature alignment; h is a _I For global image features, W and b are parameters.

5. The interactive perception method for large language model according to claim 1, wherein the semantic information h of the query sequence in S3 _r Mapping and projecting the second linear projection layer to a second modal coding unit to obtain an introspection output h _R ，h _R The mode coding unit and the linear projection layer can be guided to pay attention to the input detail information.

6. The interactive perception method for large language model according to claim 5, wherein the global code corresponding to the global image feature of the image in S4 is decomposed into fine-grained codes to obtain fine-grained image featuresThe length of the steel is 5; the following formula is given: />

7. The interactive perception method for large language models as claimed in claim 6, wherein the fine-grained image features in S5And introspection output h _R In a frozen text encoder of the CLIP connected to the second modality encoding unit to perform the interaction.

8. The interactive perception method applicable to large language models as claimed in claim 7, wherein: the input sequence in S5 isWherein h is _g Is a global vector for capturing the entire multimodal interaction and is embedded by the text encoder in the table [ SEP ]]Initializing a word vector; for h _g The output of the last transducer layer is considered the resulting interactive visual information and will be passed through a linear layer to the LLMs; the final output is denoted as h _d The method comprises the steps of carrying out a first treatment on the surface of the Thereafter, a new representation sequence (h _img ,h _X ,h _d ) Is fed into the LLMs to generate the final response.

9. The interactive perception method for large language model according to claim 1, wherein the interactive perception network is trained in S6, using several image text pairs, the training target of a given image is y= (Y ₁ ，y ₂ ，…,y _N ) Wherein y is _i Represents the ith mark, N is the total length; the losses are as follows:in the training process, multi-mode instruction following tuning is adopted.

10. A computer storage medium storing a program capable of executing the interactive perception method applicable to a large language model according to any one of claims 1 to 9.