CN117631833A - Interactive perception method suitable for large language model and computer storage medium - Google Patents
Interactive perception method suitable for large language model and computer storage medium Download PDFInfo
- Publication number
- CN117631833A CN117631833A CN202311583817.1A CN202311583817A CN117631833A CN 117631833 A CN117631833 A CN 117631833A CN 202311583817 A CN202311583817 A CN 202311583817A CN 117631833 A CN117631833 A CN 117631833A
- Authority
- CN
- China
- Prior art keywords
- language model
- interactive
- image
- interactive perception
- large language
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000002452 interceptive effect Effects 0.000 title claims abstract description 67
- 230000008447 perception Effects 0.000 title claims abstract description 53
- 238000000034 method Methods 0.000 title claims abstract description 39
- 230000000007 visual effect Effects 0.000 claims abstract description 30
- 230000003993 interaction Effects 0.000 claims abstract description 17
- 230000004044 response Effects 0.000 claims abstract description 7
- 238000012549 training Methods 0.000 claims description 16
- 230000008846 dynamic interplay Effects 0.000 claims description 11
- 238000013507 mapping Methods 0.000 claims description 10
- 230000008569 process Effects 0.000 claims description 9
- 230000001953 sensory effect Effects 0.000 claims description 6
- 230000006870 function Effects 0.000 claims description 4
- 229910000831 Steel Inorganic materials 0.000 claims description 2
- 239000010959 steel Substances 0.000 claims description 2
- 238000010276 construction Methods 0.000 abstract description 2
- 238000012986 modification Methods 0.000 description 5
- 230000004048 modification Effects 0.000 description 5
- 238000010586 diagram Methods 0.000 description 4
- 238000004590 computer program Methods 0.000 description 3
- 230000004075 alteration Effects 0.000 description 2
- 241000282326 Felis catus Species 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 239000003550 marker Substances 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000001737 promoting effect Effects 0.000 description 1
Classifications
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The application discloses an interactive perception method and a computer storage medium suitable for a large language model, belonging to the field of computers, wherein the method comprises the following steps: constructing an interactive perception network, wherein the interactive perception network is connected with a large language model LLMs, and the architecture of the interactive perception network comprises a plurality of modal coding units and a plurality of linear projection layers; the method and the device adopt the construction of the interactive perception network suitable for the large language model, so that the problem that the prior art lacks the interactive perception network suitable for the large language model is effectively solved, and further, the dynamic interactive perception is realized, and the large language model can better execute human instructions; in particular, enables large language models to integrate visual information required for different queries. The application utilizes the interactive perception network which is suitable for a large language model to understand human inquiry, transmits corresponding requests to a visual information interaction module based on the requests, and generates responses based on interleaved multi-mode information.
Description
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to an interactive sensing method suitable for a large language model and a computer storage medium.
Background
The importance of large visual language models is that they possess powerful multi-modal (image and text) understanding and generating capabilities in the field of deep learning. Such models can not only understand text content, but also parse and generate information related to images, thereby enabling a higher level of intelligence and language understanding. This has great potential for various applications such as natural language processing, computer vision, multimodal dialog systems, etc., so that machines can more fully understand and process a variety of information sources, further promoting the development of the artificial intelligence field. Previous methods incorporate visual information into LLM using a simple visual mapping network in which image features are projected through a linear layer into the embedded space of LLM. Such a mapping network projects image features only once, but does not take into account interactions between the image and the human input query. Therefore, the obtained visual information is not related to the intention of the person, and may not be enough for the LLM to make a response following the intention, so as to ensure that the LLM can truly understand and reasonably interact with the human instruction and the image input.
Disclosure of Invention
The interactive perception method and the computer storage medium suitable for the large language model solve the problem that an interactive perception network suitable for the large language model is lacking in the prior art, realize dynamic interactive perception and enable the large language model to better execute human instructions. The application provides an interactive perception method suitable for a large language model, which comprises the following steps: constructing an interactive perception network, wherein the interactive perception network is connected with a large language model LLMs, and the architecture of the interactive perception network comprises a plurality of modal coding units and a plurality of linear projection layers; an image input is obtained at one side of the interactive perception network; encoding the image by using a first mode encoding unit to obtain global image characteristics; the first linear projection layer is configured to project the global image feature map into a language embedding space of the large language model LLMs; the large language model LLMs obtain a query sequence, output semantic information containing the query sequence and map-project the semantic information into a second modal coding unit through a second linear projection layer to obtain introspection output; the third linear projection layer is configured to decompose global codes corresponding to global image features of the image into fine-grained codes, obtain fine-grained image features and project the fine-grained image features into the second modal coding unit; the fine-granularity image features and the introspection output are dynamically interacted to obtain dynamic interaction information; mapping and projecting the dynamic interaction information to a language embedding space of a large language model LLMs through a fourth linear projection layer; training an interactive perception network, and performing dynamic interactive perception on a large language model LLMs according to the dynamic interaction information, the query sequence and the global image characteristics
In another aspect of the present application, there is also provided a computer storage medium storing a program for executing the above method, and a plug-and-play module can be implemented using the storage medium.
The technical scheme provided by the application has at least the following technical effects or advantages:
the method and the device adopt the construction of the interactive perception network suitable for the large language model, so that the problem that the prior art lacks the interactive perception network suitable for the large language model is effectively solved, and further, the dynamic interactive perception is realized, and the large language model can better execute human instructions; in particular, enables large language models to integrate visual information required for different queries. The application utilizes the interactive perception network which is suitable for a large language model to understand human inquiry, transmits corresponding requests to a visual information interaction module based on the requests, and generates responses based on interleaved multi-mode information. The present application effectively avoids training a large-scale visual language model (LVLM) from scratch, requiring a significant amount of resources.
Drawings
FIG. 1 is a flow chart of an interactive perception method applicable to a large language model according to an embodiment of the present application;
FIG. 2 is a schematic diagram of an interactive sensory network applicable to a large language model according to an embodiment of the present application;
FIG. 3 is a diagram showing data types in an embodiment of the present application;
FIG. 4 is a schematic diagram of data types with descriptions in an embodiment of the present application.
Detailed Description
In order to better understand the above technical solutions, the following detailed description will refer to the accompanying drawings and specific embodiments.
The embodiment of the application provides an architecture of an interactive perception network suitable for a large language model:
the architecture of the interactive perception network 10 comprises a plurality of modal coding units 11, a plurality of linear projection layers 12 and a large language model LLMs20; in an exemplary embodiment, the interactive sensory network 10 includes the first modality encoding unit 111, the second modality encoding unit 112, and 4 linear projection layers 12;
the first modality encoding unit 111 is a visual modality encoder 1111, and the second modality encoding unit 112 is a text modality encoder 1121; the 4 linear projection layers 12 are respectively the first linear projection layer 121 to the fourth linear projection layer 124;
based on the architecture of the interactive awareness network 10, the embodiments of the present application provide an interactive awareness method applicable to a large language model, including:
s1, constructing an interactive perception network 10, wherein the interactive perception network 10 is connected with large language models LLMs20, S2, and one side of the interactive perception network 10 obtains an input image I; encoding the image I by using a first mode encoding unit to obtain global image characteristics h I The method comprises the steps of carrying out a first treatment on the surface of the The first linear projection layer 121 is configured to map the global image feature h I Mapping and projecting the large language model LLMs into a language embedding space; s3, obtaining a query sequence X by using a large language model LLMs, and outputting semantic information h containing the query sequence by using the large language model LLMs r And mapped and projected into a second mode coding unit through a second linear projection layer 122 to obtain an introspection output h R The method comprises the steps of carrying out a first treatment on the surface of the S4, the third linear projection layer 123 is configured to characterize the global image features h of the image I Decomposing the corresponding global code into fine-grained codes to obtain fine-grained image characteristicsAnd projected into a second modality encoding unit; s5, the fine granularity image feature ∈>And introspection output h R Dynamic interaction is carried out to obtain dynamic interaction information h d The method comprises the steps of carrying out a first treatment on the surface of the The dynamic interaction information h d Mapping the projection into the language embedding space of the large language model LLMs through a fourth linear projection layer 124; s6, training the interactive perception network 10, and enabling the large language model LLMs20 to perform dynamic interaction information h d Query sequence X and the global image feature h I Performing dynamic interactive sensing;
in S2, the first mode encoding unit 111 is a pre-trained visual mode encoder 1111, and the image input is processed by the pre-trained visual mode encoder 1111 to obtain a global image feature h I At the same time, a special mark capable of learning is added in the word embedding list of LLMs<img>As an input position tag of image characteristics, another special mark is added at the tail of the image and query sequence<img-d>To capture the entire encoded information of the image and query. The first linear projection layer 121 in S2 projects the global image feature h I In the process of mapping and projecting the language embedding space of the large language model LLMs, the characteristic alignment mode is as follows:
f(h I )=W·h I +b
wherein; f () is a function of feature alignment; h is a I For global image features, W and b are parameters. Semantic information h of the query sequence in S3 r Mapping and projecting the second linear projection layer to a second modal coding unit to obtain an introspection output h R ,h R The mode coding unit and the linear projection layer can be guided to pay attention to the input detail information; the global codes corresponding to the global image features of the image in the S4 are decomposed into fine-grained codes to obtain fine-grained image featuresThe length of the steel is 5; the following formula is given:fine-grained image feature +.>And introspection output h R In a frozen text encoder of the CLIP connected to the second modality encoding unit to perform the interaction. The input sequence in S5 is Wherein h is g Is a global vector for capturing the entire multimodal interaction and is embedded by the text encoder in the table [ SEP ]]Initializing a word vector; for h g The output of the last transducer layer is considered the resulting interactive visual information and will be passed through a linear layer to the LLMs. The final output is denoted as h d The method comprises the steps of carrying out a first treatment on the surface of the Thereafter, a new representation sequence (h img ,h X ,h d ) Is fed into the LLMs to generate the final response. In the step S6, training is performed on the interactive perception network, and a plurality of image text pairs are adopted, wherein the training target of a given image is y= (Y) 1 ,y 2 ,…,y N ) Wherein y is i Represents the ith mark, N is the total length; the losses are as follows: />In the training process, multi-mode instruction following tuning is adopted.
In an exemplary embodiment, as shown in FIG. 2. Given a picture I, and a query sequence x= (X) 1 ,x 2 ,…,x m ) Wherein x is i The ith tag representing the text query entered into the LLMs. Global image features obtained by a pre-trained visual encoder (CLIP-ViT-L/14)Wherein d is v Representing the hidden state size of the visual encoder. At the same time, a special mark capable of learning is added in the word embedding list of LLMs<img>As an input position tag for image features, it will be used in the subsequent feature alignment process.
First linear projection layer: in an exemplary embodiment, a linear projection layer is employed as the visual mapping network to project the global image features into the language embedding space.
The manner in which the features are aligned is as follows:
f(h I )=W·h I +b
wherein,and b is a parameter which can be learned, d L Is the hidden state size of the LLMs. By doing so, LLMs can obtain basic/global perceptual information of an image, which information and a label<img>Is added together. In addition, another special mark is added at the end of the image and query sequence<img-d>To capture the entire encoded information of the image and query. Then the first input sequence of LLMs can be expressed as
(<img>,f(h I ),<img-d>→(h img ,h X ,h img-d ))
Wherein h is img Refers to f (h I ) And<img>is a superposition of the marked representations. h is a X And h img-d Is X and<img-d>is a corresponding word encoded representation of (c). At the last layer of the LLMs,<img-d>the final output contains semantic information of the query sequence, expressed asFurthermore, self-attention mechanism h through LLMs r But in an exemplary embodiment, it is believed that text-only LLMs that are not pre-trained on multimodal data do not fuse as well as powerful pre-trained multimodal images and language modelsVisual information. In an exemplary embodiment, interactions between human queries and visual information are performed outside of the LLM. Thus, because the structure and parameters of the language model are not changed, the LLM can still maintain the original functions and generalization capability in the natural language task.
Second linear projection layer: applying a linear projection layer to transform the implicit state h r Mapped into the space of the underlying information interaction module, denoted h R This implicit state may also be referred to as an introspection output in an exemplary embodiment. For example, when a cat and a dog are input in the current picture, a human command inquires whether the dog in the picture has spots on the legs, in the traditional method, the visual extractor and the linear layer are more prone to extracting global information in the process of extracting features and aligning, and the detail feature information of 'spots on the legs' is difficult to be taken into consideration, but after LLMs are inferred, LLMs can find that the input information does not have 'spots on the legs' or corresponding feature blur, and then an introspection output h is given R Through h R To direct the visual extractor and the secondarily aligned linear layer to pay more attention to the speckle detailed information of the picture. Thus, after a request to obtain LLMs, in an exemplary embodiment, it is proposed to utilize h R And carrying out multi-mode information interaction with the image features. To this end, in an exemplary embodiment, a pre-trained text encoder of CLIP-ViT/14 is employed as a backbone for information interaction, as the text encoder has a similar embedded space as the image encoder.
Third linear projection layer: the global coding of the original image is decomposed into fine-grained coding by one linear layer. The length of the fine-grained image feature is set to 5.
Fourth linear projection layer: then, in an exemplary embodiment, fine-grained image features are used And introspection output h R Frozen text encoder connected to CLIPTo perform interactions. Specifically, the input sequence is +.>Wherein h is g Is a global vector for capturing the entire multimodal interaction and is embedded by the text encoder in the table [ SEP ]]Word vector initialization. For h g The output of the last transducer layer is considered the resulting interactive visual information and will be passed through a linear layer to the LLMs. The final output is denoted as h d . Thereafter, a new representation sequence (h img ,h X ,h d ) Is fed into the LLMs to generate the final response.
In an exemplary embodiment, the training of the linear projection layer in the interactive perceptive network is as follows:
first, multi-modal pre-training, the goal of this stage is to train the linear projection layer in feature alignment. In an exemplary embodiment, CC3M [1 ] is employed]、LAION-400M[2]、COCO Caption[3]And Flick3K 4]Tens of millions of pairs of image text issued. The total number of image text pairs is about 6900 ten thousand. The description (training object) of a given image is y= (Y) 1 ,y 2 ,…,y N ) Where yi represents the ith marker and N is the total length, the optimization penalty at this stage is as follows:
secondly, the multi-mode instruction follows and adjusts; the IPN is made active by following the data using various multi-modal instructions. First, in an exemplary embodiment, two types of image text semantic matching data are constructed based on image text pairs in the CC3M, COCO capture, and Flick3k datasets. As shown in fig. 3, these two types are "true false inference" and four selection tasks, respectively, in which the title is randomly sampled from the corresponding training set. By doing so, the overall interactive perception network IPN can be trained to help and improve LLMs performing image text semantic alignment. Second, to adapt the interactive sensory network IPN to various human queries, in an exemplary embodiment, multimodal instruction following data about conversations and complex inferences issued by Liu et al is introduced. Finally, considering that complex images contain infinite levels of visual information and possibly additional external knowledge, as shown in FIG. 4, in an exemplary embodiment, data about detailed images is introduced to improve the multimodal long text generation capabilities of the interactive sensory network IPN. It includes corresponding data from the Liu et al and artwork description data set SemArt. The total number of all instruction data is about 730 ten thousand, including 710 ten thousand semantic matching data, 2 ten thousand artwork analysis samples, and 15 ten thousand additional samples.
In summary, the embodiments of the present application provide a novel interactive perception network IPN, so that LLMs can integrate visual information required by different queries. Specifically, the interactive perception network IPN provides the LLMs with basic global information of the image, i.e., feature alignment; and simultaneously acquiring the request of the LLMs, executing the visual information interaction based on the request, and transmitting the visual information after the interaction to the LLMs.
The whole training process of the interactive perception network IPN is divided into two phases. Specifically, the first phase is a multimodal pre-training phase, allowing LLMs to obtain basic global information of an image; the second stage is a multi-mode instruction following tuning stage, and mainly enables the whole information interaction flow to be effective and adaptive to various queries.
With the above improvements, LLMs generate responses based on interleaved multimodal information by understanding a human query, passing corresponding requests to a request-based visual information interaction module. In addition, training a large-scale visual language model (LVLM) from scratch requires a lot of resources. The embodiment of the application provides a computer storage medium storing an interactive perception method program for a large language model, namely a plug-and-play module for the Large Language Model (LLMs).
It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks. While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the invention.
It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.
The foregoing is merely a preferred embodiment of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art, within the scope of the present application, should apply to the present application, and all equivalents and modifications as fall within the scope of the present application.
Claims (10)
1. An interactive perception method suitable for a large language model, comprising:
s1, constructing an interactive perception network, wherein the interactive perception network is connected with a large language model LLMs, and the architecture of the interactive perception network comprises a plurality of modal coding units and a plurality of linear projection layers;
s2, obtaining an image input at one side of the interactive perception network; encoding the image by using a first mode encoding unit to obtain global image characteristics; the first linear projection layer is configured to project the global image feature map into a language embedding space of the large language model LLMs;
s3, a large language model LLMs obtain a query sequence, the large language model LLMs output semantic information containing the query sequence and are mapped and projected into a second modal coding unit through a second linear projection layer to obtain introspection output;
s4, the third linear projection layer is configured to decompose global codes corresponding to global image features of the image into fine-grained codes, obtain fine-grained image features and project the fine-grained image features into a second modal coding unit;
s5, carrying out dynamic interaction between the fine-granularity image features and the introspection output to obtain dynamic interaction information; mapping and projecting the dynamic interaction information to a language embedding space of a large language model LLMs through a fourth linear projection layer;
s6, training the interactive perception network, and carrying out dynamic interactive perception on the large language model LLMs according to the dynamic interaction information, the query sequence and the global image characteristics.
2. The interactive sensory method according to claim 1, wherein the architecture of the interactive sensory network in S1 comprises at least one visual modality encoder, a text modality encoder, and four learnable linear projection layers.
3. The interactive perception method according to claim 2, wherein in S2, the first modality encoding unit is a pre-trained visual modality encoder, the image input is processed by the pre-trained visual modality encoder to obtain global image features, a learnable special mark < img > is added in the word embedding table of LLMs as an input position tag of the image features, and another special mark < img-d > is added at the end of the image and query sequence to capture the whole encoded information of the image and query.
4. The interactive perception method according to claim 1, wherein in the process of the first linear projection layer projecting the global image feature map to the language embedding space of the large language model LLMs in S2, the feature alignment is as follows:
f(h I )=W·h I +b
wherein; f () is a function of feature alignment; h is a I For global image features, W and b are parameters.
5. The interactive perception method for large language model according to claim 1, wherein the semantic information h of the query sequence in S3 r Mapping and projecting the second linear projection layer to a second modal coding unit to obtain an introspection output h R ,h R The mode coding unit and the linear projection layer can be guided to pay attention to the input detail information.
6. The interactive perception method for large language model according to claim 5, wherein the global code corresponding to the global image feature of the image in S4 is decomposed into fine-grained codes to obtain fine-grained image featuresThe length of the steel is 5; the following formula is given: />
7. The interactive perception method for large language models as claimed in claim 6, wherein the fine-grained image features in S5And introspection output h R In a frozen text encoder of the CLIP connected to the second modality encoding unit to perform the interaction.
8. The interactive perception method applicable to large language models as claimed in claim 7, wherein: the input sequence in S5 isWherein h is g Is a global vector for capturing the entire multimodal interaction and is embedded by the text encoder in the table [ SEP ]]Initializing a word vector; for h g The output of the last transducer layer is considered the resulting interactive visual information and will be passed through a linear layer to the LLMs; the final output is denoted as h d The method comprises the steps of carrying out a first treatment on the surface of the Thereafter, a new representation sequence (h img ,h X ,h d ) Is fed into the LLMs to generate the final response.
9. The interactive perception method for large language model according to claim 1, wherein the interactive perception network is trained in S6, using several image text pairs, the training target of a given image is y= (Y 1 ,y 2 ,…,y N ) Wherein y is i Represents the ith mark, N is the total length; the losses are as follows:in the training process, multi-mode instruction following tuning is adopted.
10. A computer storage medium storing a program capable of executing the interactive perception method applicable to a large language model according to any one of claims 1 to 9.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311583817.1A CN117631833A (en) | 2023-11-24 | 2023-11-24 | Interactive perception method suitable for large language model and computer storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311583817.1A CN117631833A (en) | 2023-11-24 | 2023-11-24 | Interactive perception method suitable for large language model and computer storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117631833A true CN117631833A (en) | 2024-03-01 |
Family
ID=90024634
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311583817.1A Pending CN117631833A (en) | 2023-11-24 | 2023-11-24 | Interactive perception method suitable for large language model and computer storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117631833A (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190339769A1 (en) * | 2018-05-01 | 2019-11-07 | Dell Products, L.P. | Gaze-activated voice services for interactive workspaces |
WO2020081314A1 (en) * | 2018-10-15 | 2020-04-23 | Ancestry.Com Operations Inc. | Image captioning with weakly-supervised attention penalty |
CN116484217A (en) * | 2023-04-17 | 2023-07-25 | 云南元矩阵科技有限公司 | Intelligent decision method and system based on multi-mode pre-training large model |
-
2023
- 2023-11-24 CN CN202311583817.1A patent/CN117631833A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190339769A1 (en) * | 2018-05-01 | 2019-11-07 | Dell Products, L.P. | Gaze-activated voice services for interactive workspaces |
WO2020081314A1 (en) * | 2018-10-15 | 2020-04-23 | Ancestry.Com Operations Inc. | Image captioning with weakly-supervised attention penalty |
CN116484217A (en) * | 2023-04-17 | 2023-07-25 | 云南元矩阵科技有限公司 | Intelligent decision method and system based on multi-mode pre-training large model |
Non-Patent Citations (2)
Title |
---|
Y LI ET AL.: ""A Comparison of Pre-trained Vision-and-Language Models for Multimodal Representation Learning across Medical Images and Reports"", 《2020 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE (BIBM)》, 13 January 2021 (2021-01-13), pages 1994 - 2004 * |
阿布都克力木·阿布力孜 等: ""预训练语言模型的扩展模型研究综述"", 《计算机科学》, vol. 49, no. 11, 15 November 2022 (2022-11-15), pages 210800125 - 1 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20230177343A1 (en) | Scene understanding and generation using neural networks | |
CN113762322B (en) | Video classification method, device and equipment based on multi-modal representation and storage medium | |
CN110796111B (en) | Image processing method, device, equipment and storage medium | |
US11769018B2 (en) | System and method for temporal attention behavioral analysis of multi-modal conversations in a question and answer system | |
CN110476173B (en) | Hierarchical device placement with reinforcement learning | |
CN111699497B (en) | Fast decoding of sequence models using discrete latent variables | |
CN111897940B (en) | Visual dialogue method, training method, device and equipment for visual dialogue model | |
CN113039555B (en) | Method, system and storage medium for classifying actions in video clips | |
CN111737434A (en) | Generating automated assistant responses and/or actions directly from conversation histories and resources | |
WO2021212601A1 (en) | Image-based writing assisting method and apparatus, medium, and device | |
CN116721334B (en) | Training method, device, equipment and storage medium of image generation model | |
CN111651573B (en) | Intelligent customer service dialogue reply generation method and device and electronic equipment | |
CN110795549A (en) | Short text conversation method, device, equipment and storage medium | |
CN115129839A (en) | Visual dialogue answer generation method and device based on graph perception | |
CN116541492A (en) | Data processing method and related equipment | |
CN114328943A (en) | Question answering method, device, equipment and storage medium based on knowledge graph | |
CN116250022A (en) | Neural network for achieving attention on object embedding for object-centric visual reasoning | |
CN117634459A (en) | Target content generation and model training method, device, system, equipment and medium | |
CN117746186A (en) | Training method of low-rank adaptive model, text image generation method and system | |
CN117437317A (en) | Image generation method, apparatus, electronic device, storage medium, and program product | |
CN112597777A (en) | Multi-turn dialogue rewriting method and device | |
CN117453880A (en) | Multi-mode data processing method and device, electronic equipment and storage medium | |
CN116958738A (en) | Training method and device of picture recognition model, storage medium and electronic equipment | |
CN112132075A (en) | Method and medium for processing image-text content | |
CN117631833A (en) | Interactive perception method suitable for large language model and computer storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |