CN117631833A - Interactive perception method suitable for large language model and computer storage medium - Google Patents

Interactive perception method suitable for large language model and computer storage medium Download PDF

Info

Publication number
CN117631833A
CN117631833A CN202311583817.1A CN202311583817A CN117631833A CN 117631833 A CN117631833 A CN 117631833A CN 202311583817 A CN202311583817 A CN 202311583817A CN 117631833 A CN117631833 A CN 117631833A
Authority
CN
China
Prior art keywords
language model
interactive
image
interactive perception
large language
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311583817.1A
Other languages
Chinese (zh)
Inventor
孙腾
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Royole Technologies Co Ltd
Original Assignee
Shenzhen Royole Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Royole Technologies Co Ltd filed Critical Shenzhen Royole Technologies Co Ltd
Priority to CN202311583817.1A priority Critical patent/CN117631833A/en
Publication of CN117631833A publication Critical patent/CN117631833A/en
Pending legal-status Critical Current

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses an interactive perception method and a computer storage medium suitable for a large language model, belonging to the field of computers, wherein the method comprises the following steps: constructing an interactive perception network, wherein the interactive perception network is connected with a large language model LLMs, and the architecture of the interactive perception network comprises a plurality of modal coding units and a plurality of linear projection layers; the method and the device adopt the construction of the interactive perception network suitable for the large language model, so that the problem that the prior art lacks the interactive perception network suitable for the large language model is effectively solved, and further, the dynamic interactive perception is realized, and the large language model can better execute human instructions; in particular, enables large language models to integrate visual information required for different queries. The application utilizes the interactive perception network which is suitable for a large language model to understand human inquiry, transmits corresponding requests to a visual information interaction module based on the requests, and generates responses based on interleaved multi-mode information.

Description

Interactive perception method suitable for large language model and computer storage medium
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to an interactive sensing method suitable for a large language model and a computer storage medium.
Background
The importance of large visual language models is that they possess powerful multi-modal (image and text) understanding and generating capabilities in the field of deep learning. Such models can not only understand text content, but also parse and generate information related to images, thereby enabling a higher level of intelligence and language understanding. This has great potential for various applications such as natural language processing, computer vision, multimodal dialog systems, etc., so that machines can more fully understand and process a variety of information sources, further promoting the development of the artificial intelligence field. Previous methods incorporate visual information into LLM using a simple visual mapping network in which image features are projected through a linear layer into the embedded space of LLM. Such a mapping network projects image features only once, but does not take into account interactions between the image and the human input query. Therefore, the obtained visual information is not related to the intention of the person, and may not be enough for the LLM to make a response following the intention, so as to ensure that the LLM can truly understand and reasonably interact with the human instruction and the image input.
Disclosure of Invention
The interactive perception method and the computer storage medium suitable for the large language model solve the problem that an interactive perception network suitable for the large language model is lacking in the prior art, realize dynamic interactive perception and enable the large language model to better execute human instructions. The application provides an interactive perception method suitable for a large language model, which comprises the following steps: constructing an interactive perception network, wherein the interactive perception network is connected with a large language model LLMs, and the architecture of the interactive perception network comprises a plurality of modal coding units and a plurality of linear projection layers; an image input is obtained at one side of the interactive perception network; encoding the image by using a first mode encoding unit to obtain global image characteristics; the first linear projection layer is configured to project the global image feature map into a language embedding space of the large language model LLMs; the large language model LLMs obtain a query sequence, output semantic information containing the query sequence and map-project the semantic information into a second modal coding unit through a second linear projection layer to obtain introspection output; the third linear projection layer is configured to decompose global codes corresponding to global image features of the image into fine-grained codes, obtain fine-grained image features and project the fine-grained image features into the second modal coding unit; the fine-granularity image features and the introspection output are dynamically interacted to obtain dynamic interaction information; mapping and projecting the dynamic interaction information to a language embedding space of a large language model LLMs through a fourth linear projection layer; training an interactive perception network, and performing dynamic interactive perception on a large language model LLMs according to the dynamic interaction information, the query sequence and the global image characteristics
In another aspect of the present application, there is also provided a computer storage medium storing a program for executing the above method, and a plug-and-play module can be implemented using the storage medium.
The technical scheme provided by the application has at least the following technical effects or advantages:
the method and the device adopt the construction of the interactive perception network suitable for the large language model, so that the problem that the prior art lacks the interactive perception network suitable for the large language model is effectively solved, and further, the dynamic interactive perception is realized, and the large language model can better execute human instructions; in particular, enables large language models to integrate visual information required for different queries. The application utilizes the interactive perception network which is suitable for a large language model to understand human inquiry, transmits corresponding requests to a visual information interaction module based on the requests, and generates responses based on interleaved multi-mode information. The present application effectively avoids training a large-scale visual language model (LVLM) from scratch, requiring a significant amount of resources.
Drawings
FIG. 1 is a flow chart of an interactive perception method applicable to a large language model according to an embodiment of the present application;
FIG. 2 is a schematic diagram of an interactive sensory network applicable to a large language model according to an embodiment of the present application;
FIG. 3 is a diagram showing data types in an embodiment of the present application;
FIG. 4 is a schematic diagram of data types with descriptions in an embodiment of the present application.
Detailed Description
In order to better understand the above technical solutions, the following detailed description will refer to the accompanying drawings and specific embodiments.
The embodiment of the application provides an architecture of an interactive perception network suitable for a large language model:
the architecture of the interactive perception network 10 comprises a plurality of modal coding units 11, a plurality of linear projection layers 12 and a large language model LLMs20; in an exemplary embodiment, the interactive sensory network 10 includes the first modality encoding unit 111, the second modality encoding unit 112, and 4 linear projection layers 12;
the first modality encoding unit 111 is a visual modality encoder 1111, and the second modality encoding unit 112 is a text modality encoder 1121; the 4 linear projection layers 12 are respectively the first linear projection layer 121 to the fourth linear projection layer 124;
based on the architecture of the interactive awareness network 10, the embodiments of the present application provide an interactive awareness method applicable to a large language model, including:
s1, constructing an interactive perception network 10, wherein the interactive perception network 10 is connected with large language models LLMs20, S2, and one side of the interactive perception network 10 obtains an input image I; encoding the image I by using a first mode encoding unit to obtain global image characteristics h I The method comprises the steps of carrying out a first treatment on the surface of the The first linear projection layer 121 is configured to map the global image feature h I Mapping and projecting the large language model LLMs into a language embedding space; s3, obtaining a query sequence X by using a large language model LLMs, and outputting semantic information h containing the query sequence by using the large language model LLMs r And mapped and projected into a second mode coding unit through a second linear projection layer 122 to obtain an introspection output h R The method comprises the steps of carrying out a first treatment on the surface of the S4, the third linear projection layer 123 is configured to characterize the global image features h of the image I Decomposing the corresponding global code into fine-grained codes to obtain fine-grained image characteristicsAnd projected into a second modality encoding unit; s5, the fine granularity image feature ∈>And introspection output h R Dynamic interaction is carried out to obtain dynamic interaction information h d The method comprises the steps of carrying out a first treatment on the surface of the The dynamic interaction information h d Mapping the projection into the language embedding space of the large language model LLMs through a fourth linear projection layer 124; s6, training the interactive perception network 10, and enabling the large language model LLMs20 to perform dynamic interaction information h d Query sequence X and the global image feature h I Performing dynamic interactive sensing;
in S2, the first mode encoding unit 111 is a pre-trained visual mode encoder 1111, and the image input is processed by the pre-trained visual mode encoder 1111 to obtain a global image feature h I At the same time, a special mark capable of learning is added in the word embedding list of LLMs<img>As an input position tag of image characteristics, another special mark is added at the tail of the image and query sequence<img-d>To capture the entire encoded information of the image and query. The first linear projection layer 121 in S2 projects the global image feature h I In the process of mapping and projecting the language embedding space of the large language model LLMs, the characteristic alignment mode is as follows:
f(h I )=W·h I +b
wherein; f () is a function of feature alignment; h is a I For global image features, W and b are parameters. Semantic information h of the query sequence in S3 r Mapping and projecting the second linear projection layer to a second modal coding unit to obtain an introspection output h R ,h R The mode coding unit and the linear projection layer can be guided to pay attention to the input detail information; the global codes corresponding to the global image features of the image in the S4 are decomposed into fine-grained codes to obtain fine-grained image featuresThe length of the steel is 5; the following formula is given:fine-grained image feature +.>And introspection output h R In a frozen text encoder of the CLIP connected to the second modality encoding unit to perform the interaction. The input sequence in S5 is Wherein h is g Is a global vector for capturing the entire multimodal interaction and is embedded by the text encoder in the table [ SEP ]]Initializing a word vector; for h g The output of the last transducer layer is considered the resulting interactive visual information and will be passed through a linear layer to the LLMs. The final output is denoted as h d The method comprises the steps of carrying out a first treatment on the surface of the Thereafter, a new representation sequence (h img ,h X ,h d ) Is fed into the LLMs to generate the final response. In the step S6, training is performed on the interactive perception network, and a plurality of image text pairs are adopted, wherein the training target of a given image is y= (Y) 1 ,y 2 ,…,y N ) Wherein y is i Represents the ith mark, N is the total length; the losses are as follows: />In the training process, multi-mode instruction following tuning is adopted.
In an exemplary embodiment, as shown in FIG. 2. Given a picture I, and a query sequence x= (X) 1 ,x 2 ,…,x m ) Wherein x is i The ith tag representing the text query entered into the LLMs. Global image features obtained by a pre-trained visual encoder (CLIP-ViT-L/14)Wherein d is v Representing the hidden state size of the visual encoder. At the same time, a special mark capable of learning is added in the word embedding list of LLMs<img>As an input position tag for image features, it will be used in the subsequent feature alignment process.
First linear projection layer: in an exemplary embodiment, a linear projection layer is employed as the visual mapping network to project the global image features into the language embedding space.
The manner in which the features are aligned is as follows:
f(h I )=W·h I +b
wherein,and b is a parameter which can be learned, d L Is the hidden state size of the LLMs. By doing so, LLMs can obtain basic/global perceptual information of an image, which information and a label<img>Is added together. In addition, another special mark is added at the end of the image and query sequence<img-d>To capture the entire encoded information of the image and query. Then the first input sequence of LLMs can be expressed as
(<img>,f(h I ),<img-d>→(h img ,h X ,h img-d ))
Wherein h is img Refers to f (h I ) And<img>is a superposition of the marked representations. h is a X And h img-d Is X and<img-d>is a corresponding word encoded representation of (c). At the last layer of the LLMs,<img-d>the final output contains semantic information of the query sequence, expressed asFurthermore, self-attention mechanism h through LLMs r But in an exemplary embodiment, it is believed that text-only LLMs that are not pre-trained on multimodal data do not fuse as well as powerful pre-trained multimodal images and language modelsVisual information. In an exemplary embodiment, interactions between human queries and visual information are performed outside of the LLM. Thus, because the structure and parameters of the language model are not changed, the LLM can still maintain the original functions and generalization capability in the natural language task.
Second linear projection layer: applying a linear projection layer to transform the implicit state h r Mapped into the space of the underlying information interaction module, denoted h R This implicit state may also be referred to as an introspection output in an exemplary embodiment. For example, when a cat and a dog are input in the current picture, a human command inquires whether the dog in the picture has spots on the legs, in the traditional method, the visual extractor and the linear layer are more prone to extracting global information in the process of extracting features and aligning, and the detail feature information of 'spots on the legs' is difficult to be taken into consideration, but after LLMs are inferred, LLMs can find that the input information does not have 'spots on the legs' or corresponding feature blur, and then an introspection output h is given R Through h R To direct the visual extractor and the secondarily aligned linear layer to pay more attention to the speckle detailed information of the picture. Thus, after a request to obtain LLMs, in an exemplary embodiment, it is proposed to utilize h R And carrying out multi-mode information interaction with the image features. To this end, in an exemplary embodiment, a pre-trained text encoder of CLIP-ViT/14 is employed as a backbone for information interaction, as the text encoder has a similar embedded space as the image encoder.
Third linear projection layer: the global coding of the original image is decomposed into fine-grained coding by one linear layer. The length of the fine-grained image feature is set to 5.
Fourth linear projection layer: then, in an exemplary embodiment, fine-grained image features are used And introspection output h R Frozen text encoder connected to CLIPTo perform interactions. Specifically, the input sequence is +.>Wherein h is g Is a global vector for capturing the entire multimodal interaction and is embedded by the text encoder in the table [ SEP ]]Word vector initialization. For h g The output of the last transducer layer is considered the resulting interactive visual information and will be passed through a linear layer to the LLMs. The final output is denoted as h d . Thereafter, a new representation sequence (h img ,h X ,h d ) Is fed into the LLMs to generate the final response.
In an exemplary embodiment, the training of the linear projection layer in the interactive perceptive network is as follows:
first, multi-modal pre-training, the goal of this stage is to train the linear projection layer in feature alignment. In an exemplary embodiment, CC3M [1 ] is employed]、LAION-400M[2]、COCO Caption[3]And Flick3K 4]Tens of millions of pairs of image text issued. The total number of image text pairs is about 6900 ten thousand. The description (training object) of a given image is y= (Y) 1 ,y 2 ,…,y N ) Where yi represents the ith marker and N is the total length, the optimization penalty at this stage is as follows:
secondly, the multi-mode instruction follows and adjusts; the IPN is made active by following the data using various multi-modal instructions. First, in an exemplary embodiment, two types of image text semantic matching data are constructed based on image text pairs in the CC3M, COCO capture, and Flick3k datasets. As shown in fig. 3, these two types are "true false inference" and four selection tasks, respectively, in which the title is randomly sampled from the corresponding training set. By doing so, the overall interactive perception network IPN can be trained to help and improve LLMs performing image text semantic alignment. Second, to adapt the interactive sensory network IPN to various human queries, in an exemplary embodiment, multimodal instruction following data about conversations and complex inferences issued by Liu et al is introduced. Finally, considering that complex images contain infinite levels of visual information and possibly additional external knowledge, as shown in FIG. 4, in an exemplary embodiment, data about detailed images is introduced to improve the multimodal long text generation capabilities of the interactive sensory network IPN. It includes corresponding data from the Liu et al and artwork description data set SemArt. The total number of all instruction data is about 730 ten thousand, including 710 ten thousand semantic matching data, 2 ten thousand artwork analysis samples, and 15 ten thousand additional samples.
In summary, the embodiments of the present application provide a novel interactive perception network IPN, so that LLMs can integrate visual information required by different queries. Specifically, the interactive perception network IPN provides the LLMs with basic global information of the image, i.e., feature alignment; and simultaneously acquiring the request of the LLMs, executing the visual information interaction based on the request, and transmitting the visual information after the interaction to the LLMs.
The whole training process of the interactive perception network IPN is divided into two phases. Specifically, the first phase is a multimodal pre-training phase, allowing LLMs to obtain basic global information of an image; the second stage is a multi-mode instruction following tuning stage, and mainly enables the whole information interaction flow to be effective and adaptive to various queries.
With the above improvements, LLMs generate responses based on interleaved multimodal information by understanding a human query, passing corresponding requests to a request-based visual information interaction module. In addition, training a large-scale visual language model (LVLM) from scratch requires a lot of resources. The embodiment of the application provides a computer storage medium storing an interactive perception method program for a large language model, namely a plug-and-play module for the Large Language Model (LLMs).
It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks. While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the invention.
It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.
The foregoing is merely a preferred embodiment of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art, within the scope of the present application, should apply to the present application, and all equivalents and modifications as fall within the scope of the present application.

Claims (10)

1. An interactive perception method suitable for a large language model, comprising:
s1, constructing an interactive perception network, wherein the interactive perception network is connected with a large language model LLMs, and the architecture of the interactive perception network comprises a plurality of modal coding units and a plurality of linear projection layers;
s2, obtaining an image input at one side of the interactive perception network; encoding the image by using a first mode encoding unit to obtain global image characteristics; the first linear projection layer is configured to project the global image feature map into a language embedding space of the large language model LLMs;
s3, a large language model LLMs obtain a query sequence, the large language model LLMs output semantic information containing the query sequence and are mapped and projected into a second modal coding unit through a second linear projection layer to obtain introspection output;
s4, the third linear projection layer is configured to decompose global codes corresponding to global image features of the image into fine-grained codes, obtain fine-grained image features and project the fine-grained image features into a second modal coding unit;
s5, carrying out dynamic interaction between the fine-granularity image features and the introspection output to obtain dynamic interaction information; mapping and projecting the dynamic interaction information to a language embedding space of a large language model LLMs through a fourth linear projection layer;
s6, training the interactive perception network, and carrying out dynamic interactive perception on the large language model LLMs according to the dynamic interaction information, the query sequence and the global image characteristics.
2. The interactive sensory method according to claim 1, wherein the architecture of the interactive sensory network in S1 comprises at least one visual modality encoder, a text modality encoder, and four learnable linear projection layers.
3. The interactive perception method according to claim 2, wherein in S2, the first modality encoding unit is a pre-trained visual modality encoder, the image input is processed by the pre-trained visual modality encoder to obtain global image features, a learnable special mark < img > is added in the word embedding table of LLMs as an input position tag of the image features, and another special mark < img-d > is added at the end of the image and query sequence to capture the whole encoded information of the image and query.
4. The interactive perception method according to claim 1, wherein in the process of the first linear projection layer projecting the global image feature map to the language embedding space of the large language model LLMs in S2, the feature alignment is as follows:
f(h I )=W·h I +b
wherein; f () is a function of feature alignment; h is a I For global image features, W and b are parameters.
5. The interactive perception method for large language model according to claim 1, wherein the semantic information h of the query sequence in S3 r Mapping and projecting the second linear projection layer to a second modal coding unit to obtain an introspection output h R ,h R The mode coding unit and the linear projection layer can be guided to pay attention to the input detail information.
6. The interactive perception method for large language model according to claim 5, wherein the global code corresponding to the global image feature of the image in S4 is decomposed into fine-grained codes to obtain fine-grained image featuresThe length of the steel is 5; the following formula is given: />
7. The interactive perception method for large language models as claimed in claim 6, wherein the fine-grained image features in S5And introspection output h R In a frozen text encoder of the CLIP connected to the second modality encoding unit to perform the interaction.
8. The interactive perception method applicable to large language models as claimed in claim 7, wherein: the input sequence in S5 isWherein h is g Is a global vector for capturing the entire multimodal interaction and is embedded by the text encoder in the table [ SEP ]]Initializing a word vector; for h g The output of the last transducer layer is considered the resulting interactive visual information and will be passed through a linear layer to the LLMs; the final output is denoted as h d The method comprises the steps of carrying out a first treatment on the surface of the Thereafter, a new representation sequence (h img ,h X ,h d ) Is fed into the LLMs to generate the final response.
9. The interactive perception method for large language model according to claim 1, wherein the interactive perception network is trained in S6, using several image text pairs, the training target of a given image is y= (Y 1 ,y 2 ,…,y N ) Wherein y is i Represents the ith mark, N is the total length; the losses are as follows:in the training process, multi-mode instruction following tuning is adopted.
10. A computer storage medium storing a program capable of executing the interactive perception method applicable to a large language model according to any one of claims 1 to 9.
CN202311583817.1A 2023-11-24 2023-11-24 Interactive perception method suitable for large language model and computer storage medium Pending CN117631833A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311583817.1A CN117631833A (en) 2023-11-24 2023-11-24 Interactive perception method suitable for large language model and computer storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311583817.1A CN117631833A (en) 2023-11-24 2023-11-24 Interactive perception method suitable for large language model and computer storage medium

Publications (1)

Publication Number Publication Date
CN117631833A true CN117631833A (en) 2024-03-01

Family

ID=90024634

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311583817.1A Pending CN117631833A (en) 2023-11-24 2023-11-24 Interactive perception method suitable for large language model and computer storage medium

Country Status (1)

Country Link
CN (1) CN117631833A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190339769A1 (en) * 2018-05-01 2019-11-07 Dell Products, L.P. Gaze-activated voice services for interactive workspaces
WO2020081314A1 (en) * 2018-10-15 2020-04-23 Ancestry.Com Operations Inc. Image captioning with weakly-supervised attention penalty
CN116484217A (en) * 2023-04-17 2023-07-25 云南元矩阵科技有限公司 Intelligent decision method and system based on multi-mode pre-training large model

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190339769A1 (en) * 2018-05-01 2019-11-07 Dell Products, L.P. Gaze-activated voice services for interactive workspaces
WO2020081314A1 (en) * 2018-10-15 2020-04-23 Ancestry.Com Operations Inc. Image captioning with weakly-supervised attention penalty
CN116484217A (en) * 2023-04-17 2023-07-25 云南元矩阵科技有限公司 Intelligent decision method and system based on multi-mode pre-training large model

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Y LI ET AL.: ""A Comparison of Pre-trained Vision-and-Language Models for Multimodal Representation Learning across Medical Images and Reports"", 《2020 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE (BIBM)》, 13 January 2021 (2021-01-13), pages 1994 - 2004 *
阿布都克力木·阿布力孜 等: ""预训练语言模型的扩展模型研究综述"", 《计算机科学》, vol. 49, no. 11, 15 November 2022 (2022-11-15), pages 210800125 - 1 *

Similar Documents

Publication Publication Date Title
US20230177343A1 (en) Scene understanding and generation using neural networks
CN113762322B (en) Video classification method, device and equipment based on multi-modal representation and storage medium
CN110796111B (en) Image processing method, device, equipment and storage medium
US11769018B2 (en) System and method for temporal attention behavioral analysis of multi-modal conversations in a question and answer system
CN110476173B (en) Hierarchical device placement with reinforcement learning
CN111699497B (en) Fast decoding of sequence models using discrete latent variables
CN111897940B (en) Visual dialogue method, training method, device and equipment for visual dialogue model
CN113039555B (en) Method, system and storage medium for classifying actions in video clips
CN111737434A (en) Generating automated assistant responses and/or actions directly from conversation histories and resources
WO2021212601A1 (en) Image-based writing assisting method and apparatus, medium, and device
CN116721334B (en) Training method, device, equipment and storage medium of image generation model
CN111651573B (en) Intelligent customer service dialogue reply generation method and device and electronic equipment
CN110795549A (en) Short text conversation method, device, equipment and storage medium
CN115129839A (en) Visual dialogue answer generation method and device based on graph perception
CN116541492A (en) Data processing method and related equipment
CN114328943A (en) Question answering method, device, equipment and storage medium based on knowledge graph
CN116250022A (en) Neural network for achieving attention on object embedding for object-centric visual reasoning
CN117634459A (en) Target content generation and model training method, device, system, equipment and medium
CN117746186A (en) Training method of low-rank adaptive model, text image generation method and system
CN117437317A (en) Image generation method, apparatus, electronic device, storage medium, and program product
CN112597777A (en) Multi-turn dialogue rewriting method and device
CN117453880A (en) Multi-mode data processing method and device, electronic equipment and storage medium
CN116958738A (en) Training method and device of picture recognition model, storage medium and electronic equipment
CN112132075A (en) Method and medium for processing image-text content
CN117631833A (en) Interactive perception method suitable for large language model and computer storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination