CN117273107A

CN117273107A - Training method and training device for text generation model

Info

Publication number: CN117273107A
Application number: CN202311285513.7A
Authority: CN
Inventors: 宋雨鑫; 戎康; 刘芳龙; 张琦; 李鑫
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2023-09-28
Filing date: 2023-09-28
Publication date: 2023-12-22

Abstract

The disclosure provides a training method and a training device for a text generation model, relates to the technical field of artificial intelligence, and particularly relates to the technical fields of computer vision, deep learning, large models and the like. The implementation scheme is as follows: obtaining a sample data set, wherein the sample data set comprises a sample image and sample text corresponding to the sample image, and the sample text corresponding to the sample image comprises text content generated for the sample image by using a large language model; determining sample image features of the sample image; inputting the sample image features into a large language model to obtain a predicted text corresponding to the sample image; and parameter adjusting the text generation model based on a difference between the sample text and the predicted text.

Description

Training method and training device for text generation model

Technical Field

The disclosure relates to the technical field of artificial intelligence, in particular to the technical fields of computer vision, deep learning, large models and the like, and can be applied to AIGC (automatic guided vehicle) and other scenes, and particularly relates to a training method, a training device, electronic equipment, a computer readable storage medium and a computer program product of a text generation model.

Background

Artificial intelligence is the discipline of studying the process of making a computer mimic certain mental processes and intelligent behaviors (e.g., learning, reasoning, thinking, planning, etc.) of a person, both hardware-level and software-level techniques. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, a machine learning/deep learning technology, a big data processing technology, a knowledge graph technology and the like.

The approaches described in this section are not necessarily approaches that have been previously conceived or pursued. Unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section. Similarly, the problems mentioned in this section should not be considered as having been recognized in any prior art unless otherwise indicated.

Disclosure of Invention

The present disclosure provides a training method, apparatus, electronic device, computer readable storage medium and computer program product for text generation model.

According to an aspect of the present disclosure, there is provided a training method of a text generation model, including: obtaining a sample data set, wherein the sample data set comprises a sample image and sample text corresponding to the sample image, and the sample text corresponding to the sample image comprises text content generated for the sample image by using a large language model; determining sample image features of the sample image; inputting the sample image features into a large language model to obtain a predicted text corresponding to the sample image; and parameter adjusting the text generation model based on a difference between the sample text and the predicted text.

According to another aspect of the present disclosure, there is provided a training apparatus of a text generation model, including: a sample data acquisition unit configured to acquire a sample data set, wherein the sample data set includes a sample image and sample text corresponding to the sample image, wherein the sample text corresponding to the sample image includes document content generated for the sample image using a large language model; an image feature acquisition unit configured to determine a sample image feature of the sample image; a predicted text acquisition unit configured to input the sample image features into a large language model to obtain a predicted text corresponding to the sample image; and a parameter adjustment unit configured to perform parameter adjustment on the text generation model based on a difference between the sample text and the predicted text.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described above.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method as described above.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program, wherein the computer program, when executed by a processor, implements the method as described above.

According to one or more embodiments of the present disclosure, the ability of a large language model may be utilized to quickly and efficiently generate corresponding text content for a given image, thereby quickly acquiring a multimodal training dataset for instruction fine-tuning of the text generation model.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The accompanying drawings illustrate exemplary embodiments and, together with the description, serve to explain exemplary implementations of the embodiments. The illustrated embodiments are for exemplary purposes only and do not limit the scope of the claims. Throughout the drawings, identical reference numerals designate similar, but not necessarily identical, elements.

FIG. 1 illustrates a schematic diagram of an exemplary system in which various methods described herein may be implemented, in accordance with an embodiment of the present disclosure;

FIG. 2 illustrates a method for training an image-guided text generation model in accordance with an embodiment of the present disclosure;

FIG. 3 illustrates an exemplary process of acquiring a sample dataset according to an embodiment of the present disclosure;

FIG. 4 illustrates an exemplary process of image-guided text generation according to an embodiment of the present disclosure;

FIG. 5 illustrates an exemplary block diagram of an apparatus for training an image-guided text generation model in accordance with an embodiment of the present disclosure;

fig. 6 illustrates a block diagram of an exemplary electronic device that can be used to implement embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the present disclosure, the use of the terms "first," "second," and the like to describe various elements is not intended to limit the positional relationship, timing relationship, or importance relationship of the elements, unless otherwise indicated, and such terms are merely used to distinguish one element from another element. In some examples, a first element and a second element may refer to the same instance of the element, and in some cases, they may also refer to different instances based on the description of the context.

The terminology used in the description of the various illustrated examples in this disclosure is for the purpose of describing particular examples only and is not intended to be limiting. Unless the context clearly indicates otherwise, the elements may be one or more if the number of the elements is not specifically limited. Furthermore, the term "and/or" as used in this disclosure encompasses any and all possible combinations of the listed items.

Embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings.

Fig. 1 illustrates a schematic diagram of an exemplary system 100 in which various methods and apparatus described herein may be implemented, in accordance with an embodiment of the present disclosure. Referring to fig. 1, the system 100 includes one or more client devices 101, 102, 103, 104, 105, and 106, a server 120, and one or more communication networks 110 coupling the one or more client devices to the server 120. Client devices 101, 102, 103, 104, 105, and 106 may be configured to execute one or more applications.

In embodiments of the present disclosure, server 120 may run one or more services or software applications that enable execution of methods of training a text generation model according to embodiments of the present disclosure.

In some embodiments, server 120 may also provide other services or software applications, which may include non-virtual environments and virtual environments. In some embodiments, these services may be provided as web-based services or cloud services, for example, provided to users of client devices 101, 102, 103, 104, 105, and/or 106 under a software as a service (SaaS) model.

In the configuration shown in fig. 1, server 120 may include one or more components that implement the functions performed by server 120. These components may include software components, hardware components, or a combination thereof that are executable by one or more processors. A user operating client devices 101, 102, 103, 104, 105, and/or 106 may in turn utilize one or more client applications to interact with server 120 to utilize the services provided by these components. It should be appreciated that a variety of different system configurations are possible, which may differ from system 100. Accordingly, FIG. 1 is one example of a system for implementing the various methods described herein and is not intended to be limiting.

A user may use client devices 101, 102, 103, 104, 105, and/or 106 to obtain images and text information used in the methods of embodiments of the present disclosure. The client device may provide an interface that enables a user of the client device to interact with the client device. The client device may also output information to the user via the interface. Although fig. 1 depicts only six client devices, those skilled in the art will appreciate that the present disclosure may support any number of client devices.

Client devices 101, 102, 103, 104, 105, and/or 106 may include various types of computer devices, such as portable handheld devices, general purpose computers (such as personal computers and laptop computers), workstation computers, wearable devices, smart screen devices, self-service terminal devices, service robots, gaming systems, thin clients, various messaging devices, sensors or other sensing devices, and the like. These computer devices may run various types and versions of software applications and operating systems, such as MICROSOFT Windows, APPLE iOS, UNIX-like operating systems, linux, or Linux-like operating systems (e.g., GOOGLE Chrome OS); or include various mobile operating systems such as MICROSOFT Windows Mobile OS, iOS, windows Phone, android. Portable handheld devices may include cellular telephones, smart phones, tablet computers, personal Digital Assistants (PDAs), and the like. Wearable devices may include head mounted displays (such as smart glasses) and other devices. The gaming system may include various handheld gaming devices, internet-enabled gaming devices, and the like. The client device is capable of executing a variety of different applications, such as various Internet-related applications, communication applications (e.g., email applications), short Message Service (SMS) applications, and may use a variety of communication protocols.

Network 110 may be any type of network known to those skilled in the art that may support data communications using any of a number of available protocols, including but not limited to TCP/IP, SNA, IPX, etc. For example only, the one or more networks 110 may be a Local Area Network (LAN), an ethernet-based network, a token ring, a Wide Area Network (WAN), the internet, a virtual network, a Virtual Private Network (VPN), an intranet, an extranet, a blockchain network, a Public Switched Telephone Network (PSTN), an infrared network, a wireless network (e.g., bluetooth, WIFI), and/or any combination of these and/or other networks.

The server 120 may include one or more general purpose computers, special purpose server computers (e.g., PC (personal computer) servers, UNIX servers, mid-end servers), blade servers, mainframe computers, server clusters, or any other suitable arrangement and/or combination. The server 120 may include one or more virtual machines running a virtual operating system, or other computing architecture that involves virtualization (e.g., one or more flexible pools of logical storage devices that may be virtualized to maintain virtual storage devices of the server). In various embodiments, server 120 may run one or more services or software applications that provide the functionality described below.

The computing units in server 120 may run one or more operating systems including any of the operating systems described above as well as any commercially available server operating systems. Server 120 may also run any of a variety of additional server applications and/or middle tier applications, including HTTP servers, FTP servers, CGI servers, JAVA servers, database servers, etc.

In some implementations, server 120 may include one or more applications to analyze and consolidate data feeds and/or event updates received from users of client devices 101, 102, 103, 104, 105, and/or 106. Server 120 may also include one or more applications to display data feeds and/or real-time events via one or more display devices of client devices 101, 102, 103, 104, 105, and/or 106.

In some implementations, the server 120 may be a server of a distributed system or a server that incorporates a blockchain. The server 120 may also be a cloud server, or an intelligent cloud computing server or intelligent cloud host with artificial intelligence technology. The cloud server is a host product in a cloud computing service system, so as to solve the defects of large management difficulty and weak service expansibility in the traditional physical host and virtual private server (VPS, virtual Private Server) service.

The system 100 may also include one or more databases 130. In some embodiments, these databases may be used to store data and other information. For example, one or more of databases 130 may be used to store information such as audio files and video files. Database 130 may reside in various locations. For example, the database used by the server 120 may be local to the server 120, or may be remote from the server 120 and may communicate with the server 120 via a network-based or dedicated connection. Database 130 may be of different types. In some embodiments, the database used by server 120 may be, for example, a relational database. One or more of these databases may store, update, and retrieve the databases and data from the databases in response to the commands.

In some embodiments, one or more of databases 130 may also be used by applications to store application data. The databases used by the application may be different types of databases, such as key value stores, object stores, or conventional stores supported by the file system.

The system 100 of fig. 1 may be configured and operated in various ways to enable application of the various methods and apparatus described in accordance with the present disclosure.

In the technical scheme of the disclosure, the related processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like of the personal information of the user accord with the regulations of related laws and regulations, and the public order colloquial is not violated.

FIG. 2 illustrates a method for training an image-guided text generation model in accordance with an embodiment of the present disclosure.

In step S202, a sample data set is acquired. Wherein the sample data set includes a sample image and sample text corresponding to the sample image. Wherein the sample text corresponding to the sample image includes document content generated for the sample image using the large language model.

In step S204, sample image features of the sample image are determined.

In step S206, the sample image features are input into a large language model to obtain a predicted text corresponding to the sample image.

In step S208, a parameter adjustment is made to the text generation model based on the difference between the sample text and the predicted text.

By utilizing the method for training the text generation model provided by the embodiment of the invention, corresponding text content can be quickly and efficiently generated for a given image by utilizing the capability of a large language model, so that a multi-modal training data set for performing instruction fine tuning on the text generation model can be quickly acquired.

The principles of the present disclosure will be described in detail below.

In step S202, a sample dataset may be acquired. Wherein the sample data set may include a sample image and sample text corresponding to the sample image. Wherein the sample text corresponding to the sample image may include text content generated for the sample image using a large language model.

Image-guided text generation is widely used for text generation tasks such as automatic advertising, commentary, and messages. In natural language processing (Natural Language Processing, NLP), instruction fine-tuning (fine-tuning) has proven to be an effective method of achieving text generation. In the instruction fine tuning scheme, the model is parametrically tuned by the annotated input-output so that the model can learn the output capabilities for a particular task.

Instruction-tuned training patterns are also increasingly being applied to vision-language tasks. In an image-guided text generation scheme, a frozen-based picture feature extractor extracts image features, maps the image features into a text embedding space that can be understood by a Large Language Model (LLM) through a learnable translation module, and performs text generation using the frozen Large Language Model (LLM).

However, as visual input is inherently more diverse, models are difficult to generalize to different visual language-dropping tasks than traditional text-generating tasks. In addition, due to the limitation of lack of fine adjustment data of multi-mode (image-text pair) instructions, an image-guided text generation model can only perform basic tasks such as visual scene understanding and reasoning, knowledge-based image description, multi-round visual dialogue and the like, and is difficult to meet the requirements of customized text generation of users, such as text writing of a designated style, advertisement generation, comment generation of different language habits and the like.

In the related art, general instruction trimming data is collected by converting an existing NLP data set into an instruction format using a hint word of a fixed template. By injecting visual information into the LLM, the instruction-adjusted LLM is adapted to the visual-to-language generation task. The LLaVA model is used to directly project the output of the visual encoder as the input of the LLaMA/Vinuca LLM model, and fine tuning is performed on the LLM according to the visual language dialogue data generated by GPT-4.

Currently, the multi-mode instruction trim data is quite deficient. The labeling process for image-text pairs is very time consuming and the definition of the correspondence of image-text pairs is ambiguous, thereby making it difficult to generate a large number of high quality multi-modal instruction trim datasets.

Accordingly, the present disclosure provides a large language model based image guided customized text generation method. In step S202, document contents corresponding to the sample image are acquired by the large language model. By utilizing the text generation capability of the large language model to perform semi-automatic annotation generation of customized text, parameter-efficient instruction fine-tuning (Parameter-effect-tuning) of the large language model can be realized. Thus, the large language model is generated by customizing the writing style and the personalized image-guided document. In the process, the text generation capability of the large language model can be utilized to realize the rapid generation of the input-output pairs of the image-text, and the image-text pairs do not need to be marked manually.

In some embodiments, step S202 may be implemented by the following process: processing the sample image by using a pre-trained second text generation model to obtain a text description of the sample image; the text description of the sample image is input into a large language model together with prompt information to obtain a sample text corresponding to the sample image, wherein the prompt information designates the style of the sample text. In some examples, the same hint information is used for all sample images so that a sample dataset for a specified text generation task can be obtained.

Wherein the second text generation model may be a different model than the text generation model trained in the present disclosure. The second text generation model may have the ability to obtain descriptive text for the image content based on the input image. In some examples, the second text generation model may be a BLIP-2 model. The sample image may be input into a pre-trained second text generation model, and descriptive text of the sample image (e.g., a capture of the BLIP-2 model output) may be determined based on an output of the second text generation model. Then, the descriptive text output by the second text model may be input into the large language model together with the prompt information of the specified task, thereby enabling the large language model to output the sample text for the specified task. In some examples, the hint information may specify a style of sample text. For example, the hint information may specify a social platform such that the large language model outputs sample text that matches the style of the specified social platform. For another example, the hint information may limit the text length of the sample text, the manner of segmentation, and other format information.

In step S204, sample image features of the sample image are determined.

Step S204 may include extracting sample visual features of the sample image using the feature extraction unit and mapping the sample visual features using the linear layer to obtain sample image features. Wherein the sample visual features may include at least one of high frequency image features, low frequency image features, or a combination thereof in the sample image. The specific form of the sample visual features is not limited herein. Various forms of feature extraction units may be utilized to extract image features in a sample image. In some embodiments, the extraction of image features may be performed by means of edge detection, corner detection, texture analysis, color histogram, etc. In other embodiments, extraction of image features may also be performed using a deep learning-based approach. After obtaining the sample visual features of the sample image, the sample visual features may be mapped using the linear layer to obtain sample image features. The feature extraction unit for the image can be combined with the large language model by using the linear layer, so that the result output by the feature extraction unit can be aligned with the input of the large language model, and the large language model can better understand the relationship between the image and the text.

In some embodiments, the feature extraction unit may include a visual encoder (e.g., vision Transformer (ViT)) and a query converter (e.g., Q-Former). Wherein the visual encoder may be configured to process an input image to obtain image-encoding features and the query converter may be configured to process the image-encoding features to obtain image visual features.

In some examples, extracting the sample visual features of the sample image with the feature extraction unit may include: extracting image coding features of the sample image by using a visual encoder; and processing the image coding feature and the query token with the query converter to obtain a sample visual feature. The Q-force can extract the most relevant feature of the image from the image coding features.

The sample image features can be processed to obtain corresponding predicted text using the text processing capabilities of the large language model. In some embodiments, the sample image features output in step S204 may be input into a large language model along with the hint information to obtain predictive text associated with the hint information. In some examples, the hint information may indicate a style of the predicted text. For example, the hint information used in step S206 may be the same hint information as used above in generating the sample text.

By adjusting parameters of the text generation model, the text generation model has better text generation capability for a specified text generation task. For example, when training a text generation model using a sample data set acquired with a prompt message of a specified style, the predicted text and the sample text may be made closer by means of parameter adjustment based on the difference between the sample text and the predicted text, that is, the text generation model may learn the ability to output text of the specified style.

In a typical deep learning model training process, all parameters in the model may be adjusted, for example, by back-propagation, to allow the model to acquire the desired data processing capacity. However, the output of a large language model is used in the solution provided in the present disclosure to achieve a text generation task based on a specified image. Large language models typically have billions, or more of parameters, parameter tuning such large language models can consume extremely high computational resources, and adapting image-guided text generation tasks to the parameter tuning of large language models can also cause the large language models to forget the original language knowledge.

Accordingly, the present disclosure provides a parameter efficient instruction trimming method.

In step S208, parameters of the large language model may be frozen, and only parameters for feature extraction of the image may be adjusted.

In some embodiments, when the step S204 includes extracting the sample visual features of the sample image using the feature extraction unit and mapping the sample visual features using the linear layer to obtain the sample image features, the step S208 may include freezing the parameters of the feature extraction unit and the large language model; and performing parameter adjustment on the linear layer.

In other embodiments, when the feature extraction unit includes a visual encoder and a query transformer, step S208 may include freezing parameters of the visual encoder, the query transformer, and the large language model, and performing parameter adjustment on at least one of the linear layer and the query token. Wherein a token vector as a query token may be generated in a random manner and parameter adjustments may be made to the token vector in step S208.

By freezing the visual encoder, the query converter and the sweet potato of the large language model in the text generation model, the training parameters can be kept at 1% of the overall parameters of the model, and meanwhile, the model training only needs tens of thousands of magnitude of stylized instruction fine-tuning data. This greatly increases the training speed of the model and reduces the training cost of the model.

Fig. 3 illustrates an exemplary process of acquiring a sample dataset according to an embodiment of the present disclosure.

As shown in fig. 3, the input image 301 may be processed with a model 302 to obtain descriptive text 303 of the input image 301. Wherein the model 302 may be a BLIP-2 model. The descriptive text 303 may then be entered into the large language model 305 along with a hint information 304, where the hint information 304 is used to indicate the style of the text 306 output by the large language model 305. The large language model 305 outputs text 306 of a specified style based on the descriptive text 303 and the prompt 304. The input-output pair composed of the input image 301 and the text 306 may be used as sample data for the training process of the text generation model described in connection with fig. 2. The process 300 depicted in fig. 3 may be utilized to process multiple input images to enable rapid construction of custom text labels for user needs for multiple images to obtain a sufficient amount of higher quality sample data for the training process of text generation models.

Fig. 4 illustrates an exemplary process of image-guided text generation according to an embodiment of the present disclosure.

As shown in fig. 4, in process 400, an input image 401 may be input to a visual encoder 402 to obtain image encoding features of the input image. The image encoding features may then be used as inputs to the key (K) or value (V) of the query transformer 403. The query converter 403 outputs sample visual features of the input image 401 based on the input key (K), value (V), and query vector (Q). The linear layer 404 may map sample visual features. Before inputting the features into the large language model 405, image embedding (image embedding) may be performed on the mapped sample visual features so that the features input into the large language model 405 conform to the input format of the large language model. The (mapped) sample visual features may be processed using a large language model to yield text generation results for the input image 401. When the text generation result is obtained by using the large language model, the prompt information of the appointed style can be combined with the large language model, so that the text generation result of the appointed style is obtained.

Fig. 5 illustrates an exemplary block diagram of an apparatus for training an image-guided text generation model in accordance with an embodiment of the present disclosure.

As shown in fig. 5, the apparatus 500 includes a sample data acquisition unit 510, an image feature acquisition unit 520, a predicted text acquisition unit 530, and a parameter adjustment unit 540.

The sample data acquisition unit 510 may be configured to acquire a sample data set. Wherein the sample data set includes a sample image and sample text corresponding to the sample image. Wherein the sample text corresponding to the sample image includes document content generated for the sample image using the large language model.

The image feature acquisition unit 520 may be configured to determine sample image features of the sample image.

The predicted text obtaining unit 530 may be configured to input the sample image features into a large language model to obtain the predicted text corresponding to the sample image.

The parameter adjustment unit 540 may be configured to parameter-adjust the text generation model based on a difference between the sample text and the predicted text.

In some embodiments, the image feature acquisition unit may be configured to: extracting sample visual features of the sample image by using a feature extraction unit; and mapping the sample visual characteristics by using the linear layer to obtain sample image characteristics.

In some embodiments, the parameter adjustment unit may be configured to: freezing parameters of the feature extraction unit and the large language model; and performing parameter adjustment on the linear layer.

In some embodiments, the feature extraction unit may include a visual encoder and a query converter, wherein extracting the sample visual features of the sample image with the feature extraction unit includes: extracting image coding features of the sample image by using a visual encoder; the image encoding features and the query tokens are processed with a query converter to obtain sample visual features.

In some embodiments, the parameter adjustment unit may be configured to: freezing parameters of the visual encoder, query converter, and large language model; and parameter tuning at least one of the alignment layer and the query token.

In some embodiments, the predictive text acquisition unit may be configured to: the sample image features and the prompt information are input into a large language model together to obtain a prediction text related to the prompt information.

In some embodiments, the hint information may indicate a style of the predicted text.

In some embodiments, the sample data acquisition unit may be configured to: processing the sample image by using a pre-trained second text generation model to obtain a text description of the sample image; and inputting the text description of the sample image and the prompt information into a large language model to obtain a sample text corresponding to the sample image, wherein the prompt information designates the style of the sample text.

In some embodiments, the second text generation model may be a BLIP-2 model.

By using the device for training the text generation model provided by the embodiment of the invention, corresponding text content can be quickly and efficiently generated for a given image by using the capability of a large language model, so that a multi-modal training data set for instruction fine adjustment of the text generation model can be quickly acquired.

There is also provided, in accordance with an embodiment of the present disclosure, an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method according to embodiments of the present disclosure.

According to an embodiment of the present disclosure, there is also provided a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform a method according to an embodiment of the present disclosure.

According to an embodiment of the present disclosure, there is also provided a computer program product comprising a computer program, wherein the computer program, when executed by a processor, implements a method according to an embodiment of the present disclosure.

Referring to fig. 6, a block diagram of an electronic device 600 that may be a server or a client of the present disclosure, which is an example of a hardware device that may be applied to aspects of the present disclosure, will now be described. Electronic devices are intended to represent various forms of digital electronic computer devices, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 6, the electronic device 600 includes a computing unit 601 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 602 or a computer program loaded from a storage unit 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data required for the operation of the electronic device 600 can also be stored. The computing unit 601, ROM 602, and RAM 603 are connected to each other by a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

A number of components in the electronic device 600 are connected to the I/O interface 605, including: an input unit 606, an output unit 607, a storage unit 608, and a communication unit 609. The input unit 606 may be any type of device capable of inputting information to the electronic device 600, the input unit 606 may receive input numeric or character information and generate key signal inputs related to user settings and/or function control of the electronic device, and may include, but is not limited to, a mouse, a keyboard, a touch screen, a trackpad, a trackball, a joystick, a microphone, and/or a remote control. The output unit 607 may be any type of device capable of presenting information and may include, but is not limited to, a display, speakers, video/audio output terminals, vibrators, and/or printers. Storage unit 608 may include, but is not limited to, magnetic disks, optical disks. The communication unit 609 allows the electronic device 600 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks, and may include, but is not limited to, a modem, a network card, an infrared communication device, a wireless communication transceiver, and/or a chipset, such as bluetooth ^TM Devices, 802.11 devices, wiFi devices, wiMax devices, cellular communication devices, and/or the like.

The computing unit 601 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 601 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 601 performs the various methods and processes described above, such as method 200. For example, in some embodiments, the method 200 may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 600 via the ROM 602 and/or the communication unit 609. One or more of the steps of the method 200 described above may be performed when a computer program is loaded into RAM 603 and executed by the computing unit 601. Alternatively, in other embodiments, computing unit 601 may be configured to perform method 200 by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), the internet, and blockchain networks.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.

Although embodiments or examples of the present disclosure have been described with reference to the accompanying drawings, it is to be understood that the foregoing methods, systems, and apparatus are merely exemplary embodiments or examples, and that the scope of the present invention is not limited by these embodiments or examples but only by the claims following the grant and their equivalents. Various elements of the embodiments or examples may be omitted or replaced with equivalent elements thereof. Furthermore, the steps may be performed in a different order than described in the present disclosure. Further, various elements of the embodiments or examples may be combined in various ways. It is important that as technology evolves, many of the elements described herein may be replaced by equivalent elements that appear after the disclosure.

Claims

1. A training method of a text generation model, comprising:

Obtaining a sample data set, wherein the sample data set comprises a sample image and sample text corresponding to the sample image, and the sample text corresponding to the sample image comprises text content generated for the sample image by using a large language model;

determining sample image features of the sample image;

inputting the sample image features into a large language model to obtain a predicted text corresponding to the sample image; and

parameter adjustments are made to the text generation model based on differences between the sample text and the predicted text.

2. The method of claim 1, wherein determining sample image features of the sample image comprises:

extracting sample visual features of the sample image by using a feature extraction unit;

and mapping the sample visual characteristics by using a linear layer to obtain the sample image characteristics.

3. The method of claim 2, wherein parameter tuning the text generation model comprises:

freezing parameters of the feature extraction unit and the large language model; and

and carrying out parameter adjustment on the linear layer.

4. The method of claim 2, wherein the feature extraction unit comprises a visual encoder and a query converter,

Wherein extracting the sample visual features of the sample image with a feature extraction unit comprises:

extracting image coding features of the sample image with the visual encoder;

and processing the image coding feature and the query token by utilizing the query converter to obtain the sample visual feature.

5. The method of claim 4, wherein parameter tuning the text generation model comprises:

freezing parameters of the visual encoder, the query converter, and the large language model; and

parameter adjustments are made to at least one of the linear layer and the query token.

6. The method of any of claims 1-5, wherein inputting the sample image features into a large language model to obtain predicted text corresponding to the sample image comprises:

and inputting the sample image characteristics and the prompt information into a large language model to obtain a prediction text related to the prompt information.

7. The method of claim 6, wherein the hint information indicates a style of the predicted text.

8. The method of any of claims 1-5, wherein obtaining a sample dataset comprises:

Processing the sample image by using a pre-trained second text generation model to obtain a text description of the sample image; and

and inputting the text description of the sample image and prompt information into a large language model to obtain a sample text corresponding to the sample image, wherein the prompt information designates the style of the sample text.

9. The method of claim 8, wherein the second text generation model is a BLIP-2 model.

10. A training device for a text generation model, comprising:

a sample data acquisition unit configured to acquire a sample data set, wherein the sample data set includes a sample image and sample text corresponding to the sample image, wherein the sample text corresponding to the sample image includes document content generated for the sample image using a large language model;

an image feature acquisition unit configured to determine a sample image feature of the sample image;

a predicted text acquisition unit configured to input the sample image features into a large language model to obtain a predicted text corresponding to the sample image; and

and a parameter adjustment unit configured to perform parameter adjustment on the text generation model based on a difference between the sample text and the predicted text.

11. The apparatus of claim 10, wherein the image feature acquisition unit is configured to:

12. The apparatus of claim 11, wherein the parameter adjustment unit is configured to:

and carrying out parameter adjustment on the linear layer.

13. The apparatus of claim 11, wherein the feature extraction unit comprises a visual encoder and a query converter,

extracting image coding features of the sample image with the visual encoder;

14. The apparatus of claim 13, wherein the parameter adjustment unit is configured to:

15. The apparatus of any of claims 10-14, wherein the predictive text acquisition unit is configured to:

16. The apparatus of claim 15, wherein the hint information indicates a style of the predicted text.

17. The apparatus of any of claims 10-14, wherein the sample data acquisition unit is configured to:

18. The apparatus of claim 17, wherein the second text generation model is a BLIP-2 model.

19. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the method comprises the steps of

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-9.

20. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-9.

21. A computer program product comprising a computer program, wherein the computer program, when executed by a processor, implements the method of any of claims 1-9.