WO2024099144A1 - 下游任务模型生成及任务执行的方法和设备 - Google Patents

下游任务模型生成及任务执行的方法和设备 Download PDF

Info

Publication number
WO2024099144A1
WO2024099144A1 PCT/CN2023/127845 CN2023127845W WO2024099144A1 WO 2024099144 A1 WO2024099144 A1 WO 2024099144A1 CN 2023127845 W CN2023127845 W CN 2023127845W WO 2024099144 A1 WO2024099144 A1 WO 2024099144A1
Authority
WO
WIPO (PCT)
Prior art keywords
task
downstream
model
input
text
Prior art date
Application number
PCT/CN2023/127845
Other languages
English (en)
French (fr)
Inventor
杨浩
林俊旸
杨安
王鹏
周畅
杨红霞
Original Assignee
阿里巴巴达摩院(杭州)科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴达摩院(杭州)科技有限公司 filed Critical 阿里巴巴达摩院(杭州)科技有限公司
Publication of WO2024099144A1 publication Critical patent/WO2024099144A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present application relates to computer technology, and more particularly to a method and device for downstream task model generation and task execution.
  • Pre-trained language models can be pre-trained on large-scale unlabeled corpora and can learn general language representations. These representations can be used for other downstream tasks, such as Visual Question Answering (VQA), Image Caption (IC), Visual Entailment (VE), Referring Expression Comprehension (REC), and other tasks in the intersection of NLP and computer vision, as well as tasks in the field of natural language processing such as text-based sentiment classification and text summarization tasks. They can be applied to various application fields such as visual assistants, intelligent robots, and online education.
  • VQA Visual Question Answering
  • IC Image Caption
  • VE Visual Entailment
  • REC Referring Expression Comprehension
  • the pre-trained tasks When applied to downstream tasks, the pre-trained tasks need to be fine-tuned based on the dataset of the downstream tasks to make the fine-tuned model more suitable for downstream tasks.
  • the rich and diverse downstream tasks make the target design of the pre-trained model in the fine-tuning stage very cumbersome and complicated. Since the goals between the pre-trained model and the downstream tasks are inconsistent, there is often a "gap", and there is a structural bias between the input and output. Therefore, the pre-trained model is usually not directly adaptable to downstream tasks, and the parameters of the pre-trained model need to be fine-tuned using the downstream task dataset.
  • the hardware requirements and downstream data requirements for fine-tuning pre-trained models are constantly increasing, and the efficiency of generating downstream task models by fine-tuning the parameters of pre-trained models is low.
  • the present application provides a method and device for downstream task model generation and task execution, so as to solve the problem of low efficiency in generating downstream task models by fine-tuning the parameters of a pre-trained model.
  • the present application provides a method for generating a downstream task model, comprising: obtaining a training data set in a downstream task scenario; adding downstream task execution parameters to the original parameters of a pre-trained model; using the training data set to adjust the downstream task execution parameters in the pre-trained model to generate a task model for the downstream task, wherein the task model of the downstream task is used to execute the downstream task.
  • the present application provides a task execution method, comprising: obtaining input data in response to a downstream task execution instruction; generating input information corresponding to the input data according to the input format information of the task model in the downstream task scenario; inputting the input information into a trained task model for processing to obtain a task processing result, wherein the task model is obtained by adding downstream task execution parameters to the original parameters of the pre-trained model, and adjusting the downstream task execution parameters in the pre-trained model based on the training data set in the downstream task scenario; and outputting the task processing result.
  • the present application provides a method for executing a visual question-answering task, comprising: obtaining an input image and a question text; generating input information of the visual question-answering task model according to the input format information of the visual question-answering task model in the visual question-answering task scenario, the image and the question text; inputting the input information into the visual question-answering task model for processing to obtain an answer text corresponding to the question text, wherein the visual question-answering task model is a pre-trained model that is trained based on a training data set in the visual question-answering task scenario by adding downstream task execution parameters to the original parameters of the pre-trained model. The downstream task execution parameters in are adjusted; and the answer text corresponding to the question text is output.
  • the present application provides an electronic device, comprising: a processor, and a memory communicatively connected to the processor; the memory stores computer-executable instructions; the processor executes the computer-executable instructions stored in the memory to implement the method described in any of the above aspects.
  • the present application provides a computer-readable storage medium, wherein the computer-readable storage medium stores computer-executable instructions, and when the computer-executable instructions are executed by a processor, they are used to implement the method described in any of the above aspects.
  • FIG1 is a schematic diagram showing an example network architecture applicable to the present application.
  • FIG2 shows a flow chart of a method for generating a downstream task model provided by an exemplary embodiment of the present application.
  • FIG3 is a schematic diagram showing the effect of the downstream task model generation method provided in the present application on improving the efficiency of fine-tuning training.
  • FIG. 4 shows a flow chart of a method for generating a multimodal task model provided by an exemplary embodiment of the present application.
  • FIG5 shows a flow chart of a task execution method provided by an exemplary embodiment of the present application.
  • FIG6 shows a flow chart of a method for executing a visual question answering task provided by an exemplary embodiment of the present application.
  • FIG. 7 shows a schematic structural diagram of a downstream task model generating device provided in an exemplary embodiment of the present application.
  • FIG8 shows a schematic diagram of the structure of a task execution device provided in an exemplary embodiment of the present application.
  • FIG9 shows a schematic diagram of the structure of a visual question-answering task execution device provided in an example embodiment of the present application.
  • FIG. 10 shows a schematic structural diagram of an electronic device provided in an exemplary embodiment of the present application.
  • Visual question answering task Given an input image and a question, determine the answer to the question from the visual information of the input image.
  • Image description task Generate description text for an input image.
  • Visual entailment task predict the semantic relevance of an input image and text, i.e., entailment, neutral, or contradiction.
  • Referential expression and comprehension task locate the image area in the input image corresponding to the input text based on the input text.
  • Image generation task Generate an image based on the input description text.
  • Text-based sentiment classification task predict the sentiment classification information of the input text.
  • Text summarization task Generate summary information of the input text.
  • Multimodal tasks refers to downstream tasks whose input and output data involve multiple modal data such as images and text, such as visual question answering tasks, image description tasks, visual implication tasks, referential expression and understanding tasks, image generation tasks, etc.
  • Multimodal pre-training model refers to a pre-training model whose input and output data involve multiple modal data such as images and texts. After fine-tuning and training, it can be applied to multimodal task processing.
  • CLIP Content Language-Image Pre-training
  • Prompt tuning strategy a method used to assist large-scale pre-trained models in fine-tuning training.
  • Downstream task execution parameters also known as prompt parameters, they are trainable parameters added to the fast fine-tuning strategy of the pre-trained model.
  • the method of fine-tuning pre-trained models has the following technical defects: as the scale of pre-trained models continues to increase, the hardware requirements and downstream data requirements for pre-trained model fine-tuning training are constantly increasing, and the efficiency of pre-trained model fine-tuning training is low.
  • an existing prompt-based pre-trained model fine-tuning method is: add a set of prompts in front of the category vector, match the category vector with the added prompts with the sample vector, and select the category with the highest matching classification as the category of the current sample.
  • it is necessary to design appropriate prompts for each downstream task and data set.
  • it is very time-consuming and labor-intensive.
  • manually designed prompts are not very stable and may be very sensitive to a certain word (or words), resulting in poor fine-tuning training results.
  • the present application provides a method for generating a downstream task model.
  • a training data set in a downstream task scenario is obtained; downstream task execution parameters are added to the original parameters of the pre-trained model; the downstream task execution parameters in the pre-trained model are adjusted based on the training data set in the downstream task scenario, and the original parameters of the pre-trained model are fixed.
  • a task model of the downstream task is obtained. Since the original parameters of the pre-trained model are fixed, the knowledge learned by the pre-trained model in the pre-training pre-corpus can be retained.
  • downstream task execution parameters need to be fine-tuned, and a downstream task model that meets the requirements can be fine-tuned using less downstream task training data.
  • the fine-tuning training effect of the pre-trained model is improved. As the scale of the pre-trained model increases, the efficiency of fine-tuning training is improved more and more significantly.
  • Fig. 1 is a schematic diagram of an example network architecture applicable to the present application.
  • the network architecture includes a server responsible for generating a downstream task model, and an electronic device responsible for executing downstream tasks based on the downstream task model.
  • the server can be a server cluster deployed in the cloud, or a local device with computing power.
  • the server stores a pre-trained model that has been pre-trained after large-scale training.
  • the server can obtain a training data set in a downstream task scenario.
  • the server performs fine-tuning training of the pre-trained model to generate a downstream task model
  • the original parameters of the pre-trained model are fixed, and the downstream task execution parameters are added to the original parameters of the pre-trained model.
  • the downstream task execution parameters in the pre-trained model are trained based on the training data set in the downstream task scenario.
  • the task model of the downstream task can be obtained.
  • the server can send the obtained task model of the downstream task to a designated electronic device.
  • the electronic device can be a client device that requests the server to generate a task model for a downstream task, or it can be another electronic device specified by the service platform to which the user/server that requests to generate the task model for the downstream task belongs. Specifically, it can be a computing device deployed locally by the customer, or it can be a server deployed in the cloud, etc.
  • the electronic device can be used to execute downstream tasks and provide corresponding services to the outside.
  • the electronic device obtains the input data of the downstream tasks; generates the input information of the task model according to the input format information of the task model in the downstream task scenario; inputs the input information into the task model of the downstream task for processing, obtains the downstream task processing result, and outputs the downstream task processing result.
  • the server adds downstream task execution parameters to the original parameters of the pre-trained language model according to the training data set in the visual question answering task scenario, and trains the downstream task execution parameters in the pre-trained language model based on the training data of the visual question answering task.
  • the visual question answering task can be obtained. Answer model.
  • the visual question answering model can be deployed locally/to another cloud server to provide visual question answering functions to the outside world.
  • the device deployed with the visual question answering model obtains the input image and question text in the visual question answering task scenario, and generates the input information of the visual question answering task model according to the input format information, image and question text of the visual question answering task model in the visual question answering task scenario; inputs the input information into the task model of the downstream task for processing, obtains the downstream task processing result, and outputs the downstream task processing result.
  • Fig. 2 is a flow chart of a method for generating a downstream task model provided by an exemplary embodiment of the present application.
  • the execution subject of this embodiment can be the server responsible for generating the downstream task model.
  • the method specifically includes the following steps S201 to S203.
  • Step S201 Obtain a training data set in a downstream task scenario.
  • the task objectives are different in different downstream task scenarios, that is, the input and output of the task model are not completely the same, and different downstream task functions have been implemented.
  • the training data sets used in different downstream task scenarios are different.
  • the input format information and output format information of the downstream task model to be generated can be designed according to the task objectives of the downstream task scenario.
  • it is generated based on the input format information and output format information of the downstream task model.
  • the input of the task model includes an input image and a question, and the output is the answer text of the question.
  • the input of the task model includes an input image and a specific prompt text, and the output is the description text of the input image.
  • Step S202 Add downstream task execution parameters to the original parameters of the pre-trained model.
  • downstream task execution parameters are added to the original parameters of the pre-trained model, and the amount of the added downstream task execution parameters is much smaller than the amount of the original parameters.
  • the downstream task execution parameters can be a set of parameters generated by random initialization, or a set of parameters generated according to data in the downstream task scenario.
  • the downstream task execution parameters can be adjusted to obtain a task model that is more suitable for a specific downstream task scenario.
  • the downstream task execution parameters may be spliced onto the original parameters of the pre-trained model.
  • the downstream task execution parameters may be spliced onto one or more layers of parameters of the pre-trained model.
  • Step S203 Use the training data set to adjust the downstream task execution parameters in the pre-training model to generate a task model for the downstream task, and the task model for the downstream task is used to execute the downstream task.
  • the original parameters of the pre-trained model are fixed, and the training data set in the downstream task scenario is used to train the downstream task execution parameters in the pre-trained model. That is, the original parameters of the pre-trained model are not updated, and only the added downstream task execution parameters are updated.
  • the task model of the downstream task is obtained.
  • the original parameters of the pre-trained models in the task models of multiple downstream tasks generated based on the same pre-trained model are the same, and the difference lies in the added downstream task execution parameters.
  • the input information corresponding to the sample data in the training data set is input into a pre-trained model that has added downstream task execution parameters, and the processing results are output; the loss value is calculated based on the processing results and the labeled data corresponding to the sample data, and the preset loss function, and the downstream task execution parameters added in the pre-trained model are updated according to the loss value.
  • the settings of the loss function, the conditions for the end of training, etc. are similar to those in the existing pre-training model fine-tuning training method and will not be repeated here.
  • the downstream task model can be applied to the field of natural language processing or the field of computer vision.
  • the pre-trained model may be a pre-trained language model
  • the downstream tasks may include at least one of the following: Visual question answering tasks, image description tasks, visual implication tasks, referential expression and understanding tasks, image generation tasks, text-based sentiment classification tasks, and text summarization tasks.
  • visual question answering tasks, image description tasks, visual implication tasks, referential expression and understanding tasks, and image generation tasks belong to the intersection of natural language processing and computer vision
  • text-based sentiment classification tasks and text summarization tasks belong to the field of natural language processing.
  • the downstream task execution parameters in the pre-trained language model with added downstream task execution parameters are fine-tuned based on the training data set in any downstream task scenario, so as to obtain a task model for executing the downstream task.
  • the method provided in this application can be used to realize visual assistant functions such as helping visually impaired and blind patients understand online pictures, identify objects, and understand the surrounding environment; it can also be used in VR programs to help users communicate with virtual partners. Chat robot functions; it can also be used in scenarios such as online education to help answer students' questions, etc.
  • the method provided in this embodiment can be used for fine-tuning training of other pre-trained models besides language models, which may be generative pre-trained models, pre-trained classification models, multimodal pre-trained models, etc., which are not limited here.
  • the original parameters of the pre-trained model are fixed, and only a small number of downstream task execution parameters are optimized, thereby reducing the number of parameters that need to be optimized, and achieving a better fine-tuning training effect when only a small number of samples are used, and retaining the knowledge learned by the pre-trained model in the pre-training pre-corpus, thereby avoiding the computational overhead caused by fine-tuning all model parameters, improving the efficiency of fine-tuning the pre-trained model, and realizing fast fine-tuning training of the pre-trained model, thereby improving the efficiency of generating the downstream task model.
  • the server fine-tunes the pre-trained model to obtain the task model of the downstream task, it can also send the task model of the downstream task to the electronic device used to execute the downstream task.
  • the electronic device uses the task model of the downstream task to execute the downstream task, which can realize the corresponding function of the downstream task or provide the corresponding service of the downstream task to the outside.
  • the electronic device can be a client device that requests the server to generate a task model for a downstream task, or it can be another electronic device specified by the service platform to which the user/server that requests to generate the task model for the downstream task belongs. It can be a computing device deployed locally by the customer, or it can be a server deployed in the cloud, etc.
  • the server can provide external functions for generating task models for downstream tasks such as visual question-answering tasks, image description tasks, visual implication tasks, referential expression and understanding tasks, image generation tasks, text-based sentiment classification tasks, and text summarization tasks.
  • the server can train the downstream task execution parameters added to the pre-trained language model using the training data set in the downstream task scenario specified by the user, generate a task model for the specified downstream task, and provide the task model for the downstream task to the electronic device specified by the user.
  • the user can specify any one of the downstream tasks such as visual question-answering tasks, image description tasks, visual implication tasks, referential expression and understanding tasks, image generation tasks, text-based sentiment classification tasks, and text summarization tasks.
  • the server may also store the generated task models of one or more downstream tasks so as to directly obtain them when needed next time.
  • the complete parameters of the task model of the downstream task can be obtained by adding the downstream task execution parameters trained in the downstream task scenario to the original parameters of the pre-trained model, and the task model for the downstream task scenario is generated, thereby saving storage space and making the management and maintenance of the task model more flexible.
  • the server can also directly store the task model of the downstream task locally. Furthermore, the server can also be used to execute the downstream task, realize the corresponding function of the downstream task, or provide the corresponding service of the downstream task to the outside.
  • the server receives an execution instruction of a downstream task.
  • the server obtains input data in the downstream task scenario; generates input information of the task model according to the input format information of the task model in the downstream task scenario; inputs the input information into the task model of the downstream task for processing to obtain the downstream task processing result; and outputs the downstream task processing result.
  • obtaining the training data set in the downstream task scenario can be implemented in the following manner: obtaining sample data in the downstream task scenario, generating input information corresponding to the sample data according to the input format information of the task model in the downstream task scenario; and obtaining the labeled data corresponding to the input information, the input information and the labeled data constitute the training data set.
  • the labeled data of the training sample includes the correct task processing result corresponding to the training sample.
  • input information that meets the input format requirements of the task model can be generated, and the labeled data of the input information can be obtained.
  • the input information and the labeled data of the input information constitute the training data set for the downstream task.
  • the labeled data of the input information can be obtained through manual labeling.
  • only a small amount of training data set with labeled data is needed to train a task model for downstream tasks with better results.
  • a corresponding input prompt template may be set for each downstream task scenario, and the input prompt template is used to indicate the format information of the input of the downstream task model.
  • the input prompt templates corresponding to different downstream tasks may be different. In addition, it is not excluded that there are two different downstream tasks corresponding to the same input prompt template.
  • the input of the task model includes an input image and a question
  • the output is the answer text of the question.
  • the input prompt template of the visual question-answering task may include an input image and a question, which can be expressed as follows: "[BOS′] input image [BOS′][BOS] question [BOS]".
  • [BOS′] input image [BOS′] represents the data of the input image (such as an image vector generated by encoding the image), which can refer to different images
  • [BOS] question [BOS] refers to the data of the input question (such as a text vector generated by encoding the question), which can specify different questions.
  • the input of the task model includes an input image and specific prompt text.
  • the input prompt template of the generative image description task may include an input image and specific prompt text, which can be expressed as follows: "[input image] What does the image describe”. Among them, "[input image]" is the data of the input image, which can refer to different images. "What does the image describe” is a preset specific prompt text, and other prompt texts can be used, but once set, it remains fixed.
  • the output of the generative image description task is the description text.
  • the input prompt template of the classification-based image description task can be the same as the input prompt template of the generative image description task, expressed as follows: "[input image] what the image describes”.
  • the output of the classification-based image description task is a classification label of the description text, and different classification labels refer to one of the preset multiple categories of description text classifications.
  • an input prompt template of an example visual entailment task may include an input image, input text 1, and input text 2.
  • the input prompt template may be expressed as: "Does [input image] and text 1 ⁇ input text 1 ⁇ mean text 2 ⁇ input text 2 ⁇ ?"
  • the output of the visual entailment task is a judgment result, including yes, no, maybe, etc.
  • input information that meets the style requirements in the input prompt template can be generated based on the input prompt template in the downstream task scenario and based on sample data such as graphic data in the downstream task scenario.
  • the server can obtain a training data set in a visual question and answer task scenario based on the input prompt template of the visual question and answer task, and train the downstream task execution parameters added in the pre-trained model to obtain a visual question and answer task model, and store the obtained (i.e., generated) visual question and answer task model.
  • the server provides a visual question and answer function to the outside.
  • other functional modules need to use the visual question and answer function, they call the visual question and answer function module to issue an execution instruction of the visual question and answer task to the server.
  • the server responds to the execution instruction of the visual question and answer task, obtains the input image and question text, generates input information corresponding to the input image and question text according to the input prompt template of the visual question and answer task, and inputs the input information into the visual
  • the visual question answering task model outputs the answer text through the visual question answering task model and returns the answer text to other functional modules.
  • the complete parameters of the downstream task model can be obtained by adding the downstream task execution parameters of the downstream task model after the training is completed to the original parameters of the pre-trained model, thereby generating the task model of the downstream task, which can save storage space and make the management and maintenance of the task model more flexible.
  • the training data set in the downstream task scenario can be automatically generated. Based on the training data set in the downstream task scenario, a small amount of downstream task execution parameters added to the pre-training model can be adjusted to generate a task model for the downstream task, which can improve the generation efficiency and performance of the downstream task model.
  • the horizontal axis in Figure 3 is the parameter amount of the pre-trained model
  • the vertical axis is the time required for fine-tuning the pre-trained model. The longer the time required for fine-tuning, the lower the efficiency of the fine-tuning training of the pre-trained model.
  • the time required for fine-tuning by the method of fine-tuning the downstream task execution parameters added in the pre-trained model in this application is shorter, saving the time required for fine-tuning the pre-trained model.
  • the method of fine-tuning the downstream task execution parameters added in the pre-trained model in this application can save more fine-tuning time, and the improvement of training efficiency is more obvious.
  • the downstream task execution parameters in the above step S202 can be generated in the following manner: select multiple words from a set vocabulary and generate word vectors for the multiple words; based on the word vectors for the multiple words, generate the downstream task execution parameters corresponding to each layer in the pre-trained model, and the downstream task execution parameters corresponding to each layer include at least one word vector.
  • corresponding downstream task execution parameters are added to the original parameters of each layer in the pre-trained model, thereby adding some trainable downstream task execution parameters to the layers used for encoding (encoder) and decoding (decoder) of the pre-trained model.
  • the original parameters of the pre-trained model are fixed, and only the added downstream task execution parameters are adjusted, so that the adjusted downstream task execution parameters are different in different downstream task scenarios, thereby generating a task model suitable for a specific downstream task scenario.
  • different layers correspond to different downstream task execution parameters.
  • the fine-tuning effect of the pre-trained model can be improved, and the performance of the generated downstream task model can be improved.
  • the downstream task execution parameters corresponding to different layers may be the same, that is, the downstream task execution parameters initially added to different layers of the pre-trained model are the same. Since the downstream task execution parameters are adjusted during the fine-tuning training of the pre-trained model, the downstream task execution parameters corresponding to different layers are different after the training is completed.
  • the vocabulary set in each downstream task scenario may be a randomly generated vocabulary, or may be a uniformly set preset vocabulary.
  • the vocabulary set in each downstream task scenario can also be set according to the training data set in the downstream task scenario, and different vocabulary is used in different downstream task scenarios.
  • the preset vocabulary can be a pre-configured vocabulary.
  • a preset number of word vectors can be selected from the obtained multiple word vectors according to the preset number of word vectors contained in each downstream task execution parameter, and a downstream task execution parameter is obtained. According to the number of layers of the pre-trained model, a downstream task execution parameter is generated for each layer.
  • the preset number can be set and adjusted according to the actual application scenario/field, and is not limited here. For example, the preset number can be set between 10 and 100, or can take other values.
  • the word direction of multiple words is When generating the downstream task execution parameters corresponding to each layer in the pre-trained model, different word vectors can be selected for each layer according to the preset number of word vectors contained in each downstream task execution parameter, and then concatenated to generate the downstream task execution parameters for each layer, so that the word vectors selected for different layers are not exactly the same, thereby ensuring that the downstream task execution parameters added for different layers are different.
  • the corresponding downstream task execution parameters when adding corresponding downstream task execution parameters to the original parameters of each layer in the pre-trained model, can be spliced in front of (head) of the original parameters of each layer in the pre-trained model, so that the downstream task execution parameters can be easily obtained, which is convenient for updating the downstream task execution parameters, and multiple consecutive word vectors are used to splice and generate the downstream task execution parameters (prompt) for quick fine-tuning, thereby alleviating the problem of prompt word sensitivity.
  • the corresponding downstream task execution parameters can also be spliced behind (at the tail) the original parameters of each layer in the pre-trained model.
  • Fig. 4 is a flow chart of a method for generating a multimodal task model provided by an exemplary embodiment of the present application.
  • the method provided in this embodiment can be used to perform fine-tuning training on a multimodal pre-trained model to generate a multimodal task model.
  • the method specifically includes the following steps S401 to S404 .
  • Step S401 In response to a task model generation instruction of a multimodal task, an input prompt template corresponding to the multimodal task is obtained.
  • the multimodal task may include at least one of the following: a visual question-answering task, an image description task, a visual implication task, a referential expression and understanding task, and an image generation task.
  • the input and output data involved in multimodal tasks include data from multiple modalities such as images and text.
  • the input of a visual question answering task includes image data and text data, and the output is text data.
  • the input of an image description task is image data, and the output is text data.
  • the input of a visual entailment task is image and text, and the output is the location information of the image region.
  • the input of an image generation task is text data, and the output is image data.
  • a corresponding input prompt template may be set for each downstream task scenario, and the input prompt template is used to indicate the input format information of the downstream task model.
  • the prompt template of the input prompt template of the multimodal task is used to indicate the format information of the pre-trained model input information.
  • the input prompt templates corresponding to different multimodal tasks may be different. In addition, it is not excluded that there are two different multimodal tasks corresponding to the same input prompt template.
  • the input of the task model includes an input image and a question
  • the output is the answer text of the question.
  • the input prompt template of the visual question-answering task may include an input image and a question, which can be expressed as follows: "[BOS′] input image [BOS′][BOS] question [BOS]".
  • [BOS′] input image [BOS′] represents the data of the input image (such as an image vector generated by encoding the image), which can refer to different images
  • [BOS] question [BOS] refers to the data of the input question (such as a text vector generated by encoding the question), which can specify different questions.
  • the input of the task model includes an input image and specific prompt text.
  • the input prompt template of the generative image description task may include an input image and specific prompt text, which can be expressed as follows: "[input image] What does the image describe”. Among them, "[input image]" is the data of the input image, which can refer to different images. "What does the image describe” is a preset specific prompt text, and other prompt texts can be used, but once set, it remains fixed.
  • the output of the generative image description task is the description text.
  • Step S402 According to the input prompt template corresponding to the multimodal task, a training data set of the multimodal task is obtained, where the multimodal task data set includes image data and text data.
  • the labeled data of the input information can be obtained by manual labeling.
  • only a small amount of training data sets with labeled data are needed to train a task model of downstream tasks with better results.
  • the training data set of the multimodal task includes data of multiple modalities such as images and texts.
  • Step S403 Add downstream task execution parameters to the original parameters of the pre-trained model.
  • step S202 is implemented in the same manner as the above-mentioned step S202.
  • step S202 please refer to the relevant content of step S202, which will not be repeated here.
  • Step S404 adjusting the downstream task execution parameters in the pre-trained language model based on the training data set of the multimodal task to generate a multimodal task model.
  • the original parameters of the pre-trained language model are kept fixed, and the downstream task execution parameters in the pre-trained language model are adjusted based on the training data set of the multimodal task to obtain a model that is more suitable for a specific multimodal task.
  • the multimodal task model is used to perform the multimodal task according to the input information generated based on the input data and the input prompt template to obtain the task processing result.
  • step S203 is implemented in the same manner as the above-mentioned step S203.
  • step S203 please refer to the relevant content of step S203, which will not be repeated here.
  • prompt learning is introduced into the multimodal task.
  • downstream task execution parameters are added to the pre-trained language model, and a training data set of the multimodal task is generated based on the input prompt template of the multimodal task.
  • the downstream task execution parameters added to the pre-trained language model are trained based on the training data set of the multimodal task to generate a task model of the multimodal task.
  • the training effect of only fine-tuning a small amount of downstream task execution parameters is comparable to the effect of training all the parameters of the pre-trained model in some cases, which improves the training efficiency of the pre-trained language model and can quickly train and generate a multimodal task model.
  • FIG5 is a flow chart of a task execution method provided by an exemplary embodiment of the present application.
  • the task execution method provided by this embodiment can utilize the task model of the downstream task generated based on the aforementioned downstream task model generation method embodiment to implement the execution of the downstream task, thereby implementing the corresponding functions of the downstream task scenario and providing corresponding services.
  • the execution subject of the method provided by this embodiment is an electronic device responsible for executing the downstream task based on the downstream task model.
  • the method specifically includes the following steps S501 to S504 .
  • Step S501 Responding to a downstream task execution instruction, obtaining input data.
  • the electronic device stores a task model of the downstream task obtained by adjusting the downstream task execution parameters added in the pre-trained model, and can provide the function of executing the downstream task or provide external services corresponding to the downstream task based on the task model of the downstream task.
  • the electronic device provides a function of executing a downstream task, and when other functional modules need to use the function of executing a downstream task, the functional module of the downstream task is called to issue an execution instruction of the downstream task to the electronic device.
  • the electronic device obtains input data in response to the execution instruction of the downstream task.
  • the electronic device provides services corresponding to the downstream tasks.
  • the user needs to use the services corresponding to the downstream tasks provided by the electronic device, the user sends a downstream task execution instruction to the electronic device through the client.
  • the electronic device obtains input data in response to the execution instruction of the downstream task.
  • the input data may include data in at least one of the following modes: image data and text data.
  • Step S502 Generate input information corresponding to the input data according to the input format information of the task model in the downstream task scenario.
  • the electronic device encodes the data of each mode in the input data to generate a corresponding vector, and generates input information of the task model according to the input format information of the task model in the current task scenario.
  • a corresponding input prompt template may be set for each downstream task scenario, and the input prompt template is used to indicate the input format information of the downstream task model.
  • an input prompt template for an example visual question answering task may include an input image and a question, which may be expressed as follows: "[BOS′] input image [BOS′][BOS] question [BOS]".
  • [BOS′] input image [BOS′] represents the data of the input image (such as an image vector generated by encoding the image), which may refer to different images
  • [[BOS] question [BOS]” refers to the data of the input question (such as a text vector generated by encoding the question), which may specify different questions.
  • two "[BOS′]” are used to mark image data
  • two "[BOS]” are used to mark text data. Two other different symbols may also be used to distinguish between marked image data and text data.
  • the output of the visual question answering task is the answer text to the question.
  • the input data acquired by the electronic device includes an input image and a question text.
  • the image is encoded to generate an image vector
  • the question text is encoded to generate a text vector.
  • the input prompt template of the visual question answering task "[BOS′] input image [BOS′][BOS] question [BOS]"
  • the image vector corresponding to the image and the text vector corresponding to the question text are concatenated to obtain the corresponding input information.
  • the input prompt template of the generative image description task may include an input image, and a specific prompt text, which may be expressed in the following form: "[input image] What the image describes".
  • "[input image]” the data of the input image may refer to different images.
  • "What the image describes” is a specific prompt text set by default, and other prompt texts may be used, but once set, it remains fixed.
  • the output of the generative image description task is a description text.
  • the input data acquired by the electronic device includes an input image, and the input image is encoded to generate an image vector.
  • the specific prompt text "What the image describes” is encoded to generate a text vector, and the image vector of the input image is concatenated with the text vector of the prompt text to obtain the corresponding input information.
  • Step S503 input the input information into the trained task model for processing to obtain the task processing result.
  • the task model is obtained by adding downstream task execution parameters to the original parameters of the pre-trained model and training the downstream task execution parameters in the pre-trained model based on the task training data set.
  • the input information After generating input information corresponding to the input data based on the input prompt template in the current task scenario, the input information is input into the trained task model for processing to obtain the task processing result.
  • the task model is based on a pre-trained model, and utilizes the downstream task model generation method provided by any of the above embodiments to obtain a training data set for the current task based on an input prompt template in the current task scenario; the downstream task execution parameters are added to the original parameters of the pre-trained model, and the downstream task execution parameters in the pre-trained model are trained (fine-tuned) based on the training data set of the current task.
  • the downstream task execution parameters are added to the original parameters of the pre-trained model, and the downstream task execution parameters in the pre-trained model are trained (fine-tuned) based on the training data set of the current task.
  • Step S504 output the task processing result.
  • the electronic device provides a function of executing a downstream task.
  • other functional modules need to use the function of executing a downstream task, they call the functional module of the downstream task to issue an execution instruction of the downstream task to the electronic device.
  • the electronic device After obtaining the task processing result, the electronic device returns the task processing result to other functional modules.
  • the electronic device provides services corresponding to downstream tasks.
  • the user needs to use the services corresponding to downstream tasks provided by the electronic device, the user sends a downstream task execution instruction to the electronic device through the client. After obtaining the task processing result, the electronic device outputs the task processing result to the user's client device.
  • the current task may include at least one of the following: a visual question answering task, an image description task, a visual implication task, a referential expression and understanding task, an image generation task, a text-based sentiment classification task, and a text summarization task.
  • the visual question answering task, the image description task, the visual implication task, the referential expression and understanding task, and the image generation task belong to the intersection of natural language processing and computer vision
  • the text-based sentiment classification task and the text summarization task belong to the field of natural language processing.
  • the task model used to execute various downstream tasks is obtained by using the downstream task model generation method provided in any of the above embodiments, by obtaining the training data set of the current task according to the input prompt template corresponding to the current task; adding downstream task execution parameters to the original parameters of the pre-trained model, based on the training data set of the current task
  • the training of the downstream task execution parameters in the pre-trained model can achieve rapid training of the pre-trained model and quickly obtain the task model of the downstream task. Under the same training data set scale, the training effect is improved, thereby improving the execution effect of the downstream tasks.
  • FIG. 6 is a flow chart of a method for executing a visual question answering task provided by an exemplary embodiment of the present application.
  • the execution subject of the method provided by this embodiment is an electronic device responsible for executing the visual question answering task based on the visual question answering task model.
  • the method specifically includes the following steps S601 to S604 .
  • Step S601 Obtain input image and question text.
  • the electronic device stores a visual question answering task model obtained by adding downstream task execution parameters to the pre-trained language model, and can provide a function of executing a visual question answering task or provide external services corresponding to the visual question answering task based on the visual question answering task model.
  • the visual question answering task determines the answer to the question from the visual information of the input image based on the input image and question.
  • the electronic device provides a function of executing a visual question-answering task.
  • the functional module of the visual question-answering task is called to issue an execution instruction of the visual question-answering task to the electronic device.
  • the electronic device obtains the image and question text in the input parameters.
  • the electronic device provides a visual question-answering service to the outside world.
  • the user needs to use the visual question-answering service provided by the electronic device, the user sends a visual question-answering request to the electronic device through the client.
  • the electronic device obtains the image and question text input by the user.
  • Step S602 Generate input information of the visual question answering task model according to the input format information, image and question text of the visual question answering task model in the visual question answering task scenario.
  • a corresponding input prompt template may be set for each downstream task scenario, and the input prompt template is used to indicate the input format information of the downstream task model.
  • the input information corresponding to the input image and question text is generated according to the input prompt template in the current task scenario.
  • the input prompt template corresponding to the visual question answering task may include an input image and a question, which may be expressed in the following form: “[BOS′] input image [BOS′] [BOS] question [BOS]”.
  • [BOS′] input image [BOS′] represents the data of the input image (such as the image vector generated by encoding the image), which can refer to different images
  • [BOS] problem [BOS] refers to the data of the input problem (such as the text vector generated by encoding the problem), which can specify different problems.
  • the image is encoded to generate a corresponding image vector
  • the question text is encoded to generate a corresponding text vector
  • the image vector and the text vector are concatenated to obtain the corresponding input information.
  • Step S603 input the input information into the visual question answering task model for processing to obtain the answer text corresponding to the question text.
  • the visual question answering task model is obtained by adding downstream task execution parameters to the original parameters of the pre-trained model, and training the downstream task execution parameters in the pre-trained model based on the training data set of the visual question answering task. It can be specifically obtained by the downstream task model generation method provided in any of the above embodiments, which will not be repeated in this embodiment.
  • Step S604 output the answer text corresponding to the question text.
  • the electronic device provides a function of performing a visual question-answering task.
  • the functional module of the visual question-answering task is called to issue an execution instruction of the visual question-answering task to the electronic device.
  • the electronic device After obtaining the answer text, the electronic device returns the answer text to other functional modules.
  • the electronic device provides a visual question-answering service to the outside world.
  • the user needs to use the visual question-answering service provided by the electronic device, the user sends a visual question-answering request to the electronic device through the client. After obtaining the answer text, the electronic device outputs the answer text to the user's client device.
  • the visual question answering model is obtained by utilizing the downstream task model generation method provided by any of the above embodiments, and obtaining a training data set for the visual question answering task according to an input prompt template corresponding to the visual question answering task; adding downstream task execution parameters to the original parameters of the pre-trained language model, and training the downstream task execution parameters in the pre-trained language model based on the training data set of the visual question answering task, which can achieve rapid fine-tuning of the pre-trained language model and quickly obtain the visual question answering task model. Under the same training data set scale, the fine-tuning effect is improved, thereby improving the execution effect of the visual question answering task.
  • FIG7 is a schematic diagram of the structure of a downstream task model generation device provided by an example embodiment of the present application.
  • the device provided in this embodiment is used to perform a downstream task model generation method.
  • the downstream task model generation device 70 includes: a training set generation module 71, a parameter addition module 72, and a parameter adjustment module 73.
  • the training set generation module 71 is used to obtain the training data set in the downstream task scenario.
  • the parameter adding module 72 is used to add downstream task execution parameters to the original parameters of the pre-trained model.
  • the parameter adjustment module 73 is used to use the training data set to adjust the downstream task execution parameters in the pre-training model to generate a task model of the downstream task, and the task model of the downstream task is used to execute the downstream task.
  • the parameter adding module 72 when adding downstream task execution parameters to the original parameters of the pre-trained model, is also used to: select multiple words from a set vocabulary to generate word vectors for the multiple words; generate downstream task execution parameters corresponding to each layer in the pre-trained model based on the word vectors for the multiple words, and the downstream task execution parameters corresponding to each layer include at least one word vector; and add the corresponding downstream task execution parameters to the original parameters of each layer in the pre-trained model.
  • the parameter adding module 72 is further used to set a vocabulary used in the downstream task scenario according to the training data set in the downstream task scenario.
  • the parameter adding module 72 when adding corresponding downstream task execution parameters to the original parameters of each layer in the pre-trained model, is also used to: splice the corresponding downstream task execution parameters in front of the original parameters of each layer in the pre-trained model.
  • the training set generation module 71 when obtaining a training data set in a downstream task scenario, is also used to: obtain sample data in the downstream task scenario; generate input information corresponding to the sample data based on the input format information of the task model in the downstream task scenario; obtain labeled data corresponding to the input information, and the input information and labeled data constitute the training data set.
  • the training set generation module 71 when generating input information corresponding to sample data based on the input format information of the task model in the downstream task scenario, is also used to: obtain an input prompt template in the downstream task scenario, the input prompt template is determined based on the input format information of the task model in the downstream task scenario; generate input information corresponding to the sample data based on the sample data and the input prompt template in the downstream task scenario.
  • the parameter adjustment module 73 is further used to: send the task model of the downstream task to an electronic device used to execute the downstream task.
  • the parameter adjustment module 73 is also used to: store the task model of the downstream task; obtain the input data of the downstream task in response to the execution instruction of the downstream task; generate the input information of the task model according to the input format information of the task model in the downstream task scenario; input the input information into the task model of the downstream task for processing to obtain the downstream task processing result; and output the downstream task processing result.
  • the pre-trained model is a pre-trained language model
  • the downstream tasks include at least one of the following: a visual question answering task, which is used to determine the answer to the question from the visual information of the input image based on the input image and the question; an image description task, which is used to generate a description text of the input image; a visual entailment task, which is used to predict the semantic relevance between the input image and the text; a referential expression and understanding task, which is used to locate the input image corresponding to the input text based on the input text. image area; image generation task, used to generate images based on the input description text; text-based sentiment classification task, used to predict the sentiment classification information of the input text; text summary task, used to generate summary information of the input text.
  • the device provided in this embodiment can be used to execute the downstream task model generation method provided based on any of the above embodiments, and the specific functions and technical effects that can be achieved will not be repeated here.
  • FIG8 is a schematic diagram of the structure of a task execution device provided by an exemplary embodiment of the present application.
  • the device provided by this embodiment is used to execute the above-mentioned task execution method.
  • the task execution device 80 includes: a data input module 81, an input information generation module 82, a task execution module 83 and a result output module 84.
  • the data input module 81 is used to obtain input data in response to downstream task execution instructions.
  • the input information generation module 82 is used to generate input information corresponding to the input data according to the input format information of the task model in the downstream task scenario.
  • the task execution module 83 is used to input input information into the trained task model for processing to obtain the task processing result.
  • the task model is obtained by adding downstream task execution parameters to the original parameters of the pre-trained model and adjusting the downstream task execution parameters in the pre-trained model based on the training data set in the downstream task scenario.
  • the result output module 84 is used to output the task processing results.
  • the downstream task is any one of the following: a visual question answering task, which is used to determine the answer to the question from the visual information of the input image based on the input image and the question; an image description task, which is used to generate a description text of the input image; a visual implication task, which is used to predict the semantic relevance between the input image and the text; a referential expression and understanding task, which is used to locate the image area in the input image corresponding to the input text based on the input text; an image generation task, which is used to generate an image based on the input description text; a text-based sentiment classification task, which is used to predict the sentiment classification information of the input text; a text summary task, which is used to generate summary information of the input text.
  • a visual question answering task which is used to determine the answer to the question from the visual information of the input image based on the input image and the question
  • an image description task which is used to generate a description text of the input image
  • a visual implication task which is used to predict the semantic relevance between the
  • the device provided in this embodiment can be used to execute the task execution method provided based on any of the above embodiments, and the specific functions and technical effects that can be achieved will not be repeated here.
  • FIG9 is a schematic diagram of the structure of a visual question-answering task execution device provided in an example embodiment of the present application.
  • the device provided in this embodiment is used to execute a visual question-answering task execution method.
  • a visual question-answering task execution device 90 includes: a data input module 91, an input information generation module 92, a visual question-answering module 93, and an answer output module 94.
  • the data input module 91 is used to obtain input images and question texts.
  • the input information generating module 92 is used to generate input information of the visual question answering task model according to the input format information, image and question text of the visual question answering task model in the visual question answering task scenario.
  • the visual question answering module 93 is used to input the input information into the visual question answering task model for processing to obtain the answer text corresponding to the question text.
  • the visual question answering task model is obtained by adding downstream task execution parameters to the original parameters of the pre-trained model and adjusting the downstream task execution parameters in the pre-trained model based on the training data set in the visual question answering task scenario.
  • the answer output module 94 outputs the answer text corresponding to the question text.
  • the input information generation module 92 when generating input information of a visual question answering task model based on an input prompt template, an image and a question text in a visual question answering task scenario, is further used to: encode the image to generate a corresponding image vector, and encode the question text to generate a corresponding text vector; and concatenate the image vector and the text vector according to the input prompt template in the visual question answering task scenario to obtain the input information of the visual question answering task model.
  • the device provided in this embodiment can be used to execute the visual question answering task execution method provided based on any of the above embodiments, and the specific functions and technical effects that can be achieved will not be repeated here.
  • Fig. 10 is a schematic diagram of the structure of an electronic device provided by an exemplary embodiment of the present application.
  • the electronic device 100 includes: a processor 1001, and a memory 1002 communicatively connected to the processor 1001, and the memory 1002 stores computer-executable instructions.
  • the processor executes the computer-executable instructions stored in the memory to implement the solution provided by any of the above method embodiments, and the specific functions and technical effects that can be achieved are not repeated here.
  • the method and device for downstream task model generation and task execution obtained a training data set in the downstream task scenario, and add downstream task execution parameters to the original parameters of the pre-trained model.
  • the original parameters of the pre-trained model are fixed, and only a small number of downstream task execution parameters added are optimized based on the training data set in the downstream task scenario. This can retain the knowledge learned by the pre-trained model in the pre-training pre-corpus and reduce the number of parameters that need to be optimized.
  • An embodiment of the present application also provides a computer-readable storage medium, in which computer execution instructions are stored.
  • computer execution instructions When the computer execution instructions are executed by a processor, they are used to implement the solution provided by any of the above method embodiments. The specific functions and technical effects that can be achieved are not repeated here.
  • An embodiment of the present application also provides a computer program product, which includes: a computer program, which is stored in a readable storage medium. At least one processor of an electronic device can read the computer program from the readable storage medium, and at least one processor executes the computer program so that the electronic device executes the solution provided by any of the above method embodiments. The specific functions and technical effects that can be achieved are not repeated here.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

本申请提供一种下游任务模型生成及任务执行的方法和设备。本申请的方法,通过获取下游任务场景下的训练数据集,并在预训练模型的原有参数上增加下游任务执行参数,在对预训练模型进行微调训练过程中,固定预训练模型的原有参数,只基于下游任务场景下的训练数据集优化增加的少量下游任务执行参数,从而可以保留预训练模型在预训练预语料中学习的知识,并且减少需要优化的参数数量,在仅仅使用少量样本的情况获得较好的微调训练效果,能够减少微调所有模型参数带来的计算开销,提高预训练模型微调的效率,实现预训练模型的快速微调,提高下游任务模型的生成效率和性能。

Description

下游任务模型生成及任务执行的方法和设备
本申请要求于2022年11月08日提交中国专利局、申请号为202211387996.7、发明名称为“下游任务模型生成及任务执行的方法和设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机技术,尤其涉及一种下游任务模型生成及任务执行的方法和设备。
背景技术
随着计算机技术的快速发展,自然语言处理也随之蓬勃发展,在自然语言处理(Natural Language Processing,简称NLP)、计算机视觉等领域中,预训练语言模型得到了广泛的关注和使用,预训练语言模型可以在大规模的未标记语料上进行预训练,并能够学习通用的语言表示,这些表示可以用于其他下游任务,如视觉问答(Visual Question Answering,简称VQA)、图像描述(Image Caption,简称IC)、视觉蕴涵(Visual Entailment,简称VE)、指代表达与理解(Referring Expression Comprehension,简称REC)等NLP与计算机视觉交叉领域的任务,以及基于文本的情感分类任务和文本摘要任务等自然语言处理领域的任务,可以应用于视觉助理、智能机器人、在线教育等各个应用领域。
在应用于下游任务时,需要基于下游任务的数据集对预训练任务进行微调,以使微调后模型更加适用于下游任务。丰富多样的下游任务使得预训练模型在微调阶段的目标设计非常繁琐复杂,由于预训练模型与下游任务之间目标不一致,往往存在着“隔阂”,输入和输出之间存在结构偏差(structure bias),因此预训练模型通常无法直接适配下游任务,需要使用下游任务数据集微调预训练模型的参数。但是,随着预训练模型规模不断增大,预训练模型微调的硬件要求和下游数据需求都在不断上涨,通过微调预训练模型的参数生成下游任务模型的效率低。
发明内容
本申请提供一种下游任务模型生成及任务执行的方法和设备,用以解决通过微调预训练模型的参数生成下游任务模型的效率低的问题。
第一方面,本申请提供一种下游任务模型生成方法,包括:获取下游任务场景下的训练数据集;在预训练模型的原有参数上增加下游任务执行参数;使用所述训练数据集调整所述预训练模型中的下游任务执行参数,生成所述下游任务的任务模型,所述下游任务的任务模型用于执行所述下游任务。
第二方面,本申请提供一种任务执行方法,包括:响应于下游任务执行指令,获取输入数据;根据所述下游任务场景下任务模型的输入的格式信息,生成所述输入数据对应的输入信息;将所述输入信息输入训练好的任务模型进行处理,得到任务处理结果,所述任务模型是通过在预训练模型的原有参数上增加下游任务执行参数,基于所述下游任务场景下的训练数据集对所述预训练模型中的下游任务执行参数进行调整后得到的;输出所述任务处理结果。
第三方面,本申请提供一种视觉问答任务执行方法,包括:获取输入的图像和问题文本;根据视觉问答任务场景下视觉问答任务模型的输入的格式信息、所述图像和问题文本,生成视觉问答任务模型的输入信息;将所述输入信息输入视觉问答任务模型进行处理,得到所述问题文本对应的答案文本,所述视觉问答任务模型是通过在预训练模型的原有参数上增加下游任务执行参数,基于所述视觉问答任务场景下的训练数据集对所述预训练模型 中的下游任务执行参数进行调整得到的;输出所述问题文本对应的答案文本。
第四方面,本申请提供一种电子设备,包括:处理器,以及与所述处理器通信连接的存储器;所述存储器存储计算机执行指令;所述处理器执行所述存储器存储的计算机执行指令,以实现上述任一方面所述的方法。
第五方面,本申请提供一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机执行指令,所述计算机执行指令被处理器执行时用于实现上述任一方面所述的方法。
附图说明
在附图中,除非另外规定,否则贯穿多个附图相同的附图标记表示相同或相似的部件或元素。这些附图不一定是按照比例绘制的。应该理解,这些附图仅描绘了根据本申请公开的一些实施方式,而不应将其视为是对本申请范围的限制。
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本申请的实施例,并与说明书一起用于解释本申请的原理。
图1示出了本申请所适用的一示例网络架构的示意图。
图2示出了本申请一示例性实施例提供的下游任务模型生成方法流程图。
图3示出了本申请提供的下游任务模型生成方法提升微调训练效率的效果示意图。
图4示出了本申请一示例性实施例提供的多模态任务模型生成方法流程图。
图5示出了本申请一示例性实施例提供的任务执行方法流程图。
图6示出了本申请一示例性实施例提供的视觉问答任务执行方法流程图。
图7示出了本申请一示例实施例提供的下游任务模型生成装置的结构示意图。
图8示出了本申请一示例实施例提供的任务执行装置的结构示意图。
图9示出了本申请一示例实施例提供的视觉问答任务执行装置的结构示意图。
图10示出了本申请一示例实施例提供的电子设备的结构示意图。
通过上述附图,已示出本申请明确的实施例,后文中将有更详细的描述。这些附图和文字描述并不是为了通过任何方式限制本申请构思的范围,而是通过参考特定实施例为本领域技术人员说明本申请的概念。
具体实施方式
为使本申请的目的、技术方案和优点更加清楚,下面将结合本申请具体实施例及相应的附图对本申请技术方案进行清楚、完整地描述。显然,所描述的实施例仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
这里将详细地对示例性实施例进行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本申请相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本申请的一些方面相一致的装置和方法的例子。
首先对本申请所涉及的名词进行解释。
视觉问答任务:根据输入的图像和问题,从输入图像的视觉信息中确定问题的答案。
图像描述任务:生成输入图像的描述文本。
视觉蕴涵任务:预测输入图像和文本在语义上的相关性,即蕴涵、中性或矛盾。
指代表达与理解任务:根据输入文本定位输入图像中与输入文本对应的图像区域。
图像生成任务:基于输入的描述文本生成图像。
基于文本的情感分类任务:预测输入文本的情感分类信息。
文本摘要任务:生成输入文本的摘要信息。
多模态任务:是指输入输出数据涉及图像和文本等多种模态数据的下游任务,例如视觉问答任务、图像描述任务、视觉蕴涵任务、指代表达与理解任务、图像生成任务等。
多模态预训练模型:是指输入输出数据涉及图像和文本等多种模态数据的预训练模型,经过微调训练后可以应用于多模态任务处理。如CLIP(Contrastive Language-Image Pre-training)模型为一种多模态预训练模型。
快速微调策略(Prompt tuning):用于辅助大规模预训练模型进行微调训练的方法。
下游任务执行参数(Prompt):也称为提示参数,是预训练模型快速微调策略中所增加的可训练参数。
预训练模型微调的方法存在如下技术缺陷:随着预训练模型规模不断增大,预训练模型微调训练的硬件要求和下游数据需求都在不断上涨,预训练模型微调训练的效率低。
为了解决上述技术问题,现有的一种基于提示的预训练模型微调方法为:在类别向量前面加入一组提示(prompt),通过添加了提示(prompt)的类别向量与样本向量进行匹配,选择匹配分类高的类别作为当前样本的类别。但是,在用于不同的下游任务时,需要为每个下游任务及数据集设计合适的提示(prompt),当数据集或者下游任务很多时,非常耗时耗力,另外手工设计提示(prompt)不太稳定,对某个词(或单词)可能很敏感,导致微调训练效果不好。
本申请提供一种下游任务模型生成方法,在对预训练模型进行微调训练生成下游任务模型时,获取下游任务场景下的训练数据集;在预训练模型的原有参数上增加下游任务执行参数;基于下游任务场景下的训练数据集对预训练模型中的下游任务执行参数进行调整,并固定预训练模型的原有参数,下游任务执行参数调整完成后,得到下游任务的任务模型,因为预训练模型的原有参数固定不变,因此可以保留预训练模型在预训练预语料中学习的知识,并且,仅微调数量较少的下游任务执行参数,使用较少的下游任务训练数据即可微调训练得到满足要求的下游任务模型,在同等训练数据集规模下,提高了预训练模型微调训练效果,随着预训练模型规模的增加,微调训练的效率提升越来越显著。
图1为本申请所适用的一示例网络架构的示意图。如图1所示,该网络架构包括负责生成下游任务模型的服务器,以及负责基于下游任务模型执行下游任务的电子设备。
其中,服务器可以是部署在云端的服务器集群、或者本地具有计算能力的设备。该服务器上存储有经过大规模训练预料预训练完成的预训练模型。服务器能够获取下游任务场景下的训练数据集。服务器在进行预训练模型的微调训练生成下游任务模型时,固定预训练模型的原有参数,在预训练模型的原有参数上增加下游任务执行参数,基于下游任务场景下的训练数据集对预训练模型中的下游任务执行参数进行训练,训练完成后即可得到下游任务的任务模型。进一步地,服务器可以将得到的下游任务的任务模型发送至指定的电子设备。
该电子设备可以是向服务器请求生成下游任务的任务模型的客户端设备,也可以是请求生成下游任务的任务模型的用户/服务器所属服务平台指定的另一电子设备,具体可以是客户本地部署的计算设备,也可以是云端部署的服务器等。
该电子设备可以用于执行下游任务并对外提供相应地服务,响应于下游任务的执行指令,电子设备获取下游任务的输入数据;根据下游任务场景下任务模型的输入的格式信息,生成任务模型的输入信息;将输入信息输入下游任务的任务模型进行处理,得到下游任务处理结果,并输出下游任务处理结果。
示例性地,以下游任务为视觉问答任务为例,服务器根据视觉问答任务场景下的训练数据集,在预训练的语言模型的原有参数上增加下游任务执行参数,基于视觉问答任务的训练数据对预训练语言模型中的下游任务执行参数进行训练,训练完成后即可得到视觉问 答模型。该视觉问答模型可以部署到本地/另一云端服务器,以对外提供视觉问答功能。在需要执行视觉问答任务时,部署有视觉问答模型的设备获取下视觉问答任务场景下输入的图像和问题文本,根据视觉问答任务场景下的视觉问答任务模型的输入的格式信息、图像和问题文本,生成视觉问答任务模型的输入信息;将输入信息输入下游任务的任务模型进行处理,得到下游任务处理结果,并输出下游任务处理结果。
下面以具体的实施例对本申请的技术方案以及本申请的技术方案如何解决上述技术问题进行详细说明。下面这几个具体的实施例可以相互结合,对于相同或相似的概念或过程可能在某些实施例中不再赘述。下面将结合附图,对本申请的实施例进行描述。
图2为本申请一示例性实施例提供的下游任务模型生成方法流程图,本实施例的执行主体可以为上述负责生成下游任务模型的服务器。如图2所示,该方法具体步骤包括如下步骤S201至步骤S203。
步骤S201、获取下游任务场景下的训练数据集。
本实施例中,不同下游任务场景下任务目标不同,也即是任务模型的输入和输出不完全相同,已实现不同的下游任务功能。
在对预训练模型进行微调生成下游任务的任务模型时,不同下游任务场景下使用的训练数据集不同,可以根据下游任务场景的任务目标设计待生成的下游任务模型的输入的格式信息和输出的格式信息。在获取下游任务场景下的训练数据集时,基于下游任务模型的输入的格式信息和输出的格式信息生成。
示例性地,一示例视觉问答任务场景下,任务模型的输入包括输入图像和问题,输出为问题的答案文本。生成式的图像描述任务场景下,任务模型的输入包括输入图像,以及特定的提示文本,输出为输入图像的描述文本。
步骤S202、在预训练模型的原有参数上增加下游任务执行参数。
本实施例中,在预训练模型的原有参数上增加下游任务执行参数,所增加的下游任务执行参数的量远小于原有参数的量。
其中,下游任务执行参数可以随机初始化生成的一组参数,或者是根据下游任务场景下的数据生成的一组参数。在对预训练模型进行微调生成下游任务模型时,仅调整预训练模型中的下游任务执行参数,以得到更加适用于特定下游任务场景的任务模型。
示例性地,可以在预训练模型的原有参数上拼接下游任务执行参数。例如,可以在预训练模型的一层或多层参数上分别拼接下游任务执行参数。
步骤S203、使用训练数据集调整预训练模型中的下游任务执行参数,生成下游任务的任务模型,下游任务的任务模型用于执行下游任务。
该步骤中,在对预训练模型进行微调训练,生成下游任务模型的过程中,固定预训练模型的原有参数,使用下游任务场景下的训练数据集,对预训练模型中的下游任务执行参数进行训练,也即不更新预训练模型的原有参数,仅更新所增加的下游任务执行参数,训练完成后得到下游任务的任务模型。
本实施例中,基于同一预训练模型生成的多个下游任务的任务模型中预训练模型的原有参数相同,不同的是增加的下游任务执行参数。
例如,将训练数据集中的样本数据对应的输入信息输入增加了下游任务执行参数的预训练模型,输出处理结果;根据处理结果与样本数据对应的标注数据,以及预设的损失函数计算损失值,根据损失值更新预训练模型中增加的下游任务执行参数。
其中,损失函数、训练结束的条件等的设置与现有的预训练模型微调训练方法中的设置方式类似,此处不再赘述。
本实施例中,下游任务模型可以应用于自然语言处理领域或计算机视觉领域。
示例性地,预训练模型可以为预训练的语言模型,下游任务可以包括如下至少一种: 视觉问答任务、图像描述任务、视觉蕴涵任务、指代表达与理解任务、图像生成任务、基于文本的情感分类任务、文本摘要任务。其中,视觉问答任务、图像描述任务、视觉蕴涵任务、指代表达与理解任务、图像生成任务属于自然语言处理与计算机视觉的交叉领域,基于文本的情感分类任务、文本摘要任务属于自然语言处理领域。
通过本实施例提供的方法,基于其中任一下游任务场景下的训练数据集对增加了下游任务执行参数的预训练语言模型中的下游任务执行参数进行微调训练,可以得到用于执行该下游任务的任务模型。
本申请提供的方法,可以应用于实现帮助视弱、盲人患者理解网上的图片、辨别物体、了解周围的环境等视觉助理功能;还可以用于VR程序中帮助用户与虚拟伙伴进行交流等的聊天机器人功能;还可以用于在线教育等场景中,帮助回答学生的问题等。
另外,本实施例提供的方法,可以用于除语言模型之外的其他预训练模型的微调训练,可以是生成式的预训练模型、预训练的分类模型、多模态预训练模型等等,此处不做限定。
本实施例中,在对预训练模型进行微调训练,生成下游任务模型的过程中,固定预训练模型的原有参数,只优化增加的少量下游任务执行参数,从而可以减少需要优化的参数数量,可以实现在仅仅使用少量样本的情况获得较好的微调训练效果,并且保留预训练模型在预训练预语料中学习的知识,从而避免微调所有模型参数带来的计算开销,提高预训练模型微调训练的效率,实现预训练模型的快速微调训练,从而提高生成下游任务模型的效率。在一实施方式中,基于图1所示的网络架构,服务器在微调训练预训练模型得到下游任务的任务模型之后,还可以将下游任务的任务模型发送至用于执行下游任务的电子设备。电子设备利用下游任务的任务模型执行下游任务,可以实现下游任务对应功能、或对外提供下游任务对应服务。
该电子设备可以是向服务器请求生成下游任务的任务模型的客户端设备,也可以是请求生成下游任务的任务模型的用户/服务器所属服务平台指定的另一电子设备,可以是客户本地部署的计算设备,也可以是云端部署的服务器等。
示例性地,服务器可以对外提供生成视觉问答任务、图像描述任务、视觉蕴涵任务、指代表达与理解任务、图像生成任务、基于文本的情感分类任务、文本摘要任务等下游任务的任务模型的功能。服务器可以基于用户指定的下游任务,基于预训练的语言模型,使用用户指定的下游任务场景下的训练数据集对预训练的语言模型中增加的下游任务执行参数进行训练,生成指定下游任务的任务模型,并将下游任务的任务模型提供给用户指定的电子设备。用户可以指定视觉问答任务、图像描述任务、视觉蕴涵任务、指代表达与理解任务、图像生成任务、基于文本的情感分类任务、文本摘要任务等下游任务中的任意一种。
在一实施例中,服务器还可以存储已生成的一个或多个下游任务的任务模型,以便于下次需要时直接获取。
在一实施例中,在存储各个下游任务的任务模型时,由于基于同一预训练模型训练生成的任务模型中,仅下游任务执行参数不同,而预训练模型的原有参数是完全相同的,因此可以存储一份预训练模型的原有参数,并存储训练完成后各个下游模型的下游任务执行参数。在需要获取某一下游任务场景下的任务模型时,通过在预训练模型的原有参数上增加该下游任务场景下训练完成的下游任务执行参数,即可得到下游任务的任务模型的完整参数,生成该下游任务场景下的任务模型,从而可以节省存储空间,并且任务模型的管理和维护更加灵活。
在另一实施方式中,在调整预训练模型中的下游任务执行参数得到下游任务的任务模型之后,服务器还可以直接在本地存储下游任务的任务模型。进一步地,服务器还可以用于执行下游任务,实现下游任务对应功能、或对外提供下游任务对应服务。
例如,服务器接收下游任务的执行指令,响应于接收到下游任务的执行指令,服务器获取下游任务场景下的输入数据;根据下游任务场景下任务模型的输入的格式信息,生成任务模型的输入信息;将输入信息输入下游任务的任务模型进行处理,得到下游任务处理结果;输出下游任务处理结果。
在一实施例中,上述步骤S201中,获取下游任务场景下的训练数据集可以采用如下方式实现:获取下游任务场景下的样本数据,根据下游任务场景下任务模型的输入的格式信息,生成样本数据对应的输入信息;并获取输入信息对应的标注数据,输入信息及标注数据构成训练数据集。训练样本的标注数据包括训练样本对应的正确的任务处理结果。
例如,可以基于下游任务场景下任务模型的输入的格式信息,以及下游任务场景下的图文数据,生成符合任务模型的输入格式要求的输入信息,并获取输入信息的标注数据,输入信息及输入信息的标注数据构成下游任务的训练数据集。
其中,输入信息的标注数据可以通过人工标注得到,本实施例中只需少量带有标注数据的训练数据集,即可训练得到具有较优效果的下游任务的任务模型。
在一些实施方式中,可以为各个下游任务场景设置对应的输入提示模板,输入提示模板用于指示下游任务模型的输入的格式信息。不同的下游任务所对应的输入提示模板可以不同。另外,也不排除存在两个不同下游任务对应相同的输入提示模板的情况。
示例性地,一示例视觉问答任务场景下,任务模型的输入包括输入图像和问题,输出为问题的答案文本。视觉问答任务的输入提示模板可以包括输入图像和问题,可以表示为如下形式:“[BOS′]输入图像[BOS′][BOS]问题[BOS]”。其中“[BOS′]输入图像[BOS′]”表示输入图像的数据(如对图像编码生成的图像向量),可以指代不同图像,“[BOS]问题[BOS]”指代输入的问题的数据(如对问题编码生成的文本向量),可以指定不同的问题。其中,两个“[BOS′]”用于标记图像数据,两个“[BOS]”用于标记文本数据,也可以使用其他两种不同的符号来区分标记图像数据和文本数据,例如在后续示例使用“[]”标记图像数据,使用“{}”标记文本数据。视觉问答任务的输出为问题的答案文本。
示例性地,生成式的图像描述任务场景下,任务模型的输入包括输入图像,以及特定的提示文本。生成式的图像描述任务的输入提示模板可以包括输入图像,以及特定的提示文本,可以表示为如下形式:“[输入图像]图像描述了什么”。其中,“[输入图像]”输入图像的数据,可以指代不同图像。“图像描述了什么”为预设设置的特定的提示文本,可以用其他提示文本,但是一旦设定保持固定。生成式的图像描述任务的输出为描述文本。
示例性地,分类式的图像描述任务的输入提示模板可以与生成式的图像描述任务的输入提示模板相同,表示为如下形式:“[输入图像]图像描述了什么”。分类式的图像描述任务的输出为描述文本的分类标签,不同的分类标签指代预设多类描述文本分类中的一种。
示例性地,一示例视觉蕴涵任务的输入提示模板可以包括输入图像、输入文本1和输入文本2,输入提示模板可以表示为:“[输入图像]和文本1{输入文本1}是否意味着文本2{输入文本2}”。视觉蕴涵任务的输出为判断结果,包括是、否、有可能等。
在根据下游任务场景下任务模型的输入的格式信息,生成样本数据对应的输入信息时,可以基于下游任务场景下的输入提示模板,基于下游任务场景下的图文数据等样本数据,生成符合输入提示模板中样式要求的输入信息。
示例性地,服务器可以基于视觉问答任务的输入提示模板获取视觉问答任务场景下的训练数据集,并训练预训练模型中增加的下游任务执行参数得到视觉问答任务模型,并存储得到的(也即生成的)视觉问答任务模型。服务器对外提供视觉问答功能。其他功能模块需要使用视觉问答功能时,调用视觉问答功能模块,以向服务器发出视觉问答任务的执行指令。服务器响应于视觉问答任务的执行指令,获取输入的图像和问题文本,根据视觉问答任务的输入提示模板,生成输入图像和问题文本对应的输入信息,将输入信息输入视 觉问答任务模型,通过视觉问答任务模型输出答案文本,将答案文本返回给其他功能模块。
在一实施例中,在存储各个下游任务的任务模型时,由于基于同一预训练模型训练生成的任务模型中,仅下游任务执行参数不同,而预训练模型的原有参数是完全相同的,因此可以存储一份预训练模型的原有参数,并存储训练完成后各个下游模型的下游任务执行参数。在需要获取下游模型时,通过在预训练模型的原有参数上增加训练完成后下游任务模型的下游任务执行参数,即可得到下游任务模型的完整参数,从而生成下游任务的任务模型,可以节省存储空间,并且任务模型的管理和维护更加灵活。
本实施例中,只需根据各个下游任务场景下任务模型的输入的格式信息,为不同的下游任务场景设置对应的输入提示模板,基于下游任务场景下输入提示模板,可以自动生成下游任务场景下的训练数据集。基于下游任务场景下的训练数据集对预训练模型中增加的少量下游任务执行参数进行调整,即可生成下游任务的任务模型,可以提高下游任务模型的生成效率和性能。
参见图3,图3中横轴为预训练模型的参数量,竖轴为对预训练模型进行微调所需的时间,微调所需时间越长,表示预训练模型的微调训练的效率越低。如图3中所示,对于参数量为相同规模的预训练模型,通过本申请微调预训练模型中增加的下游任务执行参数的方法微调所需时间更短,节省了预训练模型微调所需时间。随着预训练模型参数量规模的增加(以参数量由93M依次增加为180M、470M、930M时微调所需的时间为例),相较于现有的预训练模型的微调方法,本申请微调预训练模型中增加的下游任务执行参数的方法能够节省的微调时间增长,训练效率的提升更加明显。
在一些实施方式中,上述步骤S202中的下游任务执行参数可以通过如下方式生成:从设置的词表中选择多个词,生成多个词的词向量;根据多个词的词向量,生成预训练模型中每层对应的下游任务执行参数,每层对应的下游任务执行参数包含至少一个词向量。
在一实施方式中,上述步骤S202中在预训练模型中每层的原有参数上增加对应的下游任务执行参数,从而在预训练模型的用于编码(encoder)和解码(decoder)的层上均增加一些可训练的下游任务执行参数,在对预训练模型进行微调生成下游任务的任务模型的过程中,固定预训练模型的原有参数,仅调整增加的下游任务执行参数,使得在不同下游任务场景下调整后的下游任务执行参数不同,从而生成适用于特定下游任务场景的任务模型。
在一实施例中,不同层对应的下游任务执行参数不同,通过在预训练模型的不同层上增加不同的下游任务执行参数,能够提高预训练模型微调效果,提高生成的下游任务模型的性能。
在一实施例中,不同层对应的下游任务执行参数可以相同,也即最初在预训练模型的不同层上增加的下游任务执行参数相同。由于预训练模型的微调训练过程中会对下游任务执行参数进行调整,训练完成后不同层对应的下游任务执行参数不同。
在一实施例中,各下游任务场景下设置的词表可以是随机生成的词表,可以是统一设置的预设词表。
在一实施例中,各下游任务场景下设置的词表,还可以根据下游任务场景下的训练数据集进行设置,不同的下游任务场景下使用不同的词表。预设词表可以为预先配置的词表。
在一实施方式中,根据多个词的词向量,生成预训练模型中每层对应的下游任务执行参数时,可以根据每一下游任务执行参数中包含的词向量的预设数量,从得到的多个词向量中选择预设数量个词向量进行拼接,得到一个下游任务执行参数。根据预训练模型的层数,生成每层的一个下游任务执行参数。其中,预设数量可以根据实际应用场景/领域进行设置和调整,此处不做限定。例如,预设数量可以在10到100取值,或者可以取其他值。
在一实施例中,为了使得不同层对应的下游任务执行参数不同,在根据多个词的词向 量,生成预训练模型中每层对应的下游任务执行参数时,可以根据每一下游任务执行参数中包含的词向量的预设数量,分别针对每一层,选择不同的词向量中选择进行拼接生成每一层的下游任务执行参数,使得为不同层选取的词向量不完全相同,从而保证不同层增加的下游任务执行参数不同。
在一实施方式中,在预训练模型中每层的原有参数上增加对应的下游任务执行参数时,可以在预训练模型中每层的原有参数的前面(头部)拼接对应的下游任务执行参数,可以容易获取到下游任务执行参数,便于更新下游任务执行参数,并且使用多个连续的词向量拼接生成下游任务执行参数(prompt)进行快速微调,缓解了提示词敏感的问题。
另外,在预训练模型中每层的原有参数上增加对应的下游任务执行参数时,还可以在预训练模型中每层的原有参数的后面(尾部)拼接对应的下游任务执行参数。
图4为本申请一示例性实施例提供的多模态任务模型生成方法流程图。本实施例提供的方法,可以用于对多模态预训练模型进行微调训练,生成多模态任务模型。
如图4所示,该方法具体步骤包括如下步骤S401至步骤S404。
步骤S401、响应于多模态任务的任务模型生成指令,获取多模态任务对应的输入提示模板。
其中,多模态任务可以包括如下至少一种:视觉问答任务、图像描述任务、视觉蕴涵任务、指代表达与理解任务、图像生成任务。
多模态任务所涉及的输入和输出数据包括图像和文本等多个模态的数据。例如,视觉问答任务的输入包括图像数据和文本数据,输出为文本数据。图像描述任务的输入为图像数据,输出为文本数据。视觉蕴涵任务的输入为图像和文本,输出为图像区域的位置信息。图像生成任务的输入为文本数据,输出为图像数据。
本实施例中,可以为各个下游任务场景设置对应的输入提示模板,输入提示模板用于指示下游任务模型的输入的格式信息。
多模态任务的输入提示模板的提示模板用于指示预训练模型输入信息的格式信息。不同的多模态任务所对应的输入提示模板可以不同。另外,也不排除存在两个不同多模态任务对应相同的输入提示模板的情况。
示例性地,一示例视觉问答任务场景下,任务模型的输入包括输入图像和问题,输出为问题的答案文本。视觉问答任务的输入提示模板可以包括输入图像和问题,可以表示为如下形式:“[BOS′]输入图像[BOS′][BOS]问题[BOS]”。其中“[BOS′]输入图像[BOS′]”表示输入图像的数据(如对图像编码生成的图像向量),可以指代不同图像,“[BOS]问题[BOS]”指代输入的问题的数据(如对问题编码生成的文本向量),可以指定不同的问题。其中,两个“[BOS′]”用于标记图像数据,两个“[BOS]”用于标记文本数据,也可以使用其他两种不同的符号来区分标记图像数据和文本数据,例如在后续示例使用“[]”标记图像数据,使用“{}”标记文本数据。视觉问答任务的输出为问题的答案文本。
示例性地,生成式的图像描述任务场景下,任务模型的输入包括输入图像,以及特定的提示文本。生成式的图像描述任务的输入提示模板可以包括输入图像,以及特定的提示文本,可以表示为如下形式:“[输入图像]图像描述了什么”。其中,“[输入图像]”输入图像的数据,可以指代不同图像。“图像描述了什么”为预设设置的特定的提示文本,可以用其他提示文本,但是一旦设定保持固定。生成式的图像描述任务的输出为描述文本。
步骤S402、根据多模态任务对应的输入提示模板,获取多模态任务的训练数据集,多模态任务数据集中包括图像数据和文本数据。
该步骤中,可以基于下游的多模态任务对应的输入提示模板,基于多模态任务所属场景/领域下的图文数据等样本数据,生成符合输入提示模板格式要求的输入信息,并获取输入信息的标注数据,输入信息及输入信息的标注数据构成下游任务的训练数据集。
其中,输入信息的标注数据可以通过人工标注得到,本实施例中只需少量带有标注数据的训练数据集,即可训练得到具有较优效果的下游任务的任务模型。本实施例中,多模态任务的训练数据集中包括图像和文本等多个模态的数据。
步骤S403、在预训练模型的原有参数上增加下游任务执行参数。
该步骤与上述步骤S202的实现方式一致,具体实现方式参见步骤S202的相关内容,此处不再赘述。
步骤S404、基于多模态任务的训练数据集对预训练的语言模型中的下游任务执行参数进行调整,生成多模态任务模型。
该步骤中,在对预训练语言模型进行微调生成多模态任务模型的过程中,保持预训练语言模型的原有参数固定不变,基于多模态任务的训练数据集对预训练的语言模型中的下游任务执行参数进行调整,以得到更加适用于特定多模态任务的模型。
其中,多模态任务模型用于根据基于输入数据和输入提示模板生成的输入信息执行多模态任务,得到任务处理结果。
该步骤与上述步骤S203的实现方式一致,具体实现方式参见步骤S203的相关内容,此处不再赘述。
本实施例中,将提示学习引入多模态任务中,基于提示学习的方式,通过在预训练语言模型中增加下游任务执行参数,并基于多模态任务的输入提示模板生成多模态任务的训练数据集,基于多模态任务的训练数据集对预训练语言模型中增加的下游任务执行参数进行训练,以生成多模态任务的任务模型,只微调少量的下游任务执行参数的训练效果堪比在一些情形中训练预训练模型全量参数的效果,提高了预训练语言模型的训练效率,能够快速训练生成多模态任务模型。
图5为本申请一示例性实施例提供的任务执行方法流程图。本实施例提供的任务执行方法,可以利用基于前述下游任务模型生成方法实施例生成的下游任务的任务模型,实现下游任务的执行,从而实现下游任务场景的相应功能,提供对应服务。本实施例提供的方法的执行主体为负责基于下游任务模型执行下游任务的电子设备。
如图5所示,该方法具体步骤包括如下步骤S501至步骤S504。
步骤S501、响应于下游任务执行指令,获取输入数据。
本实施例中,电子设备存储有通过调整预训练模型中增加的下游任务执行参数得到的下游任务的任务模型,能够基于下游任务的任务模型,提供执行下游任务的功能或者对外提供下游任务对应的服务。
示例性地,电子设备提供执行下游任务的功能,其他功能模块需要使用执行下游任务的功能时,调用该下游任务的功能模块,以向电子设备发出下游任务的执行指令。电子设备响应于下游任务的执行指令,获取输入数据。
示例性地,电子设备对外提供下游任务对应的服务,在需要使用电子设备提供的下游任务对应服务时,用户通过客户端向电子设备发送下游任务执行指令。电子设备响应于下游任务的执行指令,获取输入数据。
其中,输入数据可以包括如下至少一种模态的数据:图像数据、文本数据。
步骤S502、根据下游任务场景下任务模型的输入的格式信息,生成输入数据对应的输入信息。
该步骤中,电子设备对输入数据中每一模态的数据分别进行编码生成对应向量,并根据当前任务场景下任务模型的输入的格式信息,生成任务模型的输入信息。
在一实施例中,可以为各个下游任务场景设置对应的输入提示模板,输入提示模板用于指示下游任务模型的输入的格式信息。
该步骤中,根据当前任务场景下的输入提示模板,生成输入数据对应的输入信息。
示例性地,一示例视觉问答任务的输入提示模板可以包括输入图像和问题,可以表示为如下形式:“[BOS′]输入图像[BOS′][BOS]问题[BOS]”。其中“[BOS′]输入图像[BOS′]”表示输入图像的数据(如对图像编码生成的图像向量),可以指代不同图像,“[BOS]问题[BOS]”指代输入的问题的数据(如对问题编码生成的文本向量),可以指定不同的问题。其中,两个“[BOS′]”用于标记图像数据,两个“[BOS]”用于标记文本数据,也可以使用其他两种不同的符号来区分标记图像数据和文本数据,例如在后续示例使用“[]”标记图像数据,使用“{}”标记文本数据。视觉问答任务的输出为问题的答案文本。电子设备获取到的输入数据包括输入图像和问题文本,将图像进行编码生成图像向量,将问题文本进行编码生成文本向量,根据视觉问答任务的输入提示模板“[BOS′]输入图像[BOS′][BOS]问题[BOS]”,将图像对应的图像向量与问题文本对应的文本向量拼接,得到对应的输入信息。
示例性地,生成式的图像描述任务的输入提示模板可以包括输入图像,以及特定的提示文本,可以表示为如下形式:“[输入图像]图像描述了什么”。其中,“[输入图像]”输入图像的数据,可以指代不同图像。“图像描述了什么”为预设设置的特定的提示文本,可以用其他提示文本,但是一旦设定保持固定。生成式的图像描述任务的输出为描述文本。电子设备获取到的输入数据包括输入图像,对输入图像编码生成图像向量。根据生成式的图像描述任务的输入提示模板“[输入图像]图像描述了什么”,将“图像描述了什么”这一特定的提示文本进行编码生成文本向量,并将输入图像的图像向量与提示文本的文本向量拼接,得到对应的输入信息。
步骤S503、将输入信息输入训练好的任务模型进行处理,得到任务处理结果,任务模型是通过在预训练模型的原有参数上增加下游任务执行参数,基于任务的训练数据集对预训练模型中的下游任务执行参数进行训练得到的。
在基于当前任务场景下的输入提示模板生成输入数据对应的输入信息之后,将输入信息输入训练好的任务模型进行处理,得到任务处理结果。
本实施例中,任务模型是基于预训练模型,利用上述任一实施例提供的下游任务模型生成方法,通过根据当前任务场景下的输入提示模板,获取当前任务的训练数据集;在预训练模型的原有参数上增加下游任务执行参数,基于当前任务的训练数据集对预训练模型中的下游任务执行参数进行训练(微调)得到的,具体训练(微调)过程参见上述方法实施例中的相关说明,此处不再赘述。
步骤S504、输出任务处理结果。
示例性地,电子设备提供执行下游任务的功能,其他功能模块需要使用执行下游任务的功能时,调用该下游任务的功能模块,以向电子设备发出下游任务的执行指令。电子设备在得到任务处理结果之后,将任务处理结果返回给其他功能模块。
示例性地,电子设备对外提供下游任务对应的服务,在需要使用电子设备提供的下游任务对应服务时,用户通过客户端向电子设备发送下游任务执行指令。电子设备在得到任务处理结果之后,将任务处理结果输出至用户的客户端设备。
本实施例中,当前任务可以包括如下至少一种:视觉问答任务、图像描述任务、视觉蕴涵任务、指代表达与理解任务、图像生成任务、基于文本的情感分类任务、文本摘要任务。其中,视觉问答任务、图像描述任务、视觉蕴涵任务、指代表达与理解任务、图像生成任务属于自然语言处理与计算机视觉的交叉领域,基于文本的情感分类任务、文本摘要任务属于自然语言处理领域。
本实施例中,执行各类下游任务所使用的任务模型,是通过利用上述任一实施例提供的下游任务模型生成方法,通过根据当前任务对应的输入提示模板,获取当前任务的训练数据集;在预训练模型的原有参数上增加下游任务执行参数,基于当前任务的训练数据集 对预训练模型中的下游任务执行参数进行训练得到的,能够实现预训练模型的快速训练,快速获取到下游任务的任务模型,在同等训练数据集规模下,提高了训练效果,从提高下游任务的执行效果。
参见图6,图6为本申请一示例性实施例提供的视觉问答任务执行方法流程图。本实施例提供的方法的执行主体为负责基于视觉问答任务模型执行视觉问答任务的电子设备。
如图6所示,该方法具体步骤包括如下步骤S601至步骤S604。
步骤S601、获取输入的图像和问题文本。
电子设备存储有通过预训练语言模型中增加的下游任务执行参数得到的视觉问答任务模型,能够基于视觉问答任务模型,提供执行视觉问答任务的功能或者对外提供视觉问答任务对应的服务。
其中,视觉问答任务根据输入的图像和问题,从输入图像的视觉信息中确定问题的答案。
示例性地,电子设备提供执行视觉问答任务的功能,其他功能模块需要使用执行视觉问答任务的功能时,调用该视觉问答任务的功能模块,以向电子设备发出视觉问答任务的执行指令。电子设备响应于视觉问答任务的执行指令,获取输入参数中的图像和问题文本。
示例性地,电子设备对外提供视觉问答服务,在需要使用电子设备提供的视觉问答服务时,用户通过客户端向电子设备发送视觉问答请求。电子设备响应于视觉问答请求,获取用户输入的图像和问题文本。
步骤S602、根据视觉问答任务场景下视觉问答任务模型的输入的格式信息、图像和问题文本,生成视觉问答任务模型的输入信息。
例如,可以为各个下游任务场景设置对应的输入提示模板,输入提示模板用于指示下游任务模型的输入的格式信息。
该步骤中,根据当前任务场景下的输入提示模板,生成输入的图像和问题文本对应的输入信息。
示例性地,视觉问答任务对应的输入提示模板可以包括输入图像和问题,可以表示为如下形式:“[BOS′]输入图像[BOS′][BOS]问题[BOS]”。
其中“[BOS′]输入图像[BOS′]”表示输入图像的数据(如对图像编码生成的图像向量),可以指代不同图像,“[BOS]问题[BOS]”指代输入的问题的数据(如对问题编码生成的文本向量),可以指定不同的问题。
其中,两个“[BOS′]”用于标记图像数据,两个“[BOS]”用于标记文本数据,也可以使用其他两种不同的符号来区分标记图像数据和文本数据,例如在后续示例使用“[]”标记图像数据,使用“{}”标记文本数据。
例如,对图像进行编码生成对应的图像向量,并对问题文本进行编码生成对应的文本向量;根据视觉问答任务对应的输入提示模板,将图像向量与文本向量拼接,得到对应的输入信息。
步骤S603、将输入信息输入视觉问答任务模型进行处理,得到问题文本对应的答案文本。
其中,视觉问答任务模型是通过在预训练模型的原有参数上增加下游任务执行参数,基于视觉问答任务的训练数据集对预训练模型中的下游任务执行参数进行训练得到的,具体可以通过上述任一实施例提供的下游任务模型生成方法得到,本实施例不再赘述。
步骤S604、输出问题文本对应的答案文本。
示例性地,电子设备提供执行视觉问答任务的功能,其他功能模块需要使用执行视觉问答任务的功能时,调用该视觉问答任务的功能模块,以向电子设备发出视觉问答任务的执行指令。电子设备在得到答案文本之后,将答案文本返回给其他功能模块。
示例性地,电子设备对外提供视觉问答服务,在需要使用电子设备提供的视觉问答服务时,用户通过客户端向电子设备发送视觉问答请求。电子设备在得到答案文本之后,将答案文本输出至用户的客户端设备。
本实施例中,视觉问答模型是通过利用上述任一实施例提供的下游任务模型生成方法,通过根据视觉问答任务对应的输入提示模板,获取视觉问答任务的训练数据集;在预训练语言模型的原有参数上增加下游任务执行参数,基于视觉问答任务的训练数据集对预训练语言模型中的下游任务执行参数进行训练得到的,能够实现预训练语言模型的快速微调,快速获取到视觉问答任务模型,在同等训练数据集规模下,提高了微调效果,从提高视觉问答任务的执行效果。
图7为本申请一示例实施例提供的下游任务模型生成装置的结构示意图。本实施例提供的装置应用于执行下游任务模型生成方法。如图7所示,下游任务模型生成装置70包括:训练集生成模块71、参数增加模块72、参数调整模块73。
其中,训练集生成模块71用于获取下游任务场景下的训练数据集。
参数增加模块72用于在预训练模型的原有参数上增加下游任务执行参数。
参数调整模块73用于使用训练数据集调整预训练模型中的下游任务执行参数,生成下游任务的任务模型,下游任务的任务模型用于执行下游任务。
在一实施例中,在实现在预训练模型的原有参数上增加下游任务执行参数时,参数增加模块72还用于:从设置的词表中选择多个词,生成多个词的词向量;根据多个词的词向量,生成预训练模型中每层对应的下游任务执行参数,每层对应的下游任务执行参数包含至少一个词向量;在预训练模型中每层的原有参数上增加对应的下游任务执行参数。
在一实施例中,参数增加模块72还用于:根据下游任务场景下的训练数据集,设置下游任务场景下使用的词表。
在一实施例中,在实现在预训练模型中每层的原有参数上增加对应的下游任务执行参数时,参数增加模块72还用于:在预训练模型中每层的原有参数的前面拼接对应的下游任务执行参数。
在一实施例中,在实现获取下游任务场景下的训练数据集时,训练集生成模块71还用于:获取下游任务场景下的样本数据;根据下游任务场景下任务模型的输入的格式信息,生成样本数据对应的输入信息;获取输入信息对应的标注数据,输入信息及标注数据构成训练数据集。
在一实施例中,在实现根据下游任务场景下任务模型的输入的格式信息,生成样本数据对应的输入信息时,训练集生成模块71还用于:获取下游任务场景下的输入提示模板,输入提示模板根据下游任务场景下任务模型的输入的格式信息确定;根据样本数据和下游任务场景下的输入提示模板,生成样本数据对应的输入信息。
在一实施例中,在生成下游任务的任务模型之后,参数调整模块73还用于:将下游任务的任务模型发送至用于执行下游任务的电子设备。
在一实施例中,在训练完成后得到用于执行下游任务的任务模型之后,参数调整模块73还用于:存储下游任务的任务模型;响应于下游任务的执行指令,获取下游任务的输入数据;根据下游任务场景下任务模型的输入的格式信息,生成任务模型的输入信息;将输入信息输入下游任务的任务模型进行处理,得到下游任务处理结果;输出下游任务处理结果。
在一实施例中,预训练模型为预训练的语言模型,下游任务包括如下至少一种:视觉问答任务,用于根据输入的图像和问题,从输入图像的视觉信息中确定问题的答案;图像描述任务,用于生成输入图像的描述文本;视觉蕴涵任务,用于预测输入图像和文本在语义上的相关性;指代表达与理解任务,用于根据输入文本定位输入图像中与输入文本对应 的图像区域;图像生成任务,用于基于输入的描述文本生成图像;基于文本的情感分类任务,用于预测输入文本的情感分类信息;文本摘要任务,用于生成输入文本的摘要信息。
本实施例提供的装置可以用于执行基于上述任一实施例提供的下游任务模型生成方法,具体功能和所能实现的技术效果此处不再赘述。
图8为本申请一示例实施例提供的任务执行装置的结构示意图。本实施例提供的装置应用于执行上述任务执行方法。如图8所示,任务执行装置80包括:数据输入模块81、输入信息生成模块82、任务执行模块83和结果输出模块84。
其中,数据输入模块81用于响应于下游任务执行指令,获取输入数据。
输入信息生成模块82用于根据下游任务场景下任务模型的输入的格式信息,生成输入数据对应的输入信息。
任务执行模块83用于将输入信息输入训练好的任务模型进行处理,得到任务处理结果,任务模型是通过在预训练模型的原有参数上增加下游任务执行参数,基于下游任务场景下的训练数据集对预训练模型中的下游任务执行参数进行调整后得到的。
结果输出模块84用于输出任务处理结果。
其中,下游任务为以下任意一种:视觉问答任务,用于根据输入的图像和问题,从输入图像的视觉信息中确定问题的答案;图像描述任务,用于生成输入图像的描述文本;视觉蕴涵任务,用于预测输入图像和文本在语义上的相关性;指代表达与理解任务,用于根据输入文本定位输入图像中与输入文本对应的图像区域;图像生成任务,用于基于输入的描述文本生成图像;基于文本的情感分类任务,用于预测输入文本的情感分类信息;文本摘要任务,用于生成输入文本的摘要信息。
本实施例提供的装置可以用于执行基于上述任一实施例提供的任务执行方法,具体功能和所能实现的技术效果此处不再赘述。
图9为本申请一示例实施例提供的视觉问答任务执行装置的结构示意图。本实施例提供的装置应用于执行视觉问答任务执行方法。如图9所示,视觉问答任务执行装置90包括:数据输入模块91、输入信息生成模块92、视觉问答模块93和答案输出模块94。
数据输入模块91用于获取输入的图像和问题文本。
输入信息生成模块92用于根据视觉问答任务场景下视觉问答任务模型的输入的格式信息、图像和问题文本,生成视觉问答任务模型的输入信息。
视觉问答模块93用于将输入信息输入视觉问答任务模型进行处理,得到问题文本对应的答案文本,视觉问答任务模型是通过在预训练模型的原有参数上增加下游任务执行参数,基于视觉问答任务场景下的训练数据集对预训练模型中的下游任务执行参数进行调整得到的。
答案输出模块94输出问题文本对应的答案文本。
在一实施例中,在实现根据视觉问答任务场景下的输入提示模板、图像和问题文本,生成视觉问答任务模型的输入信息时,输入信息生成模块92还用于:对图像进行编码,生成对应的图像向量,并对问题文本进行编码,生成对应的文本向量;根据视觉问答任务场景下的输入提示模板,将图像向量与文本向量拼接,得到视觉问答任务模型的输入信息。
本实施例提供的装置可以用于执行基于上述任一实施例提供的视觉问答任务执行方法,具体功能和所能实现的技术效果此处不再赘述。
图10为本申请一示例实施例提供的电子设备的结构示意图。如图10所示,该电子设备100包括:处理器1001,以及与处理器1001通信连接的存储器1002,存储器1002存储计算机执行指令。
其中,处理器执行存储器存储的计算机执行指令,以实现上述任一方法实施例所提供的方案,具体功能和所能实现的技术效果此处不再赘述。
本申请提供的下游任务模型生成及任务执行的方法和设备,通过获取所述下游任务场景下的训练数据集,并在预训练模型的原有参数上增加下游任务执行参数,在对预训练模型进行微调训练过程中,固定预训练模型的原有参数,只基于下游任务场景下的训练数据集优化增加的少量下游任务执行参数,从而可以保留预训练模型在预训练预语料中学习的知识,并且减少需要优化的参数数量,在仅仅使用少量样本的情况获得较好的微调训练效果,能够减少微调所有模型参数带来的计算开销,提高预训练模型微调训练的效率,实现预训练模型的快速微调,提高下游任务模型的生成效率和性能。
本申请实施例还提供一种计算机可读存储介质,计算机可读存储介质中存储有计算机执行指令,计算机执行指令被处理器执行时用于实现上述任一方法实施例所提供的方案,具体功能和所能实现的技术效果此处不再赘述。
本申请实施例还提供了一种计算机程序产品,计算机程序产品包括:计算机程序,计算机程序存储在可读存储介质中,电子设备的至少一个处理器可以从可读存储介质读取计算机程序,至少一个处理器执行计算机程序使得电子设备执行上述任一方法实施例所提供的方案,具体功能和所能实现的技术效果此处不再赘述。
另外,在上述实施例及附图中的描述的一些流程中,包含了按照特定顺序出现的多个操作,但是应该清楚了解,这些操作可以不按照其在本文中出现的顺序来执行或并行执行,仅仅是用于区分开各个不同的操作,序号本身不代表任何的执行顺序。另外,这些流程可以包括更多或更少的操作,并且这些操作可以按顺序执行或并行执行。需要说明的是,本文中的“第一”、“第二”等描述,是用于区分不同的消息、设备、模块等,不代表先后顺序,也不限定“第一”和“第二”是不同的类型。“多个”的含义是两个以上,除非另有明确具体的限定。
本领域技术人员在考虑说明书及实践这里公开的发明后,将容易想到本申请的其它实施方案。本申请旨在涵盖本申请的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本申请的一般性原理并包括本申请未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的,本申请的真正范围和精神由下面的权利要求书指出。
应当理解的是,本申请并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围进行各种修改和改变。本申请的范围仅由所附的权利要求书来限制。

Claims (14)

  1. 一种下游任务模型生成方法,所述下游任务模型应用于自然语言处理领域或计算机视觉领域,所述方法包括:
    获取下游任务场景下的训练数据集;
    在预训练模型的原有参数上增加下游任务执行参数;
    使用所述训练数据集调整所述预训练模型中的下游任务执行参数,生成所述下游任务的任务模型,所述下游任务的任务模型用于执行所述下游任务。
  2. 根据权利要求1所述的方法,其中,所述在预训练模型的原有参数上增加下游任务执行参数,包括:
    从设置的词表中选择多个词,生成所述多个词的词向量;
    根据所述多个词的词向量,生成所述预训练模型中每层对应的下游任务执行参数,每层对应的下游任务执行参数包含至少一个词向量;
    在所述预训练模型中每层的原有参数上增加对应的下游任务执行参数。
  3. 根据权利要求2所述的方法,其中,所述方法还包括:
    根据所述下游任务场景下的训练数据集,设置所述下游任务场景下使用的词表。
  4. 根据权利要求2所述的方法,其中,所述在所述预训练模型中每层的原有参数上增加对应的下游任务执行参数,包括:
    在所述预训练模型中每层的原有参数的前面拼接对应的下游任务执行参数。
  5. 根据权利要求1所述的方法,其中,所述获取下游任务场景下的训练数据集,包括:
    获取所述下游任务场景下的样本数据;
    根据所述下游任务场景下任务模型的输入的格式信息,生成所述样本数据对应的输入信息;
    获取所述输入信息对应的标注数据,所述输入信息及标注数据构成训练数据集。
  6. 根据权利要求5所述的方法,其中,所述根据所述下游任务场景下任务模型的输入的格式信息,生成所述样本数据对应的输入信息,包括:
    获取所述下游任务场景下的输入提示模板,所述输入提示模板根据所述下游任务场景下任务模型的输入的格式信息确定;
    根据所述样本数据和所述下游任务场景下的输入提示模板,生成所述样本数据对应的输入信息。
  7. 根据权利要求1所述的方法,其中,在生成所述下游任务的任务模型之后,所述方法还包括:
    将所述下游任务的任务模型发送至用于执行下游任务的电子设备。
  8. 根据权利要求1所述的方法,其中,在生成所述下游任务的任务模型之后,所述方法还包括:
    存储所述下游任务的任务模型;
    响应于所述下游任务的执行指令,获取所述下游任务的输入数据;
    根据所述下游任务场景下任务模型的输入的格式信息,生成任务模型的输入信息;
    将所述输入信息输入所述下游任务的任务模型进行处理,得到下游任务处理结果;
    输出所述下游任务处理结果。
  9. 根据权利要求1-8中任一项所述的方法,其中,所述预训练模型为预训练的语言模型,
    所述下游任务包括如下至少一种:
    视觉问答任务,用于根据输入的图像和问题,从输入图像的视觉信息中确定问题的答案;
    图像描述任务,用于生成输入图像的描述文本;
    视觉蕴涵任务,用于预测输入图像和文本在语义上的相关性;
    指代表达与理解任务,用于根据输入文本定位输入图像中与输入文本对应的图像区域;
    图像生成任务,用于基于输入的描述文本生成图像;
    基于文本的情感分类任务,用于预测输入文本的情感分类信息;
    文本摘要任务,用于生成输入文本的摘要信息。
  10. 一种任务执行方法,应用于自然语言处理领域或计算机视觉领域,所述方法包括:
    响应于下游任务执行指令,获取输入数据;
    根据所述下游任务场景下任务模型的输入的格式信息,生成所述输入数据对应的输入信息;
    将所述输入信息输入训练好的任务模型进行处理,得到任务处理结果,所述任务模型是通过在预训练模型的原有参数上增加下游任务执行参数,基于所述下游任务场景下的训练数据集对所述预训练模型中的下游任务执行参数进行调整后得到的;
    输出所述任务处理结果。
  11. 一种视觉问答任务执行方法,所述方法包括:
    获取输入的图像和问题文本;
    根据视觉问答任务场景下视觉问答任务模型的输入的格式信息、所述图像和问题文本,生成视觉问答任务模型的输入信息;
    将所述输入信息输入视觉问答任务模型进行处理,得到所述问题文本对应的答案文本,所述视觉问答任务模型是通过在预训练模型的原有参数上增加下游任务执行参数,基于所述视觉问答任务场景下的训练数据集对所述预训练模型中的下游任务执行参数进行调整得到的;
    输出所述问题文本对应的答案文本。
  12. 根据权利要求11所述的方法,其中,所述根据视觉问答任务场景下的输入提示模板、所述图像和问题文本,生成视觉问答任务模型的输入信息,包括:
    对所述图像进行编码,生成对应的图像向量,并对所述问题文本进行编码,生成对应的文本向量;
    根据所述视觉问答任务场景下的输入提示模板,将所述图像向量与所述文本向量拼接,得到视觉问答任务模型的输入信息。
  13. 一种电子设备,包括:处理器,以及与所述处理器通信连接的存储器;
    所述存储器存储计算机执行指令;
    所述处理器执行所述存储器存储的计算机执行指令,以实现如权利要求1-12中任一项所述的方法。
  14. 一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机执行指令,所述计算机执行指令被处理器执行时用于实现如权利要求1-12中任一项所述的方法。
PCT/CN2023/127845 2022-11-08 2023-10-30 下游任务模型生成及任务执行的方法和设备 WO2024099144A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211387996.7A CN115438176B (zh) 2022-11-08 2022-11-08 下游任务模型生成及任务执行的方法和设备
CN202211387996.7 2022-11-08

Publications (1)

Publication Number Publication Date
WO2024099144A1 true WO2024099144A1 (zh) 2024-05-16

Family

ID=84252390

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/127845 WO2024099144A1 (zh) 2022-11-08 2023-10-30 下游任务模型生成及任务执行的方法和设备

Country Status (2)

Country Link
CN (1) CN115438176B (zh)
WO (1) WO2024099144A1 (zh)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115438176B (zh) * 2022-11-08 2023-04-07 阿里巴巴达摩院(杭州)科技有限公司 下游任务模型生成及任务执行的方法和设备
CN116363452B (zh) * 2023-03-07 2024-01-09 阿里巴巴(中国)有限公司 任务模型训练方法以及装置
CN116306917B (zh) * 2023-05-17 2023-09-08 卡奥斯工业智能研究院(青岛)有限公司 任务处理方法、装置、设备和计算机存储介质
CN117994397A (zh) * 2024-03-29 2024-05-07 苏州元脑智能科技有限公司 数字人文本动作生成方法、装置、计算机设备和存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021217935A1 (zh) * 2020-04-29 2021-11-04 深圳壹账通智能科技有限公司 问题生成模型的训练方法、问题生成方法及其相关设备
CN114840651A (zh) * 2022-04-20 2022-08-02 南方科技大学 视觉问答的训练方法、系统及计算机可读存储介质
CN115114439A (zh) * 2022-08-30 2022-09-27 北京百度网讯科技有限公司 多任务模型推理、多任务信息处理的方法和装置
CN115438176A (zh) * 2022-11-08 2022-12-06 阿里巴巴达摩院(杭州)科技有限公司 下游任务模型生成及任务执行的方法和设备

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114723064A (zh) * 2020-12-22 2022-07-08 株式会社理光 预训练语言模型的微调方法、装置及计算机可读存储介质
CN112668320B (zh) * 2020-12-25 2024-02-02 平安科技(深圳)有限公司 基于词嵌入的模型训练方法、装置、电子设备及存储介质
CN113486162A (zh) * 2021-06-04 2021-10-08 北京大学 一种大规模预训练模型微调方法及装置
CN113569011B (zh) * 2021-07-27 2023-03-24 马上消费金融股份有限公司 文本匹配模型的训练方法、装置、设备及存储介质
CN114398899A (zh) * 2021-11-29 2022-04-26 阿里巴巴达摩院(杭州)科技有限公司 预训练语言模型的训练方法、装置、计算机设备和介质
CN114625840A (zh) * 2022-03-18 2022-06-14 鼎富智能科技有限公司 一种自然语言处理模型的训练方法和装置
CN115080736A (zh) * 2022-05-23 2022-09-20 清华大学 一种判别式语言模型的模型调整方法及装置
CN114995903B (zh) * 2022-05-30 2023-06-27 中电金信软件有限公司 一种基于预训练语言模型的类别标签识别方法及装置

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021217935A1 (zh) * 2020-04-29 2021-11-04 深圳壹账通智能科技有限公司 问题生成模型的训练方法、问题生成方法及其相关设备
CN114840651A (zh) * 2022-04-20 2022-08-02 南方科技大学 视觉问答的训练方法、系统及计算机可读存储介质
CN115114439A (zh) * 2022-08-30 2022-09-27 北京百度网讯科技有限公司 多任务模型推理、多任务信息处理的方法和装置
CN115438176A (zh) * 2022-11-08 2022-12-06 阿里巴巴达摩院(杭州)科技有限公司 下游任务模型生成及任务执行的方法和设备

Also Published As

Publication number Publication date
CN115438176B (zh) 2023-04-07
CN115438176A (zh) 2022-12-06

Similar Documents

Publication Publication Date Title
WO2024099144A1 (zh) 下游任务模型生成及任务执行的方法和设备
KR102048030B1 (ko) 자동화 어시스턴트와의 단대단 다국어 통신 촉진
CN108846130B (zh) 一种问题文本生成方法、装置、设备和介质
CN107391646B (zh) 一种视频图像的语义信息提取方法及装置
CN108985358B (zh) 情绪识别方法、装置、设备及存储介质
CN109101545A (zh) 基于人机交互的自然语言处理方法、装置、设备和介质
WO2021143022A1 (zh) 一种文本生成的方法及装置
JP7431833B2 (ja) 言語シーケンスラベリング方法、装置、プログラム及びコンピューティング機器
CN111368545B (zh) 一种基于多任务学习的命名实体识别方法和装置
CA3011397A1 (en) Natural expression processing method, processing and response method, device and system
JP2022006174A (ja) モデルをトレーニングするための方法、装置、デバイス、媒体、およびプログラム製品
CN112417092B (zh) 基于深度学习的智能化文本自动生成系统及其实现方法
CN111027291B (zh) 文本中标点符号添加、模型训练方法、装置及电子设备
US11676607B2 (en) Contextual denormalization for automatic speech recognition
CN110263218A (zh) 视频描述文本生成方法、装置、设备和介质
Voronov et al. Development of a software package designed to support distance education for disabled people
CN113326367B (zh) 基于端到端文本生成的任务型对话方法和系统
CN111402864A (zh) 语音处理方法及电子设备
CN115905513A (zh) 一种基于去噪式问答的对话摘要方法
US20200125321A1 (en) Digital Assistant User Interface Amalgamation
JP2022088586A (ja) 音声認識方法、音声認識装置、電子機器、記憶媒体コンピュータプログラム製品及びコンピュータプログラム
US20220284891A1 (en) Noisy student teacher training for robust keyword spotting
CN113343668B (zh) 选择题解题方法、装置、电子设备及可读存储介质
CN112131878B (zh) 文本处理方法、装置以及计算机设备
Mehta et al. Evolution in Automated Translator for Real Time Voice to Sign Language Transformation for the Deaf and Dumb People