CN116756564A

CN116756564A - Training method and using method of task solution-oriented generation type large language model

Info

Publication number: CN116756564A
Application number: CN202310622577.5A
Authority: CN
Inventors: 黄际洲; 王少磊; 孙一博
Original assignee: Apollo Intelligent Connectivity Beijing Technology Co Ltd
Current assignee: Apollo Intelligent Connectivity Beijing Technology Co Ltd
Priority date: 2023-05-29
Filing date: 2023-05-29
Publication date: 2023-09-15

Abstract

The disclosure provides a training and using method of a task-solution-oriented generation type large language model, and relates to the technical fields of artificial intelligence such as generation type large language models and unsupervised training. The method comprises the following steps: training the general generation type large language model based on first training samples constructed by different tool applications and corresponding functional description documents to obtain a first model; under the controllable running environment integrated with each tool application, the first model is controlled to automatically explore and call the preset user input to try to finish the user requirements of the preset user input characterization by the plurality of tool applications, and the actual execution result output by the first model is obtained; updating the first model in a reinforcement learning mode according to the similarity between the standard execution result and the actual execution result to obtain a second model; training the second model in a reinforcement learning mode based on human feedback, and further obtaining a target generation type large language model capable of accurately calling tool application to solve user tasks.

Description

Training method and using method of task solution-oriented generation type large language model

Technical Field

The present disclosure relates to the field of task processing, and relates to the field of artificial intelligence technologies such as generative large language models and unsupervised training, and in particular, to a training method for a generative large language model for task solution, a task processing method based on the generative large language model, and corresponding devices, electronic devices, computer readable storage media and computer program products.

Background

Large language models (LLM, large Language Model), which are essentially generative models, also simply generative large language models, have demonstrated powerful NLP (Natural Language Processing ) understanding and generating capabilities.

To fully exploit the application potential of LLM, it is necessary to establish an effective "communication channel" between LLM and the physical and real world, thereby enabling LLM to mobilize various tools (tools, used herein to refer broadly to various physical or virtual tools that can be invoked by a model, such as tool-like applications or control of physical devices by tool-like applications) to fulfill various needs.

Since the various tools are designed without consideration of the requirements of the LLM in combination, natural isolation between the LLM and the various tools results.

Disclosure of Invention

The embodiment of the disclosure provides a training method of a generative large language model facing task solution, a task processing method based on the generative large language model, and a matched device, electronic equipment, a computer readable storage medium and a computer program product.

In a first aspect, an embodiment of the present disclosure provides a training method for a task solution-oriented generated large language model, including: training the general generation type large language model based on first training samples constructed by different tool applications and corresponding functional description documents to obtain a first model; the general generation type large language model is a generation type large language model obtained based on general training samples; under a controllable running environment integrated with each tool application, controlling a first model to automatically explore and call a plurality of tool applications to finish user requirements of preset user input characterization on preset user input attempts, and acquiring an actual execution result which is output by the first model and is finally obtained after multi-step calling; updating the first model in a reinforcement learning mode according to the similarity between the standard execution result and the actual execution result which correspond to the same preset user input to obtain a second model; training the second model in a reinforcement learning mode based on human feedback to obtain the target generation type large language model.

In a second aspect, an embodiment of the present disclosure provides a training apparatus for generating a large language model for task solution, including: the first model training unit is configured to train the general generation type large language model based on a first training sample constructed by different tool applications and corresponding function description documents to obtain a first model; the general generation type large language model is a generation type large language model obtained based on general training samples; the automatic exploration unit is configured to control the first model to automatically explore and call a plurality of tool applications to finish the user requirements of the preset user input characterization on the preset user input under the controllable running environment integrated with each tool application, and obtain the actual execution result which is output by the first model and is finally obtained after multi-step calling; the second model training unit is configured to update the first model in a reinforcement learning mode according to the similarity between the standard execution result and the actual execution result which correspond to the same preset user input to obtain a second model; and the target generation type large language model training unit is configured to train the second model in a reinforcement learning mode based on human feedback to obtain the target generation type large language model.

In a third aspect, an embodiment of the present disclosure provides a task processing method based on a generative large language model, including: acquiring a task processing request described by a user in a natural language form; invoking a preset target generation type large language model to process a task processing request to obtain a returned task processing result; the target generation type large language model is obtained according to the training method of the task solution-oriented generation type large language model; and returning the task processing result to the user initiating the task request processing.

In a fourth aspect, an embodiment of the present disclosure provides a task processing device based on a generative large language model, including: a task processing request acquisition unit configured to acquire a task processing request described in a natural language form by a user; the model calling and processing unit is configured to call a preset target generation type large language model to process a task processing request, and a returned task processing result is obtained; wherein the target generative large language model is obtained according to the training device of the task solution oriented generative large language model as described in the second aspect; and the processing result returning unit is configured to return the task processing result to the user initiating the task request processing.

In a fifth aspect, embodiments of the present disclosure provide an electronic device, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to implement a training method of the generative large language model for task solution as described in the first aspect or a task processing method based on the generative large language model as described in the third aspect when executed.

In a sixth aspect, embodiments of the present disclosure provide a non-transitory computer-readable storage medium storing computer instructions for enabling a computer to implement a training method of a generative large language model for task-oriented solution as described in the first aspect or a task processing method based on the generative large language model as described in the third aspect when executed.

In a seventh aspect, embodiments of the present disclosure provide a computer program product comprising a computer program which, when executed by a processor, is capable of implementing the steps of a training method of a generative large language model for task solution as described in the first aspect or the steps of a task processing method based on a generative large language model as described in the third aspect.

According to the training method of the task solution-oriented large-scale generated language model and the task processing method based on the large-scale generated language model, the models are trained by means of the first training samples constructed by different tool applications and corresponding function description documents, so that the first models learn primarily what capabilities of the tool applications and how to realize the associated knowledge of the capabilities of the tool applications, task demands of preset user input knowledge are tried to be solved by controlling the first models in a controllable running environment in an attempt and exploring mode of calling various tool applications, the first models are updated in a reinforcement learning mode according to the similarity between actual execution results and standard execution results, the updated second models can learn how to accurately call the tool applications to solve the associated knowledge of the user demands, and finally the second models are trained in a reinforcement learning mode based on human feedback, so that the task solution-oriented large-scale generated language model is finally obtained. Because the corresponding interfaces and interface standards for model call do not need to be specially designed for massive different tool applications in the process of obtaining the second model, the model automatically learns how to call the tool applications to solve the associated knowledge of the user demands in a exploring and trying mode in a controllable running environment under the unsupervised condition, the cost and time for constructing high-quality training samples are greatly reduced, and the efficiency of training out the finally available target generation type large language model is improved.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

Other features, objects and advantages of the present disclosure will become more apparent upon reading of the detailed description of non-limiting embodiments, made with reference to the following drawings:

FIG. 1 is an exemplary system architecture in which the present disclosure may be applied;

FIG. 2 is a flowchart of a training method of a task solution-oriented generative large language model provided in an embodiment of the present disclosure;

FIG. 3 is a flow chart of a method of constructing a first training sample provided by an embodiment of the present disclosure;

FIG. 4 is a flowchart of a method for constructing a second training sample and training to obtain a third model according to an embodiment of the present disclosure;

FIG. 5 is a flowchart of a method for constructing a third training sample and training to obtain a fourth model according to an embodiment of the present disclosure;

FIG. 6 is a flow chart of a method for task processing based on a generative large language model provided by an embodiment of the present disclosure;

FIG. 7 is a block diagram of a training device for generating a large language model for task solution according to an embodiment of the present disclosure;

FIG. 8 is a block diagram of a task processing device based on a generative large language model according to an embodiment of the present disclosure;

fig. 9 is a schematic structural diagram of an electronic device adapted to perform a training method of a generative large language model for task solution and/or a task processing method based on the generative large language model according to an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness. It should be noted that, without conflict, the embodiments of the present disclosure and features of the embodiments may be combined with each other.

In the technical scheme of the disclosure, the related processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like of the personal information of the user accord with the regulations of related laws and regulations, and the public order colloquial is not violated.

FIG. 1 illustrates an exemplary system architecture 100 to which embodiments of methods, apparatus, electronic devices, and computer-readable storage media for training a face recognition model and recognizing faces of the present application may be applied.

As shown in fig. 1, a system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or send messages or the like. Various applications for implementing information communication between the terminal devices 101, 102, 103 and the server 105, such as a model training class application, a training sample construction class application, a task processing class application, and the like, may be installed on the terminal devices.

The terminal devices 101, 102, 103 and the server 105 may be hardware or software. When the terminal devices 101, 102, 103 are hardware, they may be various electronic devices with display screens, including but not limited to smartphones, tablets, laptop and desktop computers, etc.; when the terminal devices 101, 102, 103 are software, they may be installed in the above-listed electronic devices, which may be implemented as a plurality of software or software modules, or may be implemented as a single software or software module, which is not particularly limited herein. When the server 105 is hardware, it may be implemented as a distributed server cluster formed by a plurality of servers, or may be implemented as a single server; when the server is software, the server may be implemented as a plurality of software or software modules, or may be implemented as a single software or software module, which is not particularly limited herein.

The server 105 can provide various services through various built-in applications, and for example, a task processing class application that can provide processing services of various tasks for users, the server 105 can achieve the following effects when running the task processing class application: firstly, receiving a task processing request which is described in a natural language form by a user and is transmitted by terminal equipment 101, 102 and 103 through a network 104; then, calling a target generation type large language model which is generated in advance to process the task processing request, and obtaining a returned task processing result; finally, the task processing result is returned to the terminal devices 101, 102, 103 that initiate the task request processing through the network 104.

The objective-generating large language model may be obtained by training the model training class application built in the server 105 as follows: firstly, training a general generation type large language model based on a first training sample constructed by different tool applications and corresponding function description documents to obtain a first model, wherein the general generation type large language model is a generation type large language model obtained based on the training of the general training sample; then, under a controllable running environment integrated with each tool application, controlling a first model to automatically explore and call a plurality of tool applications to finish user requirements of preset user input characterization on preset user input attempts, and acquiring an actual execution result which is output by the first model and is finally obtained after multi-step calling; then, updating the first model in a reinforcement learning mode according to the similarity between the standard execution result and the actual execution result which correspond to the same preset user input to obtain a second model; and finally, training the second model in a reinforcement learning mode based on human feedback to obtain the target generation type large language model.

Because the training method for obtaining the target generated large language model needs to occupy more operation resources and stronger operation capacity, the training method for generating the large language model for task solution provided by the subsequent embodiments of the present application is generally executed by the server 105 having stronger operation capacity and more operation resources, and correspondingly, the training device for generating the large language model for task solution is also generally arranged in the server 105. However, it should be noted that, when the terminal devices 101, 102, 103 also have the required computing capability and computing resources, the terminal devices 101, 102, 103 may complete each operation performed by the server 105 through the training class application of the task solution oriented large language model installed thereon, and further output the same result as the server 105. Correspondingly, training devices of the task solution-oriented generative large language model can also be provided in the terminal devices 101, 102, 103. In this case, the exemplary system architecture 100 may also not include the server 105 and the network 104.

Of course, the server used to train the resulting goal-generating large language model may be different from the server used to invoke the trained goal-generating large language model. In particular, the target-generating large language model obtained through training of the server 105 may also obtain a lightweight target-generating large language model suitable for being placed in the terminal devices 101, 102, 103 through a model distillation manner, and may flexibly select to use the lightweight target-generating large language model in the terminal devices 101, 102, 103 or select to use a more complex target-generating large language model in the server 105 according to the recognition accuracy of actual requirements.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Referring to fig. 2, fig. 2 is a flowchart of a training method of a task solution-oriented generation type large language model, where the flowchart 200 includes the following steps:

step 201: training the general generation type large language model based on first training samples constructed by different tool applications and corresponding functional description documents to obtain a first model;

this step aims at training the generic generative large language model by the execution body of the training method of the task solution oriented generative large language model (e.g. server 105 shown in fig. 1) using a first training sample based on the construction of the different tool applications and corresponding functional description documents to obtain a first model.

The function description document of the tool application describes functions and functional characteristics of the tool application and related descriptions of how to use and realize the functions of the tool application, wherein the description modes comprise various modes such as text, codes, images, videos and the like so as to describe the functions of the corresponding tool application as accurately and comprehensively as possible. In particular, the content in the function description document may be aggregated from multiple acquisition channels, such as official instruction manuals, official introductions, and use attacks, use hearts, etc. published by professionals on forums, as well as online tutorials, operation manuals, questions and answers in developer communities, case studies, blog articles, and off-the-shelf API (Application Programming Interface ) introductions, etc. obtained from other channels. Furthermore, the LLM can learn the using mode of the tool application from the data by integrating the data and constructing a comprehensive knowledge base, and the knowledge base can enable the LLM to explore and understand the functions, limitations and application scenes of various tool applications in an unsupervised learning mode, so that the universality, adaptability and accuracy of the LLM in practical application are remarkably improved

In addition, the general generated large language model refers to a generated large language model obtained by training based on a general training sample (i.e., the most conventional and basic training corpus). The first training sample is used for training the general generation type large language model, the initial generation type large language model is not trained, and training time is shortened by means of the preliminary capability of model parameters in the existing general generation type large language model, so that training efficiency is improved.

The method comprises the steps of training a model by means of a first training sample constructed by different tool applications and corresponding function description documents, so that the first model preliminarily learns the capabilities of each tool application and the associated knowledge of how to realize the capabilities.

Step 202: under a controllable running environment integrated with each tool application, controlling a first model to automatically explore and call a plurality of tool applications to finish user requirements of preset user input characterization on preset user input attempts, and acquiring an actual execution result which is output by the first model and is finally obtained after multi-step calling;

based on step 201, the present step aims to control the first model by the execution subject under the controllable running environment integrated with each tool application to explore and attempt to call multiple tool applications to complete the actual user requirement represented by the preset user input in an attempt mode, so as to obtain the final actual execution result output by the first model after multi-step call.

It should be noted that, before executing step 202, a controllable running environment that accommodates, supports running each tool application and is convenient to be invoked by the first model may be created in advance, where the controllable running environment will limit the range of objects that each tool application can control, the type of execution operation (to avoid that an application or device that should not be controlled is erroneously controlled to execute an erroneous operation), and be isolated from the normal running environment, so as to avoid unnecessary influence on the running data in the normal running environment.

Preferably, the controllable operating environment may be embodied as a sandbox environment created in the form of a sandbox, which is mainly used for ensuring safe operation. In the sandbox, an adapter script and an interaction protocol can be developed in advance, so that the LLM can interact with various tool applications, and the script can receive LLM instructions and automatically convert the LLM instructions into calling commands of the tool applications, and automatically convert returned results of the tool applications into a format which can be understood by the LLM.

It should be noted that, through the training in step 201, the first model learns the associated knowledge of what capabilities each tool application has and how to implement the capabilities, but the accuracy of the associated knowledge is still poor and the knowledge depth is still poor, so when step 202 controls the first model to respond to the actual user requirement represented by the preset user input in a exploring and attempting manner, more actual execution results actually failing to actually meet the user requirement will certainly appear.

Step 203: updating the first model in a reinforcement learning mode according to the similarity between the standard execution result and the actual execution result which correspond to the same preset user input to obtain a second model;

based on step 202, this step aims to update the first model by the execution body in a reinforcement learning (RL, reinforcement Learning) manner according to the similarity between the standard execution result and the actual execution result corresponding to the same preset user input, so as to obtain the second model.

The standard execution result refers to a user requirement result for confirming that the user requirement represented by the corresponding preset user input is correctly responded, and for example, the standard execution result can be a text processing result, a voice processing result, an image processing result, a video processing result, a website development result and the like.

Specifically, the execution body may calculate the similarity between the standard execution result and the actual execution result corresponding to the same preset user input, and then control the first model to adjust the model parameters in a reinforcement learning manner in a direction in which the actual execution result with greater similarity can be output, that is, the score of the actual execution result with greater similarity in the reinforcement learning training process is higher, so as to encourage the model to adjust in a direction in which the actual execution result with greater similarity can be output. The above effects can be achieved in other similar ways besides by adjusting the score according to the above-mentioned idea, and are not listed here.

Considering that the expression form of the execution result is changed according to different demands of users, when calculating the similarity, the calculation mode of the similarity should be adjusted by considering different expression forms of the execution result, taking the standard execution result as a target image containing the content of the target image as an example, the actual execution result can be firstly converted into an actual image with the same size as the target image, and then the image similarity between the target image and the actual image is calculated; taking a standard execution result as a target text containing target semantics as an example, the actual execution result can be converted into an actual text in the same language as the target text, and then the text similarity and the semantic similarity between the target text and the actual text are calculated. The similarity calculation mode of the execution results of other expression forms can be adaptively selected, and is not listed here.

Meanwhile, considering some complex tasks related to the key intermediate process, when calculating an actual execution result and a standard execution result, the similarity between images at each key moment, the similarity between tool application calling sequences and the similarity between action execution times can be flexibly combined, and even the comprehensive similarity is determined by combining the similarity of control parameters of other devices controlled by the model, so that accurate similarity evaluation can be obtained as far as possible. Taking the task of taking a bottle of black tea from a refrigerator and placing the black tea on a table as an example, a target generating type large language model is used for completing the task, and a plurality of actions such as moving, opening and closing a refrigerator door, taking articles from corresponding positions of the refrigerator, moving and placing the articles in the condition of holding the articles can be controlled by a corresponding driving program, if the result of only taking one bottle of beverage before and after the table is obtained, the execution accuracy of the whole action flow can not be comprehensively evaluated, so that the related key actions can be integrated, and parameters recorded in images, texts and even other forms can be flexibly used by means of cameras and execution logs in combination with execution parameters of the key actions, and similarity comparison can be carried out by expected standard images, texts and parameters, so that the final comprehensive similarity can be obtained.

Step 204: training the second model in a reinforcement learning mode based on human feedback to obtain the target generation type large language model.

Based on step 203, this step aims to train the second model by the execution subject in a reinforcement learning (Reinforcement Learning from Human Feedback, RLHF) manner based on human feedback, so as to obtain the objective generation type large language model. Specifically, the model trained by the RLHF method may be not only the second model, but also other models obtained by further optimizing and adjusting the second model, so as to improve the model effect of the finally obtained target production type large language model.

According to the training method of the task solution-oriented large language model, the models are trained by means of the first training samples constructed by different tool applications and corresponding function description documents, so that the first models learn primarily what capabilities of the tool applications are and how the associated knowledge of the capabilities of the tool applications is achieved, task demands of preset user input knowledge are tried to be solved in an attempt and exploration mode of calling various tool applications under a controllable running environment, the first models are updated in a reinforcement learning mode according to the similarity between actual execution results and standard execution results, the updated second models can learn how to accurately call the tool applications to solve the associated knowledge of the user demands, and finally the second models are trained in a reinforcement learning mode based on human feedback, so that the task solution-oriented large language model is finally obtained. Because the corresponding interfaces and interface standards for model call do not need to be specially designed for massive different tool applications in the process of obtaining the second model, the model automatically learns how to call the tool applications to solve the associated knowledge of the user demands in a exploring and trying mode in a controllable running environment under the unsupervised condition, the cost and time for constructing high-quality training samples are greatly reduced, and the efficiency of training out the finally available target generation type large language model is improved.

Referring to fig. 3, fig. 3 is a flowchart of a method for constructing a first training sample according to an embodiment of the present disclosure, in which a specific implementation is provided for step 201 in the flowchart 200 shown in fig. 2, other steps in the flowchart 200 are not adjusted, and the specific implementation provided in the embodiment is replaced by the step 201 to obtain a new complete embodiment. Wherein the process 300 comprises the steps of:

step 301: extracting functional description contents from at least one of official documents, user guides, online tutorials, operation manuals, developer community labels, instruction manuals and use attacks of each tool application to obtain a functional description text sequence;

step 302: establishing a corresponding relation between a tool application name of each tool application and a corresponding function description text sequence to obtain a first sample pair;

step 303: a first training sample is constructed based on the first sample pair.

That is, the present embodiment specifically provides a scheme for constructing the first training sample through steps 301-303, that is, the functional description content of the tool application is obtained from multiple channels and integrated into the functional description text sequence, so that a first sample pair of "tool application name-functional description text sequence" is established, and finally, the first training sample is constructed based on a large number of first sample pairs.

It should be noted that, considering that there are a large number of different descriptions and repeated contents in integrating at least one of official documents, user guides, online courses, operation manuals, developer community labels, instruction manuals, and usage shorthand, the repeated contents can be removed by means of deduplication operation, the credible descriptions can be determined in the different descriptions based on information sources, credibility, and the like, and finally, the function description text sequence which is as thin as possible and accurate in description can be obtained.

In addition, the above embodiment fully considers that the function description information is usually embodied in a text manner, so that the above embodiment integrates the function description content into a function description text sequence, and is applied to functions with non-text processing property, and the integrated expression form of the function description content can be expressed as vectors or matrices obtained by converting other forms such as images, voices and the like, so as to facilitate the recognition and processing of models.

Referring to fig. 4, fig. 4 is a flowchart of a method for constructing a second training sample and training to obtain a third model according to an embodiment of the present disclosure, that is, a complementary scheme is additionally provided after step 201 and before step 202 in the flowchart 200 shown in fig. 2, wherein the flowchart 400 includes the following steps:

Step 401: respectively inputting sample user input into each tool application to obtain sample tool application output;

this step aims at inputting, by the execution body, the pre-selected sample user inputs into the different tool applications, respectively, and obtaining, respectively, a sample tool application output returned by each tool application for the received sample user inputs.

Step 402: generating a positive sample consisting of sample user input and valid output and a negative sample consisting of sample user input and invalid output according to whether the sample tool application output is valid output or not;

the valid output is a sample tool application output returned by the sample tool application to the identifiable and processable sample user input of the input, and the invalid output is a sample tool application output returned by the sample tool application to the unrecognizable and unprocessed sample user input of the input.

Based on step 401, this step is aimed at generating, by the execution body, a positive sample composed of sample user input and valid output and a negative sample composed of sample user input and invalid output according to whether each sample tool application output is a valid output. The positive sample indicates that the corresponding tool application can identify and process the current sample user input and effectively process the current sample user input, namely that the sample user input is matched and correlated with the tool application; the negative sample indicates that the corresponding tool application is unrecognizable and does not process the current sample user input effectively, i.e., indicates that the sample user input does not match the tool application and the tool application cannot respond effectively to the sample user input.

Step 403: constructing a second training sample based on the positive and negative samples;

step 404: and training the first model by using the second training sample to obtain a third model.

The step aims at training the first model by the execution main body through a second training sample representing whether the user input of different samples is matched with and associated with different tool applications or not, so that the obtained third model further learns the associated knowledge of what type of user input can be correctly identified and processed and effectively responded by the different tool applications on the basis of the first model, and the third model has better task processing capability.

Through the two training schemes provided by the embodiment, the LLM can automatically learn and explore various APIs and functions of tool application without manual intervention, and the automatic exploration method has wide application prospect, can greatly reduce the cost of integrated tool application and improves generalization and universality of the integrated tool application.

Based on the scheme provided by the additional execution of steps 401-404, step 202 in flow 200 changes the correspondence to: controlling a third model to automatically explore and call a plurality of tool applications to finish user requirements of preset user input characterization on preset user input attempts, and obtaining an actual execution result which is output by the third model and is finally obtained after multi-step calling;

Step 203 also changes the correspondence to: and updating the third model in a reinforcement learning mode according to the similarity between the standard execution result and the actual execution result which correspond to the same preset user input to obtain a second model.

Referring to fig. 5, fig. 5 is a flowchart of a method for constructing a third training sample and training to obtain a fourth model according to an embodiment of the present disclosure, which additionally provides a supplement scheme after step 203 and before step 204 in the flowchart 200 shown in fig. 2. Wherein the process 500 comprises the steps of:

step 501: according to the similarity, determining an optimal actual execution result with highest similarity from a plurality of actual execution results corresponding to the same preset user input;

step 502: acquiring a second model to generate an optimal prompt chain of an optimal actual execution result;

the hint chain refers to a hint process chain of the second model in the process of finally generating the optimal actual execution result, that is, how a plurality of hints are sequentially arranged, and may be specifically understood as a calling sequence and a calling parameter of a plurality of related tool applications.

It should be noted that, the second model described in this step may be the second model obtained after the supplement scheme is provided by the embodiment shown in fig. 4, or the second model obtained before the supplement scheme is provided by the embodiment shown in fig. 4.

Step 503: and training the second model by using a third training sample formed by the preset user input and the matched optimal prompt chain to obtain a fourth model.

In the embodiment, through steps 501-503, a scheme is provided for obtaining a third training sample by construction and obtaining a fourth model by training with the third training sample, that is, an actual execution result with the highest similarity to a standard execution result is selected according to the calculated similarity, and is named as an optimal actual execution result, then a second model is traced back to obtain the optimal actual execution result, and finally, the second model is trained by a third training sample formed by a preset user input-optimal prompt chain, so that the obtained fourth model has higher understanding on the content of the third training sample after the training of the third training sample.

Based on the scheme provided by the additional execution of steps 401-404, step 204 in flow 200 changes the correspondence to: training the fourth model in a reinforcement learning mode based on human feedback to obtain the target generation type large language model.

The above embodiments describe how to train and obtain the target-generated large language model from various aspects, and in order to highlight the effect exerted by the trained target-generated large language model from the actual use scenario as much as possible, the present disclosure further specifically provides a solution for solving the actual problem by using the trained target-generated large language model, and a task processing method based on the generated large language model includes the following steps:

Step 601: acquiring a task processing request described by a user in a natural language form;

in general, the task processing request should include at least a task processing instruction, and when the task processing instruction does not indicate a file to be processed that needs to be processed, the task processing request may not include information on an acquisition mode for acquiring the file to be processed, and when the task processing request indicates that the file to be processed is processed, the task processing request should also include at least information on an acquisition mode for indicating that the file to be processed is acquired.

Step 602: invoking a preset target generation type large language model to process a task processing request to obtain a returned task processing result;

step 603: and returning the task processing result to the user initiating the task request processing.

The embodiment provides a specific implementation mode for solving the task processing request transmitted by the user by using the target generation type large language model, namely, the task processing request is input into the target generation type large language model, so that the task processing result is finally returned by means of the tool application which is trained by the target generation type large language model and is used for understanding the task, automatically determining how to call and what tool application can be called to solve the requirement of the user.

To enhance understanding, the present disclosure also gives specific implementation procedures by the following two examples:

example 1:

when the user input is specifically: "please add a piece of text at the appropriate position of this picture: "welcome to AI age! "and adding a shadow effect to the text, and after the operation is finished, converting the new picture into a PDF file", when the task of the PDF file indicates the text and the picture to be processed are uploaded at the same time:

first, the objective-generation-type large language model automatically selects a picture editing tool (for example, a tool application name is imageeditor pro (chinese name may be picture editing man)) to add text and shadow effects on a picture, specifically generates the following commands to call the picture editing man, adds text and shadow effects for an input picture (input_image. Jpg), and stores the result as output_jmage_with_text. Jpg).

“image_editor_pro_add_text input_image jpg output_image_with_text jpg

"welcome to AI age! "shadow_effect";

next, the object-generated large language model automatically selects a picture-to-PDF tool (e.g., tool application name imagetopfconver (chinese name may be picture-to-PDF master)) to convert the picture to a PDF file.

“image_to_pdf_converter output_image_with_text jpg

final_output.pdf”；

And finally, the target generation type large language model receives the final PDF file, provides the final PDF file for the user, and generates corresponding reply characters.

Example 2:

when the user inputs: "this identifies all plant names in this poster and gives a brief description of each plant" when the task instruction text and the accompanying picture file:

first, the objective-generating large language model automatically selects an image recognition tool (for example, a tool application name is plant identifier (chinese name may be plant recognition master)) to recognize a plant name from a picture, specifically, may generate a call command to recognize the plant name from an input picture input_image.

“plant_identifier input_image.jpg plant_names.json”；

Next, the goal-generating large language model automatically selects encyclopedia tools (e.g., tool application name plant encyclopedia (Chinese name may be encyclopedia)) to query for a brief description of each plant:

“plant_encyclopedia_search plant_names.json plant_descriptions.json”；

finally, the objective-generation-type large language model sorts the plant names and the brief descriptions into a format readable by a user and provides the format to the user, and corresponding reply characters are generated.

With further reference to fig. 7 and 8, as implementations of the methods shown in the foregoing respective diagrams, the present disclosure provides an embodiment of a training apparatus for a task solution-oriented generative large language model and an embodiment of a task processing apparatus based on a generative large language model, respectively, where the embodiment of the training apparatus for the task solution-oriented generative large language model corresponds to the embodiment of the training method for the task solution-oriented generative large language model shown in fig. 2, and the embodiment of the task processing apparatus based on the generative large language model corresponds to the embodiment of the task processing method based on the generative large language model shown in fig. 6. The device can be applied to various electronic equipment.

As shown in fig. 7, the training apparatus 700 of the task solution oriented generative large language model of the present embodiment may include: a first model training unit 701, an automatic exploration unit 702, a second model training unit 703, and a target generation type large language model training unit 704. The first model training unit 701 is configured to train the general generation type large language model based on a first training sample constructed by different tool applications and corresponding function description documents to obtain a first model; the general generation type large language model is a generation type large language model obtained based on general training samples; the automatic exploration unit 702 is configured to control the first model to automatically explore and call a plurality of tool applications to finish the user requirements of the preset user input representation on the preset user input under the controllable running environment integrated with each tool application, and obtain the actual execution result which is output by the first model and is finally obtained after multi-step call; a second model training unit 703 configured to update the first model in a reinforcement learning manner according to the similarity between the standard execution result and the actual execution result corresponding to the same preset user input, to obtain a second model; the target-generating large language model training unit 704 is configured to train the second model in a reinforcement learning manner based on human feedback, so as to obtain the target-generating large language model.

In this embodiment, in the training apparatus 700 of the task solution-oriented generative large language model: specific processes and technical effects of the first model training unit 701, the automatic exploration unit 702, the second model training unit 703, and the object-generation-type large language model training unit 704 may refer to the relevant descriptions of steps 201-204 in the corresponding embodiment of fig. 2, and are not repeated herein.

In some optional implementations of the present embodiment, the first model training unit 701 may be further configured to:

extracting functional description contents from at least one of official documents, user guides, online tutorials, operation manuals, developer community labels, instruction manuals and use attacks of each tool application to obtain a functional description text sequence;

establishing a corresponding relation between a tool application name of each tool application and a corresponding function description text sequence to obtain a first sample pair;

a first training sample is constructed based on the first sample pair.

In some optional implementations of the present embodiment, the training apparatus 700 for generating a large language model for task solution may further include:

a controllable running environment creation unit configured to create a controllable running environment that accommodates, supports running each tool application and is convenient to be called by the first model; the controllable running environment limits the range of objects which each tool application can control, the type of executing operation and is isolated from the normal running environment.

In some alternative implementations of the present embodiment, the controllable operating environment is a sandbox environment created in the form of a sandbox.

In some optional implementations of the present embodiment, the second model training unit 703 may include:

a similarity calculating subunit configured to calculate a similarity between the standard execution result and the actual execution result corresponding to the same preset user input;

a reinforcement learning training control subunit configured to control the first model to adjust model parameters in a reinforcement learning manner in a direction that tends to be able to output actual execution results with a greater degree of similarity; wherein the actual execution result with higher similarity has higher score in the reinforcement learning training process.

In some optional implementations of the present embodiment, the similarity calculation subunit may be further configured to:

converting the actual execution result into an actual image with the same size as the target image in response to the standard execution result being the target image containing the target image content;

and calculating the image similarity between the target image and the actual image.

Converting the actual execution result into an actual text in the same language as the target text in response to the standard execution result being the target text containing target semantics;

and calculating the text similarity and the semantic similarity between the target text and the actual text.

the sample input/output acquisition unit is configured to input sample user input into each tool application respectively to obtain sample tool application output;

a positive and negative sample acquisition unit configured to generate a positive sample composed of a sample user input and a valid output and a negative sample composed of a sample user input and an invalid output according to whether the sample tool application output is the valid output; the valid output is a sample tool application output returned by the sample tool application to the input identifiable and processable sample user input, and the invalid output is a sample tool application output returned by the sample tool application to the input unidentifiable and unprocessed sample user input;

a second training sample construction unit configured to construct a second training sample based on the positive sample and the negative sample;

The third model training unit is configured to train the first model by using the second training sample to obtain a third model;

correspondingly, the automatic exploration unit 702 may be further configured to:

controlling a third model to automatically explore and call a plurality of tool applications to finish user requirements of preset user input characterization on preset user input attempts, and obtaining an actual execution result which is output by the third model and is finally obtained after multi-step calling;

correspondingly, the second model training unit 703 may be further configured to:

and updating the third model in a reinforcement learning mode according to the similarity between the standard execution result and the actual execution result which correspond to the same preset user input to obtain a second model.

the optimal actual execution result determining unit is configured to determine an optimal actual execution result with highest similarity from a plurality of actual execution results corresponding to the same preset user input according to the similarity;

the optimal prompt chain acquisition unit is configured to acquire an optimal prompt chain of an optimal actual execution result generated by the second model;

The fourth model training unit is configured to train the second model by using a third training sample formed by a preset user input and a matched optimal prompt chain to obtain a fourth model;

correspondingly, the object-generated large language model training unit 704 may be further configured to:

training the fourth model in a reinforcement learning mode based on human feedback to obtain the target generation type large language model.

As shown in fig. 8, the task processing device 800 based on the generative large language model of the present embodiment may include: task processing request acquisition section 801, model calling section 802, and processing result return section 803. Wherein, the task processing request acquisition unit 801 is configured to acquire a task processing request described by a user in a natural language form; the model calling and processing unit 802 is configured to call a preset target generation type large language model to process the task processing request, so as to obtain a returned task processing result; and a processing result returning unit 803 configured to return the task processing result to the user who initiated the task request processing.

In the present embodiment, in the task processing device 800 based on the generative large language model: the specific processes of the task processing request obtaining unit 801, the model invoking and processing unit 802, and the processing result returning unit 803 may correspond to the relevant descriptions in the method embodiments, and are not described herein.

The training device for generating a large language model for task solution and the task processing device based on the large language model provided by the embodiments exist as device embodiments corresponding to the method embodiments, the first training sample constructed by different tool applications and corresponding function description documents is used for training the model, so that the first model primarily learns the capabilities of each tool application and the associated knowledge of the capabilities of each tool application, and the task requirements of preset user input knowledge are tried to be solved by controlling the first model in a mode of trying and exploring to call various tool applications under a controllable running environment, the first model is updated in a reinforcement learning mode according to the similarity between an actual execution result and a standard execution result, the second model obtained after the update is enabled to learn the associated knowledge of how to accurately call the tool applications to solve the user requirements, and finally the second model is trained in a reinforcement learning mode based on human feedback, so that the target generating large language model for task solution is finally obtained. Because the corresponding interfaces and interface standards for model call do not need to be specially designed for massive different tool applications in the process of obtaining the second model, the model automatically learns how to call the tool applications to solve the associated knowledge of the user demands in a exploring and trying mode in a controllable running environment under the unsupervised condition, the cost and time for constructing high-quality training samples are greatly reduced, and the efficiency of training out the finally available target generation type large language model is improved.

According to an embodiment of the present disclosure, the present disclosure further provides an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor, such that the at least one processor, when executed, is capable of implementing the training method of the task solution oriented generative large language model and/or the task processing method based on the generative large language model described in any of the embodiments.

According to an embodiment of the present disclosure, there is further provided a readable storage medium storing computer instructions for enabling a computer to implement the training method of the generative large language model for task solution and/or the task processing method based on the generative large language model described in any of the above embodiments when executed.

The disclosed embodiments provide a computer program product that, when executed by a processor, enables the steps of the training method of a generative large language model for task solution described in any of the embodiments above and/or the steps of the task processing method based on the generative large language model.

Fig. 9 shows a schematic block diagram of an example electronic device 900 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 9, the apparatus 900 includes a computing unit 901 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 902 or a computer program loaded from a storage unit 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data required for the operation of the device 900 can also be stored. The computing unit 901, the ROM 902, and the RAM 903 are connected to each other by a bus 904. An input/output (I/O) interface 905 is also connected to the bus 904.

Various components in device 900 are connected to I/O interface 905, including: an input unit 906 such as a keyboard, a mouse, or the like; an output unit 907 such as various types of displays, speakers, and the like; a storage unit 908 such as a magnetic disk, an optical disk, or the like; and a communication unit 909 such as a network card, modem, wireless communication transceiver, or the like. The communication unit 909 allows the device 900 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunications networks.

The computing unit 901 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 901 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 901 performs the respective methods and processes described above, for example, a training method of a generative large language model for task solution and/or a task processing method based on the generative large language model. For example, in some embodiments, the training method of the generative large language model for task solution and/or the task processing method based on the generative large language model may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 908. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 900 via the ROM 902 and/or the communication unit 909. When the computer program is loaded into the RAM 903 and executed by the computing unit 901, one or more steps of the above-described training method of the generative large language model for task solution and/or the task processing method based on the generative large language model may be performed. Alternatively, in other embodiments, the computing unit 901 may be configured by any other suitable means (e.g., by means of firmware) to perform a training method of the generative large language model for task solution and/or a task processing method based on the generative large language model.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of large management difficulty and weak service expansibility in the traditional physical host and virtual private server (VPS, virtual Private Server) service.

According to the technical scheme, the models are trained by means of the first training samples constructed by different tool applications and corresponding function description documents, so that the first models learn primarily what capabilities each tool application has and how to realize the associated knowledge of the capabilities, task demands of preset user input knowledge are tried to be solved in an attempted and explored mode of calling a plurality of tool applications under a controllable running environment, the first models are updated in a reinforcement learning mode according to the similarity between actual execution results and standard execution results, the updated second models can learn further how to accurately call the tool applications to solve the associated knowledge of the user demands, and finally the second models are trained in a reinforcement learning mode based on human feedback to finally obtain the task-solving-oriented target-generation type large language model. Because the corresponding interfaces and interface standards for model call do not need to be specially designed for massive different tool applications in the process of obtaining the second model, the model automatically learns how to call the tool applications to solve the associated knowledge of the user demands in a exploring and trying mode in a controllable running environment under the unsupervised condition, the cost and time for constructing high-quality training samples are greatly reduced, and the efficiency of training out the finally available target generation type large language model is improved.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A training method of a task solution-oriented generated large language model comprises the following steps:

training the general generation type large language model based on first training samples constructed by different tool applications and corresponding functional description documents to obtain a first model; the general generation type large language model is generated based on general training samples;

under a controllable running environment integrated with each tool application, controlling the first model to automatically explore and call a plurality of tool applications for preset user input to finish the user requirements of the preset user input characterization, and obtaining an actual execution result which is output by the first model and is finally obtained after multi-step calling;

According to the similarity between the standard execution result and the actual execution result which correspond to the same preset user input, updating the first model in a reinforcement learning mode to obtain a second model;

training the second model in a reinforcement learning mode based on human feedback to obtain a target generation type large language model.

2. The method of claim 1, wherein constructing a first training sample based on the functional description documents and the applications by the different tools comprises:

the first training sample is constructed based on the first sample pair.

3. The method of claim 1, further comprising:

creating a controllable operating environment that accommodates, supports running tool applications and is convenient to be called by the first model; wherein the controllable operating environment limits the range of objects that each tool application can control, the type of operations performed, and is isolated from the normal operating environment.

4. A method according to claim 3, wherein the controllable operating environment is a sandbox environment created in the form of a sandbox.

5. The method of claim 1, wherein updating the first model in a reinforcement learning manner according to a similarity between a standard execution result and an actual execution result corresponding to the same preset user input comprises:

calculating the similarity between the standard execution result and the actual execution result which correspond to the same preset user input;

controlling the first model to adjust model parameters in a reinforcement learning manner in a direction that tends to be capable of outputting actual execution results having greater similarity; wherein the actual execution result with higher similarity has higher score in the reinforcement learning training process.

6. The method of claim 5, wherein the calculating the similarity between the standard execution result and the actual execution result corresponding to the same preset user input comprises:

converting the actual execution result into an actual image with the same size as the target image in response to the standard execution result being the target image containing target image content;

7. The method of claim 5, wherein the calculating the similarity between the standard execution result and the actual execution result corresponding to the same preset user input comprises:

responding to the standard execution result as a target text containing target semantics, and converting the actual execution result into an actual text in the same language as the target text;

8. The method of claim 1, wherein after training a generic generative large language model based on a first training sample constructed from different tool applications and corresponding functional description documents, prior to controlling the first model to automatically explore and invoke multiple tool applications for completing user requirements of the preset user input characterization on preset user input attempts, further comprising:

respectively inputting sample user input into each tool application to obtain sample tool application output;

generating a positive sample formed by the sample user input and the valid output and a negative sample formed by the sample user input and the invalid output according to whether the sample tool application output is the valid output or not; wherein the valid output is a sample tool application output returned by the sample tool application to the identifiable, processable sample user input of the input, and the invalid output is a sample tool application output returned by the sample tool application to the unrecognizable, unprocessed sample user input of the input;

Constructing a second training sample based on the positive sample and the negative sample;

training the first model by using the second training sample to obtain a third model;

correspondingly, the controlling the first model to automatically explore and call a plurality of tool applications for completing the user requirements of the preset user input characterization on the preset user input attempts, and obtaining the actual execution result which is output by the first model and is finally obtained after multi-step calling comprises the following steps:

controlling the third model to automatically explore and call a plurality of tool applications for completing user requirements of the preset user input representation on preset user input attempts, and obtaining an actual execution result which is output by the third model and is finally obtained after multi-step calling;

correspondingly, the updating the first model in a reinforcement learning mode according to the similarity between the standard execution result and the actual execution result, which correspond to the same preset user input, to obtain a second model comprises the following steps:

and updating the third model in a reinforcement learning mode according to the similarity between the standard execution result and the actual execution result which correspond to the same preset user input, so as to obtain the second model.

9. The method according to any one of claims 1-8, wherein after updating the first model in a reinforcement learning manner according to a similarity between standard execution results and actual execution results corresponding to the same preset user input, before training the second model in a reinforcement learning manner based on human feedback, obtaining a target-generated large language model, further comprising:

According to the similarity, determining an optimal actual execution result with highest similarity from a plurality of actual execution results corresponding to the same preset user input;

acquiring an optimal prompt chain for generating the optimal actual execution result by the two models;

training the second model by using a third training sample formed by the preset user input and the matched optimal prompt chain to obtain a fourth model;

correspondingly, training the second model in a reinforcement learning mode based on human feedback to obtain a target generation type large language model, which comprises the following steps:

10. A task processing method based on a large language model comprises the following steps:

acquiring a task processing request described by a user in a natural language form;

invoking a preset target generation type large language model to process the task processing request to obtain a returned task processing result; wherein the target generative large language model is obtained according to the training method of the task solution-oriented generative large language model of any one of claims 1 to 9;

And returning the task processing result to the user initiating the task request processing.

11. The method according to claim 10, wherein processing of the file to be processed is instructed in response to the task processing request, and the task processing request includes acquisition mode information for instructing acquisition of the file to be processed.

12. A task solution oriented training device for generating a large language model, comprising:

the first model training unit is configured to train the general generation type large language model based on a first training sample constructed by different tool applications and corresponding function description documents to obtain a first model; the general generation type large language model is generated based on general training samples;

the automatic exploration unit is configured to control the first model to try to automatically explore and call a plurality of tool applications to finish the user requirements of the preset user input representation on the preset user input under the controllable running environment integrated with each tool application, and obtain the actual execution result which is output by the first model and finally obtained after multi-step calling;

the second model training unit is configured to update the first model in a reinforcement learning mode according to the similarity between the standard execution result and the actual execution result which correspond to the same preset user input to obtain a second model;

And the target generation type large language model training unit is configured to train the second model in a reinforcement learning mode based on human feedback to obtain a target generation type large language model.

13. The apparatus of claim 12, wherein the first model training unit is further configured to:

the first training sample is constructed based on the first sample pair.

14. The apparatus of claim 12, further comprising:

a controllable running environment creation unit configured to create a controllable running environment that accommodates, supports running each tool application and is convenient to be called by the first model; wherein the controllable operating environment limits the range of objects that each tool application can control, the type of operations performed, and is isolated from the normal operating environment.

15. The apparatus of claim 14, wherein the controllable operating environment is a sandbox environment created in a sandbox form.

16. The apparatus of claim 12, wherein the second model training unit comprises:

a reinforcement learning training control subunit configured to control the first model to adjust model parameters in a reinforcement learning manner in a direction that tends to be able to output an actual execution result having a greater degree of the similarity; wherein the actual execution result with higher similarity has higher score in the reinforcement learning training process.

17. The apparatus of claim 16, wherein the similarity calculation subunit is further configured to:

18. The apparatus of claim 16, wherein the similarity calculation subunit is further configured to:

19. The apparatus of claim 12, further comprising:

a positive and negative sample acquisition unit configured to generate a positive sample of the sample user input and valid output and a negative sample of the sample user input and invalid output according to whether the sample tool application output is a valid output; wherein the valid output is a sample tool application output returned by the sample tool application to the identifiable, processable sample user input of the input, and the invalid output is a sample tool application output returned by the sample tool application to the unrecognizable, unprocessed sample user input of the input;

A third model training unit configured to train the first model by using the second training sample to obtain a third model;

correspondingly, the automatic exploration unit is further configured to:

correspondingly, the second model training unit is further configured to:

20. The apparatus of any of claims 12-19, further comprising:

the optimal prompt chain acquisition unit is configured to acquire an optimal prompt chain of the optimal actual execution result generated by the two models;

A fourth model training unit configured to train the second model by using a third training sample formed by the preset user input and the matched optimal prompt chain to obtain a fourth model;

correspondingly, the object-generated large language model training unit is further configured to:

21. A task processing device based on a generative large language model, comprising:

a task processing request acquisition unit configured to acquire a task processing request described in a natural language form by a user;

the model calling and processing unit is configured to call a preset target generation type large language model to process the task processing request, and a returned task processing result is obtained; wherein the objective generative large language model is obtained according to the training device of the task solution oriented generative large language model of any one of claims 12-20;

and the processing result returning unit is configured to return the task processing result to the user initiating the task request processing.

22. The apparatus of claim 21, wherein processing a file to be processed is instructed in response to the task processing request, and the task processing request includes acquisition mode information for instructing acquisition of the file to be processed.

23. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the training method of the generative large language model for task solution as claimed in any one of claims 1 to 9 and/or the task processing method based on the generative large language model as claimed in claim 10 or 11.

24. A non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the training method of the generative large language model for task solution oriented as claimed in any one of claims 1 to 9 and/or the task processing method based on the generative large language model as claimed in claim 10 or 11.

25. Computer program product comprising a computer program which, when executed by a processor, implements the steps of a training method of a task solution oriented generative large language model according to any one of claims 1 to 9 and/or the steps of a task processing method based on a generative large language model according to claim 10 or 11.