CN116861877A

CN116861877A - Template construction method, device, equipment and storage medium based on reinforcement learning

Info

Publication number: CN116861877A
Application number: CN202310831809.8A
Authority: CN
Inventors: 王俊
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2023-07-06
Filing date: 2023-07-06
Publication date: 2023-10-10

Abstract

The invention relates to the technical field of artificial intelligence and the field of digital medical treatment, and discloses a template construction method, a device, equipment and a storage medium based on reinforcement learning, wherein the method comprises the following steps: generating an initial promt template; acquiring an initial text generated by a large language model based on an initial template; determining template evaluation information of an initial template according to text difference characteristics between the initial text and the real text; judging whether the initial template is a preferable template according to template evaluation information of the initial template; if the initial template is not the preferred template, acquiring a plurality of candidate templates which are learned by a preset reinforcement learning algorithm based on the initial template; acquiring a candidate text generated by a large language model based on a candidate template, and determining template evaluation information corresponding to the candidate template according to text difference characteristics between the candidate text and a real text; the candidate promt template with the first generalization capability rank is taken as the preferred promt template. The invention can improve the efficiency of constructing the template of the Prompt.

Description

Template construction method, device, equipment and storage medium based on reinforcement learning

Technical Field

The invention relates to the technical field of artificial intelligence and the field of digital medical treatment, in particular to a template construction method, device and equipment based on reinforcement learning and a storage medium.

Background

A Prompt template refers to an input text or query to a Large Language Model (LLM) that is used to guide the model in generating a desired output or response. Prompt corresponds to a programming mode that allows for customization of output and interaction with the model.

However, in the existing template construction method, a developer needs to manually construct the template, so that time and labor are wasted, the construction time of the template is prolonged, the construction efficiency of the template is not improved, and the workload of the developer is not reduced. The reason is that the manual construction of the template is to rely on manual experience and knowledge to design a proper text template, which has the following disadvantages:

1. the manual construction of the template takes a lot of time and effort and it is difficult to guarantee the effect.

2. The manual construction of the template may be biased or misleading, resulting in the large language model generating inaccurate content or outputting irrelevant content.

3. The manual construction of the template is difficult to adapt to the changes of different tasks or data sets and requires constant adjustment or redesign.

Therefore, the process of constructing the template is complicated in the prior art, and the constructed template is not beneficial to improving generalization capability.

Disclosure of Invention

The invention provides a template construction method, a device, computer equipment and a storage medium based on reinforcement learning, which are used for solving the problems that the process of constructing a template of the template is complicated, and the constructed template of the template is unfavorable for improving generalization capability.

In a first aspect, a method for constructing a template based on reinforcement learning is provided, including:

acquiring a preset input text and a preset output label, and generating an initial template by using a preset meta model, the input text and the output label;

inputting the initial template into a pre-trained large language model, and acquiring an initial text generated by the large language model based on the initial template;

determining template evaluation information of the initial template according to text difference characteristics between the initial text and the real text;

judging whether the initial template is a preferable template according to the template evaluation information of the initial template;

if the initial template is not the preferred template, acquiring a plurality of candidate templates learned by a preset reinforcement learning algorithm based on the initial template;

Acquiring a candidate text generated by the large language model based on the candidate template, and determining the template evaluation information corresponding to the candidate template according to text difference characteristics between the candidate text and the real text;

and determining the generalization capability of each candidate template according to the template evaluation information corresponding to each candidate template, sequencing the generalization capability according to the sequence from high to low, and taking the candidate template with the first generalization capability ranking as the preferred template.

In a second aspect, there is provided a template construction apparatus based on reinforcement learning, including:

the acquisition module is used for acquiring a preset input text and a preset output label and generating an initial Prompt template by utilizing a preset meta model, the input text and the output label;

the input module is used for inputting the initial template to a pre-trained large language model and acquiring an initial text generated by the large language model based on the initial template;

the determining module is used for determining template evaluation information of the initial template according to text difference characteristics between the initial text and the real text;

The judging module is used for judging whether the initial template is a preferable template according to the template evaluation information of the initial template;

the learning module is used for acquiring a plurality of candidate template candidates learned by a preset reinforcement learning algorithm based on the initial template if the initial template is not the preferred template;

the evaluation module is used for acquiring a candidate text generated by the large language model based on the candidate template and determining template evaluation information corresponding to the candidate template according to text difference characteristics between the candidate text and the real text;

and the optimization module is used for determining the generalization capability of each candidate template according to the template evaluation information corresponding to each candidate template, sequencing the generalization capability according to the order from high to low, and taking the candidate template with the first generalization capability ranking as the optimal template.

In a third aspect, a computer device is provided, comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the reinforcement learning based template construction method described above when the computer program is executed by the processor.

In a fourth aspect, a computer-readable storage medium is provided, in which a computer program is stored, which when executed by a processor, implements the steps of the reinforcement learning-based template construction method described above.

In the scheme realized by the template construction method, the device, the equipment and the storage medium based on reinforcement learning, the scheme has the advantages that on one hand, a preset input text and a preset output label are obtained, and an initial template can be generated by utilizing a preset meta model, the input text and the output label, so that the problem of complicated process of constructing the template can be solved, and the efficiency of constructing the template can be improved; and on the other hand, the generalization capability is ordered from high to low, and the candidate promt template with the first generalization capability rank is used as the preferred promt template, so that the preferred promt template is determined in the constructed promt templates, the generalization capability of the promt template can be improved through the preferred promt template, and the effect of outputting the text by the large language model can be improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments of the present invention will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic view of an application environment of a reinforcement learning-based template construction method according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart of a reinforcement learning-based template construction method according to an embodiment of the present invention;

FIG. 3 is a flowchart illustrating the step S21 in FIG. 1;

FIG. 4 is a flowchart of step S24 in FIG. 1;

FIG. 5 is a flowchart illustrating step S25 in FIG. 1;

FIG. 6 is a schematic diagram of a template construction apparatus based on reinforcement learning according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of a computer device according to an embodiment of the invention;

fig. 8 is a schematic diagram of another configuration of a computer device according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The template construction method based on reinforcement learning provided by the embodiment of the invention can be applied to an application environment as shown in fig. 1, wherein a client communicates with a server through a network.

The method comprises the steps that a server side can obtain a preset input text and a preset output label through a client side, and an initial Prompt template is generated by using a preset meta model, the input text and the output label;

The method and the device have the advantages that on one hand, the preset input text and the preset output label are obtained, and the initial template can be generated by using the preset meta-model, the input text and the output label, so that the problem of complicated process of constructing the template can be solved, and the efficiency of constructing the template can be improved; and on the other hand, the generalization capability is ordered from high to low, and the candidate promt template with the first generalization capability rank is used as the preferred promt template, so that the preferred promt template is determined in the constructed promt templates, the generalization capability of the promt template can be improved through the preferred promt template, and the effect of outputting the text by the large language model can be improved.

The clients may be, but are not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices.

The server may be implemented by a stand-alone server or a server cluster formed by a plurality of servers. The present invention will be described in detail with reference to specific examples.

Referring to fig. 2, fig. 2 is a schematic flow chart of a template construction method based on reinforcement learning according to an embodiment of the invention, which includes the following steps:

s21, acquiring a preset input text and a preset output label, and generating an initial Prompt template by using a preset meta model, the input text and the output label;

acquiring a preset input text and a preset output label, acquiring a loss function of a meta model, and judging whether the loss function of the meta model is in a preset range or not;

and if the loss function of the meta-model is within the preset range, generating an initial Prompt template by using the meta-model, the input text and the output label.

Wherein the loss function of the meta model is:

wherein the method comprises the steps ofRepresenting the loss function of the meta-model,

θ represents the parameters of the meta-model, N represents the number of training data,representing cross entropy loss function, y _i A real template of Prompt representing the ith training data,>representing the initial template of the ith training data generation.

S22, inputting the initial template to a pre-trained large language model, and acquiring an initial text generated by the large language model based on the initial template; s23, determining template evaluation information of the initial template according to text difference characteristics between the initial text and the real text;

s24, judging whether the initial template is a preferable template according to template evaluation information of the initial template;

the method comprises the steps of obtaining parameter values of evaluation indexes in template evaluation information of an initial template;

and if the parameter value of the evaluation index in the template evaluation information of the initial template is not in the preset range, judging that the initial template is not the optimal template.

And if the parameter value of the evaluation index in the template evaluation information of the initial template is in a preset range, judging the initial template as a preferable template.

S25, if the initial template is not the preferred template, acquiring a plurality of candidate templates learned by a preset reinforcement learning algorithm based on the initial template;

for ease of illustration, the reinforcement learning process is as follows:

Initializing: parameters of the strategy model are randomly initialized, and an initial state is set as an initial promt template.

And (3) circulation: repeating the following steps until the maximum iteration number is reached or the termination condition is met:

(1) Sampling: an action is sampled based on the current state and the policy model.

(2) Performing: and executing corresponding operation according to the sampling action to obtain a candidate template.

(3) Evaluation: and generating candidate texts according to the candidate promt templates and the large language model, and calculating rewards according to the evaluation indexes.

(4) And (3) storing: the current state, sampling actions, rewards, and new state are stored in an experience playback pool.

(5) Updating: a batch of data is randomly extracted from the experience playback pool, and parameters of the strategy model are updated according to the reward signals.

(6) And (3) transferring: and assigning the new state to the current state, and entering the next round of circulation.

And (3) outputting: outputting a preferred promt template and a corresponding evaluation index of the preferred promt template.

The application adopts a layering structure, and divides the construction and use process of the Prompt template into two layers: the high level is responsible for generating or selecting the initial template, and the low level is responsible for optimizing, combining and adjusting the template. Therefore, the manual knowledge and data information can be fully utilized, the exploration and the utilization are balanced, and the flexibility and the stability of the Prompt template are improved.

S26, acquiring a candidate text generated by the large language model based on the candidate template, and determining the template evaluation information corresponding to the candidate template according to the text difference characteristics between the candidate text and the real text;

and S27, determining the generalization capability of each candidate template according to the template evaluation information corresponding to each candidate template, sequencing the generalization capability according to the sequence from high to low, and taking the candidate template with the first generalization capability ranking as the preferred template.

For example, according to the template evaluation information corresponding to each candidate template, determining a score value of the generalization capability of each candidate template, sorting the score values of the generalization capability, and ranking the candidate template with the first score value of the generalization capability as the preferred template.

Wherein S27 includes:

determining generalization capability of each candidate template according to an evaluation index in the template evaluation information corresponding to each candidate template, wherein the evaluation index comprises one or a combination of accuracy, recall and F1 value;

And ordering the generalization capability in the order from high to low, and taking the candidate promt template with the first rank of the generalization capability as the preferred promt template.

Wherein after determining the generalization capability of each candidate template according to the template evaluation information corresponding to each candidate template, sorting the generalization capability in order from high to low, and taking the candidate template with the first generalization capability ranking as the preferred template, the method further comprises:

inputting the preferred template into a pre-trained large language model, and acquiring preferred text generated by the large language model based on the preferred template.

For ease of illustration, an example of initial Prompt generation and optimization is given below:

assuming that an english to french translation task is to be solved, given an input text:

I like this book.

it is desirable to generate an appropriate template of promt that directs a large language model to generate the correct french translations.

First, an initial template of promt is generated, given the input text and the output labels, using a meta model:

then, using the large language model, candidate text is generated given the initial template of promt, and rewards are calculated based on the evaluation index. Assume that the output text generated by the large language model is:

J'aime ce livre.

And calculating the F1 value according to the output text and the real text (J' prime ce livre), and obtaining the rewards. If the reward reaches a preset threshold, directly using the promt template; otherwise, the next step is entered for optimization. For example, the F1 value is calculated from the output text and the real text (J' prime ce livre.) with a prize of 1.0, and if the preset threshold is 0.9, the prize reaches the preset threshold, the promt template is directly used.

And then, optimizing, combining, adjusting and the like the initial template by using a reinforcement learning method so as to improve the template effect. The reinforcement learning process is as follows:

(1) Initializing: parameters of the strategy model are randomly initialized, and an initial state is set as an initial promt template.

(2) And (3) circulation: repeating the following steps until the maximum iteration number is reached or the termination condition is met:

sampling: an action is sampled based on the current state and the policy model.

For example, the actions are: adding an example;

the object is: english: he keys dogs.

French:Il aime les chiens.

The content is as follows: a set of input-output examples is added at the end of the example section of the current template.

Performing: and executing corresponding operation according to the sampling action to obtain a candidate template.

Evaluation: and generating candidate texts according to the candidate promt templates and the large language model, and calculating rewards according to the evaluation indexes. Assume that the output text generated by the large language model is:

J'aime ce livre.

the F1 value is calculated from the output text and the real text (J' prime ce livre.) resulting in a prize of 1.0.

And (3) storing: the current state, sampling actions, rewards, and new state are stored in an experience playback pool.

Updating: a batch of data is randomly extracted from the experience playback pool, and parameters of the strategy model are updated according to the reward signals.

And (3) transferring: and assigning the new state to the current state, and entering the next round of circulation.

(3) And (3) outputting: and outputting the final template and the corresponding evaluation index.

Referring to fig. 3, fig. 3 is a flowchart of step S21 in fig. 1, which is described in detail below:

s31, acquiring a plurality of tasks or data sets, and training the meta-model by using the plurality of tasks or data sets;

s32, acquiring a preset input text and a preset output label, and generating an initial Prompt template by using the trained meta model, the input text and the output label.

In the embodiment of the invention, the preset input text and the preset output label are acquired, and the initial template is generated by using the trained meta model, the input text and the output label, so that the problem of complicated process for constructing the template can be solved, and the cost and difficulty of the template are reduced: the time and energy for manually writing and adjusting the template can be reduced, the requirements on expertise and skills are reduced, and the template is constructed and used more conveniently and efficiently.

Referring to fig. 4, fig. 4 is a flowchart of step S24 in fig. 1, which is described in detail below:

s41, acquiring an evaluation index in template evaluation information of the initial template, wherein the evaluation index comprises one or a combination of accuracy, recall and F1 value;

S42, judging whether the initial template is the optimal template according to the evaluation index in the template evaluation information of the initial template.

In the embodiment of the invention, whether the initial template is the preferable template is judged according to the evaluation index in the template evaluation information of the initial template, which is beneficial to quickly determining the preferable template.

Referring to fig. 5, fig. 5 is a flowchart of step S25 in fig. 1, which is described in detail below:

s51, if the initial template is not the preferred template, inputting the initial template into a preset reinforcement learning algorithm;

s52, acquiring a first candidate template learned by the reinforcement learning algorithm based on the initial template;

wherein S52 includes: acquiring the current state of the initial Prompt template and a preset strategy model in the reinforcement learning algorithm;

generating a learning action according to the current state and the strategy model;

and adding a group of input and output examples at the tail of the example part of the initial template according to the learning action to obtain a first candidate template learned based on the initial template.

S53, taking the first candidate template as a next input template, mining by using the next input template by the reinforcement learning algorithm to generate a second candidate template, and performing cyclic iteration processing based on the second candidate template to obtain a plurality of candidate templates generated by the reinforcement learning algorithm in the cyclic iteration processing.

In the embodiment of the invention, a plurality of candidate template templates generated by the reinforcement learning algorithm in the cyclic iteration processing process are obtained, so that the template templates can be automatically generated and optimized by the reinforcement learning method, so that the template templates are more in line with task requirements and language specifications, and the accuracy, fluency and diversity of a large-scale language model for outputting texts based on the template templates are improved.

It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic, and should not limit the implementation process of the embodiment of the present invention.

In one embodiment, a template construction device based on reinforcement learning is provided, and the template construction device based on reinforcement learning corresponds to the template construction method based on reinforcement learning in the embodiment one by one.

Referring to fig. 6, fig. 6 is a schematic structural diagram of a template construction apparatus based on reinforcement learning according to an embodiment of the present invention, as shown in fig. 6, the template construction apparatus based on reinforcement learning includes an acquisition module 101, an input module 102,

A determining module 103, a judging module 104, a learning module 105, an evaluating module 106 and a optimizing module 107. The functional modules are described in detail as follows:

the acquiring module 101 is configured to acquire a preset input text and a preset output tag, and generate an initial Prompt template by using a preset meta model, the input text and the output tag;

the input module 102 is configured to input the initial template to a pre-trained large language model, and obtain an initial text generated by the large language model based on the initial template;

a determining module 103, configured to determine template evaluation information of the initial template according to text difference features between the initial text and the real text;

a judging module 104, configured to judge whether the initial template is a preferred template according to the template evaluation information of the initial template;

a learning module 105, configured to obtain a plurality of candidate template candidates learned by a preset reinforcement learning algorithm based on the initial template if the initial template is not the preferred template;

The evaluation module 106 is configured to obtain a candidate text generated by the large language model based on the candidate template, and determine the template evaluation information corresponding to the candidate template according to a text difference feature between the candidate text and the real text;

and a preference module 107, configured to determine a generalization capability of each candidate template according to the template evaluation information corresponding to each candidate template, order the generalization capability in order from high to low, and use the candidate template with the first rank of the generalization capability as the preferred template.

In one embodiment, the obtaining module 101 is specifically configured to:

acquiring a plurality of tasks or data sets, and training the meta-model by using the plurality of tasks or data sets;

and acquiring a preset input text and a preset output label, and generating an initial Prompt template by using the trained meta model, the input text and the output label.

In one embodiment, the determining module 104 is specifically configured to:

acquiring an evaluation index in template evaluation information of the initial template, wherein the evaluation index comprises one or a combination of accuracy, recall and F1 value;

And judging whether the initial template is the optimal template according to the evaluation index in the template evaluation information of the initial template.

In one embodiment, the learning module 105 is specifically configured to:

if the initial template is not the preferred template, inputting the initial template into a preset reinforcement learning algorithm;

acquiring a first candidate template learned by the reinforcement learning algorithm based on the initial template;

and taking the first candidate template as a next input template, mining by using the next input template by the reinforcement learning algorithm to generate a second candidate template, and carrying out cyclic iteration processing based on the second candidate template to obtain a plurality of candidate templates generated by the reinforcement learning algorithm in the cyclic iteration processing process.

In one embodiment, the preference module 107 is specifically configured to:

In one embodiment, the determining module 104 is specifically configured to:

acquiring the current state of the initial Prompt template and a preset strategy model in the reinforcement learning algorithm;

In an embodiment, the reinforcement learning-based template construction apparatus further includes:

the generation module is used for inputting the preferred template to a pre-trained large language model, and acquiring preferred text generated by the large language model based on the preferred template.

For specific limitations regarding the reinforcement learning-based template construction apparatus, reference may be made to the above limitations regarding the reinforcement learning-based template construction method, and no further description is given here.

The respective modules in the reinforcement learning-based template construction apparatus described above may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

Referring to fig. 7, fig. 7 is a schematic structural diagram of a computer device according to an embodiment of the present invention, and in one embodiment, a computer device is provided, where the computer device may be a server, and an internal structure diagram of the computer device may be as shown in fig. 7. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus.

Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes non-volatile and/or volatile storage media and internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface of the computer device is for communicating with an external client via a network connection. The computer program, when executed by a processor, performs functions or steps on the server side of a reinforcement learning-based template construction method.

Referring to fig. 8, fig. 8 is another schematic structural diagram of a computer device according to an embodiment of the present invention, and in one embodiment, a computer device is provided, and the computer device may be a client, and an internal structure diagram thereof may be as shown in fig. 8. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface of the computer device is for communicating with an external server via a network connection. The computer program, when executed by a processor, performs the functions or steps of a client-side of a reinforcement learning based template construction method.

In one embodiment, a computer device is provided comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the steps of when executing the computer program:

In one embodiment, a computer readable storage medium is provided having a computer program stored thereon, which when executed by a processor, performs the steps of:

It should be noted that, the functions or steps implemented by the computer readable storage medium or the computer device may correspond to the relevant descriptions of the server side and the client side in the foregoing method embodiments, and are not described herein for avoiding repetition.

The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU for short), a graphics processor (Graphics Processing Unit, GPU for short), a network processor (Network Processor, NP for short), etc.; but also digital signal processors (Digital Signal Processing, DSP for short), application specific integrated circuits (Application Specific Integrated Circuit, ASIC for short), field-programmable gate arrays (Field-Programmable Gate Array, FPGA for short) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.

The above description and the drawings illustrate embodiments of the disclosure sufficiently to enable those skilled in the art to practice them. Other embodiments may involve structural, logical, electrical, process, and other changes. The embodiments represent only possible variations. Individual components and functions are optional unless explicitly required, and the sequence of operations may vary. Portions and sub-samples of some embodiments may be included in or substituted for portions and sub-samples of other embodiments. Moreover, the terminology used in the present application is for the purpose of describing embodiments only and is not intended to limit the claims. As used in the description of the embodiments and the claims, the singular forms "a," "an," and "the" (the) are intended to include the plural forms as well, unless the context clearly indicates otherwise. Similarly, the term "and/or" as used in this disclosure is meant to encompass any and all possible combinations of one or more of the associated listed. In addition, when used in this disclosure, the terms "comprises," "comprising," and/or variations thereof mean the presence of the stated sub-sample, integer, step, operation, element, and/or component, but do not exclude the presence or addition of one or more other sub-samples, integers, steps, operations, elements, components, and/or groups of these. Without further limitation, an element defined by the phrase "comprising one …" does not exclude the presence of other like elements in a process, method or apparatus comprising such elements. In this context, each embodiment may be described with emphasis on the differences from the other embodiments, and the same similar parts between the various embodiments may be referred to each other. For the methods, products, etc. disclosed in the embodiments, if they correspond to the method sections disclosed in the embodiments, the description of the method sections may be referred to for relevance.

Those of skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. The skilled person may use different methods for each particular application to achieve the described functionality, but such implementation should not be considered to be beyond the scope of the embodiments of the present disclosure. It will be clearly understood by those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not described herein again.

In the embodiments disclosed herein, the disclosed methods, articles of manufacture (including but not limited to devices, apparatuses, etc.) may be practiced in other ways. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of elements may be merely a logical functional division, and there may be additional divisions in actual implementation, e.g., multiple elements or components may be combined or integrated into another system, or some sub-samples may be omitted, or not performed. In addition, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interface, device or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form. The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to implement the present embodiment. In addition, each functional unit in the embodiments of the present disclosure may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. In the description corresponding to the flowcharts and block diagrams in the figures, operations or steps corresponding to different blocks may also occur in different orders than that disclosed in the description, and sometimes no specific order exists between different operations or steps. For example, two consecutive operations or steps may actually be performed substantially in parallel, they may sometimes be performed in reverse order, which may be dependent on the functions involved. Each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Claims

1. The template construction method based on reinforcement learning is characterized by comprising the following steps:

2. The template construction method according to claim 1, wherein the obtaining a preset input text and a preset output tag, and generating an initial template using a preset meta model, the input text and the output tag, comprises:

3. The template construction method according to claim 1, wherein the determining whether the initial template is a preferred template according to the template evaluation information of the initial template comprises:

4. The template construction method according to claim 1, wherein the obtaining a plurality of candidate template templates learned by a preset reinforcement learning algorithm based on the initial template if the initial template is not the preferred template comprises:

5. The template construction method according to claim 1, wherein the determining the generalization capability of each of the candidate promt templates according to the template evaluation information corresponding to each of the candidate promt templates, ordering the generalization capability in order from high to low, and ranking the candidate promt templates with the first generalization capability as the preferred promt templates includes:

6. The template construction method of claim 4, wherein the obtaining the first candidate template learned by the reinforcement learning algorithm based on the initial template comprises:

7. The template construction method according to any one of claims 1 to 6, wherein after determining a generalization capability of each of the candidate template according to the template evaluation information corresponding to each of the candidate template, ranking the generalization capability in order from high to low, and ranking the candidate template whose generalization capability is first as the preferred template, the method further comprises:

8. A reinforcement learning-based template construction apparatus, comprising:

9. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the template construction method according to any one of claims 1 to 7 when the computer program is executed.

10. A computer-readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the template construction method according to any one of claims 1 to 7.