CN116303974B

CN116303974B - Response method and device based on target generation type response language model

Info

Publication number: CN116303974B
Application number: CN202310486966.XA
Authority: CN
Inventors: 费军波; 张丽颖; 张云云; 张莹; 曾令仿; 陈�光; 程稳
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2023-05-04
Filing date: 2023-05-04
Publication date: 2023-08-01
Anticipated expiration: 2043-05-04
Also published as: CN116303974A

Abstract

The application relates to a response method and a response device based on a target generation type response language model. Wherein the method comprises the following steps: training to obtain an initial generation type prompt language model and an initial generation type response language model based on a prompt data set in the education equipment; scoring the prediction results of the two by using a scoring model; based on the weighted calculation result of the grading value, further training an initial generation type prompt language model and an initial generation type response language model through reinforcement learning and countermeasure learning to obtain a target generation type response language model; inputting the text data to be tested collected by the education equipment into a target generation type response language model, and splicing the text data to be tested and the dialogue data by the target generation type response language model to obtain corresponding responses. By adopting the method, various new prompts can be generated, and interaction between the generated prompt language model and the generated response language model is enhanced, so that the unexpected behavior problem of the generated language model is further improved.

Description

Response method and device based on target generation type response language model

Technical Field

The application relates to the technical field of artificial intelligence and deep learning, in particular to a response method and device based on a target generation type response language model.

Background

With the development of technology, the intelligent dialogue technology is not limited to the use of manual rules, and is developed towards more intelligent, so that the development brings about the promotion of intelligent dialogue effect, shows more anthropomorphic and diversified trends, and makes users more willing to use the technology. Intelligent dialogue technology based on a generative language model is currently popular, and many education products use the technology. The generative language model may perform a series of natural language processing tasks such as named entity recognition, relationship extraction, question-answering, etc., through prompts entered by the user. However, these models often do unexpected activities, such as creating incomplete information, generating text with bias, and not following the user's intent, etc., resulting in adverse effects on the mental development of the child. The reason for this problem is mainly that the pre-training method of the commonly used generative language model (i.e. training the model by predicting the next word of a given text) does not lead the generative language model to follow the basic rules.

In order to solve the above problems, reinforcement learning is introduced in the prior art to guide the generated language model to follow the basic rules to avoid unexpected behaviors, firstly, the generated language model is trained by supervised learning to enable the generated language model to have the capability of generating responses according to prompts, secondly, a scoring model is trained by using the responses generated by the model and feedback of human experts on the responses, and finally, the scoring model is used for replacing the scoring of the responses generated by the human experts on the generated language model, and the parameters of the generated language model are updated by reinforcement learning.

However, due to the diversity of language expressions, the data set used for pre-training is unlikely to cover all prompt expressions, and the unexpected behavioral problem of generating a language model still exists.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a response method and apparatus based on a target-generated response language model that can generate various new prompts and further improve unexpected behavior problems of the generated response language model.

In a first aspect, the present application provides a response method based on a target-generated response language model. The method comprises the following steps:

training to obtain an initial generation type prompt language model and an initial generation type response language model based on a prompt data set in the education equipment; the initial generation type prompt language model has the capability of generating a new prompt according to the prompt, and the initial generation type response language model has the capability of generating a response according to the prompt;

Training to obtain a prompt scoring model based on sampling paired prompt data in the educational equipment; training to obtain a response scoring model based on sampling paired response data in the educational equipment;

scoring the predicted prompt output by the initial generation type prompt language model by using the prompt scoring model to obtain a prompt scoring value; scoring the predicted response output by the initial generated response language model by using the response scoring model to obtain a response scoring value;

weighting calculation is carried out on the prompt score value and the response score value, and the initial generation type prompt language model and the initial generation type response language model are further trained through reinforcement learning and countermeasure learning based on the weighted calculation result, so that a target generation type response language model is obtained;

inputting the text data to be tested collected by the education equipment into the target generation type response language model, and splicing the text data to be tested and the dialogue data by the target generation type response language model to obtain corresponding responses.

In one embodiment, the training to obtain the initial generation type prompt language model and the initial generation type response language model based on the prompt data set in the education equipment includes:

Acquiring a prompt data set in the education equipment, and acquiring a new prompt and a response preset according to the prompt based on the prompt sampled from the prompt data set;

inputting the prompt as a model, taking the preset new prompt as a training target, and obtaining an initial generation type prompt language model by using supervised learning training;

and taking the prompt as a model input, taking the preset response as a training target, and obtaining an initial generation type response language model by using supervised learning training.

In one embodiment, the training to obtain the reminder scoring model based on sampling paired reminder data in the educational apparatus includes:

sampling one prompt in the prompt data set in the education equipment, and inputting the prompt into the initial generation type prompt language model to obtain a new prompt generated by the model;

acquiring a first grading value preset by a new prompt generated according to the model in the education equipment;

training to obtain a prompt scoring model based on the prompt, the new prompt generated by the model and the first scoring value.

In one embodiment, the training to obtain the response scoring model based on sampling paired response data in the educational device includes:

Sampling one prompt in the prompt data set in the education equipment, and inputting the prompt into the initial generation type response language model to obtain a model generation response;

acquiring a second grading value preset in response generated according to the model in the education equipment;

and training to obtain a response scoring model based on the prompt, the response generated by the model and the second scoring value.

In one embodiment, the scoring the predicted response output by the initially generated response language model by using the response scoring model to obtain a response scoring value includes:

acquiring a new prompt data set;

inputting the new prompt data set into the initial generation type response language model to obtain a first prediction response, and scoring the first prediction response by using the response scoring model to obtain a first response scoring value;

and inputting the new prompt data set into the initial generation type prompt language model to obtain a new generated prompt, inputting the new generated prompt into the initial generation type response language model to obtain a second prediction response, and scoring the second prediction response by using the response scoring model to obtain a second response scoring value.

In one embodiment, the weighting calculation is performed on the prompt score value and the response score value, and based on the weighted calculation result, the initially generated prompt language model and the initially generated response language model are further trained through reinforcement learning and countermeasure learning, so as to obtain a target generated response language model, which includes:

based on different weighted calculations of the prompt score value and the response score value, respectively obtaining the score of the initial generation type prompt language model and the score of the initial generation type response language model;

updating parameters of the initial generation type prompt language model based on the score of the initial generation type prompt language model to obtain a target generation type prompt language model;

and updating parameters of the initial generation type response language model based on the score of the initial generation type response language model to obtain a target generation type response language model.

In a second aspect, the present application also provides a response device based on the target-generated response language model. The device comprises:

the training initial model module is used for training to obtain an initial generation type prompt language model and an initial generation type response language model based on the prompt data set in the education equipment; the initial generation type prompt language model has the capability of generating a new prompt according to the prompt, and the initial generation type response language model has the capability of generating a response according to the prompt;

The training scoring model module is used for training to obtain a prompt scoring model based on sampling paired prompt data in the education equipment; training to obtain a response scoring model based on sampling paired response data in the educational equipment;

the application scoring model module is used for scoring the prediction prompts output by the initial generation type prompt language model by using the prompt scoring model to obtain prompt scoring values; scoring the predicted response output by the initially generated response language model by using the response scoring model to obtain a response scoring value;

the training target model module is used for carrying out weighted calculation on the prompt score value and the response score value, and further training the initial generation type prompt language model and the initial generation type response language model through reinforcement learning and countermeasure learning based on the weighted calculation result to obtain a target generation type response language model;

and the target model module is used for inputting the text data to be tested collected by the education equipment into the target generation type response language model, and the target generation type response language model is used for splicing the text data to be tested with the dialogue data to obtain corresponding response.

In a third aspect, the present application also provides a computer device. The computer device includes a memory storing a computer program and a processor that when executing the computer program implements the steps of the objective-based generated-response language model-based response method described in the first aspect.

In a fourth aspect, the present application also provides a computer-readable storage medium. The computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of the objective-based response language model response method described in the first aspect.

In a fifth aspect, the present application further provides a computer program product, including a computer program, which when executed by a processor implements the steps of the response method based on the objective-generated response language model described in the first aspect.

According to the response method and device based on the target generated response language model, the generated prompt language model is introduced to generate various new prompts, a prompt scoring model is introduced to evaluate the advantages and disadvantages of the generated new prompts, interaction between the two generated language models is enhanced through antagonism learning, and the target generated response language model following basic rules is obtained, so that the unexpected behavior problem of the generated language model in the education equipment is further improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application. In the drawings:

FIG. 1 is a block diagram of the hardware architecture of a response method based on a target-generated response language model in one embodiment;

FIG. 2 is a flow diagram of a response method based on a target-generated response language model in one embodiment;

FIG. 3 is a schematic diagram of an initially generated prompt language model and an initially generated response language model training process;

FIG. 4 is a schematic diagram of a prompt scoring model and response scoring model training process;

FIG. 5 is a schematic illustration of a process for training a target-generated prompt language model and a target-generated response language model based on reinforcement learning and reinforcement learning;

FIG. 6 is a block diagram of a response device based on a target-generated response language model in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden on the person of ordinary skill in the art based on the embodiments provided herein, are intended to be within the scope of the present application.

It is apparent that the drawings in the following description are only some examples or embodiments of the present application, and it is possible for those of ordinary skill in the art to apply the present application to other similar situations according to these drawings without inventive effort. Moreover, it should be appreciated that while such a development effort might be complex and lengthy, it would nevertheless be a routine undertaking of design, fabrication, or manufacture for those of ordinary skill having the benefit of this disclosure, and thus should not be construed as having the benefit of this disclosure.

Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is to be expressly and implicitly understood by those of ordinary skill in the art that the embodiments described herein can be combined with other embodiments without conflict.

Unless defined otherwise, technical or scientific terms used herein should be given the ordinary meaning as understood by one of ordinary skill in the art to which this application belongs. Reference to "a," "an," "the," and similar terms herein do not denote a limitation of quantity, but rather denote the singular or plural. The terms "comprising," "including," "having," and any variations thereof, are intended to cover a non-exclusive inclusion; for example, a process, method, system, article, or apparatus that comprises a list of steps or modules (elements) is not limited to only those steps or elements but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. The terms "connected," "coupled," and the like in this application are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. The term "plurality" as used herein refers to two or more. "and/or" describes an association relationship of an association object, meaning that there may be three relationships, e.g., "a and/or B" may mean: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship. The terms "first," "second," "third," and the like, as used herein, are merely distinguishing between similar objects and not representing a particular ordering of objects.

The method embodiments provided in the present embodiment may be executed in a terminal, a computer, or similar computing device. For example, the method runs on a terminal, and fig. 1 is a block diagram of a hardware structure of a response method based on a target-generated response language model in this embodiment. As shown in fig. 1, the terminal may include one or more (only one is shown in fig. 1) processors 102 and a memory 104 for storing data, wherein the processors 102 may include, but are not limited to, a microprocessor MCU, a programmable logic device FPGA, or the like. The terminal may also include a transmission device 106 for communication functions and an input-output device 108. It will be appreciated by those skilled in the art that the structure shown in fig. 1 is merely illustrative and is not intended to limit the structure of the terminal. For example, the terminal may also include more or fewer components than shown in fig. 1, or have a different configuration than shown in fig. 1.

The memory 104 may be used to store computer programs, such as software programs and modules of application software, such as computer programs corresponding to the response method based on the target-generated response language model in the present embodiment, and the processor 102 executes the computer programs stored in the memory 104 to perform various functional applications and data processing, that is, to implement the above-described method. Memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory remotely located relative to the processor 102, which may be connected to the terminal via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device 106 is used to receive or transmit data via a network. The network includes a wireless network provided by a communication provider of the terminal. In one example, the transmission device 106 includes a network adapter (Network Interface Controller, simply referred to as NIC) that can connect to other network devices through a base station to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is configured to communicate with the internet wirelessly.

In this embodiment, a response method based on a target-generated response language model is provided, fig. 2 is a schematic flow chart of the response method based on the target-generated response language model of this embodiment, as shown in fig. 2, and the flow chart includes the following steps:

step S210, training to obtain an initial generation type prompt language model and an initial generation type response language model based on a prompt data set in the education equipment; the initially generated prompt language model has the capability of generating new prompts according to the prompts, and the initially generated response language model has the capability of generating responses according to the prompts.

Where the hint dataset refers to a collection of pre-collected hints, the hints refer to a section of speech input to the model, e.g. "today's weather is true-! ". New hints refer to another expression consistent with hinting semantics, e.g., model generates for hints "weather today really good" no clouds today, bright sun high bright-! ". The response refers to a section of speech that is consistent with the prompt semantics, e.g., the model generates "o" for the prompt "today's weather is good" and is suitable for walking out. "to better improve the unexpected behavior problem of language models, the prompt data set in the educational apparatus includes more exploratory nature content, such as questions frequently asked by some children.

Specifically, supervised learning may be used in training the initially generated prompt language model and the initially generated response language model. And taking the prompt in the prompt data set as the input of the model, taking a new prompt preset according to the prompt as the output target of the model, and training to obtain the initial generation type prompt language model. And, the prompt in the prompt data set is used as the input of the model, and the response preset according to the prompt is used as the output target of the model, so as to train and obtain the initial generation type response language model.

Wherein, the new prompt and response preset according to the prompt can be written by data processing personnel or can be generated by using the existing language model.

The trained initial generated prompt language model and initial generated response language model have a certain capability of generating new prompts or responses according to prompts, and in order to continuously improve the capability of the generated prompt language model and the generated response language model by using reinforcement learning and countermeasure learning, a scoring model needs to be trained first, and the scoring model can give scores according to the input of the model and the output of the model.

Step S220, training to obtain a prompt scoring model based on paired prompt data in the sampling education equipment; training to obtain a response scoring model based on paired response data in the sampled educational equipment.

Specifically, sampling a prompt in the prompt data set, inputting the prompt into an initial generation type prompt language model, and generating a plurality of new prompts corresponding to the prompts; and sampling a new prompt from the model output, and combining the new prompt and the corresponding prompt into paired prompt data, so that a plurality of paired prompt data can be obtained. Based on paired prompt data, a pre-stored first grading value is obtained, and a prompt grading model is obtained based on the first grading value and corresponding paired prompt data training.

The first scoring value can be obtained through manual evaluation or automatic evaluation. The manual evaluation is that the data processing personnel scores the relative merits of the paired prompt data; automatic evaluation refers to scoring pairs of prompt data by means of statistical evaluation indexes or evaluation models.

Specifically, sampling a prompt in the prompt data set, inputting the prompt into an initial generation type response language model, and generating a plurality of responses corresponding to the prompt; a response is sampled from the model output, and the response and the corresponding prompt are formed into paired response data, so that a plurality of paired response data can be obtained. And acquiring a pre-stored second grading value based on the paired response data, and training to obtain a prompt grading model based on the second grading value and the corresponding paired response data. Similarly, the second score value may be obtained by manual evaluation or automatic evaluation.

Step S230, scoring the predicted prompt output by the initial generation type prompt language model by using the prompt scoring model to obtain a prompt scoring value; and scoring the predicted response output by the initially generated response language model by using the response scoring model to obtain a response scoring value.

Specifically, a new prompt data set in the education equipment is obtained, the new prompt data set is used as input data of an initial generation type prompt language model, a new generated prompt is obtained, and a prompt scoring model is utilized to score the new generated prompt, so that a prompt scoring value is obtained; and taking the new prompt data set and the newly generated new prompt as input data of the initial generated response language model to obtain a newly generated response, and grading the newly generated response by using the response grading model to obtain a response grading value.

And step S240, carrying out weighted calculation on the prompt score value and the response score value, and further training the initial generation type prompt language model and the initial generation type response language model through reinforcement learning and countermeasure learning based on the weighted calculation result to obtain the target generation type response language model.

Specifically, based on different weighted calculations performed on the prompt score value and the response score value, a score of the initially generated prompt language model and a score of the initially generated response language model are respectively obtained; updating parameters of the initial generation type prompt language model based on the score of the initial generation type prompt language model to obtain a target generation type prompt language model; and updating parameters of the initial generation type response language model based on the scores of the initial generation type response language model to obtain the target generation type response language model.

Wherein reinforcement learning and countermeasure learning are methods of training a model. The reinforcement learning method gives input and scores the model output, and aims to train the model to generate output with higher scores according to the input. The method of countermeasure learning involves two models, and aims to synchronously promote model effects through countermeasure of the two models.

Step S250, inputting the text data to be tested collected by the education equipment into a target generation type response language model, and splicing the text data to be tested and the dialogue data by the target generation type response language model to obtain corresponding responses.

The dialogue data comprises an initiator and input/output data of the target generation type response language model, wherein the input/output data is mainly data which is stored in the multi-round dialogue before a certain prediction, and the purpose of the dialogue data application is to ensure that the target generation type response language model can keep the continuity of the dialogue in the multi-round dialogue. For example, in the dialogue process, if the text data to be tested collected by the education device is "hello", the "hello" is spliced with the start character "start" to obtain "start" + "hello", the "start" + "hello" is used as the target to generate the input data in the forward network in the response language model, and the prediction result "hello" of the model is obtained, if the text data is collected continuously "how is the weather today? The model inputs are "start" + "hello to" + "how do today weather? The model returns "weather today," and so on if the educational device has collected input.

In more complex scenarios, for example, after user 1 tries to interact with the educational device through a voice dialogue, after user 1 speaks "hello", the educational device collects user 1 input data "hello" and returns "hello" using the target-generated response language model, at this time, how user 2 writes "hello" with user 1 dialogue, "how user 1 replies to user 2" that has not yet begun writing, "because of the voice dialogue between user 1 and user 2, this dialogue is recognized by the educational device and input to the target-generated response language model, specifically," start "+" hello "+" how you write "+" to "how you write" do not yet begin writing, "at this time, the target-generated response language model analyzes how" + "what" you write "does not yet begin writing" and the previous "start" + "how you write" and does not respond to the model.

In the method, an initial generation type prompt language model and an initial generation type response language model are firstly trained through a prompt data set in the education equipment, then a prompt scoring model and a prompt scoring model are trained to score the two generation type language models, the initial generation type prompt language model and the initial generation type response language model are further trained based on scoring by using reinforcement learning and antagonism learning, interaction between the two generation type language models is enhanced, a target generation type response language model conforming to basic rules is obtained, and therefore unexpected behavior problems of the generation type language model in the education equipment are further improved.

In one embodiment, based on the step S210, training to obtain the initially generated reminder language model and the initially generated response language model based on the reminder data set in the educational apparatus may specifically include the following steps:

step S211, acquiring a prompt data set in the education equipment, and acquiring a new prompt and response preset according to the prompt based on the prompt sampled from the prompt data set.

The preset new prompt is another expression consistent with the prompt semantics and written according to the prompt semantics, and can be written by a data processor or generated by using the existing language model; the preset response is a section of word which is written according to the prompt semantics and is consistent with the prompt semantics, can be written by a data processor, and can also be generated by using the existing language model.

And S212, taking the prompt as a model input, taking the preset new prompt as a training target, and obtaining an initial generation type prompt language model by using supervised learning training.

Step S213, inputting prompts as models, taking preset responses as training targets, and obtaining an initial generation type response language model by using supervised learning training.

Through the steps S211 to S213, through the sample in the prompt data set as input, the preset new prompt and response are used as training targets, and the method of supervised learning is used to train to obtain a generated prompt language model with the capability of generating the new prompt according to the prompt and a generated response language model with the capability of generating the new prompt response according to the prompt, so as to provide a pre-training basis for the generated language model with better effect.

In one embodiment, based on the step S220, training to obtain the reminder score model based on paired reminder data in the sampled educational device may specifically include the following steps:

step S221, sampling a prompt in the prompt data set in the education equipment, inputting the prompt into the initial generation type prompt language model, and obtaining a new prompt generated by the model.

Step S222, obtaining a first score value preset by a new prompt generated according to the model in the education equipment.

Specifically, after the new prompt generated by the model is obtained, the quality of the new prompt generated by the model can be scored, the obtained first scoring value is stored in the computer equipment in advance, and the scoring model is read when the prompt scoring model is to be trained. Illustratively, the evaluation process scores new cues generated by the model by means of statistical evaluation indicators or evaluation models. In addition, after the new prompt of the model is generated, the new prompt can be sent to a data processing personnel, and the data processing personnel can score the advantages and disadvantages of the new prompt generated by the model.

Step S223, training to obtain a prompt scoring model based on the prompt, the new prompt generated by the model and the first scoring value.

In the steps S221 to S223, a prompt scoring model with more accurate prediction is obtained by training with reference to the first scoring value corresponding to the new prompt, so as to improve the pre-training effect of the generated language model.

In one embodiment, based on the step S220, training to obtain the response scoring model based on the paired response data in the sampled educational device may specifically include the following steps:

step S224, a prompt in the prompt data set in the education equipment is sampled, and the prompt is input into the initial generation type response language model to obtain a model generation response.

Step S225, a second grading value preset according to the response generated by the model in the education equipment is obtained.

Specifically, after the response generated by the model is obtained, the quality of the response generated by the model can be scored, the obtained second scoring value is stored in the computer equipment in advance, and the scoring model is read when the prompt scoring model is to be trained. Illustratively, the evaluation process scores the responses generated by the model by means of statistical evaluation indicators or evaluation models. In addition, after the response of the model is obtained, the generated response can be sent to a data processing personnel, and the data processing personnel can score the quality of the response generated by the model.

Step S226, training to obtain a response scoring model based on the prompt, the response generated by the model and the second scoring value.

In the steps S224 to S226, a response scoring model with more accurate prediction is obtained by training with reference to the second scoring value corresponding to the response, so as to improve the pre-training effect of the generated language model.

In one embodiment, based on the step S230, the method may further include the steps of:

step S231, a new prompt data set is acquired;

step S232, inputting the new prompt data set into the initial generation type response language model to obtain a first prediction response, and scoring the first prediction response by using the response scoring model to obtain a first response scoring value;

step S233, inputting the new prompt data set into the initial generation type prompt language model to obtain a new generated prompt, inputting the new generated prompt into the initial generation type response language model to obtain a second prediction response, and scoring the second prediction response by using the response scoring model to obtain a second response scoring value.

In order to increase the effect of the countermeasure learning, the newly generated new prompt in step S233 is preferably scored by a prompt scoring model, and the newly generated new prompt with a high score value is preferably used as the initially generated response language model in step S233. Only the new prompt corresponding to the high prompt score value is reserved, so that the output quality of the two generated language models can be better guided, and the text generation effect is improved between the two language models.

In the steps S231 to S233, the trained initial generation type prompt language model is used to obtain the newly generated new prompt, so that richer training data is provided for the initial generation type response language model, and different response scoring values are obtained based on different input data sets, so that the weight parameters of each scoring value can be flexibly set.

In one embodiment, based on the step S240, the prompt score value and the response score value are weighted, and based on the weighted calculation result, the initially generated prompt language model and the initially generated response language model are further trained through reinforcement learning and countermeasure learning, so as to obtain the target generated response language model, which specifically includes the following steps:

Step S241, based on different weighted calculations of the prompt score and the response score, a score of the initial generation type prompt language model and a score of the initial generation type response language model are obtained respectively.

Wherein, the response score value can be obtained through steps S231 to S233 in the above embodiments, and the first response score value in the response score value is set as r _r1 The second response score value is r _r2 In addition, the prompt score value is set as r _p Illustratively, r is _p +α(r _r1 -r _r2 ) As the score of the initial generation type prompt language model, r is taken as _r1 +βr _r2 As a score for the initially generated responsive language model. The countermeasure can be selectively introduced by adjusting the parameters alpha and beta according to the model generation effect in training.

Specifically, when α=0 and β=0, the initially generated prompt language model and the initially generated response language model perform reinforcement learning-based training, respectively. When one of α and β is not 0, the two models are trained based on reinforcement learning and countermeasure learning at the same time. In the actual training process, a certain degree of individual training may be performed first, and then the countermeasure training process is gradually increased.

Step S242, updating parameters of the initial generation type prompt language model based on the score of the initial generation type prompt language model to obtain the target generation type prompt language model.

Wherein the goal-generating prompt language model is used to explore more expressions to provide diverse inputs to the generated response language model.

Step S243, updating parameters of the initial generated type response language model based on the score of the initial generated type response language model to obtain the target generated type response language model.

The steps S241 to S243 are respectively set with scores of the initial generation type prompt language model and the initial generation type response language model, and the degrees of reinforcement learning and counterlearning can be adjusted by adjusting weight values, model parameters can be updated timely according to model generation effects, further training is completed, so that the initial generation type prompt language model explores new prompts which are more difficult for the initial generation type response language model, the initial generation type response language model takes the prompts with richer expressions as input, text generation effects are mutually improved between the two models, and a target generation type prompt language model and a target generation type response language model which can output more standard texts are obtained.

The present embodiment is described and illustrated below by way of preferred embodiments.

FIG. 3 is a schematic diagram of an initially generated prompt language model and an initially generated response language model training process.

Step 1.1, sampling a prompt in the prompt data set in the educational equipment of' today weather is good! ";

step 1.2, a new prompt' no cloud in the world today and bright sun high illumination-! ";

and 1.3, taking the prompt obtained by sampling in the step 1.1 as a model input, taking the new prompt preset in the step 1.2 as a model training target, and obtaining an initial generation type prompt language model by using supervised learning training.

Step 2.1, sampling a prompt in the prompt data set in the educational apparatus of' today weather is true! ";

step 2.2, obtaining a response "yes" preset according to the prompt, and being suitable for walking out. ";

and 2.3, taking the prompt obtained by sampling in the step 2.1 as model input, taking the response preset in the step 2.2 as a model training target, and obtaining an initial generation type response language model by using supervised learning training.

The preset new prompts and responses can be written by data processing personnel or can be generated by using the existing language model.

FIG. 4 is a schematic diagram of a prompt scoring model and a response scoring model training process.

Step 3.1, sampling a prompt in the prompt data set in the educational equipment of' today weather is good! ";

Step 3.2, inputting the prompt obtained by sampling in step 3.1 into an initial generation type prompt language model to obtain a model output of 'today's Hao Fu! "," practice with cool wind in autumn today! "and the like;

step 3.3, sampling a plurality of generated new prompts in the model output of step 3.2, namely-! "," practice with cool wind in autumn today! ";

step 3.4, obtaining a preset first grading value corresponding to the new prompt in the step 3.3; wherein the first scoring value can be obtained by manual evaluation or automatic evaluation, "today's Haohui-! "and" today's autumn high cool wind practice-! "Each has a first score value because" today's autumn high-air cool wind practice-! "than" today's good o-o! "more fit the semantics of the prompt in step 3.1, so" today's autumn cool wind practice-! "corresponding first score value is higher than" today's Haozhi-! "the corresponding first score value score is higher;

and step 3.5, training to obtain a prompt scoring model according to the prompt obtained by sampling in step 3.1, the new prompt obtained by sampling in step 3.2 and the first score obtained in step 3.4.

Step 4.1, sampling a prompt in the prompt data set in the educational apparatus of' today weather is true! ";

Step 4.2, inputting the prompt obtained by sampling in step 4.1 into the initially generated response language model to obtain a model output of' really thinking about going away-! "football throw! "and the like;

step 4.3, sampling a plurality of generated responses in the model output of step 4.2! "football throw! ";

step 4.4, obtaining a preset second grading value corresponding to the response in the step 4.3; wherein the second scoring value can be obtained by manual evaluation or automatic evaluation, "really think about going away-! "and" football lost! "each corresponds to a second score value because" really think about going away-! "than" football loses! "more fit the semantics of the hint in step 4.1, so" really think about going away-! "corresponding second score value is greater than" football is lost-! "the corresponding second score value score is higher;

and 4.5, training to obtain a response scoring model according to the prompt obtained by sampling in the step 4.1, the response sampled in the step 4.2 and the first score obtained in the step 4.4.

FIG. 5 is a schematic diagram of a process for training a target-generated prompt language model and a target-generated response language model based on reinforcement learning and reinforcement learning.

Step 5.1, sampling a hint from a new hint data set of "weather today is true-! ";

step 5.2, prompt sampled in step 5.1 "weather today really good-! "input to the initial generation type prompt language model, get the new prompt generated by the model" ++Haohui today-! ";

step 5.3, the new prompt generated in step 5.2 is "today is happy with the pressure-! "input to prompt scoring model, get the prompt scoring value r of the new prompt _p ；

Step 5.4, the prompt in step 5.1 is "today weather is good-! "input to initially generated response language model, get model generated response" really thinks about going away-! ";

step 5.5, the response generated in step 5.4 is "really thinking about going away-! "input to response scoring model, obtain the first response scoring value r of the response _r1 ；

Step 5.6, if the prompt score r in step 5.3 _p Above the threshold, the new prompt generated in step 5.2 is "today is smoothed-! Inputting the generated response language model to obtain a response generated by the model;

step 5.7, inputting the response generated in step 5.6 into a response scoring model to obtain a second response scoring value r of the response _r2 ；

Step 5.8, r _p +α(r _r1 -r _r2 ) As the score of the initial generation type prompt language model, r is taken as _r1 +βr _r2 And using the scores as scores of the initial generation type response language models, and updating parameters of the corresponding models to obtain the target generation type prompt language models and the target generation type response language models.

And 6, inputting the text data to be tested collected by the education equipment into the target generation type response language model, and splicing the text data to be tested and the dialogue data by the target generation type response language model to obtain corresponding response.

Compared with the prior art, the method has the advantages that the prompt expression for training the model is more various, the initial generation type prompt language model can generate new prompts with different expression forms according to the existing prompts, and the new prompts are used as the input of the initial generation type response language model, so that the training data set is enriched. And the robustness of the model is stronger, and the generated prompt language model is promoted to explore prompts which are more difficult for the generated response language model by introducing the countermeasure learning, so that the generated response language model is promoted to use more forms of prompts, and the text generation effect is mutually promoted between the initial generated prompt language model and the initial generated response language model.

It should be understood that, although the steps in the flowcharts related to the above embodiments are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.

Based on the same inventive concept, in this embodiment, a response device based on the target-generated response language model is further provided, and the system is used to implement the foregoing embodiments and preferred embodiments, which are not described herein. The terms "module," "unit," "sub-unit," and the like as used below may refer to a combination of software and/or hardware that performs a predetermined function. While the system described in the following embodiments is preferably implemented in software, implementation of hardware, or a combination of software and hardware, is also possible and contemplated.

In one embodiment, as shown in FIG. 6, there is provided a response device based on a target-generated response language model, comprising: a training initial model module 61, a training scoring model module 62, an application scoring model module 63, a training target model module 64, and an application target model module 65, wherein:

training an initial model module 61, for training to obtain an initial generation type prompt language model and an initial generation type response language model based on the prompt data set in the education equipment; the initially generated prompt language model has the capability of generating new prompts according to the prompts, and the initially generated response language model has the capability of generating responses according to the prompts.

A training scoring model module 62 for training to obtain a reminder scoring model based on paired reminder data in the sampled educational equipment; training to obtain a response scoring model based on paired response data in the sampled educational equipment.

The application scoring model module 63 is configured to score the predicted prompt output by the initially generated prompt language model by using the prompt scoring model, so as to obtain a prompt scoring value; and scoring the predicted response output by the initially generated response language model by using the response scoring model to obtain a response scoring value.

The training target model module 64 is configured to perform weighted calculation on the prompt score value and the response score value, and further train the initially generated prompt language model and the initially generated response language model through reinforcement learning and countermeasure learning based on the weighted calculation result, so as to obtain the target generated response language model.

The application target model module 65 is configured to input the text data to be tested collected by the educational device into a target-generated response language model, and the target-generated response language model splices the text data to be tested with the dialogue data to obtain a corresponding response.

In one embodiment, training the initial model module 61 further comprises obtaining a reminder dataset in the educational apparatus and obtaining new reminders and responses preset according to the reminders based on the reminders sampled from the reminder dataset; inputting prompts serving as models, taking preset new prompts as training targets, and obtaining an initial generation type prompt language model by using supervised learning training; and taking the prompt as a model input, taking the preset response as a training target, and obtaining an initial generation type response language model by using supervised learning training.

In one embodiment, training scoring model module 62 further includes sampling a reminder in the reminder dataset in the educational apparatus, inputting the reminder into the initially generated reminder language model, resulting in a new reminder for the model generation; acquiring a first grading value preset by a new prompt generated according to a model in educational equipment; training to obtain a prompt scoring model based on the prompt, the new prompt generated by the model and the first scoring value.

In one embodiment, training scoring model module 62 further includes sampling a reminder in the reminder dataset in the educational apparatus, inputting the reminder into the initially generated response language model, resulting in a model generated response; acquiring a second grading value preset according to a response generated by the model in the education equipment; and training to obtain a response scoring model based on the prompt, the response generated by the model and the second scoring value.

In one embodiment, the application scoring model module 63 further includes obtaining a new hint data set; inputting the new prompt data set into an initial generation type response language model to obtain a first prediction response, and scoring the first prediction response by using a response scoring model to obtain a first response scoring value; inputting the new prompt data set into the initial generation type prompt language model to obtain a new generated prompt, inputting the new generated prompt into the initial generation type response language model to obtain a second prediction response, and scoring the second prediction response by using the response scoring model to obtain a second response scoring value.

In one embodiment, training goal model module 64 further includes deriving a score for the initially generated prompt language model and a score for the initially generated response language model, respectively, based on different weighted calculations of the prompt score and the response score; updating parameters of the initial generation type prompt language model based on the score of the initial generation type prompt language model to obtain a target generation type prompt language model; and updating parameters of the initial generation type response language model based on the scores of the initial generation type response language model to obtain the target generation type response language model.

The respective modules in the above-described target-generation-based response language model-based response device may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided that includes a memory having a computer program stored therein and a processor that when executing the computer program performs the steps of any of the above embodiments of a response method based on a target-generated response language model.

In one embodiment, a computer readable storage medium is provided having a computer program stored thereon that when executed by a processor performs the steps of any of the above embodiments of a response method based on a target-generated response language model.

In one embodiment, a computer program product is provided that includes a computer program that when executed by a processor performs the steps of any of the above embodiments of a response method based on a target-generated response language model.

It should be noted that, user information (including but not limited to user equipment information, user personal information, etc.) and data (including but not limited to data for analysis, stored data, presented data, etc.) referred to in the present application are information and data authorized by the user or sufficiently authorized by each party.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in the various embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random AccessMemory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as Static Random access memory (Static Random access memory AccessMemory, SRAM) or dynamic Random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the various embodiments provided herein may include at least one of relational databases and non-relational databases. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processors referred to in the embodiments provided herein may be general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic units, quantum computing-based data processing logic units, etc., without being limited thereto.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples only represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the present application. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application shall be subject to the appended claims.

Claims

1. A response method based on a target-generated response language model, the method comprising:

2. The objective-based response method according to claim 1, wherein training based on the reminder data set in the educational apparatus results in an initially generated reminder language model and an initially generated response language model, comprising:

3. The objective-based response language model based response method of claim 1, wherein training to obtain a prompt scoring model based on sampling pairs of prompt data in the educational apparatus comprises:

4. The objective-based response language model based response method according to claim 1, wherein training the response scoring model based on sampling paired response data in the educational apparatus comprises:

5. The method of claim 1, wherein scoring the predicted response output by the initial generated response language model using the response scoring model to obtain a response scoring value comprises:

acquiring a new prompt data set;

6. The method of claim 1, wherein the weighting the prompt score and the response score, further training the initially generated prompt language model and the initially generated response language model by reinforcement learning and countermeasure learning based on the weighted calculation result, and obtaining the target generated response language model comprises:

7. A response device based on a target-generated response language model, the device comprising:

8. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 6 when the computer program is executed.

9. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any one of claims 1 to 6.