CN117786070A

CN117786070A - Customer service question-answering model training method, question-answering method, system, equipment and medium

Info

Publication number: CN117786070A
Application number: CN202311740739.1A
Authority: CN
Inventors: 邓从健; 刘杰; 张明东; 陈茂强; 温琪; 汤冬儿; 李礼红; 林霆锋
Original assignee: Guangzhou Yunqu Information Technology Co ltd
Current assignee: Guangzhou Yunqu Information Technology Co ltd
Priority date: 2023-12-15
Filing date: 2023-12-15
Publication date: 2024-03-29

Abstract

The application relates to a customer service question-answering model training method, a question-answering method, a system, equipment and a medium. The method comprises the following steps: acquiring a thinking chain prompt statement and a thinking chain reply statement; taking the thinking chain prompt sentences as the input of the model, taking the thinking chain reply sentences as the output of the model, and performing first supervised training on the initial customer service question-answering model to obtain a first customer service question-answering model; performing second training on the first customer service questioning and answering model based on a reinforcement learning algorithm to obtain a converged second customer service questioning and answering model; the thinking chain prompt statement is a prompt statement which is constructed based on the problem statement in the initial data set and is used for guiding and replying the problem statement; the mental chain reply sentence is an expected reply sentence constructed based on the mental chain prompt sentence for replying to the question sentence. The reply sentence output by the model of the application is highly matched with human preference, and is more accurate, professional and reliable.

Description

Customer service question-answering model training method, question-answering method, system, equipment and medium

Technical Field

The present application relates to the field of customer service, and more particularly, to a customer service question-answering model training method, a customer service question-answering system, an electronic device, and a computer-readable storage medium.

Background

With the rapid development of the internet and the continuous progress of artificial intelligence technology, more and more enterprises begin to adopt an automatic customer service system to meet the consultation demands of clients. However, existing customer service answering systems often have limitations, such as insufficient ability to handle multiple rounds of conversations and cross-domain knowledge, resulting in impaired accuracy and specificity of the output information. In addition, when the traditional customer service answering system processes complex problems, the deviation generated by the model is larger, and the transparency and reliability of the output result are lower. Therefore, how to improve the accuracy, the specialty and the reliability of the customer service answering system becomes a problem to be solved urgently.

Disclosure of Invention

It is an object of the present application to provide a new solution for training a customer service questioning and answering model.

According to a first aspect of the present application, there is provided a customer service questioning and answering model training method, including:

acquiring a thinking chain prompt statement and a thinking chain reply statement;

taking the thinking chain prompt sentences as the input of the model, taking the thinking chain reply sentences as the output of the model, and performing first supervised training on the initial customer service question-answering model to obtain a first customer service question-answering model;

performing second training on the first customer service questioning and answering model based on a reinforcement learning algorithm to obtain a converged second customer service questioning and answering model;

the thinking chain prompt statement is a prompt statement which is constructed based on the problem statement in the initial data set and is used for guiding and replying the problem statement;

the mental chain reply sentence is an expected reply sentence constructed based on the mental chain prompt sentence for replying to the question sentence.

Optionally, the second training is performed on the first customer service questionnaire model based on the reinforcement learning algorithm to obtain a converged second customer service questionnaire model, which includes:

determining rewards corresponding to the reply sentences output by the rewarding model according to the thinking chain prompt sentences and the reply sentences output by the first customer service questioning and answering model;

determining a first loss function based on the reward;

and performing second training on the first customer service questioning and answering model based on a reinforcement learning algorithm by using the first loss function to obtain a converged second customer service questioning and answering model.

Optionally, the first loss function is a policy loss function, and the determining the first loss function according to the reward includes:

determining a dominance estimation term based on the rewards;

and determining a strategy loss function according to the dominance estimation term.

Optionally, the first loss function is a value function loss function, and determining the first loss function according to the reward further includes:

determining a cumulative prize actually obtained in the current state according to the prize;

a value function loss function is determined based on the actual jackpot and the expected jackpot.

Optionally, the second training is performed on the first customer service questionnaire model based on a reinforcement learning algorithm by using the first loss function to obtain a converged second customer service questionnaire model, and the method further includes:

determining a second loss function, wherein the second loss function is an entropy loss function;

and performing second training on the first customer service questioning and answering model based on a reinforcement learning algorithm by using the first loss function and the second loss function to obtain a converged second customer service questioning and answering model.

Optionally, the reward model is a model obtained by removing a last non-embedded layer from the first customer challenge-response model based on a transducer.

According to a second aspect of the present application, there is provided a customer service answering method, including:

acquiring a problem statement input by a user;

obtaining a reply sentence corresponding to the problem sentence based on a customer service question-answering model, wherein the customer service question-answering model is the converged second customer service question-answering model obtained according to the training method of any one of the first aspects of the application;

and outputting the reply sentence to a user.

According to a third aspect of the present application, there is provided a customer service answering system, including:

the first acquisition module is used for acquiring a problem statement input by a user;

a second obtaining module, configured to obtain a reply sentence corresponding to the question sentence based on a customer service question-answering model, where the customer service question-answering model is the converged second customer service question-answering model obtained according to the training method of any one of the first aspects of the present application;

and the output module is used for outputting the reply information to the user.

According to a fourth aspect of the present application there is provided an electronic device comprising a memory for storing computer instructions and a processor for invoking the computer instructions from the memory to perform the training method according to any of the first aspect of the present application or the customer service questioning and answering method according to the second aspect of the present application.

According to a fifth aspect of the present application, there is provided a computer readable storage medium storing a computer program executable by a processor to implement the training method as claimed in any one of the first aspects of the present application or the customer service questioning and answering method as claimed in the second aspect of the present application.

According to the method, the first customer service questioning and answering model is obtained by acquiring the thinking chain prompt sentences and the thinking chain reply sentences, taking the thinking chain prompt sentences as the input of the model, taking the thinking chain reply sentences as the output of the model, performing first supervised training on the initial customer service questioning and answering model, and then training the first customer service questioning and answering model based on a reinforcement learning algorithm to obtain a converged second customer service questioning and answering model. The reply sentence output by the model is highly matched with human preferences. Meanwhile, the customer service question-answering model of the embodiment of the application can realize seamless fusion with various knowledge bases by updating the thinking chain prompt sentences, thereby effectively overcoming the defects of the existing question-answering system when processing multi-round conversations and cross-domain knowledge and ensuring the accuracy, the specialty and the reliability of output.

Other features of the present application and its advantages will become apparent from the following detailed description of exemplary embodiments of the present application, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description, serve to explain the principles of the application.

Fig. 1 is a schematic flowchart of a customer service questioning and answering model training method provided in an embodiment of the present application.

Fig. 2 is a schematic flowchart of another customer service questioning and answering model training method provided in an embodiment of the present application.

Fig. 3 is a schematic flowchart of another customer service questioning and answering model training method provided in an embodiment of the present application.

Fig. 4 is a schematic flow chart of a customer service answering method provided in an embodiment of the present application.

Fig. 5 is a schematic block diagram of an electronic device provided in an embodiment of the present application.

Detailed Description

Various exemplary embodiments of the present application will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present application unless it is specifically stated otherwise.

The following description of at least one exemplary embodiment is merely exemplary in nature and is in no way intended to limit the application, its application, or uses.

Techniques, methods, and apparatus known to one of ordinary skill in the relevant art may not be discussed in detail, but are intended to be part of the specification where appropriate.

In all examples shown and discussed herein, any specific values should be construed as merely illustrative, and not a limitation. Thus, other examples of exemplary embodiments may have different values.

It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further discussion thereof is necessary in subsequent figures.

In the customer service field, users desire immediate, personalized solution of questions. However, conventional customer service answering systems often have difficulty meeting the needs in complex contexts. Current customer service questioning and answering systems are typically based on neural network models, typically based on rules, knowledge patterns, or deep learning.

Rule-based question-answering systems rely on pre-written rules and templates to generate replies. Although having a certain accuracy in a specific field, the lack of generalization capability is difficult to cope with diversified and cross-field problems.

Knowledge-graph-based question-answering systems rely on structured knowledge graphs to provide answers, which are costly to build and maintain, and present challenges in handling unstructured data and cross-domain questions.

Deep learning neural network model generated natural language replies, while improved in terms of smoothness and flexibility of generation, may still produce inaccurate or irrelevant replies when dealing with long sequences and complex questions.

Applicant studies have found that the existing customer service question-answering model is problematic in that there is a lack of profound semantic understanding and rational logical reasoning in handling the actual dialog. Based on the above, the application provides a new customer service question-answering model training method, so as to provide a customer service question-answering system which can better understand the intention of a user, has higher reliability, stronger interpretability and high customization degree.

The customer service question-answering model in the embodiment of the application is a generative language model. As shown in fig. 1, an embodiment of the present application provides a customer service inquiry and answer model training method. The training method may include steps S110 to S130.

Before the training method is executed, customer service industry history dialogue data containing product information and after-sales strategies can be downloaded from a cloud FTP file system in advance. And reading historical customer service industry dialogue data through the data processing system, cleaning the data, removing meaningless symbols, vocabularies, repeated data and the like, and obtaining an initial data set. The initial data set includes question sentences and reply sentences corresponding to each question sentence.

Step S110, obtaining a thinking chain prompt sentence and a thinking chain reply sentence.

In the embodiment of the application, the thinking Chain is CoT, chain-of-thoughts. The thought chain prompt statement is a prompt statement constructed based on the problem statement in the initial dataset for guiding the reply to the problem statement. The thinking chain prompt sentence comprises a question sentence and a small number of steps or examples for guiding and replying the question sentence, so that the customer service question-answering model is guided to more accurately conduct reasoning according to the guiding, and a reply sentence is generated.

As one example, the initial dataset question statement is "where product 4 is better than product 2". The constructed thought chain prompt statement may be the following statement:

“System:

the questions of the client are answered step by step according to the following guidelines, and the questions of the client are separated by a separation symbol (# #).

Step one # # identifies the user's intent, determines whether the customer is asking for product information or seeking after/before-market support.

Step # II determines whether the customer's query for product information or the sought after/after sales support regulations is in the repository provisioning information list below.

Knowledge base product information list:

product 1: price, product description; returning goods rules;

product 2: price, product description; returning goods rules;

product 3: price, product description; returning goods rules;

……

step three# # lists the assumptions made by the customer.

Step four # # determines whether the customer's assumptions are correct based on the provided knowledge base product information list.

Step five# # politely alerts the customer if the customer assumption is incorrect or the customer mentions a product that is not in the provided knowledge base information list. If the customer assumption is correct, the customer questions are replied back friendly based on the provided knowledge base product information list.

Where # # product 4 is better than product 2? # # ",

in the embodiment of the application, the thinking chain reply sentence is an expected reply sentence for replying to the problem sentence, which is constructed based on the thinking chain prompt sentence, that is, the reply sentence generated by reasoning step by step according to the guidance of the thinking chain prompt sentence. The thought chain reply sentence comprises a reply sentence in the initial data set.

As an example, the reply sentence in the initial dataset is "your good, we sell only product 2 at present, product 2 is suitable for people who work facing the electronic screen for a long time. The constructed thought chain reply sentence may be the following sentence:

step one # # based on the customer's message, we can determine that the customer wishes to have information about the comparison of product 4 and product 2.

Step two # # customers are querying the information of products 4 and 2, products 4 are not in the provided knowledge base information list, and products 2 are in the provided knowledge base information list.

Step three # # customer assumes that there are two products, product 4 and product 2.

The assumption of the step four # # customer is not exactly correct, there is no product 4 in the knowledge base's information list, and there is a product 2.

Step five# # you good, we sell only product 2 at present, product 2 being suitable for people who work facing the electronic screen for a long time. "

In the embodiment of the application, the initial customer service question-answering model can be trained by utilizing a data set formed by the thinking chain prompt sentences and the thinking chain reply sentences. The initial customer service questioning model may be a pre-trained generative language model.

In this embodiment of the present application, for the same mental chain prompt sentence, there may be multiple corresponding mental chain reply sentences. That is, the reply sentence expression in the same sense may be various. As an example, the expression "no products for customer consultation can be" very sorry, we do not sell such products ", or" you good, we have no such products in the store ", or" there is no knowledge about what you mention in my knowledge base.

Step S120, taking the thinking chain prompt sentence as the input of the model, taking the thinking chain reply sentence as the output of the model, and performing supervised first training on the initial customer service question-answering model to obtain a first customer service question-answering model.

In the embodiment of the present application, the loss function L used by the supervised first training may be predefined. As one example, the formula for the loss function L is as follows:

wherein x is _i And (5) the input sequence corresponding to the thinking chain prompt sentence of the ith training sample. y is _i And replying a target sequence corresponding to the sentence for the thinking chain of the ith training sample. For each given x _i The model predicts a y _i . Thus, for sequence x _i We can derive a predicted probability distribution P (y) from the model _i,p |x _i ,y _i,<p ). Wherein y is _i,<p Is the sequence x _i A target sequence preceding position p. N is the number of training samples in the dataset. P is the target sequence y _i Is a length of (c). y is _i,p Is the one-time thermal encoding of the target sequence at position p in the ith training sample.

And performing supervised fine tuning on the pre-trained customer service question-answering model by using the loss function L, and iteratively updating the weight of the model to obtain a first customer service question-answering model. The performance index (e.g., accuracy, recall, and F1 score, etc.) of the model may be calculated by comparing the generated reply sentence with the constructed expected reply sentence.

And step S130, performing second training on the first customer service questioning and answering model based on a reinforcement learning algorithm to obtain a converged second customer service questioning and answering model.

In this embodiment, the input of the model is the mental chain prompt sentence in the initial data set, the output of the model is the mental chain reply sentence in the initial data set, and the second training is performed on the first customer service question-answering model based on the reinforcement learning algorithm until a converged second customer service question-answering model is obtained. It will be appreciated that the second training is a further optimization training based on the first training.

The embodiment of the application utilizes the reply sentence output by the model to be highly matched with human preference. Meanwhile, the customer service question-answering model of the embodiment of the application can realize seamless fusion with various knowledge bases by updating the thinking chain prompt sentences, thereby effectively overcoming the defects of the existing question-answering system when processing multi-round conversations and cross-domain knowledge and ensuring the accuracy, the specialty and the reliability of output.

In some embodiments, as shown in fig. 2, step S130 may include steps S230 to S232.

Step S230, determining rewards corresponding to the reply sentences output by the rewards model according to the thinking chain prompt sentences and the reply sentences output by the first customer inquiry-reply model.

In this embodiment of the present application, the reward model may be a model obtained by training with the actual reply sentence output by the thought chain prompt sentence and the first customer question-answer model as inputs of the model and the expected reward as output of the model. The reward model may be a model obtained by removing the last non-embedded layer from the first customer question-answer model. As one example, the reward model may be a model obtained by removing the last non-embedded layer from the first customer challenge-response model based on a transducer.

In embodiments of the present application, the expected rewards may be determined based on the human preference level of the actual reply sentence. The human preference level of the actual reply sentence may be embodied in the ranking of the actual reply sentence.

As one example, professional reviewers may be invited in advance to rank scoring the reply sentences output by the supervised fine-tuned customer service question-answering model according to three aspects of helplessness, harmlessness and honesty, and collect ranking scoring data. Then, a mapping relation table of the thinking chain prompt sentences, the thinking chain reply sentences and the ranking data is established as a data set of the rewarding model. As shown in table 1.

TABLE 1

In this embodiment, the reward model may be trained using a cross entropy loss function. As one example, the loss function calculation formula for the bonus model is:

where θ is a parameter of the reward model. K is the number of target sequences corresponding to reply sentences predicted by the first customer question-answer model after supervision and fine tuning for the same thought chain prompt sentence.Representing any selection of two target sequences from the K target sequences as a combination. x is the thought chain hint statement. The K target sequences are divided into a plurality of groups of sequence pairs, and each group of sequence pairs is compared. In any set of sequence pairs, y _w Is a target sequence that is evaluated better after comparison, i.e., one that more closely matches human preferences. y is _l Is the relative y after comparison _w A poor target sequence. r is (r) _θ (x, y) is the output of the reward model, which represents the rewards given the hint statement x and when the reply statement y is generated. D is the dataset of the reward model. Sigma () is a sigmoid function that converts an input to a value between 0 and 1. />Representing the sampling of (x, y) from the data set D _w ,y _l ) The tuple calculates the expected value.

In the present embodiment, the cross entropy loss function L described above can be used ^RW And (theta) training the reward model to obtain a converged reward model.

Step S231, determining a first loss function according to rewards corresponding to reply sentences output by the rewards model.

In this embodiment, the first loss function is determined according to the rewards corresponding to the reply sentences output by the converged reward model.

In some implementationsIn an embodiment, the first loss function may be a policy loss function L ^POLICY . In the present embodiment, step S230 may include steps S330 to S332.

Step S330, determining the advantage estimation item according to the rewards output by the rewards model.

In the embodiment of the application, the advantage estimation item is determined according to the rewards output by the converged rewards modelDominance estimation term->It is represented in state s _t Take action a down _t Is a desirable reward for (a). Wherein a is _t S for the next generated token sequence _t Is the current token sequence (the initial token sequence is the thinking chain prompt word token sequence).

As an example of this, in one embodiment,the calculation formula of (2) is as follows:

wherein γ is a discount factor. T is the total time step. Each delta is a time division error (Temporal Difference error). Delta _t For sequence a _t Time division error at time step t. Delta _t+1 For sequence a _t Time division error at time step t+1. And so on.

δ _t The calculation formula of (2) can be:

δ _t ＝r _t +γV(s _t+1 )-V(s _t )

wherein V(s) _t ) Representing state s _t A jackpot is expected. V(s) _t+1 ) Representing state s _t+1 A jackpot is expected. r is (r) _t Indicating the expected prize obtained at time step t.

Step S332, determining a policy loss function according to the dominance estimation term.

The policy loss function of this embodiment not only considers the difference between the new policy and the old policy, but also considers the dominance estimation term under the policy, encouraging the model to take better action.

As one example, the calculation formula of the policy loss function is as follows:

wherein θ is a new policy parameter. θ _old Is the old policy parameter. Pi _θ (a _t |s _t ) Given s for new policy parameter θ _t Take a _t Is a probability of (2).For the old policy parameter θ _old The following is given s _t Take a _t Is a probability of (2). E is a small positive number (e.g., 0.2 or 0.3). The size of the policy update is limited to the trusted region by the clip () function.

In this embodiment, the first customer service questioning and answering model after fine tuning may be trained based on a reinforcement learning algorithm by using a policy loss function, so as to obtain a converged second customer service questioning and answering model.

In other embodiments, the first loss function may be a value function loss function. In the present embodiment, step S230 may further include steps S430 to S432.

Step S430, determining the actually obtained accumulated rewards in the current state according to the rewards output by the rewards model.

In the embodiment of the application, the actually obtained accumulated rewards in the current state are determined according to rewards output by the converged rewards model

where γ is a discount factor used to reduce the value of future rewards. Gamma ray ^t Representing the discount after t time steps,for the prize actually obtained at time step t. s is(s) ₀ =s means that the whole sequence starts from state s.

Step S432, determining a value function loss function according to the actually obtained jackpot and the expected jackpot.

The value function loss function L of the present embodiment ^VF The mean square error between the predicted value and the actual value of the discount total return is embodied. The value function predicts the expected jackpot under a given condition, which encourages the model to more accurately estimate future rewards.

As an example, L ^VF The calculation formula is as follows:

wherein V is _θ (s) is the jackpot that is expected to be achieved in the current state.

And S232, performing second training on the first customer service questioning and answering model based on the reinforcement learning algorithm by using the first loss function to obtain a converged second customer service questioning and answering model.

In this embodiment, the first customer service questioning and answering model after fine tuning may be trained by using a value function loss function, so as to obtain a converged second customer service questioning and answering model. And training the trimmed first customer service questioning and answering model by combining the strategy loss function and the value function loss function to obtain a converged second customer service questioning and answering model.

In some embodiments, as shown in fig. 3, step S232 may further include steps S530 to S532.

In step S530, a second loss function is determined.

In the embodiment of the present application, the second loss function is an entropy loss function L ^ENT . Entropy loss function L ^ENT For encouraging reinforcement learning algorithms to remain somewhat exploratory. Entropy loss function L ^ENT The calculation formula of (2) is as follows:

L ^ENT ＝entropy(π _θ (·|s _t ))

where "·" represents all possible actions, i.e. the next possible token sequence to be generated. Pi _θ (·|s _t ) To be in state s _t Policy distribution for all possible actions. In this function, entropy is based on pi _θ Is described.

For a discrete probability distribution p, the entropy is calculated as:

H(p)＝-∑ _i p(i)logp(i)

thus, the entropy loss function L ^ENT Can be expressed as:

L ^ENT ＝∑ _a π _θ (a|s _t )logπ _θ (a|s _t ))

wherein pi _θ (a|s _t ) For a given state s under a new policy parameter θ _t Probability of taking action a.

Step S532, performing a second training on the first customer service questioning and answering model based on the reinforcement learning algorithm by using the first loss function and the second loss function to obtain a converged second customer service questioning and answering model.

Preferably, the strategy loss function, the value function loss function and the entropy loss function are integrated to obtain a loss function L of the reinforcement learning algorithm ^RL 。

Loss function L of reinforcement learning algorithm ^RL The calculation formula of (2) is as follows:

L ^RL ＝L ^POLICY +c ₁ L ^VF +c ₂ L ^ENT

wherein L is ^POLICY Is a strategic loss functionA number. L (L) ^VF Is the value function loss function. L (L) ^ENT Is the entropy loss function. c ₁ And c ₂ Is a hyper-parameter used to balance the various parts of the loss function.

And performing iterative training on the trimmed first customer service question-answer model by using the loss function. In each iteration, the weights of the back-propagation update model are used. Once the model weights are updated, the reinforcement learning algorithm starts a new cycle. Finally, iteratively training out a converged second customer service questioning and answering model. The customer service question-answering model of the embodiment can be more aligned with human preference, and the requirements of customer service tasks can be better met.

The customer service question-answering model can be applied to various customer service scenes. Such as online customer service, telephone customer service, social media customer service, and smart home device customer service.

The customer service question-answering model has excellent customizing capability. The knowledge from multiple fields can be incorporated into the thought chain prompt sentences through integrating the retrieval enhancement generation technology, so that seamless fusion with various knowledge bases is realized, and high professional and accuracy can be maintained in diversified customer service scenes.

The embodiment of the application also provides a customer service question answering method. As shown in fig. 4, the method includes steps S610 to S630.

Step S610, a question sentence input by a user is acquired.

Step S620, obtaining reply sentences corresponding to the problem sentences based on the customer service question-answering model.

In an embodiment of the present application, the customer service questioning and answering model is a converged first customer service questioning and answering model obtained according to the training method of any one of the embodiments of the present application.

Step S630, outputting the reply sentence to the user.

The embodiment of the application also provides a customer service question answering system. The system may include a first acquisition module, a second acquisition module, and an output module.

The first acquisition module is used for acquiring the problem statement input by the user.

And the second acquisition module is used for acquiring reply sentences corresponding to the problem sentences based on the customer service question-answer model.

In this embodiment of the present application, the customer service answering model is a converged first customer service answering model obtained according to the training method described in any one of the embodiments of the present application.

And the output module is used for outputting the reply information to the user.

The embodiment of the present application further provides an electronic device, as shown in fig. 5, where the electronic device 1000 includes a memory 1100 and a processor 1200, where the memory 1100 is configured to store computer instructions, and the processor 1200 is configured to call the computer instructions from the memory 1100 to execute the training method according to any one of the embodiments of the present application or the customer service answering method according to the embodiments of the present application.

Embodiments of the present application also provide a computer-readable storage medium storing a computer program executable by a processor to implement the training method as described in any one of the embodiments of the present application or the customer service questioning and answering method as described in the embodiments of the present application.

The present application may be a system, method, and/or computer program product. The computer program product may include a computer readable storage medium having computer readable program instructions embodied thereon for causing a processor to implement aspects of the present application.

The computer readable storage medium may be a tangible device that can hold and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: portable computer disks, hard disks, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), static Random Access Memory (SRAM), portable compact disk read-only memory (CD-ROM), digital Versatile Disks (DVD), memory sticks, floppy disks, mechanical coding devices, punch cards or in-groove structures such as punch cards or grooves having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media, as used herein, are not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., optical pulses through fiber optic cables), or electrical signals transmitted through wires.

The computer readable program instructions described herein may be downloaded from a computer readable storage medium to a respective computing/processing device or to an external computer or external storage device over a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmissions, wireless transmissions, routers, firewalls, switches, gateway computers and/or edge servers. The network interface card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium in the respective computing/processing device.

Computer program instructions for performing the operations of the present application may be assembly instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, c++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may be executed entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present application are implemented by personalizing electronic circuitry, such as programmable logic circuitry, field Programmable Gate Arrays (FPGAs), or Programmable Logic Arrays (PLAs), with state information for computer readable program instructions, which may execute the computer readable program instructions.

Various aspects of the present application are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable medium having the instructions stored therein includes an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions. It is well known to those skilled in the art that implementation by hardware, implementation by software, and implementation by a combination of software and hardware are all equivalent.

The embodiments of the present application have been described above, the foregoing description is exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the technical improvement of the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein. The scope of the application is defined by the appended claims.

Claims

1. The customer service question-answering model training method is characterized by comprising the following steps of:

2. The method of claim 1, wherein the second training of the first customer service questionnaire model based on the reinforcement learning algorithm results in a converged second customer service questionnaire model, comprising:

determining a first loss function based on the reward;

3. The method of claim 2, wherein the first loss function is a policy loss function, and wherein the determining the first loss function based on the reward comprises:

determining a dominance estimation term based on the rewards;

4. The method of claim 2, wherein the first loss function is a value function loss function, wherein determining the first loss function based on the reward further comprises:

5. The method of any one of claims 2 to 4, wherein the second training of the first customer service questionnaire model based on a reinforcement learning algorithm using the first loss function results in a converged second customer service questionnaire model, further comprising:

6. The method of claim 2, wherein the reward model is a model obtained by removing a last non-embedded layer from the first customer challenge-response model based on a Transformer.

7. A customer service questioning and answering method, comprising:

acquiring a problem statement input by a user;

obtaining a reply sentence corresponding to the question sentence based on a customer service question-answering model, wherein the customer service question-answering model is the converged second customer service question-answering model obtained according to the training method of any one of claims 1 to 6;

and outputting the reply sentence to a user.

8. A customer service questioning and answering system, comprising:

a second obtaining module, configured to obtain a reply sentence corresponding to the question sentence based on a customer service question-answering model, where the customer service question-answering model is the converged second customer service question-answering model obtained according to the training method of any one of claims 1 to 6;

and the output module is used for outputting the reply information to the user.

9. An electronic device comprising a memory for storing computer instructions and a processor for invoking the computer instructions from the memory to perform the training method of any of claims 1-6 or the customer service request method of claim 7.

10. A computer readable storage medium, storing a computer program executable by a processor to implement the training method of any one of claims 1 to 6 or the customer service questioning and answering method of claim 7.