CN111061846A

CN111061846A - Electric power new installation and capacity increase conversation customer service system and method based on layered reinforcement learning

Info

Publication number: CN111061846A
Application number: CN201911137278.2A
Authority: CN
Inventors: 高曦莹; 张冶; 蔡颖凯; 王浩淼; 曹世龙; 李强; 田睿; 宋晓文; 张雯舒; 李丹; 宋锦春; 叶宁
Original assignee: State Grid Corp of China SGCC
Current assignee: State Grid Corp of China SGCC
Priority date: 2019-11-19
Filing date: 2019-11-19
Publication date: 2020-04-24

Abstract

The invention belongs to the technical field of text online dialogue system strategy optimization, and particularly relates to a new electric power capacity increasing dialogue customer service system and method based on layered reinforcement learning. In particular to an on-line implementation method of a power consumption customer service dialogue system with mixed task attributes, which aims at the type of a task-based dialogue system and is based on hierarchical reinforcement learning. The invention comprises the following steps: the system comprises a power service understanding module, a conversation state tracker, a conversation strategy and a power service feedback module. The invention carries out multi-layer decomposition on the subtasks with certain professional backgrounds, and adds the subtasks into the database related to the professional backgrounds to search the corresponding slot value information at any time. The customer service conversation with the professional background is realized, and the conversation success rate and the continuity are remarkably improved. The invention can save cost, improve the success rate of new electric power loading and capacity increasing conversation services, improve the smoothness degree of conversation and obviously improve the user experience.

Description

Electric power new installation and capacity increase conversation customer service system and method based on layered reinforcement learning

Technical Field

The invention belongs to the technical field of text online dialogue system strategy optimization, and particularly relates to a new electric power capacity increasing dialogue customer service system and method based on layered reinforcement learning. In particular to an on-line implementation method of a power consumption customer service dialogue system with mixed task attributes, which aims at the type of a task-based dialogue system and is based on hierarchical reinforcement learning.

Background

Along with the rapid development of artificial intelligence technology, the dialogue system is widely applied to the fields of smart phones, smart homes, unmanned vehicles and the like, and internet companies and research institutions at home and abroad also put a large amount of resources into the dialogue system as a research hotspot. In general, there are three types of dialog systems, namely question-answer type, task type and open type. The task-based dialog system focuses on specific task targets, and is a mainly used technical type of the customer service dialog system. The new loading and capacity increasing electricity utilization business is a main service project of an electric power hall, and currently, more human resources are occupied. The traditional multi-task conversation system can only process simple preset tasks and is difficult to complete for customer service conversations with certain professional properties.

Therefore, the traditional customer service conversation system has the defects of insufficient consideration of subtask relevance, less return value, incapability of meeting the subtask due to semantic constraint, and poor user experience and even conversation failure caused by frequent switching of different subtasks.

Disclosure of Invention

The invention provides a new electric power capacity increasing conversation customer service system and method based on layered reinforcement learning, aiming at the technical problems at present, and the system is a customer service intelligent system aiming at completing a conversation task under a mixed framework based on the layered reinforcement learning. The customer service dialogue system aims to solve the problem of professional customer service dialogue with certain professional background knowledge and provides a customer service dialogue system which has an association relation and needs to be completed by all subtasks in response to multiple subtasks with certain professional backgrounds. Has strong adaptability for different users and can contain certain professional knowledge.

In order to realize the purpose, the invention is realized by adopting the following technical scheme:

electric power newly-installed capacity-increased dialogue customer service system based on hierarchical reinforcement learning comprises: the system comprises a power service understanding module, a conversation state tracker, a conversation strategy and a power service feedback module; wherein:

the electric power business understanding module: the system is used for understanding and identifying specific demand information of the power consumer and transmitting the information to the conversation state tracker;

dialog state tracker: the system is used for tracking and recording the current conversation state and preparing to call state information at any time;

conversation strategy: the system is used for generating an optimization response to the power consumer and updating the conversation strategy to optimize iteration continuously;

the power service feedback module: and the response generated according to the conversation strategy is translated into information understandable by the user and fed back to the power consumer.

The new electric power capacity increasing conversation customer service method based on layered reinforcement learning comprises the following steps:

step 1, a dialogue system obtains service linguistic data from an electric power service understanding module, wherein the service linguistic data can be converted into a text extraction slot value through sound, and the text extraction slot value can also be directly extracted from an online text;

step 2, when the electric power customer talks with the intelligent customer service, intelligent body dialogue data is obtained from a shared multi-field general dialogue corpus and an electric power English item corpus;

and 3, receiving the new capacity-increasing electricity application successfully, feeding back the information of the power customer by the intelligent dialog according to the multi-standard layered reinforcement learning dialog strategy until the requirement of the customer is met, and judging that the dialog is successful.

The dialogue system is used for extracting text corpora of the electricity consumer; due to professional knowledge related to electricity utilization, a conversation strategy is decomposed into two reward values of a class 1 standard strategy and a class 2 standard strategy, wherein the class 1 standard strategy is called an external reward value and comprises multiple layers; decomposing the electric power professional knowledge for many times until the knowledge in the corpus and the database can cover all the contents; the class 2 standard strategy is called an internal reward value and comprises decomposed subtasks and actions; and the two reward values are respectively optimized for reinforcement learning, and guide the customer service system and learn.

The corpus information comprises the number of the dialogues, the number of the signs of success or failure of the dialogues, the related information of the user power and the related information of the power replied by the system.

The slot value information is used for decomposing the target of the conversation power customer into a series of slot values, and comprises the following steps:

the new capacity increasing tank value shows the new capacity increasing requirements of the power customers;

requesting a slot value, and displaying the information of the power customer inquiry dialogue system;

the slot value required by the electric power customer target is from a database set of daily electric power business hall service and real electric power customer conversation;

extracting all the slot values appearing in the dialogue paragraph, if one slot has a plurality of values, the slot is regarded as soft constraint of the power customer, and the user may change his option later to search for other options in the dialogue; if a slot value has only one option, then this is a hard constraint that cannot be negotiated; if a slot value is empty, it may be a demand of the power consumer, and if the value is not present in the database, the slot value is removed from the power consumer's possible target; the whole capacity increasing process at least needs 2 processes, capacity increasing value determination and engineering design unit determination, and both the capacity increasing value and the engineering design unit value comprise a plurality of numerical values.

The multi-standard layered reinforcement learning dialogue strategy comprises the following steps: multi-layer class 1 standard dialog strategy_gnAnd single-layer class 2 dialogue strategy_a，gn。

The class 1 standard strategy pi_gnObtaining a state s from the environment and selecting a subtask g, wherein the subtask can be further decomposed, and the number of decomposition layers is represented by n; all executable sub-tasks with reward values and termination conditions require the use of a class 2 standard policy π_a，gn。

Inputting a state s and a subtask gn into the class 2 standard strategy, and outputting a basic action a; subtask gn strategy 2 type standard strategy pi_a，gnKeeping constant input until a termination condition is reached to end the subtask gn; internal award value T provided by internal evaluation mechanism in dialog manager_t ⁱ(gn_t) The reward signal is used for revealing whether the subtask gn is about to be completed or not, and the reward value signal is also used for optimizing the class 2 standard strategy pi_a，gn(ii) a The state s contains global information of the conversation and tracking information of all subtasks; to optimize class 2 criteria strategy π_a，gnMaximizing the accumulated internal expected reward at each step t

In the above formula, r_t+k ⁱRepresenting internal evaluation reward in t + k steps, class 1 standard strategy pi_gnOptimizing the accumulated reward value in the t step;

in the above formula, r_t+k ^eRepresenting the reward value received externally from the environment when a new subtask starts, at step t + k, the internal and external reward values work together to cause the dialogue learning strategy to select the appropriate dialogue action.

The class 1 standard strategy pi_gnAnd class 2 Standard strategy π_a，gnLearning by adopting a deep Q learning method; wherein, the class 1 standard dialogue strategy optimization Q function needs to satisfy:

in the above formula, N represents the standard dialog strategy of class 2_a，gnThe number of steps required to complete a subtask; gn' represents in state s_t+NThe next subtask;

class 2 standard dialog strategy pi_a，gnThe optimized Q function satisfies:

in the above formula, Q₁ ^*(s, gn) and Q₂ ^*(s, a, gn) is represented by a neural network and is represented by θ₁And theta₂Parameterized as Q₁(s，gn；θ₁) And Q₂(s，a，gn；θ₂)。

Optimizing the performance of the dialogue system, defining a loss function of a training network, amplifying the action probability with positive reward value, and reducing the action probability with negative reward value;

the loss function for a class 1 standard dialog strategy at each iteration i is:

wherein

In the above formula, r^e＝∑γ^kr_t+k ^eA discount value representing the sum of rewards when the sub-target gn completes; n represents the number of steps at completion;

the class 2 standard dialog strategy minimum loss function is:

wherein

rⁱRepresenting a prize value containing a discount factor; minimizing a loss function by a random gradient descent method;

updating the conversation strategy through the accumulated reward of the Q value, thereby realizing effective conversation with the client;

using the stochastic gradient descent method to minimize the loss function, for a class 1 standard dialog strategy gradient:

in the formula:

representing the decreasing gradient of the loss function, E representing the future expectation, D representing the empirical replay buffer, gamma representing the discount factor,

represents Q₁A falling gradient function of the function;

the class 2 standard dialog strategy is:

in the formula:

represents Q₂A falling gradient function of the function;

the conversation strategy performance is further improved, and the performance is improved by using two heuristic methods, namely a target network and experience playback; playback tuple of experiment (s, g, r)^eS') and (s, g, a, r)ⁱS'); the dialogue strategy function is continuously updated iteratively with each round of dialogue updating of the Q function until the final convergence.

Compared with the prior art, the invention has the advantages and beneficial effects that:

the invention realizes the customer service dialogue with a certain professional background. The success rate of the conversation is obviously improved, and the consistency of the conversation is obviously improved. The labor cost is saved, the intelligent customer service conversation in other professional fields can be realized by the implementation method due to the change of the professional database, and the conversation success rate can be further improved as the strategy needs to train a large amount of data and the knowledge data in the field is more.

The new added capacity (newly added or added capacity) service customer service system of the power grid at least comprises the following two subtasks, wherein the first subtask is used for determining the capacity balance of the new added capacity. And a second sub-task of selecting a design company and a general engineering quantity determined according to the capacity. There are time, expense and logic relations among all subtasks, but all the schemes in the customer service system need to be completed together in a conversation, and the conversation tasks cannot be completed without one step. The invention adopts the dialogue manager formed by a layered deep reinforcement learning method, and can solve complex tasks in the dialogue in different scale spaces by using the method; the relevance degree of the conversation is improved; has strong adaptability for different users and can contain certain professional knowledge.

The invention relates to a method for realizing a power consumer service conversation online system with a mixed task completion attribute based on layered reinforcement learning. In the method, a dialogue strategy decomposes a target task into a plurality of layers until the target task is decomposed into subtasks which can be understood and executed by a system, and a deep reinforcement learning method is used for learning and training. The method has the greatest advantage that the subtasks with certain professional backgrounds are subjected to multi-layer decomposition, and the subtasks are added into the database related to the professional backgrounds to find corresponding groove value information at any time. The invention can improve the success rate of the electric power new installation and capacity increase conversation business, improve the fluency degree of the conversation and improve the user experience.

Drawings

In order to facilitate the understanding and practice of the present invention for those of ordinary skill in the art, the following detailed description of the present invention is provided in conjunction with the accompanying drawings and the detailed description, the following examples are provided to illustrate the present invention, but it should be understood that the scope of the present invention is not limited by the detailed description.

FIG. 1 is a system block diagram of the present invention;

FIG. 2 is an overview of the dialog method of the present invention;

FIG. 3 is a schematic diagram of a class 1 standard dialog strategy learner of the present invention;

FIG. 4 is a schematic diagram of the class 2 standard dialog strategy learner of the present invention.

Detailed Description

The invention relates to a power new-installation capacity-increasing dialogue customer service system and a method based on hierarchical reinforcement learning, wherein the power new-installation capacity-increasing dialogue customer service system based on the hierarchical reinforcement learning comprises the following steps: the system comprises a power service understanding module, a conversation state tracker, a conversation strategy and a power service feedback module. Wherein:

The invention relates to a realization method of a new electric power capacity increasing conversation customer service system based on layered reinforcement learning, which is characterized in that a power consumption customer converts text information into groove value information through an electric power business understanding module and transmits the groove value information to a conversation manager, the conversation manager composed of a conversation state tracker and a conversation strategy transmits response information to an electric power business feedback module, and the electric power business feedback module generates semantic texts which can be understood by the power consumption customer and feeds the semantic texts back to the customer. The professional knowledge can be queried and updated through the database. As shown in fig. 1, fig. 1 is a system configuration diagram of the present invention.

The implementation method of the electric power new-installation capacity-increasing dialogue customer service system based on the layered reinforcement learning comprises the following steps:

step 1, the dialogue system obtains the service linguistic data from the electric power service understanding module, wherein the service linguistic data can be extracted by converting voice into a text, and can also be extracted directly from an online text.

And 2, when the electric power customer talks with the intelligent customer service, the intelligent body dialogue data is obtained from a shared multi-field general dialogue corpus and an electric English item corpus.

And 3, receiving the new capacity increasing application successfully, and feeding back the information of the power customer by the intelligent dialog according to the multi-standard layered reinforcement learning dialog strategy until the requirement of the customer is met, wherein the dialog is regarded as successful.

The dialogue system is used for extracting text corpora of the electricity consumer according to the step 1. Due to professional knowledge related to electricity utilization, the conversation strategy is decomposed into a 1-class standard strategy pi_gnAnd class 2 Standard strategy π_a，gnTwo reward values, of which class 1 standard strategy π_gnReferred to as external prize values, may contain multiple layers. The power expertise is decomposed many times until the knowledge in the corpus and database can cover the entire content. Class 2 standard strategy pi_a，gnReferred to as internal prize values, contain the decomposed subtasks and actions. And the two reward values are respectively optimized for reinforcement learning, and guide the customer service system and learn.

The slot value information is a decomposition of the goal of the conversational power consumer into a series of slot values according to step 1. For example, the new capacity-increasing tank value, dst _ cap ═ 10KVA, shows the new capacity-increasing demand of the power customer. Request slot values, e.g., Price? The electricity customer inquiry dialogue system information is displayed. The slot values needed for the power customer objectives are from a data base set of daily power business hall servicers conversing with real power customers. All the slot values appearing in the dialog paragraph are extracted, and if a slot has multiple values, for example, or _ cap ═ 20KVA, we consider this to be a soft constraint for the power consumer, who may later change his option to explore other options in the dialog. If a slot value has only one option, then this is a hard constraint that cannot be negotiated. If a slot value is empty, it may be a demand of the power consumer, and if the value is not present in the database, the slot value is removed from the power consumer's possible target. At least 2 processes are needed in the whole compatibilization process, and the compatibilization value is determined and the engineering design unit is determined. And both the new package capacity value and the engineering unit value comprise a plurality of values.

And 2, the corpus information comprises the number of the dialogues, the number of the signs of success or failure of the dialogues, the user power related information and the power related information replied by the system. As the electric power professional information is continuously evolved along with the development of the times, the electric power database needs to be inquired in a dialogue to perfect and supplement the electric power professional information.

According to the step 3, the multi-standard layered reinforcement learning dialogue strategy adopts a deep reinforcement learning method to update the strategy, and the deep reinforcement learning needs a large amount of data and linguistic data to train, and can adopt a virtualizer to train a network.

The multi-standard layered reinforcement learning conversation strategy comprises a multi-layer class-1 standard conversation strategy pi according to the step 3_gnAnd single-layer class 2 dialogue strategy_a，gn。

Class 1 standard strategy pi_gnThe state s is obtained from the environment and the subtask g is selected, which can be further decomposed, the number of decomposition levels being denoted by n. All executable sub-tasks with reward values and termination conditions require the use of a class 2 standard policy π_a，gn。

The class 2 standard policy inputs state s and subtask gn, and outputs basic action a. Subtask gn towards class 2 standard strategy pi_a，gnThe constant input is kept until the termination condition is reached to end the subtask gn. Internal reward value r is provided by an internal rating mechanism in the dialog manager_t ⁱ(gn_t) The reward signal is used to reveal whether the subtask gn is about to be completed, and the reward value signal is also used to optimize the strategy pi_a，gn. The state s contains global information of the dialog and also tracking information of all subtasks. To optimize class 2 criteria strategy π_a，gnMaximizing the accumulated internal expected reward at each step t

In the above formula, r_t+k ⁱRepresenting internal evaluation reward in t + k steps, class 1 standard strategy pi_gnAnd (4) optimizing the accumulated reward value in the t step.

In the above formula, r_t+k ^eRepresenting the prize value received externally from the environment when a new subtask starts, at step t + k. The internal and external reward values work together to cause the dialogue learning strategy to select the appropriate dialogue action.

Class 1 standard strategy pi_gnAnd class 2 Standard strategy π_a，gnThe deep Q learning method is adopted for learning. Therein, class 1 standard dialog strategy pi_gnThe optimization of the Q function needs to satisfy:

in the above formula, N represents the standard dialog strategy of class 2_a，gnThe number of steps required to complete the subtask. gn' represents in state s_t+NThe next subtask.

The class 2 standard dialogue strategy optimization Q function satisfies:

in the above formula, Q₁ ^*(s, gn) and Q₂ ^*(s, a, gn) is represented by a neural network and is represented by θ₁And theta₂Parameterized as Q₁(s，gn；θ₁) And Q₂(s，a，gn；θ₂). The neural network selected in the present invention is DQN (deep Q network).

In order to optimize the performance of the dialogue system, a loss function of a training network is defined, the action probability with positive reward value is amplified, and the action probability with negative reward value is reduced. Class 1 standard dialog strategy pi_gnThe minimum loss function at each iteration i is:

wherein

In the above formula, r^e＝∑γ^kr_t+k ^eRepresenting the discount value of the prize sum when the sub-target gn is completed. N represents the number of steps at completion.

Class 2 standard dialog strategy pi_a，gnMinimum loss boxThe number is as follows:

wherein

rⁱRepresenting a prize value containing a discount factor. The loss function is minimized by a random gradient descent method.

And updating the conversation strategy through the accumulated reward of the Q value, thereby realizing effective conversation with the client.

Minimizing the loss function using the stochastic gradient descent method, for class 1 standard dialog strategy π_gnThe gradient is:

in the formula:

represents Q₁Decreasing gradient function of the function.

Class 2 standard dialog strategy pi_a，gnThe gradient is:

in the formula:

representsQ₂Decreasing gradient function of the function.

In order to further improve the conversation strategy performance, two heuristic methods, namely a target network and experience playback, are used for improving the performance. Playback tuple of experiment (s, g, r)^eS') and (s, g, a, r)ⁱS'). The dialogue strategy function is continuously updated iteratively with each round of dialogue updating of the Q function until the final convergence.

As shown in fig. 2, fig. 2 is an overall view of the dialog method of the present invention. Is a general overview of the method of completing a dialog based on a hybrid task.

Two types of layered reinforcement learning agents form a dialogue learning strategy, wherein the 2 types of standard strategies are associated with an internal judgment mechanism, the internal judgment mechanism can be updated in a single step iteration mode, receives dialogue actions of the 2 types of standard strategies, and provides internal reward values r for the strategiesⁱ. External reward value r fed back by power customer is received by type 1 standard strategy^eReceiving conversation state s of power customer, 1 type standard strategy pi at the same time_gnSub-targets can be deeply decomposed in multiple layers, the number of layers can be more than 2, and the sub-targets can be up to pi of 2-type standard strategies_a，gnCan process single step and receive sub-targets, 2-type standard strategy pi_a，gnThe selected dialog action is implemented to the power consumer. A general overview of the hybrid task based completion dialog method is shown in fig. 1.

As shown in FIGS. 3 and 4, FIG. 3 shows a class 1 standard dialog strategy of the present invention_gnFIG. 4 is a schematic diagram of a class 2 standard dialog strategy of the present invention_a，gnA learner diagram. The intelligent agent is used for learning the hierarchical conversation strategy respectively representing the 1-type standard and the 2-type standard.

For example, a client applies for newly increasing the power capacity of 10KVA-20KVA, a conversation strategy aims at a complex task synthesized by 10KVA, firstly, one subtask is selected to determine the capacity, the subtasks are multilayer, a series of actions are taken to collect relevant information until all information needed by the client is collected, and the subtask is ended, wherein the process comprises one time of handling 20KVA and two times of respectively increasing the capacity of 10 KVA; and searching for the next subtask and selecting a design company until all subtask information is collected. The dialogue strategy is realized by combining deep reinforcement learning and a hierarchical value function, and the hierarchical decomposition of the method can decompose the electric power professional content into options which can be directly judged to have sequence.

The option here refers to a generalized action concept with a termination function (attenuation coefficient) γ containing a policy π and a dependent state.

Pi is the policy function in the customer service dialog system, s is the state in the customer service dialog system, a is the action of the customer service dialog system, gamma is the decay factor in the customer service dialog system, and k is its exponent. r is the reward or penalty value in the customer service dialog system. A slot is an attribute that an agent has well-defined.

The invention carries out deep disassembly subtasks on the tasks, and compared with the traditional deep reinforcement learning, the method greatly improves the success rate of the conversation. The subtask decomposition is more sufficient, the continuity of semantic communication is better, the learning speed is faster, and the convergence performance is better.

Embodiments of the present application may be provided as methods, systems, or computer program products. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, is limited to these examples; within the idea of the invention, also features in the above embodiments or in different embodiments may be combined, steps may be implemented in any order, and there are many other variations of the different aspects of the invention as described above, which are not provided in detail for the sake of brevity.

The embodiments of the invention are intended to embrace all such alternatives, modifications and variances that fall within the broad scope of the appended claims. Therefore, any omissions, modifications, substitutions, improvements and the like that may be made without departing from the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. Electric power newly-installed capacity-increased dialogue customer service system based on hierarchical reinforcement learning is characterized in that: the method comprises the following steps: the system comprises a power service understanding module, a conversation state tracker, a conversation strategy and a power service feedback module; wherein:

2. The new electric power capacity increasing conversation customer service method based on layered reinforcement learning is characterized in that: the method comprises the following steps:

3. The new electric power capacity increasing conversation customer service method based on layered reinforcement learning as claimed in claim 2, characterized in that: the dialogue system is used for extracting text corpora of the electricity consumer; due to professional knowledge related to electricity utilization, a conversation strategy is decomposed into two reward values of a class 1 standard strategy and a class 2 standard strategy, wherein the class 1 standard strategy is called an external reward value and comprises multiple layers; decomposing the electric power professional knowledge for many times until the knowledge in the corpus and the database can cover all the contents; the class 2 standard strategy is called an internal reward value and comprises decomposed subtasks and actions; and the two reward values are respectively optimized for reinforcement learning, and guide the customer service system and learn.

4. The new electric power capacity increasing conversation customer service method based on layered reinforcement learning as claimed in claim 2, characterized in that: the corpus information comprises the number of the dialogues, the number of the signs of success or failure of the dialogues, the related information of the user power and the related information of the power replied by the system.

5. The new electric power capacity increasing conversation customer service method based on layered reinforcement learning as claimed in claim 2, characterized in that: the slot value information is used for decomposing the target of the conversation power customer into a series of slot values, and comprises the following steps:

6. The new electric power capacity increasing conversation customer service method based on layered reinforcement learning as claimed in claim 2, characterized in that: the multi-standard layered reinforcement learning dialogue strategy comprises the following steps: multi-layer class 1 standard dialog strategy_gnAnd single-layer class 2 dialogue strategy_a，gn。

7. The new electric power capacity increasing conversation customer service method based on layered reinforcement learning as claimed in claim 6, wherein: said class 1Standard strategy pi_gnObtaining a state s from the environment and selecting a subtask g, wherein the subtask can be further decomposed, and the number of decomposition layers is represented by n; all executable sub-tasks with reward values and termination conditions require the use of a class 2 standard policy π_a，gn。

8. The new electric power capacity increasing conversation customer service method based on layered reinforcement learning as claimed in claim 6, wherein: inputting a state s and a subtask gn into the class 2 standard strategy, and outputting a basic action a; subtask gn strategy 2 type standard strategy pi_a，gnKeeping constant input until a termination condition is reached to end the subtask gn; internal reward value r is provided by an internal rating mechanism in the dialog manager_t ⁱ(gn_t) The reward signal is used for revealing whether the subtask gn is about to be completed or not, and the reward value signal is also used for optimizing the class 2 standard strategy pi_a，gn(ii) a The state s contains global information of the conversation and tracking information of all subtasks; to optimize class 2 criteria strategy π_a，gnMaximizing the accumulated internal expected reward at each step t

9. The new electric power capacity increasing conversation customer service method based on layered reinforcement learning as claimed in claim 2, characterized in that: said class 1Standard strategy pi_gnAnd class 2 Standard strategy π_a，gnLearning by adopting a deep Q learning method; wherein, the class 1 standard dialogue strategy optimization Q function needs to satisfy:

class 2 standard dialog strategy pi_a，gnThe optimized Q function satisfies:

10. The new electric power capacity increasing conversation customer service method based on layered reinforcement learning as claimed in claim 1, characterized in that: optimizing the performance of the dialogue system, defining a loss function of a training network, amplifying the action probability with positive reward value, and reducing the action probability with negative reward value;

wherein

In the above formula, r^e＝∑γ^kr_t+k ^eA discount value representing the sum of rewards when the sub-target gn completes;n represents the number of steps at completion;

the class 2 standard dialog strategy minimum loss function is:

wherein

in the formula:

represents Q₁A falling gradient function of the function;

the class 2 standard dialog strategy is:

in the formula:

represents Q₂A falling gradient function of the function;