Specific embodiment
In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention
In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is
A part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art
Every other embodiment obtained without creative efforts, shall fall within the protection scope of the present invention.
A kind of stream of the optimization method of the online conversation status tracking model provided as shown in Figure 1 for one embodiment of the invention
Cheng Tu includes the following steps:
S11: dialogue state trace model is assisted by nitrification enhancement pre-training, to determine tutor model, wherein institute
State auxiliary dialogue state trace model include: dialogue state trace model based on statistics and rule-based dialogue state with
Track model, the dialogue state trace model based on statistics include online dialogue state trace model;
S12: the semantic feature of user's read statement is extracted, determines the first of the semantic feature according to the tutor model
Confidence state determines the second confidence state of the semantic feature according to online conversation status tracking model;
S13: the tutor model and institute are determined according to the difference of the first confidence state and the second confidence state
The gap in online conversation status tracking pattern search space is stated, and then determines the benchmark score of positive reward;
S14: it determines that the feedback of user's read statement is talked with according to the online confidence state, utilizes the feedback pair
The voice duration of words determines the cost score reversely rewarded, wherein the voice duration of the dialogue and the cost score are at just
Than;
S15: based on the semantic feature together with the benchmark score, the cost score, pass through nitrification enhancement pair
The online conversation status tracking model optimization, to optimize search space and the feedback of the online conversation status tracking model
The voice duration of dialogue.
In the present embodiment, the output of dialogue state trace model is confidence level state, that is to say, that uses intensified learning
Method carry out on-line optimization dialogue state trace model, motion space is entire dialogue state space.Only use dialog strategy mould
Prize signal in block is the tracking strategy that cannot directly acquire in dialogue state tracking.By the online conversation status tracking mould
Type carries in the electronic device, for users to use, the voice signal of real-time reception user input.
For step S11, dialogue state trace model is assisted by nitrification enhancement pre-training, to determine teacher's mould
Type by the inspiration of concomitant learning, increases auxiliary dialogue state trace model as teacher's model in implementation method, this
Model can be any form of dialogue state trace model, can be rule, be also possible to statistics, optimization it is online
Dialogue state trace model is with the Neural Networks Representation connected entirely.This auxiliary dialogue state trace model also can be to online
Tracking strategy issue prize signal, those are punished from the far dialogue state of auxiliary dialogue state trace model, with
This search space to reduce the online conversation state tracking module of optimization.Due to the tutor model can be it is any form of
Dialogue state trace model, thus tutor model can by based on statistics dialogue state trace model or rule-based dialogue
Status tracking model training forms.
For step S12, after receiving the voice signal of user, the language of the sentence in user's input speech signal is extracted
Adopted feature, the semantic feature of the sentence based on user input determine institute's predicate by the tutor model determined in step S11
The auxiliary confidence state of adopted feature determines the semantic feature as the first confidence state, by online conversation status tracking model
Online confidence state as the second confidence state.For example, determining that the auxiliary confidence state of the semantic feature is set as first
Letter state is ba t, determining the online confidence state of the semantic feature as the second confidence state is be t。
For step S13, according to the difference of the first confidence state and the second confidence state that determine in step s 12
Determine the gap of the search space of the tutor model and the online conversation status tracking model, so that it is determined that benchmark score,
Positive reward parameter as optimization online conversation status tracking model.
For step S14, the voice of the feedback dialogue of user's read statement is determined according to the online confidence state
Duration, so that it is determined that the cost score reversely rewarded.After the voice signal for receiving user's input, according to online conversation state
Trace model determines that confidence state, the confidence state are the confidence level groups by various feedback dialogue and affiliated feedback dialogue
At, so that it is determined that the optimal feedback dialogue of confidence level out, since the length of each feedback dialogue is different, so that the voice of feedback dialogue
Duration is also just different.Since the feedback dialogue of different phonetic duration can solve the enquirement of user, it is contemplated that time cost, instead
The time for presenting dialogue is shorter, and consumed time cost is also just smaller.To determine cost using the voice duration of feedback dialogue
Score, the reversed reward parameter as optimization online conversation trace model.
For step S15, according to the semantic feature determined in step S12 and together with the benchmark score determined in step S13,
In the cost score that step S14 is determined, by nitrification enhancement to the online conversation status tracking model optimization, thus excellent
Change the search space of the online conversation status tracking model and the voice duration of feedback dialogue.
It can be seen that by the implementation method by increasing auxiliary dialogue state trace model as tutor model, according to
The tutor model issues prize signal to online conversation status tracking model, to religion separate in online conversation status tracking model
The dialogue state of teacher's model is punished, while considering that the voice duration of feedback dialogue optimizes, online right to reduce with this
The search space of speech phase trace model, to improve the dialog strategy of online conversation state tracking module.
As an implementation, in the present embodiment, according to the first confidence state and the second confidence state
Difference determine the gap of the tutor model Yu the online conversation status tracking pattern search space, and then determine positive prize
The benchmark score encouraged includes:
When the absolute value of the first confidence state and the difference of the second confidence state is not above preset threshold,
The benchmark score is 0,
When the absolute value of the first confidence state and the difference of the second confidence state is more than preset threshold, by institute
The opposite number of the absolute value of difference is stated as benchmark score.
In the present embodiment, the religion is determined according to the difference of the first confidence state and the second confidence state
The gap of teacher's model and the online conversation status tracking pattern search space, so that it is determined that the benchmark score r of positive rewardbs。
When the absolute value of the first confidence state and the difference of the second confidence state | | be t-ba t| |≤threshold epsilon,
The benchmark score of the positive reward is 0.
When the absolute value of the first confidence state and the difference of the second confidence state | | be t-ba t| | > threshold epsilon,
The benchmark score of the positive reward is rbs=-| | be t-ba t||。
By embodiment of above as can be seen that present embodiment is according to increased auxiliary dialogue state trace model conduct
Tutor model, punished apart from the farther away dialogue state of tutor model in presence trace model, given specific
Payment method, to reduce the search space of online conversation state tracking module.
As an implementation, in the present embodiment, the method also includes:
User is collected to the evaluation result of the feedback dialogue;
The assessment score of positive reward is determined according to the evaluation result;
Based on the semantic feature together with the assessment score, the benchmark score, the cost score, pass through extensive chemical
Algorithm is practised to the online conversation status tracking model optimization, to optimize the search sky of the online conversation status tracking model
Between, feedback dialogue voice duration and feedback effects.
In the present embodiment, when the electronic equipment for carrying the online conversation status tracking model is inputted according to user
Sentence feedback after, collect user to it is described feedback dialogue evaluation result.Wherein, the evaluation result of the feedback dialogue can be with
By online conversation status tracking model provide, for example, the online conversation status tracking model feedback dialogue after, continue to
Family provides a feedback dialogue evaluation frame, wherein evaluation option is preset in the evaluation frame, for example, may include: " to praise very much!",
A series of evaluation options such as " satisfaction ", " general ", " giving an irrelevant answer ".After user evaluates this, user is collected to described
Feed back the evaluation result of dialogue.
Assessment score is determined according to the evaluation result, for example, when evaluation result is " to praise very much!" when, assessment score can phase
To somewhat higher, when evaluation result is " general ", assessment score is with respect to can be more lower.
According to determining semantic feature and together with determining assessment score, benchmark score, cost score, pass through intensified learning
Algorithm is to the online conversation status tracking model optimization, to optimize the search space of the online conversation status tracking model
And the voice duration and feedback effects of feedback dialogue.
By embodiment of above as can be seen that present embodiments provide for the parameters in terms of another to constrain online
Dialogue state trace model determines assessment score according to the evaluation result of user, to judge whether the dialogue of feedback reaches use
The target at family, so that the search space of online conversation status tracking model is advanced optimized, to improve online conversation state
The dialog strategy of tracking module.
As an implementation, in the present embodiment, the nitrification enhancement includes: depth deterministic policy gradient
Algorithm and/or depth enhance network algorithm.
In the present embodiment, since dialogue state is continuously, so using DDPG (Deep Deterministic
Policy Gradient, depth deterministic policy gradient algorithm) come optimize online conversation state model tracking strategy network
Parameter, to limit the spatial gradient of punishment.After the convergence of online conversation state tracking module, talk with plan followed by joint
Slightly optimize.Using DQN (Deep Q-Learning, depth enhance network algorithm), deep neural network is generated effective
Uncertainty estimation also extends to large-scale parallel system, is ranked up in multiple time steps to information, guarantees it
Diversity, calculates at low cost, learning efficiency height, and performance is excellent.
By the embodiment can be seen that using specific nitrification enhancement to online conversation status tracking model into
Row optimization, can further limit the search space of online dialogue state trace model.
Illustrate the overall effect of the scheme below, identifying machine learning method is DST (Dialogue State
Tracking, dialogue state tracking) in state-of-the-art technology.But these methods have some limitations.Firstly, they are SL
(Supervised Learning, supervised learning) method needs a large amount of off-line data to annotate.This is not only expensive, but also online
Learn also infeasible.Secondly, giving limited flag data, SL method may be easy to happen excessive adjustment, lead to generalization ability
Difference, again, since the DST method based on SL is independently of dialog strategy, so DST module is unable to the habit of dynamically adapting user.
These limitations forbid DST module to carry out online updating.In order to solve this problem, DST optimization is carried out by using online interaction
Deeply study DRL (Deep Reinforcement Learning, deeply study) frame.
RL (Reinforcement Learning, intensified learning) updates dialog strategy in the conversational system of oriented mission
Module is popular.But other than the combination learning mode of several DST and policy, RL is not yet dedicated for DST module.
Under RL frame, using DST as agency, referred to as tracking agent, the other parts of conversational system are considered as environment.To using special
Door optimizes intensified learning frame for online DST.
Different from policy agency, the decision (presence) that tracking agent is made is continuous.Therefore, DST is considered as connecting
Continuous control problem.Since continuous presence is both continuous and higher-dimension, the existing direct application effect of RL algorithm
It is bad.
Here, by constructing a new DST frame by companion's teaching idea.Herein, pair that supplemental training is always or usually as specified
Speech phase tracker, such as traditional tracker are used as teacher by training offline to know the optimization of practical DST agency
Journey, to avoid excessively adjusting and realizing steady and quick convergence.As shown in Figure 2, wherein ba tIt is that auxiliary DST model generates
Assist presence, be tIt is the exploration presence that tracking agent generates.ba tAnd be tBetween difference will be fed to return letter
The search space of tracking agent is substantially reduced in number.The modular construction of this frame allows using more flexible and interpretable pair
Talk about administrative model.For example, interpretable dialogue policy (rule-based policy) can easily make together with any DST model
With.This flexibility is actually highly useful.Secondly as having used teacher's DST model, the optimization process of tracking agent needs
Seldom dialogue data, and training is more steady.
In order to avoid the confusion of concepts acted on behalf of with policy, the state of tracking agent and the input of behavior is substituted respectively herein
And output.In this work, the dialog manager of semantic hierarchies only considered.Therefore, input is worked from system, SLU
It is extracted in the context of (Spoken Language Understanding, speech understanding) output and preceding bout each slotting
The semantic feature of slot.The output of tracking agent is the confidence state of the corresponding slot in current turning.With the system action of policy agency
On the contrary, output, that is, presence of tracking agent is continuous.In Fig. 2, the output S of tracking agenttIt indicates, b is used in outpute t
It indicates.
Tracking strategy indicates StAnd be tBetween mapping function, be intended to maximize desired accumulation reward.Due to tracking
The search space of agency is continuous, therefore machine people's control problem is such, using certainty nitrification enhancement (such as
DDPG algorithm) optimize tracking strategy.
The above, the conversational system reward in accumulation reward is generally defined as the group that the punishment of wheel number and success are rewarded
It closes.It can effectively optimisation strategy be acted on behalf of using the two prize signals.However for tracking agent, due to continuously exporting
Caused big search space, the two signals are not enough to realize quick and robust convergence.In order to solve this problem, it also provides
One basic score prize signal constrains the search space of tracking agent.Therefore, the whole reward of tracking agent includes three
Kind signal:
(1) wheel number punishment, is expressed as rtp, it is a negative constant value to punish prolonged dialogue.This is more herein
Tend to the dialogue of short time.
(2) it successfully rewards, is expressed as rsr, it is the delay reward entirely talked with to last bout.As user and machine
Between conversation end when, user provides assessed value to judge the performance of conversational system.If entire talk does not reach user
Target, successfully reward will be 0.Otherwise, successfully reward will be a positive value.
(3) basic score, is expressed as rbs, for reducing the search space of tracking agent.Use the teacher DST of auxiliary.Make
With auxiliary presence ba tTo instruct the exploration of tracking agent.If exploring presence be tFar from auxiliary presence and
More than threshold value, then basic score is according to formula:
rbs=-| | be t-ba t| | provide punishment.
It imparts knowledge to students in RL-DST frame in companion, auxiliary DST can use any well-drilled DST model, and can be with
Optimize tracking agent by any certainty nitrification enhancement.Here, the realization to the conversation tasks and specific algorithm
It is illustrated.
By the suggestion frame for assessing a certain field task orientation conversational system.These systems are the dialogue systems based on slot
System.There are three types of slot-types: goal constraint, request slot and searching method.Target limitation is a certain neck that user is look for
The limitation of domain information.Searching method describes user and attempts the mode interacted with system.Request slot is the request that user issues.
It here, only considering goal constraint, and is direct to the extension of searching method and request time slot.Therefore, using target following
Agency rather than multinomial method for tracking target.Searching method and the tracking for requesting time slot are all polynomial.Final
Overall output is the output of target following agency and other two Polynomial Methods.
Auxiliary polynomial tracker: multinomial tracker is used as assisting DST.It is also referred to as CMBP (Constrained
Markov Bayesian Polynomial constrains Markov Bayes multinomial), it is a kind of driving of combined data and base
In the mixed model of the model of rule.CMBP parameter is few, and generalization ability is strong.In CMBP, the presence of current pass is recognized
For depending on the observation of current pass and the presence of preceding bout.
Three types slot (target, request, method) in a certain field will not influence each other.Accordingly, it is considered to it is described certain
The goal constraint part of DST tracking agent in one field task, the form of target following agency be deep-neural-network without
It is multinomial.
In order to optimize the target following agency with continuous and higher-dimension output space, DDPG (Deep is used herein
Deterministic Policy Gradient, deep layer certainty policy gradient) algorithm, which is based on deterministic policy
Performer is commented on method and has replay buffer area and use the DQN of soft more new strategy by performer-commentator of gradient, the algorithm
(Deep Q-Learning, deeply study) algorithm combines.
The experience storage for having a target following agency during training is acted on behalf of in target following.The lattice of data in EMS memory
Formula is St, be t, rt, wherein StIt is time slot feature vector, be tIt is the exploration presence of corresponding time slot.Directly reward rtBy rewarding
Function R (St, be t, ba t) generate, each bout is presented in the reward of part.
In the learning process of tracking agent, dialog strategy be it is fixed, tracker constantly changes.It is dialogue for DST
A part of the environment of policy agent, so the environment of dialog strategy agency is also changed when tracking agent is optimised.Cause
This, we can choose the policy of advanced optimizing, to further increase the performance of dialogue system.
As shown in figure 3, it is directed to three types slot (target, request, method) and three types combination respectively,
After DDPG algorithm, further uses DQN algorithm and optimize, it can be seen that it is directed to the effect of these types of method optimization,
Have and promoted significantly, has further promotion so as to cause the return value of dialogue management.Wherein:
TA_G is DST tracking agent, it only estimates the presence of goal constraint, other two presence part is by more
Item formula tracker generates
TA_R is DST tracking agent, it only estimates the presence of request time slot, other two presence part is by more
Item formula tracker generates
TA_M is DST tracking agent.The presence of its estimation and search method, in addition two parts presence is by multinomial
Formula tracker generates
TA_ALL is DST tracking agent, and here, entire presence is directly generated by above three tracking agent.
A kind of knot of the optimization system of online conversation status tracking model of one embodiment of the invention offer is provided
Structure schematic diagram, the technical solution of the present embodiment are applicable to the optimization method of the online conversation status tracking model to equipment, should
The optimization method of online conversation status tracking model described in above-mentioned any embodiment can be performed in system, and configures in the terminal.
A kind of optimization system of online conversation status tracking model provided in this embodiment includes: that tutor model determines program
Module 11, confidence state determine program module 12, and benchmark score determines program module 13, and cost score determines 14 He of program module
Optimize program module 15.
Wherein, tutor model determine program module 11 for by nitrification enhancement pre-training assist dialogue state track
Model, to determine tutor model, wherein the auxiliary dialogue state trace model includes: the dialogue state tracking based on statistics
Model and rule-based dialogue state trace model, the dialogue state trace model based on statistics includes online conversation
Status tracking model;Confidence state determines program module 12 for extracting the semantic feature of user's read statement, according to the religion
Teacher's model determines the first confidence state of the semantic feature, determines the semantic feature according to online conversation status tracking model
The second confidence state;Benchmark score determines program module 13 for according to the first confidence state and the second confidence shape
The difference of state determines the gap of the tutor model Yu the online conversation status tracking pattern search space, and then determines positive
The benchmark score of reward;Cost score determines program module 14 for determining that the user inputs according to the online confidence state
The feedback of sentence is talked with, and determines the cost score reversely rewarded using the voice duration of the feedback dialogue, wherein the dialogue
Voice duration it is directly proportional to the cost score;Optimize program module 15 to be used for based on the semantic feature together with the benchmark
Score, the cost score, by nitrification enhancement to the online conversation status tracking model optimization, with optimize it is described
The search space of line dialogue state trace model and the voice duration of feedback dialogue.
Further, the benchmark score determines that program module is used for:
When the absolute value of the first confidence state and the difference of the second confidence state is not above preset threshold,
The benchmark score is 0,
When the absolute value of the first confidence state and the difference of the second confidence state is more than preset threshold, by institute
The opposite number of the absolute value of difference is stated as benchmark score.
Further, the system also includes:
Assessment score determines program module, the evaluation result talked with for collecting user to the feedback,
The assessment score of positive reward is determined according to the evaluation result;
Optimize program module, for being based on the semantic feature together with the assessment score, the benchmark score, the generation
Valence score, by nitrification enhancement to the online conversation status tracking model optimization, to optimize the online conversation state
The voice duration and feedback effects that the search space of trace model, feedback are talked with.
Further, the nitrification enhancement includes: depth deterministic policy gradient algorithm and/or depth enhancing network
Algorithm.
The embodiment of the invention also provides a kind of nonvolatile computer storage media, computer storage medium is stored with meter
The online conversation status tracking in above-mentioned any means embodiment can be performed in calculation machine executable instruction, the computer executable instructions
The optimization method of model;
As an implementation, nonvolatile computer storage media of the invention is stored with the executable finger of computer
It enables, computer executable instructions setting are as follows:
Dialogue state trace model is assisted by nitrification enhancement pre-training, to determine tutor model, wherein described auxiliary
Helping dialogue state trace model includes: dialogue state trace model and rule-based dialogue state tracking mould based on statistics
Type, the dialogue state trace model based on statistics include online dialogue state trace model;
The semantic feature for extracting user's read statement, the first confidence of the semantic feature is determined according to the tutor model
State determines the second confidence state of the semantic feature according to online conversation status tracking model;
According to the difference of the first confidence state and the second confidence state determine the tutor model and it is described
The gap of line dialogue state trace model search space, and then determine the benchmark score of positive reward;
The feedback dialogue that user's read statement is determined according to the online confidence state utilizes the feedback dialogue
Voice duration determines the cost score reversely rewarded, wherein the voice duration of the dialogue is directly proportional to the cost score;
Based on the semantic feature together with the benchmark score, the cost score, by nitrification enhancement to described
Online conversation status tracking model optimization, to optimize search space and the feedback dialogue of the online conversation status tracking model
Voice duration.
As a kind of non-volatile computer readable storage medium storing program for executing, it can be used for storing non-volatile software program, non-volatile
Property computer executable program and module, such as the corresponding program instruction/mould of the method for the test software in the embodiment of the present invention
Block.One or more program instruction is stored in non-volatile computer readable storage medium storing program for executing, when being executed by a processor, is held
The optimization method of online conversation status tracking model in the above-mentioned any means embodiment of row.
Non-volatile computer readable storage medium storing program for executing may include storing program area and storage data area, wherein storage journey
It sequence area can application program required for storage program area, at least one function;Storage data area can be stored according to test software
Device use created data etc..In addition, non-volatile computer readable storage medium storing program for executing may include that high speed is deposited at random
Access to memory, can also include nonvolatile memory, a for example, at least disk memory, flush memory device or other are non-
Volatile solid-state part.In some embodiments, it includes relative to place that non-volatile computer readable storage medium storing program for executing is optional
The remotely located memory of device is managed, these remote memories can be by being connected to the network to the device of test software.Above-mentioned network
Example include but is not limited to internet, intranet, local area network, mobile radio communication and combinations thereof.
The embodiment of the present invention also provides a kind of electronic equipment comprising: at least one processor, and with described at least one
The memory of a processor communication connection, wherein the memory is stored with the finger that can be executed by least one described processor
Enable, described instruction executed by least one described processor so that at least one described processor be able to carry out it is of the invention any
The step of optimization method of the online conversation status tracking model of embodiment.
The client of the embodiment of the present application exists in a variety of forms, including but not limited to:
(1) mobile communication equipment: the characteristics of this kind of equipment is that have mobile communication function, and to provide speech, data
Communication is main target.This Terminal Type includes: smart phone (such as iPhone), multimedia handset, functional mobile phone and low
Hold mobile phone etc..
(2) super mobile personal computer equipment: this kind of equipment belongs to the scope of personal computer, there is calculating and processing function
Can, generally also have mobile Internet access characteristic.This Terminal Type includes: PDA, MID and UMPC equipment etc., such as iPad.
(3) portable entertainment device: this kind of equipment can show and play multimedia content.Such equipment include: audio,
Video player (such as iPod), handheld device, e-book and intelligent toy and portable car-mounted navigation equipment.
(4) other electronic devices with phonetic function.
Herein, relational terms such as first and second and the like be used merely to by an entity or operation with it is another
One entity or operation distinguish, and without necessarily requiring or implying between these entities or operation, there are any this reality
Relationship or sequence.Moreover, the terms "include", "comprise", include not only those elements, but also including being not explicitly listed
Other element, or further include for elements inherent to such a process, method, article, or device.Do not limiting more
In the case where system, the element that is limited by sentence " including ... ", it is not excluded that including process, method, the article of the element
Or there is also other identical elements in equipment.
The apparatus embodiments described above are merely exemplary, wherein described, unit can as illustrated by the separation member
It is physically separated with being or may not be, component shown as a unit may or may not be physics list
Member, it can it is in one place, or may be distributed over multiple network units.It can be selected according to the actual needs
In some or all of the modules achieve the purpose of the solution of this embodiment.Those of ordinary skill in the art are not paying creativeness
Labour in the case where, it can understand and implement.
Through the above description of the embodiments, those skilled in the art can be understood that each embodiment can
It realizes by means of software and necessary general hardware platform, naturally it is also possible to pass through hardware.Based on this understanding, on
Stating technical solution, substantially the part that contributes to existing technology can be embodied in the form of software products in other words, should
Computer software product may be stored in a computer readable storage medium, such as ROM/RAM, magnetic disk, CD, including several fingers
It enables and using so that a computer equipment (can be personal computer, server or the network equipment etc.) executes each implementation
Method described in certain parts of example or embodiment.
Finally, it should be noted that the above embodiments are merely illustrative of the technical solutions of the present invention, rather than its limitations;Although
Present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that: it still may be used
To modify the technical solutions described in the foregoing embodiments or equivalent replacement of some of the technical features;
And these are modified or replaceed, technical solution of various embodiments of the present invention that it does not separate the essence of the corresponding technical solution spirit and
Range.