CN108962221A

CN108962221A - The optimization method and system of online conversation status tracking model

Info

Publication number: CN108962221A
Application number: CN201810763146.XA
Authority: CN
Inventors: 俞凯; 陈志�
Original assignee: Shanghai Jiaotong University; AI Speech Ltd
Current assignee: Sipic Technology Co Ltd
Priority date: 2018-07-12
Filing date: 2018-07-12
Publication date: 2018-12-07
Anticipated expiration: 2038-07-12
Also published as: CN108962221B

Abstract

The embodiment of the present invention provides a kind of optimization method of online conversation status tracking model.This method comprises: pre-training assists dialogue state trace model to determine tutor model；The semantic feature for extracting user's read statement, determines the first confidence state according to tutor model, determines the second confidence state according to online conversation status tracking model；It determines the gap of search space, and then determines benchmark score；Determine that the voice duration of the feedback dialogue of user's read statement determines cost score according to online confidence state；Based on semantic feature together with benchmark score, cost score, optimize the search space of online conversation status tracking model and the voice duration of feedback dialogue by nitrification enhancement.The embodiment of the present invention also provides a kind of optimization system of online conversation status tracking model.The embodiment of the present invention reduces the search space of online conversation status tracking model by the tutor model, to improve the dialog strategy of online conversation state tracking module by increasing tutor model.

Description

The optimization method and system of online conversation status tracking model

Technical field

The present invention relates to Intelligent voice dialog field more particularly to a kind of optimization methods of online conversation status tracking model And system.

Background technique

Spoken dialogue system is usually made of input module, control module, output module, wherein by ASR (Automatic Speech Recognition, automatic speech recognition), SLU (Spoken Language Understanding, speech understanding) The input module of composition extracts the dialogue movement of semantic class from user voice signal；There are two tasks for control module, and one is guarantors Dialogue state is held, this is that the coding that machine understands dialogue passes through DST once receiving the information from input module (Dialogue State Tracking, dialogue state tracking) updates dialogue state, and another kind is selection semantic layer machine dialogue Movement is to respond user, and here it is dialogue decision strategies；Output module by NLG (Natural Language Generation, Spatial term) and TTS (Text To Speech, Text To Speech) synthesis composition, reach and the natural language of generation is turned It is melted into voice.

Since DSTC (The Dialog State Tracking Challenge, dialogue state tracking challenge) is provided The dialogue state tracking data and general appraisal framework and test platform of label, therefore have been proposed various for DST's Machine learning method.

In realizing process of the present invention, at least there are the following problems in the related technology for inventor's discovery:

These methods are strictly dependent on the offline data of label.Since flag data is offline, these supervised learning sides The learning process of method is independent in dialog strategy module.It is marked due to lacking, these methods are not easily adapted for online updating DST。

And in certain methods, it is based on NABC (Natural Actor and Belief Critic, natural leading role and letter Face upward comment algorithm) in dialogue state trace model is indicated using Bayesian network, cause its in different conversation tasks all Redesign the dialogue state trace model based on Bayesian network.When conversation tasks complexity increases, Bayesian network Parameter amount will increase significantly, carrying out expression parameter distribution using probabilistic model also can not necessarily reach expected optimization effect Fruit.Complicated conversation tasks are made it unsuitable for, and the expansibility of model is not strong.

Dialog management system eliminates the SLU module in conversational system end to end, directly from man-machine dialogue Dialogue state is obtained in text, and selects to reply the machine movement of user.Such conversational system end to end is difficult will be some Priori knowledge adds, this needs a large amount of labeled data just to learn these priori knowledges, and it is modular right to be compared to Telephone system, its flexibility are greatly lowered.Since dialogue state module and dialog strategy module are direct with neural network Connected, it can be very unstable when using deeply study optimization.Its corpus training pattern for needing largely to mark is not suitable for Using the demand of conversational system, also, for it is some meet the special circumstances actually talked with and cannot directly add priori manually know Know, but needing to train by a large amount of corpus can just obtain corresponding as a result, making that training process is unstable, flexibility Difference.

Summary of the invention

It needs to design complicated network at least solve trained online conversation status tracking model in the prior art, no It is suitble to complicated conversation tasks, the expansibility of model is not strong, the corpus training pattern for needing largely to mark, and is difficult to add priori Knowledge, training process is unstable, the problem of flexibility difference.Discovery other than applicant, the output of dialogue state tracking module are Confidence level state, optimizes online conversation state tracking module namely for the method for intensified learning, and motion space is whole A dialogue state space.And by the inspiration of concomitant learning, an auxiliary dialogue state tracking module is increased as teacher's mould Type issues prize signal and punishment to online tracking strategy, empty come the search for reducing online conversation status tracking system with this Between, to solve the above problems.

In a first aspect, the embodiment of the present invention provides a kind of optimization method of online conversation status tracking model, comprising:

Dialogue state trace model is assisted by nitrification enhancement pre-training, to determine tutor model, wherein described auxiliary Helping dialogue state trace model includes: dialogue state trace model and rule-based dialogue state tracking mould based on statistics Type, the dialogue state trace model based on statistics include online dialogue state trace model；

The semantic feature for extracting user's read statement, the first confidence of the semantic feature is determined according to the tutor model State determines the second confidence state of the semantic feature according to online conversation status tracking model；

According to the difference of the first confidence state and the second confidence state determine the tutor model and it is described The gap of line dialogue state trace model search space, and then determine the benchmark score of positive reward；

The feedback dialogue that user's read statement is determined according to the online confidence state utilizes the feedback dialogue Voice duration determines the cost score reversely rewarded, wherein the voice duration of the dialogue is directly proportional to the cost score；

Based on the semantic feature together with the benchmark score, the cost score, by nitrification enhancement to described Online conversation status tracking model optimization, to optimize search space and the feedback dialogue of the online conversation status tracking model Voice duration.

Second aspect, the embodiment of the present invention provide a kind of optimization system of online conversation status tracking model, comprising:

Tutor model determines program module, for assisting dialogue state trace model by nitrification enhancement pre-training, To determine tutor model, wherein the auxiliary dialogue state trace model include: the dialogue state trace model based on statistics with And rule-based dialogue state trace model, the dialogue state trace model based on statistics include online dialogue state with Track model；

Confidence state determines program module, for extracting the semantic feature of user's read statement, according to the tutor model The the first confidence state for determining the semantic feature determines the second of the semantic feature according to online conversation status tracking model Confidence state；

Benchmark score determines program module, for the difference according to the first confidence state and the second confidence state It determines the gap of the tutor model Yu the online conversation status tracking pattern search space, and then determines the base of positive reward Quasi- score；

Cost score determines program module, for determining the anti-of user's read statement according to the online confidence state Feedback dialogue determines the cost score reversely rewarded using the voice duration of the feedback dialogue, wherein when the voice of the dialogue Length is directly proportional to the cost score；

Optimize program module, for being based on the semantic feature together with the benchmark score, the cost score, by strong Change learning algorithm to the online conversation status tracking model optimization, to optimize the search of the online conversation status tracking model The voice duration of space and feedback dialogue.

The third aspect provides a kind of electronic equipment comprising: at least one processor, and with described at least one Manage the memory of device communication connection, wherein the memory is stored with the instruction that can be executed by least one described processor, institute It states instruction to be executed by least one described processor, so that at least one described processor is able to carry out any embodiment of the present invention Online conversation status tracking model optimization method the step of.

Fourth aspect, the embodiment of the present invention provide a kind of storage medium, are stored thereon with computer program, and feature exists In the optimization method of the online conversation status tracking model of realization any embodiment of the present invention when the program is executed by processor Step.

The beneficial effect of the embodiment of the present invention is: by increasing auxiliary dialogue state trace model as tutor model, Prize signal is issued to online conversation status tracking model according to the tutor model, to remote in online conversation status tracking model Dialogue state from tutor model is punished, while considering that the voice duration of feedback dialogue optimizes, and according to user Evaluation result determine assessment score, reduce the search space of online conversation status tracking model with this, to improve The dialog strategy of line dialogue state tracking module.

Detailed description of the invention

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is this hair Bright some embodiments for those of ordinary skill in the art without creative efforts, can be with root Other attached drawings are obtained according to these attached drawings.

Fig. 1 is a kind of flow chart of the optimization method for online conversation status tracking model that one embodiment of the invention provides；

Fig. 2 is a kind of model structure of the optimization method for online conversation status tracking model that one embodiment of the invention provides Figure；

Fig. 3 is a kind of effect of optimization of the optimization method for online conversation status tracking model that one embodiment of the invention provides Datagram；

Fig. 4 is a kind of structural representation of the optimization system for online conversation status tracking model that one embodiment of the invention provides Figure.

Specific embodiment

In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is A part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art Every other embodiment obtained without creative efforts, shall fall within the protection scope of the present invention.

A kind of stream of the optimization method of the online conversation status tracking model provided as shown in Figure 1 for one embodiment of the invention Cheng Tu includes the following steps:

S11: dialogue state trace model is assisted by nitrification enhancement pre-training, to determine tutor model, wherein institute State auxiliary dialogue state trace model include: dialogue state trace model based on statistics and rule-based dialogue state with Track model, the dialogue state trace model based on statistics include online dialogue state trace model；

S12: the semantic feature of user's read statement is extracted, determines the first of the semantic feature according to the tutor model Confidence state determines the second confidence state of the semantic feature according to online conversation status tracking model；

S13: the tutor model and institute are determined according to the difference of the first confidence state and the second confidence state The gap in online conversation status tracking pattern search space is stated, and then determines the benchmark score of positive reward；

S14: it determines that the feedback of user's read statement is talked with according to the online confidence state, utilizes the feedback pair The voice duration of words determines the cost score reversely rewarded, wherein the voice duration of the dialogue and the cost score are at just Than；

S15: based on the semantic feature together with the benchmark score, the cost score, pass through nitrification enhancement pair The online conversation status tracking model optimization, to optimize search space and the feedback of the online conversation status tracking model The voice duration of dialogue.

In the present embodiment, the output of dialogue state trace model is confidence level state, that is to say, that uses intensified learning Method carry out on-line optimization dialogue state trace model, motion space is entire dialogue state space.Only use dialog strategy mould Prize signal in block is the tracking strategy that cannot directly acquire in dialogue state tracking.By the online conversation status tracking mould Type carries in the electronic device, for users to use, the voice signal of real-time reception user input.

For step S11, dialogue state trace model is assisted by nitrification enhancement pre-training, to determine teacher's mould Type by the inspiration of concomitant learning, increases auxiliary dialogue state trace model as teacher's model in implementation method, this Model can be any form of dialogue state trace model, can be rule, be also possible to statistics, optimization it is online Dialogue state trace model is with the Neural Networks Representation connected entirely.This auxiliary dialogue state trace model also can be to online Tracking strategy issue prize signal, those are punished from the far dialogue state of auxiliary dialogue state trace model, with This search space to reduce the online conversation state tracking module of optimization.Due to the tutor model can be it is any form of Dialogue state trace model, thus tutor model can by based on statistics dialogue state trace model or rule-based dialogue Status tracking model training forms.

For step S12, after receiving the voice signal of user, the language of the sentence in user's input speech signal is extracted Adopted feature, the semantic feature of the sentence based on user input determine institute's predicate by the tutor model determined in step S11 The auxiliary confidence state of adopted feature determines the semantic feature as the first confidence state, by online conversation status tracking model Online confidence state as the second confidence state.For example, determining that the auxiliary confidence state of the semantic feature is set as first Letter state is b^a _t, determining the online confidence state of the semantic feature as the second confidence state is b^e _t。

For step S13, according to the difference of the first confidence state and the second confidence state that determine in step s 12 Determine the gap of the search space of the tutor model and the online conversation status tracking model, so that it is determined that benchmark score, Positive reward parameter as optimization online conversation status tracking model.

For step S14, the voice of the feedback dialogue of user's read statement is determined according to the online confidence state Duration, so that it is determined that the cost score reversely rewarded.After the voice signal for receiving user's input, according to online conversation state Trace model determines that confidence state, the confidence state are the confidence level groups by various feedback dialogue and affiliated feedback dialogue At, so that it is determined that the optimal feedback dialogue of confidence level out, since the length of each feedback dialogue is different, so that the voice of feedback dialogue Duration is also just different.Since the feedback dialogue of different phonetic duration can solve the enquirement of user, it is contemplated that time cost, instead The time for presenting dialogue is shorter, and consumed time cost is also just smaller.To determine cost using the voice duration of feedback dialogue Score, the reversed reward parameter as optimization online conversation trace model.

For step S15, according to the semantic feature determined in step S12 and together with the benchmark score determined in step S13, In the cost score that step S14 is determined, by nitrification enhancement to the online conversation status tracking model optimization, thus excellent Change the search space of the online conversation status tracking model and the voice duration of feedback dialogue.

It can be seen that by the implementation method by increasing auxiliary dialogue state trace model as tutor model, according to The tutor model issues prize signal to online conversation status tracking model, to religion separate in online conversation status tracking model The dialogue state of teacher's model is punished, while considering that the voice duration of feedback dialogue optimizes, online right to reduce with this The search space of speech phase trace model, to improve the dialog strategy of online conversation state tracking module.

As an implementation, in the present embodiment, according to the first confidence state and the second confidence state Difference determine the gap of the tutor model Yu the online conversation status tracking pattern search space, and then determine positive prize The benchmark score encouraged includes:

When the absolute value of the first confidence state and the difference of the second confidence state is not above preset threshold, The benchmark score is 0,

When the absolute value of the first confidence state and the difference of the second confidence state is more than preset threshold, by institute The opposite number of the absolute value of difference is stated as benchmark score.

In the present embodiment, the religion is determined according to the difference of the first confidence state and the second confidence state The gap of teacher's model and the online conversation status tracking pattern search space, so that it is determined that the benchmark score r of positive reward^bs。

When the absolute value of the first confidence state and the difference of the second confidence state | | b^e _t-b^a _t| |≤threshold epsilon, The benchmark score of the positive reward is 0.

When the absolute value of the first confidence state and the difference of the second confidence state | | b^e _t-b^a _t| | > threshold epsilon, The benchmark score of the positive reward is r^bs=-| | b^e _t-b^a _t||。

By embodiment of above as can be seen that present embodiment is according to increased auxiliary dialogue state trace model conduct Tutor model, punished apart from the farther away dialogue state of tutor model in presence trace model, given specific Payment method, to reduce the search space of online conversation state tracking module.

As an implementation, in the present embodiment, the method also includes:

User is collected to the evaluation result of the feedback dialogue；

The assessment score of positive reward is determined according to the evaluation result；

Based on the semantic feature together with the assessment score, the benchmark score, the cost score, pass through extensive chemical Algorithm is practised to the online conversation status tracking model optimization, to optimize the search sky of the online conversation status tracking model Between, feedback dialogue voice duration and feedback effects.

In the present embodiment, when the electronic equipment for carrying the online conversation status tracking model is inputted according to user Sentence feedback after, collect user to it is described feedback dialogue evaluation result.Wherein, the evaluation result of the feedback dialogue can be with By online conversation status tracking model provide, for example, the online conversation status tracking model feedback dialogue after, continue to Family provides a feedback dialogue evaluation frame, wherein evaluation option is preset in the evaluation frame, for example, may include: " to praise very much！", A series of evaluation options such as " satisfaction ", " general ", " giving an irrelevant answer ".After user evaluates this, user is collected to described Feed back the evaluation result of dialogue.

Assessment score is determined according to the evaluation result, for example, when evaluation result is " to praise very much！" when, assessment score can phase To somewhat higher, when evaluation result is " general ", assessment score is with respect to can be more lower.

According to determining semantic feature and together with determining assessment score, benchmark score, cost score, pass through intensified learning Algorithm is to the online conversation status tracking model optimization, to optimize the search space of the online conversation status tracking model And the voice duration and feedback effects of feedback dialogue.

By embodiment of above as can be seen that present embodiments provide for the parameters in terms of another to constrain online Dialogue state trace model determines assessment score according to the evaluation result of user, to judge whether the dialogue of feedback reaches use The target at family, so that the search space of online conversation status tracking model is advanced optimized, to improve online conversation state The dialog strategy of tracking module.

As an implementation, in the present embodiment, the nitrification enhancement includes: depth deterministic policy gradient Algorithm and/or depth enhance network algorithm.

In the present embodiment, since dialogue state is continuously, so using DDPG (Deep Deterministic Policy Gradient, depth deterministic policy gradient algorithm) come optimize online conversation state model tracking strategy network Parameter, to limit the spatial gradient of punishment.After the convergence of online conversation state tracking module, talk with plan followed by joint Slightly optimize.Using DQN (Deep Q-Learning, depth enhance network algorithm), deep neural network is generated effective Uncertainty estimation also extends to large-scale parallel system, is ranked up in multiple time steps to information, guarantees it Diversity, calculates at low cost, learning efficiency height, and performance is excellent.

By the embodiment can be seen that using specific nitrification enhancement to online conversation status tracking model into Row optimization, can further limit the search space of online dialogue state trace model.

Illustrate the overall effect of the scheme below, identifying machine learning method is DST (Dialogue State Tracking, dialogue state tracking) in state-of-the-art technology.But these methods have some limitations.Firstly, they are SL (Supervised Learning, supervised learning) method needs a large amount of off-line data to annotate.This is not only expensive, but also online Learn also infeasible.Secondly, giving limited flag data, SL method may be easy to happen excessive adjustment, lead to generalization ability Difference, again, since the DST method based on SL is independently of dialog strategy, so DST module is unable to the habit of dynamically adapting user. These limitations forbid DST module to carry out online updating.In order to solve this problem, DST optimization is carried out by using online interaction Deeply study DRL (Deep Reinforcement Learning, deeply study) frame.

RL (Reinforcement Learning, intensified learning) updates dialog strategy in the conversational system of oriented mission Module is popular.But other than the combination learning mode of several DST and policy, RL is not yet dedicated for DST module. Under RL frame, using DST as agency, referred to as tracking agent, the other parts of conversational system are considered as environment.To using special Door optimizes intensified learning frame for online DST.

Different from policy agency, the decision (presence) that tracking agent is made is continuous.Therefore, DST is considered as connecting Continuous control problem.Since continuous presence is both continuous and higher-dimension, the existing direct application effect of RL algorithm It is bad.

Here, by constructing a new DST frame by companion's teaching idea.Herein, pair that supplemental training is always or usually as specified Speech phase tracker, such as traditional tracker are used as teacher by training offline to know the optimization of practical DST agency Journey, to avoid excessively adjusting and realizing steady and quick convergence.As shown in Figure 2, wherein b^a _tIt is that auxiliary DST model generates Assist presence, b^e _tIt is the exploration presence that tracking agent generates.b^a _tAnd b^e _tBetween difference will be fed to return letter The search space of tracking agent is substantially reduced in number.The modular construction of this frame allows using more flexible and interpretable pair Talk about administrative model.For example, interpretable dialogue policy (rule-based policy) can easily make together with any DST model With.This flexibility is actually highly useful.Secondly as having used teacher's DST model, the optimization process of tracking agent needs Seldom dialogue data, and training is more steady.

In order to avoid the confusion of concepts acted on behalf of with policy, the state of tracking agent and the input of behavior is substituted respectively herein And output.In this work, the dialog manager of semantic hierarchies only considered.Therefore, input is worked from system, SLU It is extracted in the context of (Spoken Language Understanding, speech understanding) output and preceding bout each slotting The semantic feature of slot.The output of tracking agent is the confidence state of the corresponding slot in current turning.With the system action of policy agency On the contrary, output, that is, presence of tracking agent is continuous.In Fig. 2, the output S of tracking agent_tIt indicates, b is used in output^e _t It indicates.

Tracking strategy indicates S_tAnd b^e _tBetween mapping function, be intended to maximize desired accumulation reward.Due to tracking The search space of agency is continuous, therefore machine people's control problem is such, using certainty nitrification enhancement (such as DDPG algorithm) optimize tracking strategy.

The above, the conversational system reward in accumulation reward is generally defined as the group that the punishment of wheel number and success are rewarded It closes.It can effectively optimisation strategy be acted on behalf of using the two prize signals.However for tracking agent, due to continuously exporting Caused big search space, the two signals are not enough to realize quick and robust convergence.In order to solve this problem, it also provides One basic score prize signal constrains the search space of tracking agent.Therefore, the whole reward of tracking agent includes three Kind signal:

(1) wheel number punishment, is expressed as r^tp, it is a negative constant value to punish prolonged dialogue.This is more herein Tend to the dialogue of short time.

(2) it successfully rewards, is expressed as r^sr, it is the delay reward entirely talked with to last bout.As user and machine Between conversation end when, user provides assessed value to judge the performance of conversational system.If entire talk does not reach user Target, successfully reward will be 0.Otherwise, successfully reward will be a positive value.

(3) basic score, is expressed as r^bs, for reducing the search space of tracking agent.Use the teacher DST of auxiliary.Make With auxiliary presence b^a _tTo instruct the exploration of tracking agent.If exploring presence b^e _tFar from auxiliary presence and More than threshold value, then basic score is according to formula:

r^bs=-| | b^e _t-b^a _t| | provide punishment.

It imparts knowledge to students in RL-DST frame in companion, auxiliary DST can use any well-drilled DST model, and can be with Optimize tracking agent by any certainty nitrification enhancement.Here, the realization to the conversation tasks and specific algorithm It is illustrated.

By the suggestion frame for assessing a certain field task orientation conversational system.These systems are the dialogue systems based on slot System.There are three types of slot-types: goal constraint, request slot and searching method.Target limitation is a certain neck that user is look for The limitation of domain information.Searching method describes user and attempts the mode interacted with system.Request slot is the request that user issues. It here, only considering goal constraint, and is direct to the extension of searching method and request time slot.Therefore, using target following Agency rather than multinomial method for tracking target.Searching method and the tracking for requesting time slot are all polynomial.Final Overall output is the output of target following agency and other two Polynomial Methods.

Auxiliary polynomial tracker: multinomial tracker is used as assisting DST.It is also referred to as CMBP (Constrained Markov Bayesian Polynomial constrains Markov Bayes multinomial), it is a kind of driving of combined data and base In the mixed model of the model of rule.CMBP parameter is few, and generalization ability is strong.In CMBP, the presence of current pass is recognized For depending on the observation of current pass and the presence of preceding bout.

Three types slot (target, request, method) in a certain field will not influence each other.Accordingly, it is considered to it is described certain The goal constraint part of DST tracking agent in one field task, the form of target following agency be deep-neural-network without It is multinomial.

In order to optimize the target following agency with continuous and higher-dimension output space, DDPG (Deep is used herein Deterministic Policy Gradient, deep layer certainty policy gradient) algorithm, which is based on deterministic policy Performer is commented on method and has replay buffer area and use the DQN of soft more new strategy by performer-commentator of gradient, the algorithm (Deep Q-Learning, deeply study) algorithm combines.

The experience storage for having a target following agency during training is acted on behalf of in target following.The lattice of data in EMS memory Formula is S_t, b^e _t, r_t, wherein S_tIt is time slot feature vector, b^e _tIt is the exploration presence of corresponding time slot.Directly reward r_tBy rewarding Function R (S_t, b^e _t, b^a _t) generate, each bout is presented in the reward of part.

In the learning process of tracking agent, dialog strategy be it is fixed, tracker constantly changes.It is dialogue for DST A part of the environment of policy agent, so the environment of dialog strategy agency is also changed when tracking agent is optimised.Cause This, we can choose the policy of advanced optimizing, to further increase the performance of dialogue system.

As shown in figure 3, it is directed to three types slot (target, request, method) and three types combination respectively, After DDPG algorithm, further uses DQN algorithm and optimize, it can be seen that it is directed to the effect of these types of method optimization, Have and promoted significantly, has further promotion so as to cause the return value of dialogue management.Wherein:

TA_G is DST tracking agent, it only estimates the presence of goal constraint, other two presence part is by more Item formula tracker generates

TA_R is DST tracking agent, it only estimates the presence of request time slot, other two presence part is by more Item formula tracker generates

TA_M is DST tracking agent.The presence of its estimation and search method, in addition two parts presence is by multinomial Formula tracker generates

TA_ALL is DST tracking agent, and here, entire presence is directly generated by above three tracking agent.

A kind of knot of the optimization system of online conversation status tracking model of one embodiment of the invention offer is provided Structure schematic diagram, the technical solution of the present embodiment are applicable to the optimization method of the online conversation status tracking model to equipment, should The optimization method of online conversation status tracking model described in above-mentioned any embodiment can be performed in system, and configures in the terminal.

A kind of optimization system of online conversation status tracking model provided in this embodiment includes: that tutor model determines program Module 11, confidence state determine program module 12, and benchmark score determines program module 13, and cost score determines 14 He of program module Optimize program module 15.

Wherein, tutor model determine program module 11 for by nitrification enhancement pre-training assist dialogue state track Model, to determine tutor model, wherein the auxiliary dialogue state trace model includes: the dialogue state tracking based on statistics Model and rule-based dialogue state trace model, the dialogue state trace model based on statistics includes online conversation Status tracking model；Confidence state determines program module 12 for extracting the semantic feature of user's read statement, according to the religion Teacher's model determines the first confidence state of the semantic feature, determines the semantic feature according to online conversation status tracking model The second confidence state；Benchmark score determines program module 13 for according to the first confidence state and the second confidence shape The difference of state determines the gap of the tutor model Yu the online conversation status tracking pattern search space, and then determines positive The benchmark score of reward；Cost score determines program module 14 for determining that the user inputs according to the online confidence state The feedback of sentence is talked with, and determines the cost score reversely rewarded using the voice duration of the feedback dialogue, wherein the dialogue Voice duration it is directly proportional to the cost score；Optimize program module 15 to be used for based on the semantic feature together with the benchmark Score, the cost score, by nitrification enhancement to the online conversation status tracking model optimization, with optimize it is described The search space of line dialogue state trace model and the voice duration of feedback dialogue.

Further, the benchmark score determines that program module is used for:

Further, the system also includes:

Assessment score determines program module, the evaluation result talked with for collecting user to the feedback,

Optimize program module, for being based on the semantic feature together with the assessment score, the benchmark score, the generation Valence score, by nitrification enhancement to the online conversation status tracking model optimization, to optimize the online conversation state The voice duration and feedback effects that the search space of trace model, feedback are talked with.

Further, the nitrification enhancement includes: depth deterministic policy gradient algorithm and/or depth enhancing network Algorithm.

The embodiment of the invention also provides a kind of nonvolatile computer storage media, computer storage medium is stored with meter The online conversation status tracking in above-mentioned any means embodiment can be performed in calculation machine executable instruction, the computer executable instructions The optimization method of model；

As an implementation, nonvolatile computer storage media of the invention is stored with the executable finger of computer It enables, computer executable instructions setting are as follows:

As a kind of non-volatile computer readable storage medium storing program for executing, it can be used for storing non-volatile software program, non-volatile Property computer executable program and module, such as the corresponding program instruction/mould of the method for the test software in the embodiment of the present invention Block.One or more program instruction is stored in non-volatile computer readable storage medium storing program for executing, when being executed by a processor, is held The optimization method of online conversation status tracking model in the above-mentioned any means embodiment of row.

Non-volatile computer readable storage medium storing program for executing may include storing program area and storage data area, wherein storage journey It sequence area can application program required for storage program area, at least one function；Storage data area can be stored according to test software Device use created data etc..In addition, non-volatile computer readable storage medium storing program for executing may include that high speed is deposited at random Access to memory, can also include nonvolatile memory, a for example, at least disk memory, flush memory device or other are non- Volatile solid-state part.In some embodiments, it includes relative to place that non-volatile computer readable storage medium storing program for executing is optional The remotely located memory of device is managed, these remote memories can be by being connected to the network to the device of test software.Above-mentioned network Example include but is not limited to internet, intranet, local area network, mobile radio communication and combinations thereof.

The embodiment of the present invention also provides a kind of electronic equipment comprising: at least one processor, and with described at least one The memory of a processor communication connection, wherein the memory is stored with the finger that can be executed by least one described processor Enable, described instruction executed by least one described processor so that at least one described processor be able to carry out it is of the invention any The step of optimization method of the online conversation status tracking model of embodiment.

The client of the embodiment of the present application exists in a variety of forms, including but not limited to:

(1) mobile communication equipment: the characteristics of this kind of equipment is that have mobile communication function, and to provide speech, data Communication is main target.This Terminal Type includes: smart phone (such as iPhone), multimedia handset, functional mobile phone and low Hold mobile phone etc..

(2) super mobile personal computer equipment: this kind of equipment belongs to the scope of personal computer, there is calculating and processing function Can, generally also have mobile Internet access characteristic.This Terminal Type includes: PDA, MID and UMPC equipment etc., such as iPad.

(3) portable entertainment device: this kind of equipment can show and play multimedia content.Such equipment include: audio, Video player (such as iPod), handheld device, e-book and intelligent toy and portable car-mounted navigation equipment.

(4) other electronic devices with phonetic function.

Herein, relational terms such as first and second and the like be used merely to by an entity or operation with it is another One entity or operation distinguish, and without necessarily requiring or implying between these entities or operation, there are any this reality Relationship or sequence.Moreover, the terms "include", "comprise", include not only those elements, but also including being not explicitly listed Other element, or further include for elements inherent to such a process, method, article, or device.Do not limiting more In the case where system, the element that is limited by sentence " including ... ", it is not excluded that including process, method, the article of the element Or there is also other identical elements in equipment.

The apparatus embodiments described above are merely exemplary, wherein described, unit can as illustrated by the separation member It is physically separated with being or may not be, component shown as a unit may or may not be physics list Member, it can it is in one place, or may be distributed over multiple network units.It can be selected according to the actual needs In some or all of the modules achieve the purpose of the solution of this embodiment.Those of ordinary skill in the art are not paying creativeness Labour in the case where, it can understand and implement.

Through the above description of the embodiments, those skilled in the art can be understood that each embodiment can It realizes by means of software and necessary general hardware platform, naturally it is also possible to pass through hardware.Based on this understanding, on Stating technical solution, substantially the part that contributes to existing technology can be embodied in the form of software products in other words, should Computer software product may be stored in a computer readable storage medium, such as ROM/RAM, magnetic disk, CD, including several fingers It enables and using so that a computer equipment (can be personal computer, server or the network equipment etc.) executes each implementation Method described in certain parts of example or embodiment.

Finally, it should be noted that the above embodiments are merely illustrative of the technical solutions of the present invention, rather than its limitations；Although Present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that: it still may be used To modify the technical solutions described in the foregoing embodiments or equivalent replacement of some of the technical features； And these are modified or replaceed, technical solution of various embodiments of the present invention that it does not separate the essence of the corresponding technical solution spirit and Range.

Claims

1. a kind of optimization method of online conversation status tracking model, comprising:

Dialogue state trace model is assisted by nitrification enhancement pre-training, to determine tutor model, wherein the auxiliary pair Speech phase trace model includes: dialogue state trace model and rule-based dialogue state trace model based on statistics, The dialogue state trace model based on statistics includes online dialogue state trace model；

The semantic feature for extracting user's read statement, the first confidence shape of the semantic feature is determined according to the tutor model State determines the second confidence state of the semantic feature according to online conversation status tracking model；

According to the difference of the first confidence state and the second confidence state determine the tutor model with it is described online right The gap of speech phase trace model search space, and then determine the benchmark score of positive reward；

The feedback dialogue that user's read statement is determined according to the online confidence state utilizes the voice of the feedback dialogue Duration determines the cost score reversely rewarded, wherein the voice duration of the dialogue is directly proportional to the cost score；

Based on the semantic feature together with the benchmark score, the cost score, by nitrification enhancement to described online Dialogue state trace model optimization, to optimize the search space of the online conversation status tracking model and the language of feedback dialogue Sound duration.

2. according to the method described in claim 1, wherein, according to the difference of the first confidence state and the second confidence state Value determines the gap of the tutor model and the online conversation status tracking pattern search space, and then determines positive reward Benchmark score includes:

It is described when the absolute value of the first confidence state and the difference of the second confidence state is not above preset threshold Benchmark score is 0,

When the absolute value of the first confidence state and the difference of the second confidence state is more than preset threshold, by the difference The opposite number of the absolute value of value is as benchmark score.

3. according to the method described in claim 1, wherein, the method also includes:

User is collected to the evaluation result of the feedback dialogue；

Based on the semantic feature together with the assessment score, the benchmark score, the cost score, calculated by intensified learning Method is to the online conversation status tracking model optimization, to optimize the search space, anti-of the online conversation status tracking model Present the voice duration and feedback effects of dialogue.

4. method according to any one of claim 1-3, wherein the nitrification enhancement includes: depth certainty Policy-Gradient algorithm and/or depth enhance network algorithm.

5. a kind of optimization system of online conversation status tracking model, comprising:

Tutor model determines program module, for assisting dialogue state trace model by nitrification enhancement pre-training, with true Determine tutor model, wherein the auxiliary dialogue state trace model includes: dialogue state trace model and base based on statistics In the dialogue state trace model of rule, the dialogue state trace model based on statistics includes online dialogue state tracking mould Type；

Confidence state determines program module, for extracting the semantic feature of user's read statement, is determined according to the tutor model First confidence state of the semantic feature, the second confidence of the semantic feature is determined according to online conversation status tracking model State；

Benchmark score determines program module, for being determined according to the difference of the first confidence state and the second confidence state The gap of the tutor model and the online conversation status tracking pattern search space, and then determine the basis point of positive reward Number；

Cost score determines program module, for determining the feedback pair of user's read statement according to the online confidence state Words utilize the voice duration of the feedback dialogue to determine the cost score reversely rewarded, wherein the voice duration of the dialogue with The cost score is directly proportional；

Optimize program module, for, together with the benchmark score, the cost score, passing through extensive chemical based on the semantic feature Algorithm is practised to the online conversation status tracking model optimization, to optimize the search space of the online conversation status tracking model And the voice duration of feedback dialogue.

6. system according to claim 5, wherein the benchmark score determines that program module is used for:

7. system according to claim 5, wherein the system also includes:

Optimize program module, for dividing based on the semantic feature together with the assessment score, the benchmark score, the cost Number, by nitrification enhancement to the online conversation status tracking model optimization, to optimize the online conversation status tracking The voice duration and feedback effects that the search space of model, feedback are talked with.

8. the system according to any one of claim 5-7, wherein the nitrification enhancement includes: depth certainty Policy-Gradient algorithm and/or depth enhance network algorithm.

9. a kind of electronic equipment comprising: at least one processor, and deposited with what at least one described processor communication was connect Reservoir, wherein the memory be stored with can by least one described processor execute instruction, described instruction by it is described at least One processor executes, so that at least one described processor is able to carry out the step of any one of claim 1-4 the method Suddenly.

10. a kind of storage medium, is stored thereon with computer program, which is characterized in that the realization when program is executed by processor The step of any one of claim 1-4 the method.