CN105529030A

CN105529030A - Speech recognition processing method and device

Info

Publication number: CN105529030A
Application number: CN201511016852.0A
Authority: CN
Inventors: 吴世伟
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2015-12-29
Filing date: 2015-12-29
Publication date: 2016-04-27
Anticipated expiration: 2035-12-29
Also published as: CN105529030B

Abstract

The invention provides a speech recognition processing method and device. The speech recognition processing method comprises: receiving speech signals; extracting multiple pieces of feature information from the speech signals; calculating a feedback function according to the multiple pieces of feature information in the speech signals; and establishing a decision model of speech recognition according to the feedback function. By adopting the speech recognition processing method, the speech recognition accuracy can be improved, the smoothness of speech interaction between a user and a speech recognition system is improved, and the user experience is promoted.

Description

Voice recognition processing method and apparatus

Technical field

The present invention relates to technical field of voice recognition, particularly relate to a kind of voice recognition processing method and apparatus.

Background technology

In man machine language is mutual, speech recognition system needs to process diversified voice request, and the target of speech recognition system is exactly feed back to the most comfortable feedback result of user.But due to the diversity of voice signal and external environment, the feedback system of speech recognition system also need because of time and determine.

At present, speech recognition system, after receiving the voice request of user, can carry out the identification of corresponding phonetics and semantics usually to this voice request, when after identification user view, operates accordingly according to voice request.But, current Problems existing is, if speech recognition system does not identify user view according to the voice request of user, voice request is re-entered after needing user to operate, complex operation when causing user to use speech recognition system, the accuracy rate of speech recognition is low, and interactive voice process is level and smooth not, and the experience of user is also bad.

Summary of the invention

The present invention is intended to solve one of technical matters in correlation technique at least to a certain extent.

For this reason, first object of the present invention is to propose a kind of voice recognition processing method, this voice recognition processing method can improve the accuracy rate of speech recognition, and raising user and speech recognition system carry out smoothness during interactive voice, improve the experience of user.

Second object of the present invention is to propose a kind of voice recognition processing device.

For reaching above-mentioned purpose, first aspect present invention embodiment proposes a kind of voice recognition processing method, comprises the following steps: received speech signal; Extract the multiple characteristic informations in described voice signal; Feedback function is calculated according to the multiple characteristic informations in described voice signal; And the decision model of speech recognition is set up according to described feedback function.

The voice recognition processing method of the embodiment of the present invention, for the voice signal received, extract the recognition result of voice signal, result of voice analysis, the information structuring rejuction rulees such as dialogue state, the method that usage data drives carries out the training of decision model, make speech recognition system when carrying out speech recognition, can expect to carry out corresponding feedback according to the feedback after decision model process mutual, for the effective input assert after decision model process, all give clear and definite feedback, instead of be interpreted as noise, thus the accuracy rate of speech recognition can be improved, raising user and speech recognition system carry out smoothness during interactive voice, improve the experience of user.

For reaching above-mentioned purpose, second aspect present invention embodiment proposes a kind of voice recognition processing device, comprising: receiver module, for received speech signal; Extraction module, for extracting the multiple characteristic informations in described voice signal; Computing module, for calculating feedback function according to the multiple characteristic informations in described voice signal; And set up module, for setting up the decision model of speech recognition according to described feedback function.

The voice recognition processing device of the embodiment of the present invention, for the voice signal received, extract the recognition result of voice signal, result of voice analysis, the information structuring rejuction rulees such as dialogue state, the method that usage data drives carries out the training of decision model, make speech recognition system when carrying out speech recognition, can expect to carry out corresponding feedback according to the feedback after decision model process mutual, for the effective input assert after decision model process, all give clear and definite feedback, instead of be interpreted as noise, thus the accuracy rate of speech recognition can be improved, raising user and speech recognition system carry out smoothness during interactive voice, improve the experience of user.

The aspect that the present invention adds and advantage will part provide in the following description, and part will become obvious from the following description, or be recognized by practice of the present invention.

Accompanying drawing explanation

The present invention above-mentioned and/or additional aspect and advantage will become obvious and easy understand from the following description of the accompanying drawings of embodiments, wherein:

Fig. 1 is the process flow diagram of the voice recognition processing method of one embodiment of the invention;

Fig. 2 is the process flow diagram of the voice recognition processing method of another embodiment of the present invention;

Fig. 3 is the structural representation of the voice recognition processing device of one embodiment of the invention; And

Fig. 4 is the structural representation of the voice recognition processing device of another embodiment of the present invention.

Embodiment

Be described below in detail embodiments of the invention, the example of described embodiment is shown in the drawings, and wherein same or similar label represents same or similar element or has element that is identical or similar functions from start to finish.Be exemplary below by the embodiment be described with reference to the drawings, be intended to for explaining the present invention, and can not limitation of the present invention be interpreted as.

In addition, term " first ", " second " only for describing object, and can not be interpreted as instruction or hint relative importance or imply the quantity indicating indicated technical characteristic.Thus, be limited with " first ", the feature of " second " can express or impliedly comprise one or more these features.In describing the invention, the implication of " multiple " is two or more, unless otherwise expressly limited specifically.

Describe and can be understood in process flow diagram or in this any process otherwise described or method, represent and comprise one or more for realizing the module of the code of the executable instruction of the step of specific logical function or process, fragment or part, and the scope of the preferred embodiment of the present invention comprises other realization, wherein can not according to order that is shown or that discuss, comprise according to involved function by the mode while of basic or by contrary order, carry out n-back test, this should understand by embodiments of the invention person of ordinary skill in the field.

Below with reference to the accompanying drawings voice recognition processing method and apparatus according to the embodiment of the present invention is described.

Fig. 1 is the process flow diagram of the voice recognition processing method of one embodiment of the invention.

As shown in Figure 1, voice recognition processing method comprises:

S101, received speech signal.

Particularly, receive the voice signal of user's input, wherein, user can send voice signal by equipment such as microphones.

S102, extracts the multiple characteristic informations in voice signal.

Wherein, multiple characteristic information comprises and refuses to know mark, semantic analysis result, semantic parsing degree of confidence and language model degree of confidence.

Particularly, first the voice signal that user inputs is divided into multiple phrase sound, and it is quiet to remove in these phrase sounds, more multiple phrase cent is not inputed to speech recognition engine.The context Dynamic Selection language model that speech recognition engine is talked with according to interactive voice processes phrase sound, obtain corresponding recognition result or refuse to know mark, and then, recognition result can input to semantic analyzer and carry out context-sensitive semanteme parsing, obtains corresponding semantic analysis result.Meanwhile, after Speech processing is completed, the characteristic informations such as the speech analysis degree of confidence also during acquisition speech analysis and language model degree of confidence.

S103, calculates feedback function according to the multiple characteristic informations in voice signal.

In one embodiment of the invention, according to following formulae discovery feedback function:

R=-(w _in _i+ w _en _e+ w _fn _f+ w _rejn _rej+ w _s1n _sem+ w _s2f _sem+ w _lmf _lm), wherein, R represents feedback function, n _irepresent dialog turns, n _erepresent error number, n _frepresent known slot quantity, n _rejrepresent and refuse to know mark, n _semrepresent semantic analysis result, f _semrepresent semantic and resolve degree of confidence, f _lmrepresentation language model confidence, w represents parameter.

Particularly, feedback function is calculated in conjunction with all utilizable characteristic informations, that is, user feedback mark is carried out in the process that speech recognition system identifies the voice signal that user inputs, mutual input for user judges, such as, interactive dialogue performance level, whether user provides the expressing information of cooperation to mark etc.

In the process that speech recognition system identifies the voice signal that user inputs, in order to the feedback information that can accurately catch user to give, wherein feedback information comprises positive feedback and negative feedback, therefore feedback function reasonable in design is needed, such as the computing formula of above-mentioned shown feedback function.Wherein, n _erepresenting error number, is give tacit consent in speech recognition system.N _rejfor refusing to know mark, n _rejcan be 1 or-1, n _rejbe 1 represent voice signal and normally identified, and n _rejfor-1 represent voice signal refused know.N _semfor semantic analysis result, n _semcan be 1 ,-1 or-2, n _sembe that 1 representative is carried out obtaining meeting contextual correct parsing, n after semanteme is resolved to voice signal _semfor-1 representative is carried out correctly being resolved after semanteme is resolved but not meeting context to voice signal, and n _semfor-2 representatives carry out semantic parsing failure of resolving to voice signal.Thus, mark n is known according to refusing _rej, semantic analysis result n _sem, semantic resolve degree of confidence f _semwith language model degree of confidence f _lmfeedback function can be calculated etc. parameter with reference to above-mentioned formula, can judge that the feedback of user is positive feedback or negative feedback according to feedback function R.

S104, sets up the decision model of speech recognition according to feedback function.

In one embodiment of the invention, the decision model of speech recognition is set up according to following formula:

Q(s，a)＝R(s，a)+r∑ _s′P(s′|s，a)max _d′Q(s′，a′)，

Wherein, Q represents that feedback is expected, s and s ' represents system state node, a and a ' represents decision-making action, and P represents the redirect probability between state in decision-making action.

Particularly, after the feedback provided according to user calculates feedback function, bonus point is carried out to the positive feedback of user, deduction is carried out to the negative feedback of user, and then, use Markovian decision algorithm, namely set up decision model according to above-mentioned formula.For objective function, value iteration (valueiteration) algorithm of standard can be used to carry out parametric solution, the parameter of feedback function and the redirect probability of state can be obtained through training.

Fig. 2 is the process flow diagram of the voice recognition processing method of another embodiment of the present invention.

As shown in Figure 2, voice recognition processing method comprises:

S201, received speech signal.

S202, extracts the multiple characteristic informations in voice signal.

S203, calculates feedback function according to the multiple characteristic informations in voice signal.

S204, sets up the decision model of speech recognition according to feedback function.

Q(s，a)＝R(s，a)+r∑ _s′P(s′|s，a)max _d′Q(s′，a′)，

S205, obtains the interactive voice information of user's input, and processes the interactive voice information that user inputs according to decision model, and selects corresponding interactive strategy and user to carry out interactive voice.

Wherein, interactive strategy can comprise such as boot policy, ignore strategy and clarification strategy etc., when the interactive voice information of speech recognition system identification user is noise, can the clear expression of positive guide user positive guide user, and when identifying that the interactive voice information of user has ambiguity or understands fuzzy, should confirm.That is, user and the mutual each dialogue of speech recognition system may have noise, unsharp answer, fuzzy semanteme or complete response, and several strategy such as speech recognition system can be selected to guide, ignores, clarification.

Such as, interactive voice engine exports voice " you will determine hotel in which city ", user input voice " En En; ... " assert it is noise after the speech recognition that interactive voice engine inputs user based on decision model, therefore select the strategy that user is guided, export voice and " the city title that you want to move in please be say ".Now, user input voice " Beijing weather how ", assert it is not noise data after the speech recognition that interactive voice engine inputs user based on decision model, that city title still has ambiguity, therefore select the strategy that user view is confirmed, export voice " could you tell me and want to order hotel in Beijing? "Now, user input voice " yes ", regarding as after the speech recognition that interactive voice engine inputs user based on decision model is the recognition result of affirmative, therefore continue to export voice " you want where order hotel in Pekinese ", thus continue to guide user and speech recognition system to carry out alternately according to user view.

The voice recognition processing method of the embodiment of the present invention, based on decision model, the voice messaging that user inputs is processed, clear and definite feedback is all given to the voice messaging being identified as effectively input, instead of be interpreted as noise, thus the feedback making voice interactive system to feed back to user the most comfortable is mutual, raising user and speech recognition system carry out smoothness during interactive voice, improve the experience of user.

In order to realize above-described embodiment, the present invention also proposes a kind of voice recognition processing device.

Fig. 3 is the structural representation of the voice recognition processing device of one embodiment of the invention.

As shown in Figure 3, voice recognition processing device comprises: receiver module 10, extraction module 20, computing module 30 and set up module 40.

Wherein, receiver module 10 is for received speech signal.Particularly, receiver module 10 receives the voice signal of user's input, and wherein, user can send voice signal by equipment such as microphones.

Extraction module 20 is for extracting the multiple characteristic informations in voice signal.Wherein, multiple characteristic information comprises and refuses to know mark, semantic analysis result, semantic parsing degree of confidence and language model degree of confidence.Particularly, first the voice signal that user inputs is divided into multiple phrase sound, and it is quiet to remove in these phrase sounds, more multiple phrase cent is not inputed to extraction module 20.The context Dynamic Selection language model that extraction module 20 is talked with according to interactive voice processes phrase sound, obtain corresponding recognition result or refuse to know mark, and then recognition result can input to semantic analyzer and carry out context-sensitive semanteme parsing, obtains corresponding semantic analysis result.Meanwhile, after completing Speech processing, extraction module 20 also obtains the characteristic information such as speech analysis degree of confidence and language model degree of confidence during speech analysis.

Computing module 30 is for calculating feedback function according to the multiple characteristic informations in voice signal.

R=-(w _in _i+ w _en _e+ w _fn _f+ w _rejn _rej+ w _s1n _sem+ w _s2f _sem+ w _lmf _lm), wherein, R represents feedback function, n _irepresent dialog turns, n _erepresent error number, n _frepresent known slot quantity, n _rejrepresent and refuse to know mark, n _semrepresent semantic analysis result, f _semrepresent semantic and resolve degree of confidence, f _lmrepresentation language model confidence, w represents parameter.Particularly, computing module 30 calculates feedback function in conjunction with all utilizable characteristic informations, that is, in the process that speech recognition system identifies the voice signal that user inputs, computing module 30 carries out user feedback mark, mutual input for user judges, such as, interactive dialogue performance level, whether user provides the expressing information of cooperation to mark etc.

In the process that speech recognition system identifies the voice signal that user inputs, in order to the feedback information that can accurately catch user to give, wherein feedback information comprises positive feedback and negative feedback, therefore feedback function reasonable in design is needed, such as the computing formula of above-mentioned shown feedback function.Wherein, n _erepresenting error number, is give tacit consent in speech recognition system.N _rejfor refusing to know mark, n _rejcan be 1 or-1, n _rejbe 1 represent voice signal and normally identified, and n _rejfor-1 represent voice signal refused know.N _semfor semantic analysis result, n _semcan be 1 ,-1 or-2, n _sembe that 1 representative is carried out obtaining meeting contextual correct parsing, n after semanteme is resolved to voice signal _semfor-1 representative is carried out correctly being resolved after semanteme is resolved but not meeting context to voice signal, and n _semfor-2 representatives carry out semantic parsing failure of resolving to voice signal.Thus, computing module 30 knows mark n according to refusing _rej, semantic analysis result n _sem, semantic resolve degree of confidence f _semwith language model degree of confidence f _lmfeedback function can be calculated etc. parameter with reference to above-mentioned formula, can judge that the feedback of user is positive feedback or negative feedback according to feedback function R.

Set up module 40 for setting up the decision model of speech recognition according to feedback function.

Q(s，a)＝R(s，a)+r∑ _s′P(s′|s，a)max _d′Q(s′，a′)，

Particularly, after computing module 30 calculates feedback function according to the feedback that user provides, the positive feedback setting up module 40 couples of users carries out bonus point, deduction is carried out to the negative feedback of user, and then, set up module 40 and use Markovian decision algorithm, namely set up decision model according to above-mentioned formula.For objective function, value iteration (valueiteration) algorithm of standard can be used to carry out parametric solution, the parameter of feedback function and the redirect probability of state can be obtained through training.

As shown in Figure 4, voice recognition processing device comprises: receiver module 10, extraction module 20, computing module 30, set up module 40, acquisition module 50 and processing module 60.

Wherein, acquisition module 50 is for obtaining the interactive voice information of user's input.Processing module 60 for processing the interactive voice information that user inputs according to decision model, and selects corresponding interactive strategy and user to carry out interactive voice.Wherein, interactive strategy can comprise such as boot policy, ignore strategy and clarification strategy etc., when the interactive voice information of speech recognition system identification user is noise, can the clear expression of positive guide user positive guide user, and when identifying that the interactive voice information of user has ambiguity or understands fuzzy, should confirm.That is, user and the mutual each dialogue of speech recognition system may have noise, unsharp answer, fuzzy semanteme or complete response, and several strategy such as speech recognition system can be selected to guide, ignores, clarification.

The voice recognition processing device of the embodiment of the present invention, based on decision model, the voice messaging that user inputs is processed, clear and definite feedback is all given to the voice messaging being identified as effectively input, instead of be interpreted as noise, thus the feedback making voice interactive system to feed back to user the most comfortable is mutual, raising user and speech recognition system carry out smoothness during interactive voice, improve the experience of user.

Should be appreciated that each several part of the present invention can realize with hardware, software, firmware or their combination.In the above-described embodiment, multiple step or method can with to store in memory and the software performed by suitable instruction execution system or firmware realize.Such as, if realized with hardware, the same in another embodiment, can realize by any one in following technology well known in the art or their combination: the discrete logic with the logic gates for realizing logic function to data-signal, there is the special IC of suitable combinational logic gate circuit, programmable gate array (PGA), field programmable gate array (FPGA) etc.

In the present invention, unless otherwise clearly defined and limited, term " installation ", " being connected ", " connection ", etc. term should be interpreted broadly, such as, can be fixedly connected with, also can be removably connect, or integral; Can be mechanical connection, also can be electrical connection; Can be directly be connected, also indirectly can be connected by intermediary, can be the connection of two element internals or the interaction relationship of two elements, unless otherwise clear and definite restriction.For the ordinary skill in the art, above-mentioned term concrete meaning in the present invention can be understood as the case may be.

In the description of this instructions, specific features, structure, material or feature that the description of reference term " embodiment ", " some embodiments ", " example ", " concrete example " or " some examples " etc. means to describe in conjunction with this embodiment or example are contained at least one embodiment of the present invention or example.In this manual, to the schematic representation of above-mentioned term not must for be identical embodiment or example.And the specific features of description, structure, material or feature can combine in one or more embodiment in office or example in an appropriate manner.In addition, when not conflicting, the feature of the different embodiment described in this instructions or example and different embodiment or example can carry out combining and combining by those skilled in the art.

Although illustrate and describe embodiments of the invention above, be understandable that, above-described embodiment is exemplary, can not be interpreted as limitation of the present invention, and those of ordinary skill in the art can change above-described embodiment within the scope of the invention, revises, replace and modification.

Claims

1. a voice recognition processing method, is characterized in that, comprises the following steps:

Received speech signal;

Extract the multiple characteristic informations in described voice signal;

Feedback function is calculated according to the multiple characteristic informations in described voice signal; And

The decision model of speech recognition is set up according to described feedback function.

2. voice recognition processing method as claimed in claim 1, is characterized in that, described multiple characteristic information comprises to be refused to know mark, semantic analysis result, semantic parsing degree of confidence and language model degree of confidence.

3. voice recognition processing method as claimed in claim 1 or 2, is characterized in that, feedback function according to following formulae discovery:

R=-(w _in _i+ w _en _e+ w _fn _f+ w _rejn _rej+ w _s1n _sem+ w _s2f _sem+ w _lms _lm), wherein, R represents feedback function, n _irepresent dialog turns, n _erepresent error number, n _frepresent known slot quantity, n _rejrepresent and refuse to know mark, n _semrepresent semantic analysis result, f _semrepresent semantic and resolve degree of confidence, s _lmrepresentation language model confidence, w represents parameter.

4. voice recognition processing method as claimed in claim 3, the decision model of described speech recognition is set up according to following formula:

Q(s，a)＝R(s，a)+r∑ _s′P(s′|s，a)max _d′Q(s′，a′)，

5. the voice recognition processing method as described in any one of claim 1-4, is characterized in that, after the decision model setting up speech recognition according to described feedback function, also comprises:

Obtain the interactive voice information of user's input, and according to described decision model, the interactive voice information that described user inputs is processed, and select corresponding interactive strategy and described user to carry out interactive voice.

6. a voice recognition processing device, is characterized in that, comprising:

Receiver module, for received speech signal;

Extraction module, for extracting the multiple characteristic informations in described voice signal;

Computing module, for calculating feedback function according to the multiple characteristic informations in described voice signal; And

Set up module, for setting up the decision model of speech recognition according to described feedback function.

7. voice recognition processing device as claimed in claim 6, is characterized in that, described multiple characteristic information comprises to be refused to know mark, semantic analysis result, semantic parsing degree of confidence and language model degree of confidence.

8. voice recognition processing device as claimed in claims 6 or 7, it is characterized in that, described computing module is feedback function according to following formulae discovery:

9. voice recognition processing device as claimed in claim 8, describedly set up module sets up described speech recognition decision model according to following formula:

Q(s，a)＝R(s，a)+r∑ _s′P(s′|s，a)max _d′Q(s′，a′)，

10. the voice recognition processing device as described in any one of claim 6-9, is characterized in that, also comprise:

Acquisition module, for obtaining the interactive voice information of user's input;

Processing module, for processing the interactive voice information that described user inputs according to described decision model, and selects corresponding interactive strategy and described user to carry out interactive voice.