CN103680500A

CN103680500A - Speech recognition method and device

Info

Publication number: CN103680500A
Application number: CN201210314129.0A
Authority: CN
Inventors: 钱胜
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2012-08-29
Filing date: 2012-08-29
Publication date: 2014-03-26
Anticipated expiration: 2032-08-29
Also published as: CN103680500B

Abstract

The invention provides a speech recognition method and a speech recognition device, wherein the method comprises the steps that: a context-dependent HMM(hidden markov model) is adopted when a decoding network is trained, a sil (silence) model is added to a suffix in the decoding network, acoustic contexts of the HMM state before and after the sil model are regulated, and the HMM state skip sequence of a to-be-recognized speech is acquired through the decoding network. Furthermore, a skip to the head part of a linguistic model is added at the end of the linguistic model in the decoding network to simulate the influence of a pause between sentences on the context information of the linguistic model. According to the speech recognition method and the speech recognition device, speech recognition effect is improved.

Description

A kind of method and apparatus of speech recognition

[technical field]

The present invention relates to Computer Applied Technology field, particularly a kind of method and apparatus of speech recognition.

[background technology]

Speech recognition technology is to allow machine voice signal be changed into the technology of corresponding text or order by identification and understanding process, the wherein maturation of Hidden Markov Model (HMM) (HMM) technology and the constantly perfect main stream approach that becomes speech recognition.

HMM sets up statistical model to the time series structure of voice signal, an as mathematical dual random process of regarding: one is the implicit stochastic process of coming analog voice signal statistical property to change with the Markov chain with finite state number, another is the stochastic process of the observation sequence that is associated with markovian each state.The former shows by the latter, but the former design parameter is immesurable.In fact people's speech process is exactly a dual random process, and voice signal itself is one and becomes sequence when observable, and HMM has reasonably imitated this process, is comparatively desirable a kind of speech model.

Speech recognition principle based on HMM method is by find out optimum redirect sequence in all possible HMM state transition sequence, using its corresponding text message as recognition result.And decoding network is described all possible HMM state transition, speech recognition is exactly the process of searching best redirect sequence on decoding network, the result of identification must be decoding network can describe a kind of in likely.In identifying, the sequence of HMM state transition is called as path.Only take identification " in " and the simple isolated word of " state " be example, its decoding network as shown in Figure 1, wherein " in " corresponding HMM state transition sequence is " zh ", " ong ", HMM state transition sequence corresponding to " state " is " g ", " uo ", and <s> and </s> are respectively beginning and the terminating symbols of language model.

People, speak in process, speaker often because thinking deeply, hesitate, the reason such as cough, surprised, stutter all can produce pause, in voice signal, pause can be presented as that a period of time do not have sound, or has sound but be not voice sound, but cough or sneeze sound.Be parked in voice, be divided in sentence, pause and sentence between pause, as its name suggests, in sentence, pause and refer to the pause of people in saying process in short, between sentence, pause and refer to that people is when saying many, and between pause.

In existing speech recognition, it is generally acknowledged only quiet having end to end of voice, and in the middle of voice, be not pause, in voice, having like this while pause can be to have semantic word pause wrong identification, and more seriously, because speech recognition is a process of expanding backward according to current state, this mistake can directly have influence on identifying below, causes recognition result to be made another mistake.The key addressing this problem is the correct pause in voice that identifies, and follow-up identifying is carried out backward under correct result; And correctly identify the prerequisite of pausing in voice be all HMM state transitions of description correct in decoding network may.

Existing conventional method is that the suffix in decoding network increases quiet model (sil model), in identifying, run into like this while pausing, sil model can have semantic model competition with other, if sil model is preponderated, is identified as pause (be called again by sil model and absorb).Fig. 2 for increasing the schematic diagram of sil model in decoding network, and in figure, <s> and </s> are respectively beginning and the terminating symbols of language model.

But in actual applications,, because the pause in voice can affect near acoustics pronunciation, the dead time, longer impact was larger, in addition, contextual information for the language model that pauses between sentence can be undergone mutation, and recognition methods of the prior art can not address these problems, and recognition effect is limited.

[summary of the invention]

The invention provides a kind of method and apparatus of speech recognition, so that improve the effect of speech recognition.

Concrete technical scheme is as follows:

A method for speech recognition, the method comprises:

Training adopts context-sensitive Hidden Markov Model (HMM) HMM, the suffix in decoding network to increase quiet sil model during decoding network and adjusts the acoustical context of HMM state before and after this sil model;

Utilize described decoding network to obtain the HMM state transition sequence of voice to be identified.

According to one preferred embodiment of the present invention, the context dependent of HMM state and phoneme in described context-sensitive HMM;

Before and after this sil model of described adjustment, the acoustical context of HMM state is specially: by phoneme in the HMM state before this sil model in decoding network below replace with sil, phoneme in the HMM state after this sil model in decoding network replaced with to sil above.

According to one preferred embodiment of the present invention, the method also comprises: in described decoding network, the end of language model increases by one to the redirect of this language model head.

According to one preferred embodiment of the present invention, the method also comprises: in described HMM state transition sequence basis, query language model is determined after optimal path, if exist in optimal path from the end of described language model to the redirect of head, determine to exist between sentence and pause.

According to one preferred embodiment of the present invention, the method also comprises:

According to the optimal path of described voice to be identified, punctuation mark is added in the position of pausing between described sentence.

A device for speech recognition, this device comprises:

Network training unit, adopts context-sensitive Hidden Markov Model (HMM) HMM, the suffix in decoding network to increase quiet sil model when training decoding network and adjusts the acoustical context of HMM state before and after this sil model;

Path determining unit, for utilizing described decoding network to obtain the HMM state transition sequence of voice to be identified.

Described network training unit is when adjusting the acoustical context of sil model front and back HMM state, specifically by phoneme in the HMM state before this sil model in decoding network below replace with sil, phoneme in the HMM state after this sil model in decoding network replaced with to sil above.

According to one preferred embodiment of the present invention, described network training unit, also increases by one to the redirect of this language model head for the end at described decoding network language model.

According to one preferred embodiment of the present invention, described path determining unit, also determines optimal path for query language model in described HMM state transition sequence basis;

This device also comprises:

Pause recognition unit, if the optimal path of determining for described path determining unit exists from the end of described language model to the redirect of head, determines to exist between sentence and pauses.

According to one preferred embodiment of the present invention, described pause recognition unit, also for the optimal path according to described voice to be identified, punctuation mark is added in the position of pausing between described sentence.

As can be seen from the above technical solutions, the present invention adopts context-sensitive HMM model when training decoding network, suffix in decoding network increases sil model and adjusts the mode of the acoustical context of sil model front and back HMM state, simulation pauses on the contextual impact of acoustic model, and the speech recognition of carrying out based on this decoding network has improved the effect of speech recognition.

[accompanying drawing explanation]

Fig. 1 is a simplified example figure of decoding network;

Fig. 2 increases the schematic diagram of sil model in decoding network in prior art;

A kind of schematic diagram of the decoding network that Fig. 3 provides for the embodiment of the present invention;

The another kind of schematic diagram of the decoding network that Fig. 4 provides for the embodiment of the present invention;

The structural drawing of the speech recognition equipment that Fig. 5 provides for the embodiment of the present invention.

[embodiment]

In order to make the object, technical solutions and advantages of the present invention clearer, below in conjunction with the drawings and specific embodiments, describe the present invention.

For the process of speech recognition, be actually and depend on trained decoding network, that is to say, speech recognition at least comprises two processes: first is the training process of decoding network, and second is speech recognition process voice to be identified being carried out based on decoding network.Wherein, in the speech recognition process that voice to be identified are carried out, relate to the inquiry of acoustic model and the inquiry of language model, the inquiry of acoustic model is to obtain the HMM state transition sequence of voice to be identified based on decoding network inquiry acoustic model (acoustic model of using in the embodiment of the present invention comprises HMM and sil model), the inquiry of language model is based on decoding network query language model, thereby determines the result that optimal path obtains speech recognition.

While training decoding network in embodiments of the present invention, adopt context-sensitive HMM, the suffix in decoding network increases sil model and adjusts the acoustical context of this sil model front and back HMM state.

First context-sensitive HMM is simply introduced, so-called context-sensitive HMM for the HMM that describes same phoneme along with the contextual difference of acoustic phoneme difference, take " China " as example, while adopting context-free HMM to describe, HMM state transition sequence is: " zh ", " ong ", " g " and " uo ", decoding network now as shown in fig. 1.If while adopting context-sensitive HMM to describe, HMM state transition sequence is: " zh+ong ", " zh-ong+g ", " ong-g+uo ", " g-uo ", wherein "+" represents below, "-" represents above, for example " zh+ong " expression " zh " be below " ong " time state, " zh-ong+g " expression " ong " be above " zh " and be below the state of " g ", " g-uo " expression " g " be below the state of " uo ".

In embodiments of the present invention, suffix in decoding network increases sil model, sil model is the HMM that is used for describing quiet, noise, non-voice, pause etc. in speech recognition, because the pause in voice can have influence near acoustics pronunciation, therefore the suffix in decoding network increases after sil model, need to adjust the context of acoustics, near the acoustical context sil model that makes to newly increase meets correlation principle.Particularly, can by phoneme in the HMM state before this sil model in decoding network below replace with sil, phoneme in the HMM state after this sil model in decoding network replaced with to sil above.As shown in Figure 3, by below the replacing with of " ong " " sil ", by below also the replacing with of " uo " " sil ", by above the replacing with of " zh " " sil ", by above the replacing with of " g " " sil ".

Carry out above-mentioned increase sil model and adjust after acoustical context, take equally " China " as example, HMM state transition sequence is " sil-zh ", " zh-ong+sil ", " sil ", " sil-g+uo ", " g-guo+sil ".

The sil model increasing in aforesaid way is unified quiet model, pause in voice is mainly the impact on front and back pronunciation factor on the impact of acoustical context, but not quiet phoneme itself, near the mode of the acoustical context above-mentioned sil model inserting by adjustment has been described this impact exactly, thereby can improve recognition effect.

In speech recognition process, the mode of inquiring about for acoustic model inquiry and language model and definite mode of optimal path all do not change, in determining the process of optimal path, at suffix place, sil model and other HMM models are at war with, if sil model is won, these place's voice are identified as sil.

For this special situation of pausing between sentence, at the contextual information of pause place language model, can undergo mutation, the content of supposing one section of voice is W1, W2, W3, W4, wherein between W2 and W3, exists and pauses.If this section of voice are in short, this is to pause in a sentence, and so corresponding optimal path is: <s>W1 W2 W3 W4</s>.If this section of voice are two words, to pause between a sentence, corresponding optimal path is: <s>W1 W2</s><sGreatT.G reaT.GTW3 W4</s>, that is to say, the language model of W2 has below become </s> from W3, and the language model of W3 has become <s> from W2 above.In order to realize the identification pausing between sentence, in embodiments of the present invention can be further in decoding network the end </s> of language model increase by one to the redirect of this language model head <s>, as shown in Figure 4.

In speech recognition process, at language model end, below can the competing between </s> and other language models of this language model, for pausing between sentence, </s> can win.The voice of W1, W2, W3, W4 of still take are example, when recognizing the language model of W2, the language model of W2 is below competed between </s> and the language model of W3, if pause between sentence, </s> can win, if pause in sentence, W3 can win.

After in decoding network, the redirect of this language model head <s> is arrived in one of the end </s> of language model increase, adopt this decoding network to carry out in speech recognition process, query language model in the HMM state transition sequence basis obtaining in acoustic model inquiry, determine after optimal path, if exist in optimal path from the end of language model to the redirect of head, determine to exist between sentence and pause.Take decoding network shown in Fig. 4 as example, while carrying out speech recognition by this decoding network, due to the redirect having increased from language model end </s> to this language model head <s>, in optimal path computation process, " in " to the redirect of " state " increased " in " to the redirect pausing, if " in " win to the redirect pausing, explanation " in " be sentence tail, " in " and " state " between pause be to pause between sentence, in optimal path, be just presented as " in " the end </s> of language model to the redirect of head <s>, sign is exactly in recognition result, to have " </s><sGreatT.Gre aT.GT ".

The identification pausing between sentence can be on voice identification result basis, punctuation mark is added in the position of pausing between sentence, interpolation type the present invention of punctuation mark is not limited, can adopt such as according to pause duration, different punctuation marks being set, the shorter interpolation comma of the duration that for example pauses, the long fullstop etc. that adds of pause duration.

Be more than the detailed description that method provided by the present invention is carried out, below device provided by the present invention be described in detail.

The structural drawing of the speech recognition equipment that Fig. 5 provides for the embodiment of the present invention, as shown in Figure 5, this device can comprise: 500He path, network training unit determining unit 510.

Wherein adopt context-sensitive HMM, the suffix in decoding network to increase sil model during network training unit 500 training decoding network and adjust the acoustical context of HMM state before and after this sil model.

Path determining unit 510 utilizes decoding network to obtain the HMM state transition sequence of voice to be identified.

The context dependent of HMM state and phoneme in above-mentioned context-sensitive HMM, for the HMM that describes same phoneme along with the contextual difference of acoustic phoneme difference.In such cases, network training unit 500 is when adjusting the acoustical context of sil model front and back HMM state, specifically by phoneme in the HMM state before this sil model in decoding network below replace with sil, phoneme in the HMM state after this sil model in decoding network replaced with to sil above.

Except above-mentioned alleviation pauses on the impact of acoustical context in voice, the situation of undergoing mutation for the contextual information of pause place language model between sentence, network training unit 500, also increases by one to the redirect <s> of this language model head for the end </s> at decoding network language model.

The above-mentioned HMM state transition sequence of obtaining is the query script of acoustic model, in addition, also can determine optimal path in conjunction with the inquiry of language model, path determining unit 500, also for query language model in HMM state transition sequence basis, determine optimal path.

Further, this device can also comprise: pause recognition unit 520, if the optimal path of determining for path determining unit 510 exists the redirect from the end </s> of language model to head <s>, determine to exist between sentence and pause.Further application can be to add punctuation mark such as the position of pausing between sentence.

By above description, can be found out, method and apparatus provided by the invention possesses following advantage:

1) the present invention adopts context-sensitive HMM model when training decoding network, suffix in decoding network increases sil model and adjusts the mode of the acoustical context of sil model front and back HMM state, simulation pauses on the contextual impact of acoustic model, and the speech recognition of carrying out based on this decoding network has improved the effect of speech recognition.

2) end that the present invention increases language model in decoding network is simulated between sentence and is paused to the redirect of head, and the language model contextual information that solving pauses between sentence causes the brought impact of undergoing mutation, further improves the effect of speech recognition.

The foregoing is only preferred embodiment of the present invention, in order to limit the present invention, within the spirit and principles in the present invention not all, any modification of making, be equal to replacement, improvement etc., within all should being included in the scope of protection of the invention.

Claims

1. a method for speech recognition, is characterized in that, the method comprises:

2. method according to claim 1, is characterized in that, the context dependent of HMM state and phoneme in described context-sensitive HMM;

3. method according to claim 1, is characterized in that, the method also comprises: in described decoding network, the end of language model increases by one to the redirect of this language model head.

4. method according to claim 3, it is characterized in that, the method also comprises: in described HMM state transition sequence basis, query language model is determined after optimal path, if existed in optimal path from the end of described language model to the redirect of head, determines to exist between sentence and pauses.

5. method according to claim 4, is characterized in that, the method also comprises:

6. a device for speech recognition, is characterized in that, this device comprises:

7. device according to claim 6, is characterized in that, the context dependent of HMM state and phoneme in described context-sensitive HMM;

8. device according to claim 6, is characterized in that, described network training unit also increases by one to the redirect of this language model head for the end at described decoding network language model.

9. device according to claim 8, is characterized in that, described path determining unit is also determined optimal path for query language model in described HMM state transition sequence basis;

This device also comprises:

10. device according to claim 9, is characterized in that, described pause recognition unit, and also for the optimal path according to described voice to be identified, punctuation mark is added in the position of pausing between described sentence.