CN104078039A

CN104078039A - Voice recognition system of domestic service robot on basis of hidden Markov model

Info

Publication number: CN104078039A
Application number: CN201310102175.9A
Authority: CN
Inventors: 刘治; 苏敏发; 谢杰腾
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2013-03-27
Filing date: 2013-03-27
Publication date: 2014-10-01

Abstract

The invention discloses a voice recognition system of a domestic service robot on the basis of a hidden Markov model, and belongs to the field of voice recognition. The whole process of the system is composed of voice signal filtering, sampling, quantification, windowing, end point detection, feature extraction, model training and threshold value comparison. The filtering operation aims to filter low-frequency interference. Voice signals are continuous time-varying analog signals, and therefore sampling quantification must be carried out on the voice signals for obtaining discrete digital signals. Framing is carried out to enable original signals to become sectional signals, and operation of the framing is equivalent to adding a rectangular window to original signals in a time domain. Multiplication with the rectangular window in the time domain is equivalent to convolution between a signal frequency spectrum and Fourier transformation of the rectangular window in the frequency domain. End point detection is achieved through a double-threshold end point detection algorithm. Mel-frequency cepstrum coefficients are adopted by voice signal characteristic parameters, parameter training on the characteristic parameters is achieved through the hidden Markov model, matching is carried out with an established template base, and obtained results are compared with a threshold value to obtain a recognition result.

Description

Household service robot speech recognition system based on Hidden Markov Model (HMM)

Technical field

The invention belongs to speech recognition system field, be specifically related to a kind of voice signal model training and recognition methods based on Hidden Markov Model (HMM).

Background technology

Speech recognition is exactly to allow machine by identifying, the mankind's voice signal be changed into the process of corresponding text or order, its final purpose is exactly as interpersonal talk exchange of information, realize man-machine conversation freely, namely give machine with the sense of hearing, make machine can understand people's language, distinguish content and the speaker of speech, further make machine operate according to people's will, the mankind are freed from heavy or dangerous work.

The research of speech recognition technology relates to numerous subjects such as acoustics, linguistics, phonetics, physiological science, digital signal processing, communication theory, electronic technology, computer science, pattern-recognition and artificial intelligence widely, therefore a speech recognition system that recognition effect is good, needs to consider to comprise speaker's psychological condition, input equipment, the many-sided factor of the environment of speaking.

In recent years; in the very active problem of field of speech recognition, be the understanding of the natural language of the confidence level evaluation and test algorithm of Robust speech recognition, speaker adaptation technology, large vocabulary keyword recognizer, speech recognition, class-based language model and self-adaptation language model and profound level, the direction of research also more and more lays particular emphasis on spoken dialogue system.The research of speaker adaptation technology at present makes considerable headway, the technology that has occurred some comparative maturities, as sound channel normalization technology, the linear regression algorithm (MLLR, Maximum Likel ihood Linear Regression) of maximum likelihood, Bayes (Bayes) self-adaptation algorithm for estimating.And unspecified person, large vocabulary, continuous speech recognition are still the Focal point and difficult point of current stage the Research of Speech Recognition.

Speech recognition technology mainly comprises voice signal pre-service, characteristic parameter extraction, set up template base, recognition decision and the threshold value module such as relatively.Voice signal is from microphone input signal, and through pre-service, pre-service comprises pre-filtering, sampling and Quantifying, pre-emphasis, windowing and end-point detection; After pre-service, signal is carried out to characteristic parameter extraction, by extracted argument sequence, set up and preserve into voice parameterized template storehouse; Speech recognition process is that voice are inputted from microphone, through pre-service, characteristic parameter extraction, the characteristic parameter of extraction is carried out to probability calculation and mates with set up template base, and coupling is obtained a result and compared with threshold value, finally obtains recognition result.

Summary of the invention

The present invention is a kind of speech recognition system based on Hidden Markov Model (HMM) training, mainly by matlab, realizes system emulation.Voice signal first after filtering, sampling and Quantifying obtains discrete digital signal, is exactly then pre-emphasis, and the object of pre-emphasis is filtering low-frequency disturbance; Voice signal is a kind of typical non-stationary signal, there is time varying characteristic, thus by voice signal, divide frame operation, due to the effect of minute frame, signal is originally become sectional, this is the equal of just that original signal has been added to a rectangular window in time domain.In time domain, multiply each other and also with regard to being equivalent to the Fourier transform of the interior signal spectrum of frequency domain and rectangular window, carry out convolution with rectangular window, can do to each frame the processing of a windowing for this reason after minute frame, what in this patent, use is Hamming window; The object of end-point detection is the segment signal from comprising voice, to determine starting point and the terminal of voice, and find out accurately the starting point and ending point of voice segments, just likely making the data that collect is the voice signals that really will analyze, adopts double threshold end-point detection algorithm in this patent.Speech recognition is the process of a coupling, voice signal to input is analyzed, extract required feature, and set up matching template on extracted characteristic parameter basis, must carry out characteristic parameter extraction to voice signal for this reason, in this patent, adopt a kind of characteristic parameter that can fine reflection human auditory system mechanism, Mel frequency cepstrum coefficient (MFCC).The model training of voice signal is the core of speech recognition system, hidden Markov model (Hidden Markov Models, referred to as HMM) be a dual random process: one is reused in the statistical nature (transient state characteristic of signal can directly observe) of the steady section in short-term of describing non-stationary signal; Another heavy stochastic process described each in short-term steady section be how to be converted to next one steady section in short-term, i.e. the dynamic perfromance of statistical nature (lying in observation sequence) in short-term.People's speech process is also a kind of like this dual random process, with the production process of Hidden Markov Model (HMM) (HMM) description voice signal, is therefore point-device.

Accompanying drawing explanation

Fig. 1 speech recognition system identifying the general frame

Fig. 2 speech sound signal terminal point detects block diagram

Fig. 3 voice signal Hidden Markov Model (HMM) training block diagram

Embodiment

Before voice signal is processed, must carry out digitizing to it, this process is exactly that mould/number (A/D) transforms.Mould/number conversion process will and quantize two processes through over-sampling, thereby obtains the discrete digital signal in time and amplitude.According to nyquist sampling law, general sample frequency is more than the twice of original signal frequency, and just can make in sampling process not can drop-out, and can be from sampled signal the waveform of reconstruct original signal accurately.

1) voice signal pre-service

Before voice signal is analyzed, generally to be promoted to voice signal (pre-emphasis), object is filtering low-frequency disturbance, especially the power frequency of 50Hz or 60Hz is disturbed, the HFS that lifting is useful to speech recognition, allow the frequency spectrum of signal become smooth, thereby be convenient to carry out spectrum analysis or channel parameters analysis.Pre-emphasis is by a single order Hi-pass filter 1-0.9375z by voice signal ^-1, be conventionally referred to as preemphasis filter.Pre-emphasis filter transfer function is:

H(z)＝1-0.9375z ^-1

If s (n) is the voice signal before pre-emphasis, the signal obtaining after preemphasis filter for:

\overset{&OverBar;}{s} (n) = s (n) - 0.9375 s (n - 1)

Voice signal is a kind of non-stationary signal, there is time varying characteristic, but one (it is generally acknowledged at 10-30ms) in scope in short-term, its characteristic remains unchanged substantially, thereby can be seen as a metastable state process, therefore voice signal can be divided to frame operation.General frame number per second is about 33-100 frame, depends on the circumstances.Divide frame can adopt the method for contiguous segmentation, but generally will adopt the method for overlapping segmentation, this is to seamlessly transit between frame and frame in order to make, and keeps continuity.The overlapping part of former frame and a rear frame is called frame and moves.Frame moves with the ratio of frame length and is generally taken as 0-0.5.Due to the effect of minute frame, signal is originally become sectional, this is the equal of just in time domain, to have added a rectangular window at original signal.In time domain, multiply each other and also with regard to being equivalent to the Fourier transform of the interior signal spectrum of frequency domain and rectangular window, carry out convolution with rectangular window.This can change the frequency spectrum of original signal.After minute frame, to do to each frame the processing of a windowing for this reason.Thereby obtain windowing voice signal s (w):

s (w) = \overset{&OverBar;}{s} (n) * w (n)

In voice signal digital processing, conventional window function has Hanning window and Hamming window.In this patent, use Hamming window:

2) speech sound signal terminal point detects

The object that speech sound signal terminal point detects is from a segment signal, to determine that starting point and the terminal of voice, the correctness of end-point detection are also the prerequisites of phonetic recognization rate height.Because only have the correct starting point of finding out voice segments and terminal, just may make the data that collect is the voice signals that really will analyze.What this patent adopted is the end-point detection of double threshold relative method.According to the characteristic parameter of voice signal (energy and zero-crossing rate), carry out voiceless sound, noise differentiation exactly, and then complete end-point detection.The meaning of short-time average energy has been to provide the basis of distinguishing voiceless sound section and voiced segments, and this is because the short-time average energy value of voiceless sound section is significantly less than voiced segments, so utilize short-time average energy can divide the boundary of voiceless sound and voiced sound.Voice signal is divided to the short-time average energy that calculates every frame after frame, reset a thresholding, just can realize in theory a simple end-point detection algorithm.

The short-time energy definition of voice signal:

E_{n} = Σ_{m = - \infty}^{\infty} T [x (m)] \cdot w (n - m) = Σ_{m = - \infty}^{\infty} [x (m) w (n - m)]^{2}

= Σ_{m = n}^{n + N - 1} x {(m)}^{2} \cdot h (n - m) = x^{2} (n) * h (n)

Wherein h (n)=w (n) is window function, and N is that window is long.

Short-time zero-crossing rate is a proper method of estimating sinusoidal frequency.When adjacent two sampled values of discrete signal have different symbols, just there is zero passage phenomenon.The number of times that in a common frame signal, waveform passes through zero level is called zero-crossing rate.The short-time zero-crossing rate definition of voice signal:

Zn = Σ_{m = - \infty}^{\infty} | sgn [x (n)] - sgn [x (n - 1)] \cdot w (n - m)

Sgn[wherein] is-symbol function:

sgn = \{\begin{matrix} 1, x (n) &GreaterEqual; 0 \\ 0, x (n) \leq 0 \end{matrix}

The flow process of double threshold end-point detection algorithm: before starting to carry out speech sound signal terminal point detection, first for short-time average energy and zero-crossing rate are set respectively two thresholdings.One is lower thresholding, and its numerical value is less, more responsive to the variation of signal, is easy to be exceeded; Another is higher thresholding, and numeric ratio is larger, and signal reaches certain intensity, and this thresholding is just likely exceeded.Surpassing the not necessarily beginning of voice of low threshold, is likely also that noise in short-term causes, surpasses that high threshold thinks to be caused by voice signal.The end-point detection of voice signal is divided into four-stage: quiet section, transition section, voice segments, end.In program, with a variable, represent the residing state of current speech signal.At quiet section, if energy or zero-crossing rate have surpassed low threshold, with regard to beginning label starting point, enter transition section.In transition section, because the numerical value of parameter is smaller, whether can not determine that, really in voice segments, therefore the numerical value when two parameters all falls back to below low threshold, just current state is returned to quiet section.And if any one in two parameters surpassed high threshold in transition section, just can be sure of to have entered voice segments.Even if some paroxysmal noises also can cause that the numerical value of short-time energy or short-time zero-crossing rate is very high, but often can not maintain the sufficiently long time, therefore, we introduce the concept of shortest time thresholding again.Current state is when voice segments, and not only the numerical value of two parameters is reduced to below low threshold, and total length that clocks is less than the shortest time thresholding of setting, just thinks that this is one section of noise, continues the later voice signal of scanning.Otherwise with regard to the good end caps of mark, end-point detection finishes to return.

3) characteristic parameter extraction of voice signal

The characteristic parameter extraction of voice signal has several different methods, and linear predictor coefficient (LPC) is based on sound pronunciation mechanism, description be sound channel characteristic; Linear prediction cepstrum coefficient coefficient (LPCC) is based on the synthetic parameter of LPC.But these two kinds of parameters all do not make full use of the auditory properties of people's ear.People's auditory system is also a special nonlinear system in fact, and it is different to the susceptibility of the signal of different frequency, is a logarithmic relationship substantially.This patent adopts Mel frequency cepstrum coefficient (MFCC) to extract the characteristic parameter of voice signal.

Mel frequency cepstrum coefficient (MFCC)) be by the frequency spectrum of signal, first frequency axis be transformed to Mel frequency scale, then transform to cepstrum domain and obtain cepstrum coefficient.Mel is the unit of pitch, is the sensation of human auditory system to sound frequency, and the relation of Mel frequency scale and frequency is:

f_{mel} = 2595 \log_{10} (1 + \frac{f}{700})

Wherein f is actual line resistant frequency, f _melit is Mei Er frequency.

The computation process of MFCC characteristic parameter is as follows:

1. pair voice signal carries out pre-service, and windowing divides frame to be become short signal.

2. voice signal is dividing frame to become after short signal through windowing, with FFT, these time-domain signals s (w) is converted into frequency-region signal p (f), and can calculate thus its short-time energy spectrum p (w):

p(w)＝|p(f)| ²＝|X(e ^jw)| ²

3. p (w) is converted into the p (Mel) on Mei Er (Mel) coordinate by the frequency spectrum on frequency (Hz) axle, wherein Mel represents Mei Er frequency, and its transformational relation is:

f_{mel} = 2595 \log_{10} (1 + \frac{f}{700})

4. V-belt bandpass filter is added on to Mei Er coordinate and obtains bank of filters H _m(k), then calculate the output of this bank of filters of energy spectrum p (Mel) process on Mei Er (Mel) coordinate:

θ_{m} (k) = 1 n [Σ_{k = 1}^{K} | X (k) |^{2} H_{m} (k)], k = 1,2, . . ., K

Wherein k represents k wave filter, and K represents number of filter.H wherein _m(k) represent k Mel bank of filters, its centre frequency from 0 to between Mel frequency distribution, centre frequency is f (m), m=1,2 ..., K, its formula is designed to:

H_{m} (k) = \{\begin{matrix} 0 & k < f (m - 1), k > f (m + 1) \\ \frac{k - f (m - 1)}{f (m) - f (m - 1)} & f (m - 1) \leq k \leq f (m) \\ \frac{f (m + 1) - k}{f (m + 1) - f (m)} & f (m) < k \leq f (m + 1) \end{matrix}

θ _m(k) represent the output energy of k wave filter, Mel frequency cepstrum C _mel(n) in U.S.A, now spending the inverse discrete cosine transform (IDCT) that can adopt modification in spectrum tries to achieve:

C_{Mel} (n) Σ_{k = 1}^{K} θ (M_{k}) \cos (n (k - 0.5) \frac{π}{K}) \cdot (1 \leq N \leq P = K / 2)

5. the cepstrum parameter of standard only reflects the static characteristics of speech parameter, think that the voice between different frame are incoherent, in fact the physical condition being pronounced limits, between different frame, voice variation is continuous, relevant, so also use first order difference Mel cepstrum parameter in identification parameter, it is defined as:

d_{Mel} (n) = \frac{1}{\sqrt{Σ_{i = - k}^{k} i^{2}}} Σ_{- k}^{k} i \cdot c (n + i)

Wherein k is constant, generally gets 2, c, and d represents a frame speech parameter, in this patent, MFCC parameter and first order difference Mel cepstrum parameter is merged into a vector, as the parameter of a frame voice signal.

4) model training of phonic signal character parameter

Hidden Markov model (Hidden Markov Models is called for short HMM) is a dual random process: one is reused in the statistical nature of the steady section in short-term of describing non-stationary signal; Another retrace stated each in short-term steady section how to be converted to next one steady section in short-term, i.e. the dynamic perfromance of statistical nature in short-term.People's speech process is also a kind of like this dual random process.Because voice signal itself is an observable sequence, and it is by (not observable) in brain, according to speech, needs and the parameter of the phoneme (word, sentence) that the knowledge of grammar (condition selecting) is sent flows.Whole process and Hidden Markov Model (HMM) are substantially identical, so HMM can accurately describe the production process of voice signal.

A HMM model can be described by following parameters:

The status number of 1.N----model.Between state, connect each other, a state can be by other state transitions.Between state, also can there is other contact method.The set expression of state is S={S ₁, S ₂..., S _n, t state representation is constantly q _t.

2.M----observes symbolic number.The number of the observation symbol that each state may be exported.Observation assemble of symbol is expressed as

V＝{v ₁，v ₂...，v _M}。

The length of 3.T----observation symbol.The observation symbol sebolic addressing that hidden Markov model produces is expressed as O={o ₁, o ₂..., o _t.

4.A----state transition probability distributes.This is the matrix consisting of state transition probability, its element a _ijrefer to that t moment state is S _i, and constantly transfer to state S at t+1 _jprobability, i.e. A={a _ij, a _ij=p[q _t+1=S _j| q _t=S _i] 1≤i, j≤N.

5.B----state S _jobservation symbol probability distribute.It is state S _jthe matrix that observation symbol probability forms, its element b _j(k) refer to state S _joutput observation symbol v _kprobability, t is constantly in state S _j,

B＝{b _j(k)}，

6. π----initial state distribution.When it refers to t=1, (initial time) is in certain shape probability of state.

Under actual conditions, observation density is usually continuous, thus in patent, adopt the HMM model with Continuous Observation density, and observation density function is mixed Gaussian density function, while adopting mixed Gaussian density function, the form of expression of the probability density function of observation density is:

b_{j} (o_{t}) = Σ_{m = 1}^{M} c_{jm} N_{0} (o_{t}, u_{jm}, s_{jm}), 1 \leq j \leq N

O wherein _tthe measurement vector of model to be asked, here o _tit is MFCC cepstral vectors; c _jmm the mixing constant of state j, namely the hybrid gain factor; N _oit is the density function of Gaussian distribution; U _jmit is the mean value vector of m the mixed components of state j; S _jmit is the covariance matrix of m the mixed components of state j.In fact o _tcomponent substantially uncorrelated, so, S _jmbecome diagonal form covariance matrix, b _j(o _t) can be expressed as:

b_{j} (o_{i}) \frac{Σ_{m = 1}^{M} c_{jm} Π_{d = 1}^{D} {\exp [- {(o_{t} (d) - u_{jmd})}^{2} / ({2 s}_{jmd})] / \sqrt{2 π}}}{{(Π_{d = 1}^{D} S_{jmd})}^{1 / 2}}

Above formula should meet following statistical restraint condition:

\{\begin{matrix} Σ_{m = 1}^{M} c_{jm} = 1 & 1 \leq j \leq N \\ c_{jm} &GreaterEqual; 0 & 1 \leq j \leq N, 1 \leq m \leq M \\ {&Integral;}_{- \infty}^{\infty} b_{j} (x) dx = 1 & 1 \leq j \leq N \end{matrix}

Therefore, the complete definition of hybrid density HMM need to be selected following parameter value continuously:

State in N----model

Gaussian Mixture number in M----state

The dimension of each measurement vector of D----

π----initial state distribution probability

A----state transition probability;

C----hybrid gain matrix;

The Mean Matrix of μ----mixed components

The covariance matrix of U----mixed components

We are expressed as λ by the parameter sets of continuous hybrid density HMM model, and HMM model representation is that λ=(π, A, C, μ, U) HMM is applied to three problems that speech recognition must solve:

(1) evaluation problem.Known observation sequence O={o ₁, o ₂..., o _tand model λ=(π, A, C, μ, U) how to calculate the probability P (O/ λ) that produces observation sequence O under the condition of setting models λ.Solving of evaluation problem makes us can select the model that given observation sequence mates most, and in this patent, adopting algorithm is forward-backward algorithm algorithm.

(2) problem identificatioin of optimum condition chain.Known observation sequence O={o ₁, o ₂..., o _tand the status switch of corresponding best (the explanation observation sequence that can be best) of model λ=(π, A, C. μ, U) How to choose.What in this patent, adopt is Viterbi algorithm.

(3) Model Parameter Optimization problem.How adjustment model parameter lambda=(π, A, C, μ, U) is so that P (O/ λ) maximum.Be adjustment model parameter, make model can describe a given observation sequence, illustrate that best this observation sequence is exactly that optimal model generates.The algorithm adopting in this patent is Baum-Welch algorithm.

Forward-backward algorithm algorithm

Forward direction definition of probability is: α _t(i)=p (o ₁o ₂o _t, q _t=i| λ), represent given HMM model parameter λ, Partial Observation sequence { o ₁o ₂... o _tthe probability in state i constantly at t.

Forward direction probability α _t(i) can calculate with recursion formula below:

(1) initialization

α ₁(i)＝π _ib _i(o ₁) 1≤i≤N

(2) iterative computation

α_{t + 1} (j) [Σ_{i = 1}^{N} α_{t} (i) a_{ij}] b_{j} (o_{t + 1}), 1 \leq t \leq T - 1,1 \leq j \leq N

(3) stop calculating

P (O | λ) = Σ_{i = 1}^{N} α_{T} (i)

Corresponding with forward direction probability, also have backward probability, definition backward probability is:

β _t(i)=p (o _t+1, o _t+2... o _t, q _t=i| λ), represent given HMM mode input λ, observation sequence at t constantly in state i, system output observation sequence { o _t+1, o _t+2... o _tprobability.Backward probability β _t(i) also have similar recursion formula to calculate:

(1) initialization

β _T(i)＝1，1≤i≤N

(2) iterative computation

β_{t} (i) Σ_{j = 1}^{N} a_{ij} b_{j} (o_{t + 1}) β_{t + 1} (j), 1 \leq t \leq T - 1,1 \leq j \leq N

Utilize forward direction probability and backward probability to calculate output probability

Forward direction probability and backward probability are divided into whole observation sequence the output probability product of two Partial Observation sequences to the output probability of HMM model, and they have corresponding recursion formula separately, and output probability computing formula is:

p (o | λ) = Σ_{i = 1}^{N} α_{T} (i)

= Σ_{j = 1}^{N} α_{t} (i) β_{t} (i), 1 \leq t \leq T - 1

= Σ_{i = 1}^{N} Σ_{j = 1}^{N} α_{t} (i) a_{ij} b_{j} (o_{t + 1}) β_{t + 1} (j), 1 \leq t \leq T - 1

Best state chain is determined, for reducing a large amount of multiplication, calculates, and adopts the Viterbi algorithm of logarithmic form

Viterbi algorithm:

(1) pre-service

\overset{&OverBar;}{π} = \log (π_{i}),

{\overset{&OverBar;}{b}}_{i} (o_{t}) = \log [b_{i} (o_{t})],

{\overset{&OverBar;}{a}}_{ij} = \log (a_{ij})

(2) initialization

\overset{&OverBar;}{δ} (i) = \log [δ_{1}] = \overset{&OverBar;}{π} + {\overset{&OverBar;}{b}}_{i} (o_{t}),

(3) iterative computation

\overset{&OverBar;}{δ} (j) = \log [δ_{i} (j)] = \max_{1 \leq t \leq m} [{\overset{&OverBar;}{δ}}_{t - 1} (j) + {\overset{&OverBar;}{a}}_{ij}] + {\overset{&OverBar;}{b}}_{j} (o_{t})

(4) stop calculating

{\overset{&OverBar;}{p}}^{*} = \max_{1 \leq t \leq N} [{\overset{&OverBar;}{δ}}_{T} (i)]

{q_{T}}^{*} = \underset{1 \leq t \leq N}{\arg \max} [{\overset{&OverBar;}{δ}}_{T} (i)]

(5) recall optimal path

Baum-Welch：

The basic thought of Baum-Welch algorithm is: the model λ making new advances from existing model λ ' estimation according to certain parameter revaluation formula, make p (o| λ ')≤p (o| λ), and with λ, replace λ '.Repeat said process until model parameter, in convergence state, has obtained Maximum Likelihood Model.So, how to construct such revaluation formula, make p (o| λ ')≤p (o| λ).Baum by proof by this problem be converted into dexterously find make auxiliary function Q (λ ', λ) maximized model λ because

Q (λ^{'}, λ) &GreaterEqual; Q (λ^{'}, λ^{'}) &DoubleRightArrow; P (O | λ) &GreaterEqual; P (O | λ^{'})

Wherein

Q (λ^{'}, λ) = \underset{q}{Σ} p (o, q | λ^{'}) \log p (o, q | λ)

p (o, q | λ) = π_{q_{0}} Π_{t = 1}^{T} a_{q_{t - 1} q_{t}} {b_{q}}_{t} (o_{t})

By p (o, q| λ) substitution Q (λ ', λ) can obtain

Q (λ^{'}, λ) = Q_{π} (λ^{'}, π) + Σ_{i = 1}^{N} {Q_{a}}_{i} (λ^{'}, a_{i}) + Σ_{j = 1}^{N} {Q_{b}}_{i} (λ^{'}, b_{j})

Wherein

{Q_{π}}_{i} (λ^{'}, π) = Σ_{i = 1}^{N} p (O, q_{0} = i | λ^{'}) \log π_{i}

{Q_{a}}_{i} (λ^{'}, a_{i}) = Σ_{j = 1}^{N} Σ_{t = 1}^{T} p (O, q_{t - 1} = i, q_{i} = j | λ^{'}) \log a_{ij}

{Q_{b}}_{i} (λ^{'}, b_{i}) = Σ_{t = 1}^{T} p (O, q_{t} = i | λ^{'}) \log b_{i} (O_{t})

= Σ_{k = 1}^{K} Σ_{t = 1}^{T} P (O, q_{t} = i | λ^{'}) \log b_{i} (k) δ (O_{t}, v_{k})

Wherein

δ (O_{t}, v_{k}) = \{\begin{matrix} 1, if & O_{t =} v_{k} \\ 0, & else \end{matrix}

In formula, parameter must meet following three constraint conditions:

Σ_{j = 1}^{N} π_{j} = 1

Σ_{j = 1}^{N} a_{ij} = 1, &ForAll; i

Σ_{k = 1}^{K} b_{j} (k) = 1, &ForAll; j

Can find out each individual event of auxiliary function all there is following form:

Σ_{j = 1}^{N} w_{j} \log y_{j},

Variable

{y_{j}}_{j = 1}^{N}

Meet

Σ_{j = 1}^{N} y_{j} = 1

Known by mathematical derivation, variable in the situation that meeting constraint condition time, each individual event value is maximum.Each individual event maximizing to auxiliary function Q (λ, λ '), the maximized model of Q (λ, λ ') of sening as an envoy to of can deriving.

Claims

1. the household service robot speech recognition system based on Hidden Markov Model (HMM), is characterized in that comprising the steps:

Step (1): input speech signal is carried out to filtering, be intended to filtering low-frequency disturbance;

Step (2): because voice signal is the simulating signal that consecutive hours becomes, the voice signal after filtering low-frequency disturbance carries out sampling and Quantifying and obtains discrete digital signal;

Step (3): it is sectional that a minute frame becomes original signal, be the equal of in original signal time domain, to have added a rectangular window, and multiply each other with rectangular window in time domain, with regard to being equivalent to the Fourier transform of signal spectrum and rectangular window in frequency domain, carry out convolution, so will carry out windowing process to voice signal;

Step (4): the voice signal after complete to windowing process carries out end-point detection, because the end points of correct detection voice signal is the prerequisite of carrying out speech recognition.

Step (5): the characteristic parameter to voice signal extracts, for the model training of lower step characteristic parameter is done basis;

Step (6): extracted phonic signal character parameter is carried out to model training by Hidden Markov Model (HMM) (HMM);

Step (7): set up the template base of voice signal, the characteristic parameter through Hidden Markov training is mated with template base, passing threshold comparison, finally obtains recognition result.

2. the household service robot speech recognition system based on Hidden Markov Model (HMM) according to claim 1, is characterized in that described step 4) the method that adopts of end-point detection be double threshold end-point detection algorithm.

3. the household service robot speech recognition system based on Hidden Markov Model (HMM) according to claim 1, it is characterized in that described step 5) the Mel frequency cepstrum parameter of standard only reflects the static characteristics of speech parameter, in fact the physical condition being pronounced limits, between different frame, voice variation is continuous, be correlated with, so also use first order difference Mel cepstrum parameter in identification parameter, it is defined as:

d_{Mel} (n) = \frac{1}{\sqrt{Σ_{i = - k}^{k} i^{2}}} Σ_{- k}^{k} i \cdot c (n + i)

Wherein k is constant, generally gets 2, c, and d represents a frame speech parameter, in use MFCC parameter and differential parameter is merged into a vector, as the parameter of a frame voice signal.

4. the household service robot speech recognition system based on Hidden Markov Model (HMM) according to claim 1, it is characterized in that described step 6) adopt Hidden Markov Model (HMM) to train the characteristic parameter extracting need to solve three problems, they are respectively problem identificatioin, the Model Parameter Optimization problem of evaluation problem, optimum condition chain; And solve the method that these three problems adopt, be respectively forward-backward algorithm algorithm, Viterbi algorithm and Baum-Welch algorithm.