CN103065629A

CN103065629A - Speech recognition system of humanoid robot

Info

Publication number: CN103065629A
Application number: CN 201210475180
Authority: CN
Inventors: 刘治; 林俊潜; 徐淑琼; 章云
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2012-11-20
Filing date: 2012-11-20
Publication date: 2013-04-24

Abstract

The invention discloses a speech recognition system which comprises a speech input module, a preprocessing module, a feature extraction module, a training module, a recognition module, a recognition decision module and a threshold comparison module. The output end of the speech input module is connected with the input end of the preprocessing module, the output end of the preprocessing module is connected with the input end of the feather extraction module, the output end of the feather extraction module is connected with the input ends of the training module and the recognition module, and the training module is connected with the recognition module. The output end of the recognition module is connected with the input end of the recognition decision module, and the output end of the recognition decision module is connected with the input end of the threshold comparison module. The speech recognition system adopts a hidden Markov model (HMM) and a wavelet transform and a neural network technique, makes a decision by adopting threshold comparison and improves a recognition rate.

Description

A kind of speech recognition system of anthropomorphic robot

Technical field

The present invention is a kind of speech recognition system based on anthropomorphic robot, is used for intelligent robot, also can be used for intelligence system or intelligence equipment, human-computer interaction device etc.

Background technology

Can carry out speech exchange with voice and machine, allow machine understand what you say, this is the thing that people dream of for a long time, it also is a human ideal all the time, allow exactly various machines can understand human language and also can take action by people's verbal order, thereby realize man-machine communication.Speech recognition technology has appearred in the development along with science and technology, and this human ideal is achieved gradually.But to realize fully that this ideal also needs human unremitting effort.Speech recognition technology allows exactly machine pass through identification and understands voice signal, and converts thereof into the technology into corresponding text or order.

Speech recognition is a very active in recent years research field.Its application is very extensive, common are: voice entry system, speech control system, voice dialing system, intelligent appliance etc.Speech recognition technology might be as a kind of important man-machine interaction means in the near future, and is auxiliary even replace the input equipments such as traditional keyboard, mouse, in the typing of the enterprising style of writing word of personal computer and operation control.And in application scenarios such as hand-held PDA, intelligent appliance, industrial field controls, speech recognition technology then has more vast potential for future development.Especially in the Palmtop built system that comprises PDA, mobile phone etc., the existence of keyboard has hindered the miniaturization of system greatly, yet these systems more and more trend towards intellectuality, informationization, not only can show a large amount of literal and figure, also need the literal input capability of providing convenience, traditional keyboard entry method is not competent, and speech recognition technology is exactly a kind of alternative means that is rich in potentiality.And the application of voice technology has become one and has had emulative emerging technology industry.Therefore studying speech recognition technology has a wide range of applications and is worth and development prospect.

Speech recognition technology mainly comprises Feature Extraction Technology, pattern match criterion and three aspects of model training technology.The networking of speech recognition technology car has also obtained quoting fully, for example in the networking of wing truck, only needs by PTT contact staff oral account the destination to be set and directly navigates, and is safe, convenient.But speech recognition mainly also has been faced with following five problems:

(1), to identification and the understanding of natural language.At first continuous speech must be decomposed into the units such as word, phoneme, its less important rule of setting up a understanding semanteme;

(2), the voice messaging amount is large.Speech pattern is not only different to different speaker, also is different to same speaker, and for example, speaker is arbitrarily speaking and conscientiously different during the voice messaging in a minute the time.A people's tongue is along with the time changes;

(3), the ambiguity of voice.The speaker is when speech, and different words may sound it being similar.This is common in English and Chinese;

(4), the characteristics of speech sounds of single letter or word, word is subjected to contextual the impact, so that has changed stress, tone, volume and the rate of articulation etc.;

(5), neighbourhood noise and interference have a strong impact on speech recognition, causes discrimination low.

In recent decades, a lot of experts, scholar be with these problems, and research and discovery constantly is so that speech recognition technology is developed.And constructed variedly speech recognition system based on speech recognition technology.The application of speech recognition system has at present: voice control, Industry Control and the medical field of the phonetic dialing of telephone communication, automobile, personal digital assistant (Personal Digital Assistant, PDA), intelligent toy, household remote etc.People constantly study speech recognition technology, are to wish there is one day can reach as exchanging between people and the people that people and machine also can be realized talking with freely, thereby realize industrial robotization, intellectuality.Along with the development of science and technology and the people research to gradually deepization of speech recognition theory, reaching its maturity of theoretical system, development along with Digital Signal Processing, at following 20 years, speech recognition technology will enter in industry, household electrical appliances, communication, automotive electronics, medical treatment and the various electronic equipment gradually.Can say for certain that speech recognition technology will become the technology of a key in the Future Information industry.But also undeniable, it also has very long stretch to walk, and real commercialization, also need to be in the progress of making a breakthrough property of many-side, and also need to be by means of the development of other related discipline.

Summary of the invention

The present invention is a kind of speech recognition system, and fundamental purpose provides a kind of speech recognition system of efficient, stable, practical, high discrimination.

For achieving the above object, the present invention is take MATLAB as implementation tool, in conjunction with welcome's anthropomorphic robot platform.Put up complete speech recognition system, the user utilizes platform to pass through the microphone voice command, treated, the identification of input speech signal, and obtaining a result acts on the action of guest-greeting machine human action.Can this system that test and assess reach the expectation index, and recognition capability is strong, and accuracy is high, the speech recognition system that robustness is good.

The present invention is achieved by the following technical solutions, a kind of speech recognition system of anthropomorphic robot, comprise voice input module, pretreatment module, characteristic extracting module, training module, identification module, recognition decision module, threshold value comparison module, the output terminal of voice input module is connected with the input end of pretreatment module, the output terminal of pretreatment module is connected with the input end of characteristic extracting module, the output terminal of characteristic extracting module is connected with the input end of training module, identification module respectively, and training module is connected with identification module; The output terminal of identification module is connected with the input end of recognition decision module, and the output terminal of recognition decision module is connected with the input end of threshold value comparison module.

Described voice input module is used for the input primary speech signal.

Described pretreatment module comprises in turn the pre-filtering unit, sampling and Quantifying unit, the pre-emphasis unit that connect, adds window unit, end-point detection unit;

Described pre-filtering unit is used for removing the high frequency noise of primary speech signal;

The simulating signal of described sampling and Quantifying unit sampling nyquist sampling theorem sampling and quantification denoising obtains digital signal;

Described pre-emphasis unit is used for promoting HFS, allows the frequency spectrum of signal become smooth, so that the parameter analysis;

The described window unit that adds is used for the signal finite process;

Described end-point detection unit is removed unwanted quiet section for detection of starting point, the terminal point of voice segments, extracts the speech signal segments of reagent.

The method that described end-point detection unit adopts the double threshold energy method to combine with artificial neural network.

Described characteristic extracting module adopts the composite character parameter extraction algorithm based on wavelet transformation, and described extraction algorithm is based on the linear prediction cepstrum coefficient parameter of wavelet transformation with based on the Mel frequency cepstral coefficient of wavelet transformation.

Described training module is by the training study method of Baum-Welch (expectation value correction) algorithm as Hidden Markov Model (HMM).

Described recognition decision module is to obtain output probability by Viterbi (Viterbi) algorithm.

The threshold ratio that described threshold value comparison module is used for the output probability value that will obtain and setting if be higher than threshold value then export recognition result, otherwise abandons this recognition result.

The course of work of the present invention: voice signal is the voice input module input signal from microphone, and through the pretreatment module pre-service, pre-service comprises pre-filtering, sampling and Quantifying, pre-emphasis, windowing and end-point detection; After the pre-service signal is carried out characteristic parameter extraction, with the argument sequence that extracts, foundation is preserved into voice parameterized template storehouse and is namely trained formwork module; Speech recognition process is that voice are inputted from microphone, through pre-service, characteristic parameter extraction, the characteristic parameter that extracts and the speech parameter template base of setting up are carried out probability calculation and coupling, mate the passing threshold comparison module of obtaining a result and carry out threshold ratio, finally obtain recognition result.

Among the present invention, carry out again a threshold ratio behind the calculating probability, think correct recognition result if be higher than threshold value person, otherwise, abandon this recognition result, and re-enter voice command behind prompting " pardon " voice.Threshold value is an empirical value, under specific laboratory environment, and the value that draws through many experiments.

Description of drawings

Fig. 1 speech recognition system block diagram;

Fig. 2 voice signal pre-service block diagram;

Fig. 3 DWTM computation process block diagram.

Embodiment

For a better understanding of the present invention; below in conjunction with accompanying drawing embodiments of the invention are elaborated: present embodiment is implemented under take technical solution of the present invention as prerequisite; provided detailed embodiment and specific operation process, but protection scope of the present invention is not limited to following embodiment.

As shown in Figure 1, be system chart of the present invention, a kind of speech recognition system, comprise voice input module, pretreatment module, characteristic extracting module, training module, identification module, recognition decision module, threshold value comparison module, the output terminal of voice input module is connected with the input end of pretreatment module, the output terminal of pretreatment module is connected with the input end of characteristic extracting module, the output terminal of characteristic extracting module is connected with the input end of training module, identification module respectively, and training module is connected with identification module; The output terminal of identification module is connected with the input end of recognition decision module, and the output terminal of recognition decision module is connected with the input end of threshold value comparison module.

Described voice input module is used for the input primary speech signal.

Described pretreatment module comprises in turn the pre-filtering unit, sampling and Quantifying unit, the pre-emphasis unit that connect, adds window unit, end-point detection unit as shown in Figure 2.

In the pre-service, first to Speech Signal Pre-filtering, its objective is to prevent that aliasing from disturbing that pre-filtering is actually a bandpass filter, lower limiting frequency is respectively f on it _HAnd f _LAgain signal is carried out the A/D conversion, analog voice signal is continuous signal, can't be by Computer Processing, so the first step that voice signal is processed will become digital signal with analog signal conversion exactly.Therefore, must and quantize two steps through sampling, thereby voice signal is after the microphone typing, through the A/D conversion analog signal conversion is become digital signal, again to its sampling and Quantifying, sampling and Quantifying is that the voice signal by computer or other digital recorder collection has all passed through digitizing, does not generally need the user to carry out digitized processing again.According to nyquist sampling theorem: f _s＞2*f _Max, with the frequency sampling of 8000Hz, being divided into the frames of 200 samplings, it is 50% overlapping that consecutive frame has; Carry out pre-emphasis to quantizing rear signal again, because the average power spectra of voice signal is subjected to the impact of glottal excitation and mouth and nose radiation, front end falls by the 6dB/ octave more than 800Hz greatly, will carry out pre-emphasis to voice signal for this reason.The purpose of pre-emphasis is to promote HFS, makes the frequency spectrum of signal become smooth, so that carry out spectrum analysis or channel parameters analysis, pre-emphasis digital filter H (z)=1-uz ^-1, u is 0.97; At last signal is carried out windowing, because the motion of people's self vocal organs, voice signal is a kind of typical non-stationary signal, and its characteristic is time dependent.But this physical motion is much slower than acoustic vibration speed, and therefore, voice signal usually can suppose that at 10～20ms voice signal is stationary signal in such time period, and its spectral characteristic and some physical features parameter can regard constant as approx; What adopt among the present invention is Hamming window.

Described pre-filtering unit is used for removing the high frequency noise of primary speech signal; Remove unnecessary composition, process for the back signal and prepare, guarantee quality and the speed of signal.

The simulating signal of described sampling and Quantifying unit sampling nyquist sampling theorem sampling and quantification denoising obtains digital signal; Because primary speech signal is the simulating signal that is mixed with high frequency noise, because primary speech signal is gone high frequency noise and digitized processing, according to the Nyquist sampling thheorem, frequency sampling and quantification with 8000Hz obtain digitized voice signal.

Described pre-emphasis unit is used for promoting HFS, allows the frequency spectrum of digital signal become smooth, so that the parameter analysis; The average power of voice signal is subjected to the impact of glottal excitation and mouth and nose radiation, and falling can appear in front end, processes for voice signal carries out pre-emphasis for this reason, promotes HFS, allows the frequency spectrum of signal become smooth, so that the parameter analysis.The pre-emphasis digital filter is:

H(z)＝1-μz ^-1

Wherein, the u value gets 0.97 close to 1 at this.

The described window unit that adds is used for the digital signal finite process; Because voice signal is a kind of non-equilibrium signal, its characteristic time to time change.In time period, can regard stationary signal but voice signal usually can be supposed at 10ms～20ms as, its spectral characteristic is also approximate constant.Therefore voice signal is carried out windowing process, be divided into a short section of working hard, each short section is called an analysis frame.Digitized voice signal is divided into the frames of 200 samplings, and it is 50% overlapping that consecutive frame has.To voice signal windowing operation, Hamming window function is with Hamming window:

Wherein, N is the sampling number of every frame, at native system N=200.

Sound end detects, and purpose is to determine starting point and the terminal point of voice from a segment signal that comprises voice, is a step extremely crucial in the speech recognition system, only has the end points of judging exactly voice signal, could be correct carry out speech processes.Effectively end-point detection can not only make the processing time minimize, and can get rid of the noise of unvoiced segments, thereby Disposal quality is guaranteed.Good end-point detecting method can change the problems such as the detection effect that speech recognition software exists is undesirable, discrimination is low, can provide reliable basis for speech recognition, should have good robustness, can distinguish well background noise, non-speech sounds and non-dialogue people's sound and normal dialog sound, reduce the end points mistake that these sound cause and the mistake that causes thus and interrupt.The high precision of end-point detection can guarantee that the signal of input recognizer is effective complete voice signal, makes recognition effect more accurate fast.Method commonly used has energy method, but in actual applications, can't reach very high signal to noise ratio (S/N ratio), so cause the end-point detection out of true, the voice that detect will be imperfect, the effect of impact identification.In the present invention, the signal after the windowing calculates first its short-time energy and short-time average zero-crossing rate, its voiceless sound of Preliminary detection, voiced sound and each voice segments; Through a multilayer perceptron neural network, further detect again, carry out smoothly with medium filtering again, smoothly each frequency spectrum.The experiment proved that this mixed method reaches good end-point detection effect.

Described end-point detection unit is removed unwanted quiet section for detection of starting point, the terminal point of voice segments, extracts the speech signal segments of reagent; Be a step extremely crucial in the speech recognition system, only have the end points of judging exactly voice signal, could be correct carry out speech processes.

The method that described end-point detection unit adopts the double threshold energy method to combine with artificial neural network.Energy method, i.e. the method that short-time energy method combines with short-time average zero-crossing rate can only in the situation of low signal-to-noise ratio (SNR), just can reach very high accuracy.

The short-time energy detection algorithm is fit to detect voiced sound, its computing formula:

E_{n} = Σ_{m = - \infty}^{\infty} {[x (m) w (n - m)]}^{2} = Σ_{m = n - N - 1}^{n} {[x (m) w (n - m)]}^{2}

Short-time average zero-crossing rate is fit to detect voiceless sound, its computing formula:

Z_{n} = Σ_{m = - \infty}^{\infty} | sgn [x (m)] - sgn [x (m - 1)] | w (n - m)

= | sgn [x (n)] - sgn [x (n - 1)] | w (n)

Wherein,

sgn [x (n)] = \{\begin{matrix} 1, x (n) &GreaterEqual; 0 \\ - 1, x (n) < 0 \end{matrix} .

The double threshold energy method is that two threshold values are set on the basis of energy method, and one smaller, and for detection of the quiet section boundary with speech signal segments, another is larger, for detection of the intensity of voice signal.Arrange loosely once at this to two threshold values, to guarantee that voice signal is complete.In this, again signal is input to the artificial neural network detecting device.

The artificial neural network detecting device is to adopt Multi-layer perceptron, and the advantage of using it for end-point detection is to have considered that the probability of correlativity between the speech frame and error is minimum.But major defect is to adopt this algorithm to be difficult to find the phonetic feature that can significantly distinguish voiced sound and voiceless sound, and this just in time forms the advantage complementation with the double threshold energy method.The interconnection pattern that artificial neural network is comprised of many processing units (neuron) has reflected the structure of neural network, and it is determining the ability of this network.Each neuron (such as neuron j) is accepted the information transmission of other neurons (such as neuron i), and the relational expression of total input is:

I_{j} = Σ_{i = 1}^{n} w_{ij} x_{i} - θ_{j}

In the following formula, w _IjBe the connection weights from neuron i to neuron j; x _iOutput for neuron i; θ _jThe threshold value of expression neuron j.The output relation formula of neuron j is: o _j=f (I _j), function here, f () is called excitation function, chooses the S type function to be:

f (x) = \frac{1}{1 + e^{- x}}

Multilayer perceptron (multilayer perception networks, MLPN) neural network, it is a kind of feedforward neural network, is one to have the neural network of multilayer neuron, feedforward, error anti-pass structure.Usually be comprised of three parts: one group of perception unit forms input layer; The hidden layer of one or more layers computing node; The output layer of one deck computing node.Input layer is used for the perception external information, and output layer provides classification results, and hidden layer is processed input message.The neuron of each hidden layer or output layer is used for carrying out two kinds of calculating: calculate function signal that neuronic output place occurs and the estimation of gradient vector and calculate, it need to oppositely pass through network, therefore, adopt back propagation (BP algorithm) as the standard exercise algorithm of MLPN network, the BP algorithm is learnt in a kind of mode of the teacher's of having teaching, and learning process is comprised of forward-propagating process and back-propagation process.At first by the teacher each input pattern is set a desired output, then to a net actual learning and memory pattern (being training sample) of input, and propagated to output layer through the middle layer by input layer, actual output namely is error with the difference of expectation value, connect weights by output layer toward the middle layer layer-by-layer correction according to error again, by such process, actual output approaches to corresponding expectation value output separately gradually, determines at last the connection weights ω of each interlayer _Ij

The amplitude of voice signal is large with respect to its dynamic range of ground unrest, can think that random occurrence is many in voice segments, and average information is large, and namely entropy is large; Moreover the energy of ground unrest section is more steady in the distribution at each frequency place, is reflected to that to think on the quantity of information that its contained average information is namely composed entropy larger.Therefore, the voice signal after will processing through the double threshold energy method calculates its amplitude entropy and spectrum entropy and as the input of MLPN neural network, through each Es-region propagations of neural network, draws at last testing result.

Speech recognition is the process of a coupling, at first, sets up model according to the characteristics of voice, and the voice signal of input is analyzed, and extracts required feature, sets up on this basis the required template of speech recognition.Will be according to the block mold of speech recognition in identifying, the feature of the voice signal of the sound template deposited in the computing machine and input is compared, according to certain search and matching strategy, find out template a series of optimums and voice match input, obtain recognition result, this is the principle of speech recognition.The method commonly used to the characteristic parameter extraction of voice signal has: LPCC (linear prediction cepstrum coefficient coefficient) parameter, MFCC (Mel frequency cepstral coefficient) parameter, based on parameter of LPCC or MFCC first order difference etc.The shortcoming of LPCC parameter is not utilize the pleasant sense of hearing, although the MFCC parameter has been utilized pleasant auditory properties, these two parameters have only reflected the static characteristics of voice, can't reflect its dynamic perfromance, therefore, proposed to describe with the first order difference parameter dynamic perfromance of voice.Voice signal is non-stationary signal, Fourier transform is a kind of analytical approach of overall importance to signal, the locality that can't reflect voice signal, the present invention has introduced wavelet transform, when extracting LPCC parameter and MFCC parameter, replace Fourier transform with wavelet transform, bring into play the advantage of wavelet transformation, can improve the recognition performance of speech recognition system.In order to improve the discrimination of system, calculate again its first order difference parameter, in conjunction with both, with Δ DWTL+DWTL, Δ DWTM+DWTM is as the characteristic parameter of voice signal.Wherein DWTL and DWTM are respectively LPCC and the MFCC parameters with wavelet transformation, and Δ DWTL and Δ DWTM are respectively the differential parameters of LPCC and MFCC parameter.

Characteristic parameter extraction refers to obtain the one group of process that can describe the phonic signal character parameter from voice signal.The algorithm that proposes is based on linear prediction cepstrum coefficient parameter (LPCC) and Mel frequency cepstral coefficient (MFCC), replace discrete Fourier transform (DFT) with wavelet transform, wavelet transformation has time domain locality and frequency domain locality, and its time-frequency window can be regulated adaptively according to different frequency, thereby can accurately reflect the instantaneous variation of non-stationary signal; Wavelet analysis has different resolution at the time-frequency plane diverse location, is a kind of multiresolution analysis method.Namely have higher frequency resolution and lower temporal resolution in low frequency part, have higher temporal resolution and lower frequency resolution at HFS, therefore, all have the ability of characterization signal local feature in time-frequency two territories.

Linear prediction cepstrum coefficient parameter (LPCC) is the characteristic parameter that characterizes speaker's individual character, its model system function:

lgH (z) = \hat{H} (z) = Σ_{n = 1}^{\infty} \hat{h (n)} z^{- n}

With the combination of upper two formulas, again to its both sides about the z differentiate, namely have

(1 - Σ_{k = 1}^{p} a_{k} z^{- k}) Σ_{n = 1}^{\infty} n \hat{h} (n) z^{- n + 1} = Σ_{k = 1}^{p} k a_{k} z^{- k + 1}

Make constant term and z about following formula ^-1The coefficient of each time power equates respectively, namely obtains h (n) and a _kBetween recurrence relation, again from predictive coefficient a _kDraw cepstrum

Also namely draw the LPCC parameter.

Calculation procedure based on the linear prediction cepstrum coefficient parameter (referred to as the DWTL parameter) of wavelet transformation:

(1), original voice signal will pass through pre-service first;

(2), every frame signal is carried out WAVELET PACKET DECOMPOSITION, the calculating wavelet packet coefficient;

(3), wavelet packet coefficient obtained in the previous step is asked cepstrum D _n

(4), with D _nBe merged into new vector [D ₁D ₂D _n] as the DWTL parameter.

The most important characteristics of Mel frequency cepstral coefficient (MFCC) are exactly to have utilized the characteristic of the decorrelation of the sense of hearing principle of people's ear and cepstrum, the Mel frequency becomes nonlinear correspondence relation with the Hz frequency, this relation of utilizing between them, with the Spectrum Conversion of voice signal to the perception frequency domain.Its relational expression is:

f_{m} = 2595 \lg (1 + \frac{f_{Hz}}{700})

Wherein, f _mThe perception frequency domain take Mel as unit, f _HzIt is the actual frequency domain take Hz as unit.

Computation process such as Fig. 3 based on the Mel frequency cepstrum parameter (referred to as the DWTM parameter) of wavelet transformation.

DWTM calculation of parameter step:

(1), original voice signal will pass through pre-service first;

(2), pretreated every frame signal is all passed through the processing of wavelet transform, the signal after the processing carries out WAVELET PACKET DECOMPOSITION, calculates wavelet packet coefficient;

(3), with the wavelet coefficient that decomposites on each frequency band, carry out again filtering;

(4), with the energy value S (m) that asks for again coefficient obtained in the previous step;

(5), log spectrum S (m) is transformed to the cepstrum frequency domain through discrete cosine transform.

The exponent number that DWTM and DWTL parameter adopt is 13, gets 12 rank after removing DC component c (0), and filter bank is 24 groups.With Δ DWTL+DWTL, Δ DWTM+DWTM is as the characteristic parameter of voice signal.So that the effective combination of behavioral characteristics and static nature, can improve the discrimination of system.Wherein DWTL and DWTM are respectively LPCC and the MFCC parameters with wavelet transformation, and Δ DWTL and Δ DWTM are respectively the differential parameters of DWTL and DWTM parameter.Hybrid parameter is the matrix of a 24*24.Matrix is as follows:

ML = [\begin{matrix} ΔDWTL & ΔDWTM \\ DWTL & DWTM \end{matrix}]

Recognition decision commonly used has dynamic time warping algorithm, Viterbi algorithm etc., and the most representative in probabilistic method is exactly the method for HMM, and be combined the most closely algorithm surely belongs to the Viterbi algorithm with HMM.The Viterbi algorithm is a kind of dynamic programming algorithm, be used for seeking the hidden status switch of most probable (Viterbi path) that is produced (0bserved Event) by observation information, forward algorithm is a similar algorithm, is used for calculating the probability that a string observed events occurs.When identification, to each observed events calculating probability, what probability was the highest is exactly recognition result, sometimes, what probability was the highest might not be correct recognition result, therefore, among the present invention, carry out again a threshold ratio behind the calculating probability, think correct recognition result if be higher than threshold value person, otherwise, abandon this recognition result, and re-enter voice command behind prompting " pardon " voice.Threshold value is an empirical value, under specific laboratory environment, and the value that draws through many experiments.

Speech recognition system comprises that pre-service, end-point detection, characteristic parameter extraction, recognition decision and threshold ratio wait.When setting up what a complete speech recognition system, then will set up a template base, template base is selected to set up according to the needs of practical application, in the present embodiment, set up one by 6 people, everyone repeats template base of 10 recording to an order.The voice command of template base includes: the direction order, such as " left ", " to the right ", " advancing ", " stopping " etc.; The order of saying hello is such as " you are good ", " thanks "; The question and answer order, such as " what is your name? ", " from where are you? ", " which function you have? " etc..Comprise following process for each order: typing template voice, through pre-service, detect sound end, extract voice and characteristic parameter, recognition decision, threshold ratio are.

1), voice typing: with the analog voice signal digitizing, with the sample frequency of 8000HZ voice are sampled, quantized again;

2), pre-service has comprised: pre-filtering, pre-filtering are actually a bandpass filter, its objective is to prevent that aliasing from disturbing; Pre-emphasis, the purpose of pre-emphasis are to promote HFS, make the frequency spectrum of signal become smooth, so that carry out spectrum analysis or channel parameters analysis, pre-emphasis digital filter H (z)=1-uz ^-1, u is 0.97; Windowing, voice signal are non-stationary signals, and voice signal usually can suppose that at 10～20ms voice signal is stationary signal in such time period, and its spectral characteristic and some physical features parameter can regard constant as approx;

3), end-point detection, purpose is to determine starting point and the terminal point of voice from a segment signal that comprises voice, at present embodiment, uses first its end points of energy method Preliminary detection, does further detection through the multilayer perceptron neural network again;

4), extract characteristic parameter, calculate respectively DWTL parameter and DWTM parameter, calculate respectively again first order difference parameter Δ DWTL and Δ DWTM; Again they are combined into the matrix of a 24*24 as speech characteristic parameter;

5), set up template base, the characteristic parameter that extracts is set up an argument sequence.

For each voice command, all through 1)～5) step, iterative cycles is finally set up a complete sound template storehouse.After having set up template base, will test, its process is as follows:

Step 1), the typing tested speech, facing to microphone input voice command, such as " to the right ", analog voice signal is changed into audio digital signals, i.e. digitizing through A/D; Again it is sampled, quantizes.

Step 2), pre-service, process such as top step 2).

Step 3), end-point detection, process such as top step 3).

Step 4), extract characteristic parameter, process such as top step 4).

Step 5), recognition decision, to each calculation of parameter probability that extracts, maximum probability be result to be identified.

Step 6), with step 5) maximum probability calculated and threshold ratio, as recognition result, less than threshold value, abandon result to be identified more than or equal to the maximum probability of threshold value, prompting user is inputted voice command again.

Same process is also done in other tested speech orders, until all tested speech is finished test.And the deficiency of the system of discovery, then adjust parameter, again test.Through experiment showed, that the present invention can be good at voice command recognition, for example, microphone is said " left ", after processing, identifying, robot turns left.Other orders or question and answer can both be identified effectively.

Claims

1. speech recognition system, comprise voice input module (1), pretreatment module (2), characteristic extracting module (3), training module (4), identification module (5), the output terminal of described voice input module (1) is connected with the input end of pretreatment module (2), the output terminal of pretreatment module (2) is connected with the input end of characteristic extracting module (3), the output terminal of characteristic extracting module (3) respectively with training module (4), the input end of identification module (5) connects, and training module (4) is connected with identification module (5); Characterized by further comprising recognition decision module (6), threshold value comparison module (7), the output terminal of identification module (5) is connected with the input end of recognition decision module (6), and the output terminal of recognition decision module (6) is connected with the input end of threshold value comparison module (7).

2. described speech recognition system according to claim 1 is characterized in that described voice input module (1) is used for the input primary speech signal.

3. described speech recognition system according to claim 1 is characterized in that described pretreatment module (2) comprises the pre-filtering unit, sampling and Quantifying unit, the pre-emphasis unit that connect in turn, adds window unit, end-point detection unit;

The described window unit that adds is used for the signal finite process;

4. described speech recognition system according to claim 3 is characterized in that the method that described end-point detection unit adopts the double threshold energy method to combine with artificial neural network.

5. described speech recognition system according to claim 1, it is characterized in that described characteristic extracting module (3) adopts the composite character parameter extraction algorithm based on wavelet transformation, described extraction algorithm is based on the linear prediction cepstrum coefficient parameter of wavelet transformation with based on the Mel frequency cepstral coefficient of wavelet transformation.

6. described speech recognition system according to claim 1 is characterized in that described training module (4) is by the training study method of Baum-Welch algorithm as Hidden Markov Model (HMM).

7. described speech recognition system according to claim 1 is characterized in that described recognition decision module (6) is to obtain output probability by the Viterbi algorithm.

8. described speech recognition system according to claim 1 is characterized in that threshold ratio that described threshold value comparison module (7) is used for the output probability value that will obtain and setting, if be higher than threshold value then export recognition result, otherwise abandons this recognition result.