CN103065629A - Speech recognition system of humanoid robot - Google Patents

Speech recognition system of humanoid robot Download PDF

Info

Publication number
CN103065629A
CN103065629A CN 201210475180 CN201210475180A CN103065629A CN 103065629 A CN103065629 A CN 103065629A CN 201210475180 CN201210475180 CN 201210475180 CN 201210475180 A CN201210475180 A CN 201210475180A CN 103065629 A CN103065629 A CN 103065629A
Authority
CN
China
Prior art keywords
module
signal
speech recognition
recognition system
voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN 201210475180
Other languages
Chinese (zh)
Inventor
刘治
林俊潜
徐淑琼
章云
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong University of Technology
Original Assignee
Guangdong University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong University of Technology filed Critical Guangdong University of Technology
Priority to CN 201210475180 priority Critical patent/CN103065629A/en
Publication of CN103065629A publication Critical patent/CN103065629A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Telephonic Communication Services (AREA)

Abstract

The invention discloses a speech recognition system which comprises a speech input module, a preprocessing module, a feature extraction module, a training module, a recognition module, a recognition decision module and a threshold comparison module. The output end of the speech input module is connected with the input end of the preprocessing module, the output end of the preprocessing module is connected with the input end of the feather extraction module, the output end of the feather extraction module is connected with the input ends of the training module and the recognition module, and the training module is connected with the recognition module. The output end of the recognition module is connected with the input end of the recognition decision module, and the output end of the recognition decision module is connected with the input end of the threshold comparison module. The speech recognition system adopts a hidden Markov model (HMM) and a wavelet transform and a neural network technique, makes a decision by adopting threshold comparison and improves a recognition rate.

Description

A kind of speech recognition system of anthropomorphic robot
Technical field
The present invention is a kind of speech recognition system based on anthropomorphic robot, is used for intelligent robot, also can be used for intelligence system or intelligence equipment, human-computer interaction device etc.
Background technology
Can carry out speech exchange with voice and machine, allow machine understand what you say, this is the thing that people dream of for a long time, it also is a human ideal all the time, allow exactly various machines can understand human language and also can take action by people's verbal order, thereby realize man-machine communication.Speech recognition technology has appearred in the development along with science and technology, and this human ideal is achieved gradually.But to realize fully that this ideal also needs human unremitting effort.Speech recognition technology allows exactly machine pass through identification and understands voice signal, and converts thereof into the technology into corresponding text or order.
Speech recognition is a very active in recent years research field.Its application is very extensive, common are: voice entry system, speech control system, voice dialing system, intelligent appliance etc.Speech recognition technology might be as a kind of important man-machine interaction means in the near future, and is auxiliary even replace the input equipments such as traditional keyboard, mouse, in the typing of the enterprising style of writing word of personal computer and operation control.And in application scenarios such as hand-held PDA, intelligent appliance, industrial field controls, speech recognition technology then has more vast potential for future development.Especially in the Palmtop built system that comprises PDA, mobile phone etc., the existence of keyboard has hindered the miniaturization of system greatly, yet these systems more and more trend towards intellectuality, informationization, not only can show a large amount of literal and figure, also need the literal input capability of providing convenience, traditional keyboard entry method is not competent, and speech recognition technology is exactly a kind of alternative means that is rich in potentiality.And the application of voice technology has become one and has had emulative emerging technology industry.Therefore studying speech recognition technology has a wide range of applications and is worth and development prospect.
Speech recognition technology mainly comprises Feature Extraction Technology, pattern match criterion and three aspects of model training technology.The networking of speech recognition technology car has also obtained quoting fully, for example in the networking of wing truck, only needs by PTT contact staff oral account the destination to be set and directly navigates, and is safe, convenient.But speech recognition mainly also has been faced with following five problems:
(1), to identification and the understanding of natural language.At first continuous speech must be decomposed into the units such as word, phoneme, its less important rule of setting up a understanding semanteme;
(2), the voice messaging amount is large.Speech pattern is not only different to different speaker, also is different to same speaker, and for example, speaker is arbitrarily speaking and conscientiously different during the voice messaging in a minute the time.A people's tongue is along with the time changes;
(3), the ambiguity of voice.The speaker is when speech, and different words may sound it being similar.This is common in English and Chinese;
(4), the characteristics of speech sounds of single letter or word, word is subjected to contextual the impact, so that has changed stress, tone, volume and the rate of articulation etc.;
(5), neighbourhood noise and interference have a strong impact on speech recognition, causes discrimination low.
In recent decades, a lot of experts, scholar be with these problems, and research and discovery constantly is so that speech recognition technology is developed.And constructed variedly speech recognition system based on speech recognition technology.The application of speech recognition system has at present: voice control, Industry Control and the medical field of the phonetic dialing of telephone communication, automobile, personal digital assistant (Personal Digital Assistant, PDA), intelligent toy, household remote etc.People constantly study speech recognition technology, are to wish there is one day can reach as exchanging between people and the people that people and machine also can be realized talking with freely, thereby realize industrial robotization, intellectuality.Along with the development of science and technology and the people research to gradually deepization of speech recognition theory, reaching its maturity of theoretical system, development along with Digital Signal Processing, at following 20 years, speech recognition technology will enter in industry, household electrical appliances, communication, automotive electronics, medical treatment and the various electronic equipment gradually.Can say for certain that speech recognition technology will become the technology of a key in the Future Information industry.But also undeniable, it also has very long stretch to walk, and real commercialization, also need to be in the progress of making a breakthrough property of many-side, and also need to be by means of the development of other related discipline.
Summary of the invention
The present invention is a kind of speech recognition system, and fundamental purpose provides a kind of speech recognition system of efficient, stable, practical, high discrimination.
For achieving the above object, the present invention is take MATLAB as implementation tool, in conjunction with welcome's anthropomorphic robot platform.Put up complete speech recognition system, the user utilizes platform to pass through the microphone voice command, treated, the identification of input speech signal, and obtaining a result acts on the action of guest-greeting machine human action.Can this system that test and assess reach the expectation index, and recognition capability is strong, and accuracy is high, the speech recognition system that robustness is good.
The present invention is achieved by the following technical solutions, a kind of speech recognition system of anthropomorphic robot, comprise voice input module, pretreatment module, characteristic extracting module, training module, identification module, recognition decision module, threshold value comparison module, the output terminal of voice input module is connected with the input end of pretreatment module, the output terminal of pretreatment module is connected with the input end of characteristic extracting module, the output terminal of characteristic extracting module is connected with the input end of training module, identification module respectively, and training module is connected with identification module; The output terminal of identification module is connected with the input end of recognition decision module, and the output terminal of recognition decision module is connected with the input end of threshold value comparison module.
Described voice input module is used for the input primary speech signal.
Described pretreatment module comprises in turn the pre-filtering unit, sampling and Quantifying unit, the pre-emphasis unit that connect, adds window unit, end-point detection unit;
Described pre-filtering unit is used for removing the high frequency noise of primary speech signal;
The simulating signal of described sampling and Quantifying unit sampling nyquist sampling theorem sampling and quantification denoising obtains digital signal;
Described pre-emphasis unit is used for promoting HFS, allows the frequency spectrum of signal become smooth, so that the parameter analysis;
The described window unit that adds is used for the signal finite process;
Described end-point detection unit is removed unwanted quiet section for detection of starting point, the terminal point of voice segments, extracts the speech signal segments of reagent.
The method that described end-point detection unit adopts the double threshold energy method to combine with artificial neural network.
Described characteristic extracting module adopts the composite character parameter extraction algorithm based on wavelet transformation, and described extraction algorithm is based on the linear prediction cepstrum coefficient parameter of wavelet transformation with based on the Mel frequency cepstral coefficient of wavelet transformation.
Described training module is by the training study method of Baum-Welch (expectation value correction) algorithm as Hidden Markov Model (HMM).
Described recognition decision module is to obtain output probability by Viterbi (Viterbi) algorithm.
The threshold ratio that described threshold value comparison module is used for the output probability value that will obtain and setting if be higher than threshold value then export recognition result, otherwise abandons this recognition result.
The course of work of the present invention: voice signal is the voice input module input signal from microphone, and through the pretreatment module pre-service, pre-service comprises pre-filtering, sampling and Quantifying, pre-emphasis, windowing and end-point detection; After the pre-service signal is carried out characteristic parameter extraction, with the argument sequence that extracts, foundation is preserved into voice parameterized template storehouse and is namely trained formwork module; Speech recognition process is that voice are inputted from microphone, through pre-service, characteristic parameter extraction, the characteristic parameter that extracts and the speech parameter template base of setting up are carried out probability calculation and coupling, mate the passing threshold comparison module of obtaining a result and carry out threshold ratio, finally obtain recognition result.
Among the present invention, carry out again a threshold ratio behind the calculating probability, think correct recognition result if be higher than threshold value person, otherwise, abandon this recognition result, and re-enter voice command behind prompting " pardon " voice.Threshold value is an empirical value, under specific laboratory environment, and the value that draws through many experiments.
Description of drawings
Fig. 1 speech recognition system block diagram;
Fig. 2 voice signal pre-service block diagram;
Fig. 3 DWTM computation process block diagram.
Embodiment
For a better understanding of the present invention; below in conjunction with accompanying drawing embodiments of the invention are elaborated: present embodiment is implemented under take technical solution of the present invention as prerequisite; provided detailed embodiment and specific operation process, but protection scope of the present invention is not limited to following embodiment.
As shown in Figure 1, be system chart of the present invention, a kind of speech recognition system, comprise voice input module, pretreatment module, characteristic extracting module, training module, identification module, recognition decision module, threshold value comparison module, the output terminal of voice input module is connected with the input end of pretreatment module, the output terminal of pretreatment module is connected with the input end of characteristic extracting module, the output terminal of characteristic extracting module is connected with the input end of training module, identification module respectively, and training module is connected with identification module; The output terminal of identification module is connected with the input end of recognition decision module, and the output terminal of recognition decision module is connected with the input end of threshold value comparison module.
Described voice input module is used for the input primary speech signal.
Described pretreatment module comprises in turn the pre-filtering unit, sampling and Quantifying unit, the pre-emphasis unit that connect, adds window unit, end-point detection unit as shown in Figure 2.
In the pre-service, first to Speech Signal Pre-filtering, its objective is to prevent that aliasing from disturbing that pre-filtering is actually a bandpass filter, lower limiting frequency is respectively f on it HAnd f LAgain signal is carried out the A/D conversion, analog voice signal is continuous signal, can't be by Computer Processing, so the first step that voice signal is processed will become digital signal with analog signal conversion exactly.Therefore, must and quantize two steps through sampling, thereby voice signal is after the microphone typing, through the A/D conversion analog signal conversion is become digital signal, again to its sampling and Quantifying, sampling and Quantifying is that the voice signal by computer or other digital recorder collection has all passed through digitizing, does not generally need the user to carry out digitized processing again.According to nyquist sampling theorem: f s>2*f Max, with the frequency sampling of 8000Hz, being divided into the frames of 200 samplings, it is 50% overlapping that consecutive frame has; Carry out pre-emphasis to quantizing rear signal again, because the average power spectra of voice signal is subjected to the impact of glottal excitation and mouth and nose radiation, front end falls by the 6dB/ octave more than 800Hz greatly, will carry out pre-emphasis to voice signal for this reason.The purpose of pre-emphasis is to promote HFS, makes the frequency spectrum of signal become smooth, so that carry out spectrum analysis or channel parameters analysis, pre-emphasis digital filter H (z)=1-uz -1, u is 0.97; At last signal is carried out windowing, because the motion of people's self vocal organs, voice signal is a kind of typical non-stationary signal, and its characteristic is time dependent.But this physical motion is much slower than acoustic vibration speed, and therefore, voice signal usually can suppose that at 10~20ms voice signal is stationary signal in such time period, and its spectral characteristic and some physical features parameter can regard constant as approx; What adopt among the present invention is Hamming window.
Described pre-filtering unit is used for removing the high frequency noise of primary speech signal; Remove unnecessary composition, process for the back signal and prepare, guarantee quality and the speed of signal.
The simulating signal of described sampling and Quantifying unit sampling nyquist sampling theorem sampling and quantification denoising obtains digital signal; Because primary speech signal is the simulating signal that is mixed with high frequency noise, because primary speech signal is gone high frequency noise and digitized processing, according to the Nyquist sampling thheorem, frequency sampling and quantification with 8000Hz obtain digitized voice signal.
Described pre-emphasis unit is used for promoting HFS, allows the frequency spectrum of digital signal become smooth, so that the parameter analysis; The average power of voice signal is subjected to the impact of glottal excitation and mouth and nose radiation, and falling can appear in front end, processes for voice signal carries out pre-emphasis for this reason, promotes HFS, allows the frequency spectrum of signal become smooth, so that the parameter analysis.The pre-emphasis digital filter is:
H(z)=1-μz -1
Wherein, the u value gets 0.97 close to 1 at this.
The described window unit that adds is used for the digital signal finite process; Because voice signal is a kind of non-equilibrium signal, its characteristic time to time change.In time period, can regard stationary signal but voice signal usually can be supposed at 10ms~20ms as, its spectral characteristic is also approximate constant.Therefore voice signal is carried out windowing process, be divided into a short section of working hard, each short section is called an analysis frame.Digitized voice signal is divided into the frames of 200 samplings, and it is 50% overlapping that consecutive frame has.To voice signal windowing operation, Hamming window function is with Hamming window:
Figure BSA00000808987800031
Wherein, N is the sampling number of every frame, at native system N=200.
Sound end detects, and purpose is to determine starting point and the terminal point of voice from a segment signal that comprises voice, is a step extremely crucial in the speech recognition system, only has the end points of judging exactly voice signal, could be correct carry out speech processes.Effectively end-point detection can not only make the processing time minimize, and can get rid of the noise of unvoiced segments, thereby Disposal quality is guaranteed.Good end-point detecting method can change the problems such as the detection effect that speech recognition software exists is undesirable, discrimination is low, can provide reliable basis for speech recognition, should have good robustness, can distinguish well background noise, non-speech sounds and non-dialogue people's sound and normal dialog sound, reduce the end points mistake that these sound cause and the mistake that causes thus and interrupt.The high precision of end-point detection can guarantee that the signal of input recognizer is effective complete voice signal, makes recognition effect more accurate fast.Method commonly used has energy method, but in actual applications, can't reach very high signal to noise ratio (S/N ratio), so cause the end-point detection out of true, the voice that detect will be imperfect, the effect of impact identification.In the present invention, the signal after the windowing calculates first its short-time energy and short-time average zero-crossing rate, its voiceless sound of Preliminary detection, voiced sound and each voice segments; Through a multilayer perceptron neural network, further detect again, carry out smoothly with medium filtering again, smoothly each frequency spectrum.The experiment proved that this mixed method reaches good end-point detection effect.
Described end-point detection unit is removed unwanted quiet section for detection of starting point, the terminal point of voice segments, extracts the speech signal segments of reagent; Be a step extremely crucial in the speech recognition system, only have the end points of judging exactly voice signal, could be correct carry out speech processes.
The method that described end-point detection unit adopts the double threshold energy method to combine with artificial neural network.Energy method, i.e. the method that short-time energy method combines with short-time average zero-crossing rate can only in the situation of low signal-to-noise ratio (SNR), just can reach very high accuracy.
The short-time energy detection algorithm is fit to detect voiced sound, its computing formula:
E n = Σ m = - ∞ ∞ [ x ( m ) w ( n - m ) ] 2 = Σ m = n - N - 1 n [ x ( m ) w ( n - m ) ] 2
Short-time average zero-crossing rate is fit to detect voiceless sound, its computing formula:
Z n = Σ m = - ∞ ∞ | sgn [ x ( m ) ] - sgn [ x ( m - 1 ) ] | w ( n - m )
= | sgn [ x ( n ) ] - sgn [ x ( n - 1 ) ] | w ( n )
Wherein, sgn [ x ( n ) ] = 1 , x ( n ) &GreaterEqual; 0 - 1 , x ( n ) < 0 .
The double threshold energy method is that two threshold values are set on the basis of energy method, and one smaller, and for detection of the quiet section boundary with speech signal segments, another is larger, for detection of the intensity of voice signal.Arrange loosely once at this to two threshold values, to guarantee that voice signal is complete.In this, again signal is input to the artificial neural network detecting device.
The artificial neural network detecting device is to adopt Multi-layer perceptron, and the advantage of using it for end-point detection is to have considered that the probability of correlativity between the speech frame and error is minimum.But major defect is to adopt this algorithm to be difficult to find the phonetic feature that can significantly distinguish voiced sound and voiceless sound, and this just in time forms the advantage complementation with the double threshold energy method.The interconnection pattern that artificial neural network is comprised of many processing units (neuron) has reflected the structure of neural network, and it is determining the ability of this network.Each neuron (such as neuron j) is accepted the information transmission of other neurons (such as neuron i), and the relational expression of total input is:
I j = &Sigma; i = 1 n w ij x i - &theta; j
In the following formula, w IjBe the connection weights from neuron i to neuron j; x iOutput for neuron i; θ jThe threshold value of expression neuron j.The output relation formula of neuron j is: o j=f (I j), function here, f () is called excitation function, chooses the S type function to be:
f ( x ) = 1 1 + e - x
Multilayer perceptron (multilayer perception networks, MLPN) neural network, it is a kind of feedforward neural network, is one to have the neural network of multilayer neuron, feedforward, error anti-pass structure.Usually be comprised of three parts: one group of perception unit forms input layer; The hidden layer of one or more layers computing node; The output layer of one deck computing node.Input layer is used for the perception external information, and output layer provides classification results, and hidden layer is processed input message.The neuron of each hidden layer or output layer is used for carrying out two kinds of calculating: calculate function signal that neuronic output place occurs and the estimation of gradient vector and calculate, it need to oppositely pass through network, therefore, adopt back propagation (BP algorithm) as the standard exercise algorithm of MLPN network, the BP algorithm is learnt in a kind of mode of the teacher's of having teaching, and learning process is comprised of forward-propagating process and back-propagation process.At first by the teacher each input pattern is set a desired output, then to a net actual learning and memory pattern (being training sample) of input, and propagated to output layer through the middle layer by input layer, actual output namely is error with the difference of expectation value, connect weights by output layer toward the middle layer layer-by-layer correction according to error again, by such process, actual output approaches to corresponding expectation value output separately gradually, determines at last the connection weights ω of each interlayer Ij
The amplitude of voice signal is large with respect to its dynamic range of ground unrest, can think that random occurrence is many in voice segments, and average information is large, and namely entropy is large; Moreover the energy of ground unrest section is more steady in the distribution at each frequency place, is reflected to that to think on the quantity of information that its contained average information is namely composed entropy larger.Therefore, the voice signal after will processing through the double threshold energy method calculates its amplitude entropy and spectrum entropy and as the input of MLPN neural network, through each Es-region propagations of neural network, draws at last testing result.
Described characteristic extracting module adopts the composite character parameter extraction algorithm based on wavelet transformation, and described extraction algorithm is based on the linear prediction cepstrum coefficient parameter of wavelet transformation with based on the Mel frequency cepstral coefficient of wavelet transformation.
Speech recognition is the process of a coupling, at first, sets up model according to the characteristics of voice, and the voice signal of input is analyzed, and extracts required feature, sets up on this basis the required template of speech recognition.Will be according to the block mold of speech recognition in identifying, the feature of the voice signal of the sound template deposited in the computing machine and input is compared, according to certain search and matching strategy, find out template a series of optimums and voice match input, obtain recognition result, this is the principle of speech recognition.The method commonly used to the characteristic parameter extraction of voice signal has: LPCC (linear prediction cepstrum coefficient coefficient) parameter, MFCC (Mel frequency cepstral coefficient) parameter, based on parameter of LPCC or MFCC first order difference etc.The shortcoming of LPCC parameter is not utilize the pleasant sense of hearing, although the MFCC parameter has been utilized pleasant auditory properties, these two parameters have only reflected the static characteristics of voice, can't reflect its dynamic perfromance, therefore, proposed to describe with the first order difference parameter dynamic perfromance of voice.Voice signal is non-stationary signal, Fourier transform is a kind of analytical approach of overall importance to signal, the locality that can't reflect voice signal, the present invention has introduced wavelet transform, when extracting LPCC parameter and MFCC parameter, replace Fourier transform with wavelet transform, bring into play the advantage of wavelet transformation, can improve the recognition performance of speech recognition system.In order to improve the discrimination of system, calculate again its first order difference parameter, in conjunction with both, with Δ DWTL+DWTL, Δ DWTM+DWTM is as the characteristic parameter of voice signal.Wherein DWTL and DWTM are respectively LPCC and the MFCC parameters with wavelet transformation, and Δ DWTL and Δ DWTM are respectively the differential parameters of LPCC and MFCC parameter.
Characteristic parameter extraction refers to obtain the one group of process that can describe the phonic signal character parameter from voice signal.The algorithm that proposes is based on linear prediction cepstrum coefficient parameter (LPCC) and Mel frequency cepstral coefficient (MFCC), replace discrete Fourier transform (DFT) with wavelet transform, wavelet transformation has time domain locality and frequency domain locality, and its time-frequency window can be regulated adaptively according to different frequency, thereby can accurately reflect the instantaneous variation of non-stationary signal; Wavelet analysis has different resolution at the time-frequency plane diverse location, is a kind of multiresolution analysis method.Namely have higher frequency resolution and lower temporal resolution in low frequency part, have higher temporal resolution and lower frequency resolution at HFS, therefore, all have the ability of characterization signal local feature in time-frequency two territories.
Linear prediction cepstrum coefficient parameter (LPCC) is the characteristic parameter that characterizes speaker's individual character, its model system function:
Figure BSA00000808987800042
lgH ( z ) = H ^ ( z ) = &Sigma; n = 1 &infin; h ( n ) ^ z - n
With the combination of upper two formulas, again to its both sides about the z differentiate, namely have
( 1 - &Sigma; k = 1 p a k z - k ) &Sigma; n = 1 &infin; n h ^ ( n ) z - n + 1 = &Sigma; k = 1 p k a k z - k + 1
Make constant term and z about following formula -1The coefficient of each time power equates respectively, namely obtains h (n) and a kBetween recurrence relation, again from predictive coefficient a kDraw cepstrum
Figure BSA00000808987800045
Also namely draw the LPCC parameter.
Calculation procedure based on the linear prediction cepstrum coefficient parameter (referred to as the DWTL parameter) of wavelet transformation:
(1), original voice signal will pass through pre-service first;
(2), every frame signal is carried out WAVELET PACKET DECOMPOSITION, the calculating wavelet packet coefficient;
(3), wavelet packet coefficient obtained in the previous step is asked cepstrum D n
(4), with D nBe merged into new vector [D 1D 2D n] as the DWTL parameter.
The most important characteristics of Mel frequency cepstral coefficient (MFCC) are exactly to have utilized the characteristic of the decorrelation of the sense of hearing principle of people's ear and cepstrum, the Mel frequency becomes nonlinear correspondence relation with the Hz frequency, this relation of utilizing between them, with the Spectrum Conversion of voice signal to the perception frequency domain.Its relational expression is:
f m = 2595 lg ( 1 + f Hz 700 )
Wherein, f mThe perception frequency domain take Mel as unit, f HzIt is the actual frequency domain take Hz as unit.
Computation process such as Fig. 3 based on the Mel frequency cepstrum parameter (referred to as the DWTM parameter) of wavelet transformation.
DWTM calculation of parameter step:
(1), original voice signal will pass through pre-service first;
(2), pretreated every frame signal is all passed through the processing of wavelet transform, the signal after the processing carries out WAVELET PACKET DECOMPOSITION, calculates wavelet packet coefficient;
(3), with the wavelet coefficient that decomposites on each frequency band, carry out again filtering;
(4), with the energy value S (m) that asks for again coefficient obtained in the previous step;
(5), log spectrum S (m) is transformed to the cepstrum frequency domain through discrete cosine transform.
The exponent number that DWTM and DWTL parameter adopt is 13, gets 12 rank after removing DC component c (0), and filter bank is 24 groups.With Δ DWTL+DWTL, Δ DWTM+DWTM is as the characteristic parameter of voice signal.So that the effective combination of behavioral characteristics and static nature, can improve the discrimination of system.Wherein DWTL and DWTM are respectively LPCC and the MFCC parameters with wavelet transformation, and Δ DWTL and Δ DWTM are respectively the differential parameters of DWTL and DWTM parameter.Hybrid parameter is the matrix of a 24*24.Matrix is as follows:
ML = &Delta;DWTL &Delta;DWTM DWTL DWTM
Recognition decision commonly used has dynamic time warping algorithm, Viterbi algorithm etc., and the most representative in probabilistic method is exactly the method for HMM, and be combined the most closely algorithm surely belongs to the Viterbi algorithm with HMM.The Viterbi algorithm is a kind of dynamic programming algorithm, be used for seeking the hidden status switch of most probable (Viterbi path) that is produced (0bserved Event) by observation information, forward algorithm is a similar algorithm, is used for calculating the probability that a string observed events occurs.When identification, to each observed events calculating probability, what probability was the highest is exactly recognition result, sometimes, what probability was the highest might not be correct recognition result, therefore, among the present invention, carry out again a threshold ratio behind the calculating probability, think correct recognition result if be higher than threshold value person, otherwise, abandon this recognition result, and re-enter voice command behind prompting " pardon " voice.Threshold value is an empirical value, under specific laboratory environment, and the value that draws through many experiments.
Described training module is by the training study method of Baum-Welch (expectation value correction) algorithm as Hidden Markov Model (HMM).
Described recognition decision module is to obtain output probability by Viterbi (Viterbi) algorithm.
The threshold ratio that described threshold value comparison module is used for the output probability value that will obtain and setting if be higher than threshold value then export recognition result, otherwise abandons this recognition result.
Speech recognition system comprises that pre-service, end-point detection, characteristic parameter extraction, recognition decision and threshold ratio wait.When setting up what a complete speech recognition system, then will set up a template base, template base is selected to set up according to the needs of practical application, in the present embodiment, set up one by 6 people, everyone repeats template base of 10 recording to an order.The voice command of template base includes: the direction order, such as " left ", " to the right ", " advancing ", " stopping " etc.; The order of saying hello is such as " you are good ", " thanks "; The question and answer order, such as " what is your name? ", " from where are you? ", " which function you have? " etc..Comprise following process for each order: typing template voice, through pre-service, detect sound end, extract voice and characteristic parameter, recognition decision, threshold ratio are.
1), voice typing: with the analog voice signal digitizing, with the sample frequency of 8000HZ voice are sampled, quantized again;
2), pre-service has comprised: pre-filtering, pre-filtering are actually a bandpass filter, its objective is to prevent that aliasing from disturbing; Pre-emphasis, the purpose of pre-emphasis are to promote HFS, make the frequency spectrum of signal become smooth, so that carry out spectrum analysis or channel parameters analysis, pre-emphasis digital filter H (z)=1-uz -1, u is 0.97; Windowing, voice signal are non-stationary signals, and voice signal usually can suppose that at 10~20ms voice signal is stationary signal in such time period, and its spectral characteristic and some physical features parameter can regard constant as approx;
3), end-point detection, purpose is to determine starting point and the terminal point of voice from a segment signal that comprises voice, at present embodiment, uses first its end points of energy method Preliminary detection, does further detection through the multilayer perceptron neural network again;
4), extract characteristic parameter, calculate respectively DWTL parameter and DWTM parameter, calculate respectively again first order difference parameter Δ DWTL and Δ DWTM; Again they are combined into the matrix of a 24*24 as speech characteristic parameter;
5), set up template base, the characteristic parameter that extracts is set up an argument sequence.
For each voice command, all through 1)~5) step, iterative cycles is finally set up a complete sound template storehouse.After having set up template base, will test, its process is as follows:
Step 1), the typing tested speech, facing to microphone input voice command, such as " to the right ", analog voice signal is changed into audio digital signals, i.e. digitizing through A/D; Again it is sampled, quantizes.
Step 2), pre-service, process such as top step 2).
Step 3), end-point detection, process such as top step 3).
Step 4), extract characteristic parameter, process such as top step 4).
Step 5), recognition decision, to each calculation of parameter probability that extracts, maximum probability be result to be identified.
Step 6), with step 5) maximum probability calculated and threshold ratio, as recognition result, less than threshold value, abandon result to be identified more than or equal to the maximum probability of threshold value, prompting user is inputted voice command again.
Same process is also done in other tested speech orders, until all tested speech is finished test.And the deficiency of the system of discovery, then adjust parameter, again test.Through experiment showed, that the present invention can be good at voice command recognition, for example, microphone is said " left ", after processing, identifying, robot turns left.Other orders or question and answer can both be identified effectively.

Claims (8)

1. speech recognition system, comprise voice input module (1), pretreatment module (2), characteristic extracting module (3), training module (4), identification module (5), the output terminal of described voice input module (1) is connected with the input end of pretreatment module (2), the output terminal of pretreatment module (2) is connected with the input end of characteristic extracting module (3), the output terminal of characteristic extracting module (3) respectively with training module (4), the input end of identification module (5) connects, and training module (4) is connected with identification module (5); Characterized by further comprising recognition decision module (6), threshold value comparison module (7), the output terminal of identification module (5) is connected with the input end of recognition decision module (6), and the output terminal of recognition decision module (6) is connected with the input end of threshold value comparison module (7).
2. described speech recognition system according to claim 1 is characterized in that described voice input module (1) is used for the input primary speech signal.
3. described speech recognition system according to claim 1 is characterized in that described pretreatment module (2) comprises the pre-filtering unit, sampling and Quantifying unit, the pre-emphasis unit that connect in turn, adds window unit, end-point detection unit;
Described pre-filtering unit is used for removing the high frequency noise of primary speech signal;
The simulating signal of described sampling and Quantifying unit sampling nyquist sampling theorem sampling and quantification denoising obtains digital signal;
Described pre-emphasis unit is used for promoting HFS, allows the frequency spectrum of signal become smooth, so that the parameter analysis;
The described window unit that adds is used for the signal finite process;
Described end-point detection unit is removed unwanted quiet section for detection of starting point, the terminal point of voice segments, extracts the speech signal segments of reagent.
4. described speech recognition system according to claim 3 is characterized in that the method that described end-point detection unit adopts the double threshold energy method to combine with artificial neural network.
5. described speech recognition system according to claim 1, it is characterized in that described characteristic extracting module (3) adopts the composite character parameter extraction algorithm based on wavelet transformation, described extraction algorithm is based on the linear prediction cepstrum coefficient parameter of wavelet transformation with based on the Mel frequency cepstral coefficient of wavelet transformation.
6. described speech recognition system according to claim 1 is characterized in that described training module (4) is by the training study method of Baum-Welch algorithm as Hidden Markov Model (HMM).
7. described speech recognition system according to claim 1 is characterized in that described recognition decision module (6) is to obtain output probability by the Viterbi algorithm.
8. described speech recognition system according to claim 1 is characterized in that threshold ratio that described threshold value comparison module (7) is used for the output probability value that will obtain and setting, if be higher than threshold value then export recognition result, otherwise abandons this recognition result.
CN 201210475180 2012-11-20 2012-11-20 Speech recognition system of humanoid robot Pending CN103065629A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201210475180 CN103065629A (en) 2012-11-20 2012-11-20 Speech recognition system of humanoid robot

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201210475180 CN103065629A (en) 2012-11-20 2012-11-20 Speech recognition system of humanoid robot

Publications (1)

Publication Number Publication Date
CN103065629A true CN103065629A (en) 2013-04-24

Family

ID=48108229

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201210475180 Pending CN103065629A (en) 2012-11-20 2012-11-20 Speech recognition system of humanoid robot

Country Status (1)

Country Link
CN (1) CN103065629A (en)

Cited By (38)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103514877A (en) * 2013-10-12 2014-01-15 新疆美特智能安全工程股份有限公司 Vibration signal characteristic parameter extracting method
CN103514879A (en) * 2013-09-18 2014-01-15 广东欧珀移动通信有限公司 Local voice recognition method based on BP neural network
CN104123934A (en) * 2014-07-23 2014-10-29 泰亿格电子(上海)有限公司 Speech composition recognition method and system
CN104679729A (en) * 2015-02-13 2015-06-03 广州市讯飞樽鸿信息技术有限公司 Recorded message effective processing method and system
CN104952446A (en) * 2014-03-28 2015-09-30 苏州美谷视典软件科技有限公司 Digital building presentation system based on voice interaction
CN105632493A (en) * 2016-02-05 2016-06-01 深圳前海勇艺达机器人有限公司 Method for controlling and wakening robot through voice
CN105845127A (en) * 2015-01-13 2016-08-10 阿里巴巴集团控股有限公司 Voice recognition method and system
CN105913840A (en) * 2016-06-20 2016-08-31 西可通信技术设备(河源)有限公司 Speech recognition device and mobile terminal
CN106054602A (en) * 2016-05-31 2016-10-26 中国人民解放军理工大学 Fuzzy adaptive robot system capable of recognizing voice demand and working method thereof
WO2017000786A1 (en) * 2015-06-30 2017-01-05 芋头科技(杭州)有限公司 System and method for training robot via voice
CN106313113A (en) * 2015-06-30 2017-01-11 芋头科技(杭州)有限公司 System and method for training robot
CN106328126A (en) * 2016-10-20 2017-01-11 北京云知声信息技术有限公司 Far-field speech recognition processing method and device
CN106373562A (en) * 2016-08-31 2017-02-01 黄钰 Robot voice recognition method based on natural language processing
CN106448676A (en) * 2016-10-26 2017-02-22 安徽省云逸智能科技有限公司 Robot speech recognition system based on natural language processing
CN106448656A (en) * 2016-10-26 2017-02-22 安徽省云逸智能科技有限公司 Robot speech recognition method based on natural language processing
CN106531152A (en) * 2016-10-26 2017-03-22 安徽省云逸智能科技有限公司 HTK-based continuous speech recognition system
CN106782550A (en) * 2016-11-28 2017-05-31 黑龙江八农垦大学 A kind of automatic speech recognition system based on dsp chip
CN106887226A (en) * 2017-04-07 2017-06-23 天津中科先进技术研究院有限公司 Speech recognition algorithm based on artificial intelligence recognition
CN106997243A (en) * 2017-03-28 2017-08-01 北京光年无限科技有限公司 Speech scene monitoring method and device based on intelligent robot
CN107680583A (en) * 2017-09-27 2018-02-09 安徽硕威智能科技有限公司 A kind of speech recognition system and method
CN107742516A (en) * 2017-09-29 2018-02-27 上海与德通讯技术有限公司 Intelligent identification Method, robot and computer-readable recording medium
CN107765557A (en) * 2016-08-23 2018-03-06 美的智慧家居科技有限公司 Intelligent home control system and method
CN107791255A (en) * 2017-09-15 2018-03-13 北京石油化工学院 One kind is helped the elderly robot and speech control system
CN108172242A (en) * 2018-01-08 2018-06-15 深圳市芯中芯科技有限公司 A kind of improved blue-tooth intelligence cloud speaker interactive voice end-point detecting method
CN108288465A (en) * 2018-01-29 2018-07-17 中译语通科技股份有限公司 Intelligent sound cuts the method for axis, information data processing terminal, computer program
CN108320746A (en) * 2018-02-09 2018-07-24 杭州智仁建筑工程有限公司 A kind of intelligent domestic system
CN108510979A (en) * 2017-02-27 2018-09-07 芋头科技(杭州)有限公司 A kind of training method and audio recognition method of mixed frequency acoustics identification model
CN108550394A (en) * 2018-03-12 2018-09-18 广州势必可赢网络科技有限公司 Disease diagnosis method and device based on voiceprint recognition
CN108665889A (en) * 2018-04-20 2018-10-16 百度在线网络技术(北京)有限公司 The Method of Speech Endpoint Detection, device, equipment and storage medium
CN108922561A (en) * 2018-06-04 2018-11-30 平安科技(深圳)有限公司 Speech differentiation method, apparatus, computer equipment and storage medium
CN109036470A (en) * 2018-06-04 2018-12-18 平安科技(深圳)有限公司 Speech differentiation method, apparatus, computer equipment and storage medium
CN110281247A (en) * 2019-06-10 2019-09-27 旗瀚科技有限公司 A kind of man-machine interactive system and method for disabled aiding robot of supporting parents
WO2019229639A1 (en) * 2018-05-31 2019-12-05 Chittora Anshu A method and a system for analysis of voice signals of an individual
CN111583962A (en) * 2020-05-12 2020-08-25 南京农业大学 Sheep rumination behavior monitoring method based on acoustic analysis
CN112149606A (en) * 2020-10-02 2020-12-29 深圳市中安视达科技有限公司 Intelligent control method and system for medical operation microscope and readable storage medium
CN112562646A (en) * 2020-12-09 2021-03-26 江苏科技大学 Robot voice recognition method
CN113393865A (en) * 2020-03-13 2021-09-14 阿里巴巴集团控股有限公司 Power consumption control, mode configuration and VAD method, apparatus and storage medium
CN115862636A (en) * 2022-11-19 2023-03-28 杭州珍林网络技术有限公司 Internet man-machine verification method based on voice recognition technology

Cited By (48)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103514879A (en) * 2013-09-18 2014-01-15 广东欧珀移动通信有限公司 Local voice recognition method based on BP neural network
CN103514877A (en) * 2013-10-12 2014-01-15 新疆美特智能安全工程股份有限公司 Vibration signal characteristic parameter extracting method
CN104952446A (en) * 2014-03-28 2015-09-30 苏州美谷视典软件科技有限公司 Digital building presentation system based on voice interaction
CN104123934A (en) * 2014-07-23 2014-10-29 泰亿格电子(上海)有限公司 Speech composition recognition method and system
CN105845127A (en) * 2015-01-13 2016-08-10 阿里巴巴集团控股有限公司 Voice recognition method and system
CN105845127B (en) * 2015-01-13 2019-10-01 阿里巴巴集团控股有限公司 Audio recognition method and its system
CN104679729A (en) * 2015-02-13 2015-06-03 广州市讯飞樽鸿信息技术有限公司 Recorded message effective processing method and system
CN104679729B (en) * 2015-02-13 2018-06-26 广州市讯飞樽鸿信息技术有限公司 Message just recorded validity processing method and system
WO2017000786A1 (en) * 2015-06-30 2017-01-05 芋头科技(杭州)有限公司 System and method for training robot via voice
CN106313113A (en) * 2015-06-30 2017-01-11 芋头科技(杭州)有限公司 System and method for training robot
CN106313113B (en) * 2015-06-30 2019-06-07 芋头科技(杭州)有限公司 The system and method that a kind of pair of robot is trained
CN105632493A (en) * 2016-02-05 2016-06-01 深圳前海勇艺达机器人有限公司 Method for controlling and wakening robot through voice
CN106054602A (en) * 2016-05-31 2016-10-26 中国人民解放军理工大学 Fuzzy adaptive robot system capable of recognizing voice demand and working method thereof
CN105913840A (en) * 2016-06-20 2016-08-31 西可通信技术设备(河源)有限公司 Speech recognition device and mobile terminal
CN107765557B (en) * 2016-08-23 2021-07-06 美的智慧家居科技有限公司 Intelligent household control system and method
CN107765557A (en) * 2016-08-23 2018-03-06 美的智慧家居科技有限公司 Intelligent home control system and method
CN106373562A (en) * 2016-08-31 2017-02-01 黄钰 Robot voice recognition method based on natural language processing
CN106328126A (en) * 2016-10-20 2017-01-11 北京云知声信息技术有限公司 Far-field speech recognition processing method and device
CN106448656A (en) * 2016-10-26 2017-02-22 安徽省云逸智能科技有限公司 Robot speech recognition method based on natural language processing
CN106448676A (en) * 2016-10-26 2017-02-22 安徽省云逸智能科技有限公司 Robot speech recognition system based on natural language processing
CN106531152A (en) * 2016-10-26 2017-03-22 安徽省云逸智能科技有限公司 HTK-based continuous speech recognition system
CN106782550A (en) * 2016-11-28 2017-05-31 黑龙江八农垦大学 A kind of automatic speech recognition system based on dsp chip
CN108510979A (en) * 2017-02-27 2018-09-07 芋头科技(杭州)有限公司 A kind of training method and audio recognition method of mixed frequency acoustics identification model
CN106997243A (en) * 2017-03-28 2017-08-01 北京光年无限科技有限公司 Speech scene monitoring method and device based on intelligent robot
CN106997243B (en) * 2017-03-28 2019-11-08 北京光年无限科技有限公司 Speech scene monitoring method and device based on intelligent robot
CN106887226A (en) * 2017-04-07 2017-06-23 天津中科先进技术研究院有限公司 Speech recognition algorithm based on artificial intelligence recognition
CN107791255A (en) * 2017-09-15 2018-03-13 北京石油化工学院 One kind is helped the elderly robot and speech control system
CN107680583A (en) * 2017-09-27 2018-02-09 安徽硕威智能科技有限公司 A kind of speech recognition system and method
CN107742516A (en) * 2017-09-29 2018-02-27 上海与德通讯技术有限公司 Intelligent identification Method, robot and computer-readable recording medium
CN108172242A (en) * 2018-01-08 2018-06-15 深圳市芯中芯科技有限公司 A kind of improved blue-tooth intelligence cloud speaker interactive voice end-point detecting method
CN108288465A (en) * 2018-01-29 2018-07-17 中译语通科技股份有限公司 Intelligent sound cuts the method for axis, information data processing terminal, computer program
CN108320746B (en) * 2018-02-09 2020-11-10 北京国安电气有限责任公司 Intelligent home system
CN108320746A (en) * 2018-02-09 2018-07-24 杭州智仁建筑工程有限公司 A kind of intelligent domestic system
CN108550394A (en) * 2018-03-12 2018-09-18 广州势必可赢网络科技有限公司 Disease diagnosis method and device based on voiceprint recognition
CN108665889B (en) * 2018-04-20 2021-09-28 百度在线网络技术(北京)有限公司 Voice signal endpoint detection method, device, equipment and storage medium
CN108665889A (en) * 2018-04-20 2018-10-16 百度在线网络技术(北京)有限公司 The Method of Speech Endpoint Detection, device, equipment and storage medium
WO2019229639A1 (en) * 2018-05-31 2019-12-05 Chittora Anshu A method and a system for analysis of voice signals of an individual
CN109036470A (en) * 2018-06-04 2018-12-18 平安科技(深圳)有限公司 Speech differentiation method, apparatus, computer equipment and storage medium
CN108922561A (en) * 2018-06-04 2018-11-30 平安科技(深圳)有限公司 Speech differentiation method, apparatus, computer equipment and storage medium
CN109036470B (en) * 2018-06-04 2023-04-21 平安科技(深圳)有限公司 Voice distinguishing method, device, computer equipment and storage medium
CN110281247A (en) * 2019-06-10 2019-09-27 旗瀚科技有限公司 A kind of man-machine interactive system and method for disabled aiding robot of supporting parents
CN113393865A (en) * 2020-03-13 2021-09-14 阿里巴巴集团控股有限公司 Power consumption control, mode configuration and VAD method, apparatus and storage medium
CN113393865B (en) * 2020-03-13 2022-06-03 阿里巴巴集团控股有限公司 Power consumption control, mode configuration and VAD method, apparatus and storage medium
CN111583962A (en) * 2020-05-12 2020-08-25 南京农业大学 Sheep rumination behavior monitoring method based on acoustic analysis
CN111583962B (en) * 2020-05-12 2021-05-18 南京农业大学 Sheep rumination behavior monitoring method based on acoustic analysis
CN112149606A (en) * 2020-10-02 2020-12-29 深圳市中安视达科技有限公司 Intelligent control method and system for medical operation microscope and readable storage medium
CN112562646A (en) * 2020-12-09 2021-03-26 江苏科技大学 Robot voice recognition method
CN115862636A (en) * 2022-11-19 2023-03-28 杭州珍林网络技术有限公司 Internet man-machine verification method based on voice recognition technology

Similar Documents

Publication Publication Date Title
CN103065629A (en) Speech recognition system of humanoid robot
CN102800316B (en) Optimal codebook design method for voiceprint recognition system based on nerve network
Manoharan et al. Analysis of complex non-linear environment exploration in speech recognition by hybrid learning technique
CN109192200B (en) Speech recognition method
CN101930735A (en) Speech emotion recognition equipment and speech emotion recognition method
CN104078039A (en) Voice recognition system of domestic service robot on basis of hidden Markov model
Lavrynenko et al. Method of voice control functions of the UAV
CN113823293B (en) Speaker recognition method and system based on voice enhancement
CN102237083A (en) Portable interpretation system based on WinCE platform and language recognition method thereof
Mistry et al. Overview: Speech recognition technology, mel-frequency cepstral coefficients (mfcc), artificial neural network (ann)
Sharma et al. A real time speech to text conversion system using bidirectional Kalman filter in Matlab
Brucal et al. Female voice recognition using artificial neural networks and MATLAB voicebox toolbox
Jarng HMM voice recognition algorithm coding
Loh et al. Speech recognition interactive system for vehicle
Salian et al. Speech Emotion Recognition using Time Distributed CNN and LSTM
Kumawat et al. SSQA: Speech signal quality assessment method using spectrogram and 2-D convolutional neural networks for improving efficiency of ASR devices
Tailor et al. Deep learning approach for spoken digit recognition in Gujarati language
Shareef et al. Comparison between features extraction techniques for impairments arabic speech
Djeffal et al. Noise-robust speech recognition: A comparative analysis of LSTM and CNN approaches
Hu et al. Speaker Recognition Based on 3DCNN-LSTM.
Liu et al. Review of the anti-noise method in the speech recognition technology
Shrawankar et al. Speech: a challenge to digital signal processing technology for human-to-computer interaction
Bansod et al. Speaker Recognition using Marathi (Varhadi) Language
Shahrul Azmi et al. Noise robustness of Spectrum Delta (SpD) features in Malay vowel recognition
Morales et al. Adding noise to improve noise robustness in speech recognition.

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20130424