CN1264887A - Non-particular human speech recognition and prompt method based on special speech recognition chip - Google Patents
Non-particular human speech recognition and prompt method based on special speech recognition chip Download PDFInfo
- Publication number
- CN1264887A CN1264887A CN00105548A CN00105548A CN1264887A CN 1264887 A CN1264887 A CN 1264887A CN 00105548 A CN00105548 A CN 00105548A CN 00105548 A CN00105548 A CN 00105548A CN 1264887 A CN1264887 A CN 1264887A
- Authority
- CN
- China
- Prior art keywords
- parameter
- speech recognition
- model
- voice
- recognition
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 111
- 238000012549 training Methods 0.000 claims abstract description 32
- 230000003044 adaptive effect Effects 0.000 claims abstract description 12
- 230000015572 biosynthetic process Effects 0.000 claims description 17
- 238000003786 synthesis reaction Methods 0.000 claims description 16
- 239000000284 extract Substances 0.000 claims description 15
- 238000000605 extraction Methods 0.000 claims description 13
- 230000019771 cognition Effects 0.000 claims description 12
- 238000001514 detection method Methods 0.000 claims description 10
- 238000005070 sampling Methods 0.000 claims description 9
- 230000011218 segmentation Effects 0.000 claims description 8
- 238000005516 engineering process Methods 0.000 claims description 7
- 230000006978 adaptation Effects 0.000 claims description 6
- 238000013459 approach Methods 0.000 claims description 6
- 238000007493 shaping process Methods 0.000 claims description 6
- 238000001228 spectrum Methods 0.000 claims description 6
- 230000002194 synthesizing effect Effects 0.000 claims description 6
- 238000003860 storage Methods 0.000 claims description 4
- 238000004458 analytical method Methods 0.000 claims description 3
- 238000004364 calculation method Methods 0.000 claims description 3
- 230000003321 amplification Effects 0.000 claims description 2
- 238000002386 leaching Methods 0.000 claims description 2
- 238000003199 nucleic acid amplification method Methods 0.000 claims description 2
- 230000003595 spectral effect Effects 0.000 claims description 2
- 101100286668 Mus musculus Irak1bp1 gene Proteins 0.000 abstract 1
- 238000010586 diagram Methods 0.000 description 9
- 238000013139 quantization Methods 0.000 description 8
- 238000005457 optimization Methods 0.000 description 4
- 230000004913 activation Effects 0.000 description 2
- 238000005314 correlation function Methods 0.000 description 2
- 230000005284 excitation Effects 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 239000000203 mixture Substances 0.000 description 2
- 238000011002 quantification Methods 0.000 description 2
- 238000004891 communication Methods 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 235000019800 disodium phosphate Nutrition 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000000630 rising effect Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000001953 sensory effect Effects 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
- 230000009897 systematic effect Effects 0.000 description 1
Images
Landscapes
- Machine Translation (AREA)
- Telephonic Communication Services (AREA)
Abstract
A non-particlar person speech recognition and prompt method includes pre-training of non-particular person speech recognition, extracting speech recognition parameters, recognizing the speech command of non-particular person, adaptive learning of non-particular person and speech prompt. Its advantages are simpl method, high recognition rate and high stability and robustness. It can be used in toy control, speech dialing, intelligent electric appliances, etc..
Description
The invention belongs to the voice technology field, relate in particular to 8 of employings or 16 monolithic MCU microcontrollers and realize the specific people of little vocabulary, non specific human speech sound distinguishing method.Be particularly suitable for the speech recognition special chip of 8 8-digit microcontrollers.
Specific people's speech recognition special chip, development is very fast abroad in recent years.More external voice technologies and semiconductor company all drop into a large amount of man power and materials and develop the speech recognition special chip, and oneself audio recognition method is carried out patent protection.The speech recognition performance of these special chips also has nothing in common with each other.Usually the process of speech recognition as shown in Figure 1, the voice signal of input is at first sampled through A/D, frequency spectrum shaping windowing pre-emphasis is handled, improve radio-frequency component, carry out real-time characteristic parameter extraction, the parameter of extraction is linear prediction cepstrum coefficient (LPCC) or Me1 frequency marking cepstrum coefficient (MFCC), carry out end-point detection then, extract the efficient voice parameter, the lang sound recognition template of going forward side by side training or speech recognition template matches, and with best recognition result output.The hardware system of its special chip comprises 8 or 16 monolithic MCU microcontrollers and coupled automatic gain control (AGC), audio frequency preamplifier, low-pass filter, D/A (A/D), mould/number (D/A), audio-frequency power amplifier, voice operation demonstrator, random access memory (RAM), ROM (read-only memory) (ROM), width modulation (PWM) carrying out speech recognition and phoneme synthesizing method generally as shown in Figure 2.The speech recognition special chip RSC-164 series of products of U.S. Sensory company production at present are can buy one of best special chip of recognition performance in the world at present.These speech recognition special chips have been used for different mobile phones and wireless phone.Along with speech recognition technology improves, the speech recognition special chip will be widely used in various household electrical appliance and the control system, form information household electric industry, and this is one and develops rapidly and rising and high-tech industries that potentiality are very big.The mobile phone with specific people's speech recognition sound controlled dialing function of Philips company and Korea S Samsung release at present.The number of identification name is 10~20.And do not have an ability of unspecified person speech recognition.Yet there are no the Chinese speech recognition methods based on the unspecified person of special chip, the English audio recognition method of unspecified person also can only be discerned minute quantity vocabulary, as yes, no etc.
The objective of the invention is for overcoming the weak point of prior art, a kind of unspecified person speech recognition, phonetic prompt method based on the speech recognition special chip proposed, can realize the specific people's speech recognition of high precision at cheap 8 monolithics or 16 MCU microcontrollers, it is low to have the method complexity, the high and good characteristics of robustness of accuracy of identification.Particularly the Chinese digital speech recognition performance is reached even surpass current international most advanced level.
The present invention proposes a kind of unspecified person speech recognition, phonetic prompt method based on the speech recognition special chip, comprise the A/D sampling, frequency spectrum shaping windowing pre-emphasis is handled, characteristic parameter extraction, end-point detection, the speech recognition template training, the speech recognition template matches, recognition result output, and phonetic synthesis, it is characterized in that, specifically may further comprise the steps:
The training in advance of A, unspecified person speech recognition:
Training process requires that a large amount of sound banks is arranged, and training process is finished on PC, and the template after the training is deposited in the chip, and its training method comprises: adopt based on polynomial sorting technique; The parameter of model of cognition is represented with polynomial coefficient; Approach posterior probability by polynomial expression; Model parameter is tried to achieve by the optimized calculation method of system of linear equations;
B, speech recognition parameter extract:
(1) voice signal input back adopts A/D to sample, and becomes original digital speech, adopts the electric-level gain control
System is with the high precision of guaranteeing to sample;
(2) said original figure voice signal is carried out frequency spectrum shaping and divides the frame windowing process, to guarantee to divide the frame voice
Accurate stationarity;
(3) feature of said minute frame voice is carried out phonetic feature and extract, the principal character parameter adopts linear prediction to fall
Spectral coefficient (LPCC), and storage is used for back dynamic segmentation and template extraction;
(4) use the zero-crossing rate and the short-time energy feature of voice signal to carry out end-point detection, remove the voice of no sound area
Frame is to guarantee the validity of each frame phonetic feature;
The identification of C, unspecified person voice command:
Identifying adopts the two-stage recognition structure, is divided into thick identification and smart identification.Just can obtain a result to the thick identification of the order that is not easy to obscure, the order that is easy to obscure is discerned by meticulousr model;
The speaker adaptation study of D, unspecified person speech recognition:
The speaker is had accent or speaks when lack of standardization, and recognition system can cause erroneous judgement, adopts the speaker adaptation method that recognition template is adjusted; Said self-adapting regulation method adopts the maximum a posteriori probability method, progressively revises the recognition template parameter by alternative manner;
E. voice suggestion:
Phonetic synthesis and encoding and decoding speech technology are used in voice suggestion, but consider the restriction of system resource, should reduce the expense of system as far as possible; The phonetic synthesis model parameter is analyzed leaching process and is finished on computers, be stored in the chip then, therefore the speech analysis parameter extracting method can be very complicated, thereby guarantee to have high-quality synthetic speech, but the phonetic synthesis model parameter that needs to store should be the least possible, and phoneme synthesizing method is also simple as far as possible; Phonetic synthesis model of the present invention uses multiple-pulse phonetic synthesis model.
Electric-level gain control during said phonetic feature extracts can comprise: the input speech signal sampling precision is judged, if the input speech signal sampling precision is not high enough, by self-adaptive level control, adjusted the amplification quantity of voice, improve the speech sample precision; Said end-point detecting method is according to the end points thresholding of setting, and search for quiet section, determine voice, the top point; Said cepstrum parameter is that the linear prediction model (LPC) according to voice calculates.
Model of cognition training process in the training in advance method of said speech recognition can be: set up the database of wanting voice command recognition, extract the characteristic parameter of voice then, the process of characteristic parameter extraction is identical with the front.By the learning process of iteration, extract identification parameter based on polynomial disaggregated model.Learning process adopts second best measure, adjusts parameter in the polynomial disaggregated model at every turn, all calculates up to desired model parameter; Whole training process is finished on computers, and the model parameter that will draw after will training at last deposits in the speech recognition special chip, as model of cognition; This is and the different place of specific people's speech recognition;
The middle identifying of said voice command recognition methods can be: calculate the output result of each polynomial disaggregated model, the model of getting the output probability maximum is a recognition result; Identifying adopts thick identification and the identification of smart identification two-stage; Its difference is that the model parameter of thick identification is less, and recognition speed is fast, and smart model of cognition parameter is more.Can improve discrimination to the order that is easy to obscure by smart identification.
Self-adaptation in the recognition methods of said voice command adopts the model adaptation adjustment technology, and to the voice command of identification error, behind adaptive learning, discrimination can obviously improve.Adaptive process can be: input requires adaptive speech data, adopts the adaptive approach based on maximum a posteriori probability, respectively speech recognition parameter is adjusted by iteration, makes to differentiate between the model to estimate and keep maximum distinctive.
Employing phoneme synthesizing method in the said voice suggestion specifically can may further comprise the steps:
(1) uses multiple-pulse phonetic synthesis model, on PC, extract the LPC of phonetic synthesis model by optimization method
Parameter and excitation parameters.
(2) quantification of LPC parameter is carried out vector quantization with 10 bits; The number of the driving pulse of LPC model is
25, adopt single order pitch period loop, these parameters use 189 bits to carry out scalar quantization.
(3) for guaranteeing the level and smooth of synthetic speech, carry out linear interpolation in interframe.
The present invention has following characteristics:
(1) the present invention is the medium and small vocabulary non specific human speech sound distinguishing method based on the speech recognition special chip.This
A little methods have characteristics such as complicacy is low, accuracy of identification is high, robustness is good.
(2) adopt the shared way of identification parameter and coding parameter, thereby significantly reduced requirement system resource,
Guarantee to have very high coding quality simultaneously.
(3) because 8 MCU of employing or 16 bit DSPs are core, adopt 10 bit linear A/D, D/A, because of
Outstanding feature such as this this chip has that volume is little, in light weight, power consumptive province, cost are low.Communication, worker
There is great application valency in fields such as industry control, intelligent home electrical appliance, intelligent toy, automotive electronics
Value.
(4) voice recognition commands bar number of the present invention is in 10 on 8 cores, is 30 on 16 chips
Bar.To 8 chip identification rates is more than 95%, is more than 98% to 16 chip identification rates.
Attached brief description:
Fig. 1 is the process schematic block diagram of common speech recognition.
Fig. 2 is that the hardware system of general voice special chip is formed synoptic diagram.
Fig. 3 totally constitutes synoptic diagram for the method for the embodiment of the invention.
The end-point detecting method block diagram of Fig. 4 present embodiment as shown.
Fig. 5 is the unspecified person voice training process overall flow block diagram of present embodiment.
Fig. 6 is the identification process block diagram of the unspecified person alone word recognizer of present embodiment.
Fig. 7 is the identification judging process detail flowchart of present embodiment.
A kind of unspecified person speech recognition based on the speech recognition special chip, phonetic prompt method embodiment that the present invention proposes are described in detail as follows in conjunction with each figure:
The embodiments of the invention entire method constitutes as shown in Figure 3, whole process can be divided into (1) A/D sampling and sampling back voice with increase the weight of, improve the energy of high-frequency signal, windowing divides frame to handle; (2) extraction of speech characteristic parameter (comprising end-point detection parameter, model of cognition parameter), (3) end-point detection are determined effective speech parameter; (4) effective speech characteristic parameter is carried out dynamic segmentation, to reduce the template stores space of parameter; (5) speech recognition is carried out template relatively by method for mode matching, and voice identification result is exported.The specification specified of each step is as follows.1, speech recognition parameter feature extraction: (1) voice signal at first carries out low-pass filter, samples by 10-bit linear A/D then, becomes original digital speech, and adopting the purpose of 10 A/D is in order to reduce the cost of chip.Because the precision of A/D is low, therefore to control and the energy and the overload situations of input signal are judged gain-controlled amplifier on the method, so that guarantee to have made full use of the dynamic range of 10 A/D, obtain high as far as possible sampling precision.(2) the original figure voice signal is carried out frequency spectrum shaping and divides the frame windowing process, the accurate stationarity that guarantees to divide the frame voice.Preemphasis filter is taken as 1-0.95z
-1, zero-crossing rate lifts level and is taken as 4 in calculating.(3) minute feature of frame voice is carried out phonetic feature and extract, phonetic feature comprises LPCC cepstrum coefficient, energy, zero-crossing rate etc., and storage is used for the back dynamic segmentation.The calculating of a wherein very important step correlation function value need be finished in real time, owing to based on 8 single-chip microcomputer 8 no sign multiplication is only arranged, the process of therefore calculating correlation function value is as follows: α (n)=s (n)+128
In the following formula, s (n) is converted into unsigned number α (n) for 8 signed numbers are arranged.Obviously product is preserved with three bytes and can not be overflowed (frame length is not more than 256).2, end-point detection: (1) guarantees the validity of each frame phonetic feature, eliminates irrelevant noise, must carry out the end-point detection and the judgement of voice.End-point detecting method of the present invention was divided into for two steps, at first end points is carried out preliminary ruling according to speech signal energy, after energy is greater than a certain determined value, be defined as preliminary starting point, continue to seek the bigger unvoiced frame of speech signal energy backward from this starting point then, carry out the voiced segments location.Be in the main true if unvoiced frame exists this end points of explanation to judge, begin to search for forward, backward the start frame of quiet frame as voice from unvoiced frame.Result's output with search.The end-point detection block diagram as shown in Figure 4.Its basic skills is: ZERO_RATE_TH is a threshold value of zero-crossing rate, and ACTIVE_LEVEL, INACTIVE_LEVEL and ON_LEVEL are the threshold values of energy.(2) initial value of system is decided to be silent state.Under silent state, when surpassing threshold value ZERO_RATE_TH or energy, zero-crossing rate surpasses threshold value A CTIVE_LEVEL ' time, change state of activation over to, if energy surpasses threshold value ON_LEVEL, then directly change sonance over to.Remember that this frame is the forward terminal of voice.(3) under state of activation,, then change sonance over to if energy surpasses threshold value ON_LEVEL; If continuous some frames (being set by constant C ONST_DURATION) energy all surpasses only threshold value ON_LEVEL, change no voice and spirit over to.(4),, then change unactivated state over to if energy is lower than threshold value INACTIVE_LEVEL at sonance.This frame of mark is the aft terminal of voice.(5) in unactivated state, if continuous some frames (being set by constant C ONST_DURATION) energy all surpasses only threshold value INACTIVE_LEVEL, then voice finish; Otherwise change sonance over to.
The actual value of parameter is as follows: ZERO_RATE_TH is taken as 0.4, and ACTIVE_LEVEL is more according to the background noise setting, and INACTIVE_LEVEL is taken as 4 times of ACTIVE_LEVEL, and ON_LEVEL is taken as 8 times of ACTIVE_LEVEL, and CONST_DURATION is made as 20 frames.
3, phonetic feature dynamic segmentation, weighted mean:
(1) the input phonetic feature is carried out dynamic segmentation and weighted mean, improve the voiceless consonant characteristic parameter in identification
Proportion extracts most important template parameter in the phonetic feature.The phonetic feature segmentation is this system voice identification
One of core of method.
(2) the normalization Euclidean distance of calculating the speech characteristic parameter between different frame is adopted in dynamic segmentation.When variation surpasses
Certain thresholding assert that this point is the important separation of phonetic feature.Phonetic feature in the different sections is added
Weight average, and they are preserved as new speech characteristic parameter, and remove previous phonetic feature.
By model parameter is reduced widely, not only save storage space, and reduced answering of computing
Assorted degree and improved system's arithmetic speed.
4, the training of unspecified person speech recognition template:
The training of unspecified person speech recognition template parameter is finished on computers, at first carries out the extraction of speech characteristic parameter, uses based on the polynomial expression disaggregated model, approaches posterior probability by polynomial expression.The exponent number of multinomial model is relevant with model accuracy, adopts the quadratic polynomial disaggregated model just can reach very high accuracy of identification.Entire method is as follows:
Make F (V)=(f
1(V) f
2(V) ... f
10(V))
T=A
TX (V) is f wherein
1(V) be the polynomial expression approximating function, X (V) is polynomial eigenvector, and it is made up of the phase cross between the different components of speech characteristic vector.Based on least mean-square error (MSE) criterion optimization method, estimate posterior probability with D (V):
Wherein P is a probability vector.Y=(0,0,0 ..., 0,1,0 ..., 0) and be the approximate vector of P, only the value with the corresponding class of V is 1, other value is 0.Satisfy equation (1) separate for:
E{XX
TA
*=E{XY
T(2) unspecified person speech recognition system the training process flow diagram as shown in Figure 5, be described in detail as follows: (1) is by the eigenvector X (V) of speech characteristic vector evaluator of input.
V wherein
TkBe V
iK dimension component.(2) divide K class with the polynomial expression eigenvector, K is identification speech number.Ω is sorter training set.C
iRepresent the i class, i=1 ..., K.{ X
CiRepresent all polynomial expression features of the voice that all belong to the i class.
(3) in order to improve training effectiveness, in advance relevant first-order statistics amount E (X) and second-order statistic E (XX
T)
Calculating is finished.
(4) based on the minimum mean square error criterion optimization method, adopt the optimization method of suboptimum, adjust polynomial at every turn
A highest model parameter of distinctive in the disaggregated model is up to the accuracy requirement of satisfying model.And from height
Calculate the characteristic component of actual use among the polynomial expression eigenvector X of dimension, composition and classification device training characteristics
Vector X
*,
(5) adopt formula (2) to optimize whole polynomial expression disaggregated model parameter again, systematic training is finished.
5, unspecified person speech recognition:
Unspecified person speech recognition process flow diagram as shown in Figure 6.Detailed steps is as follows:
(1) input speech signal extracts speech recognition features, and method is identical with the front.
(2) the eigenvector X (V) of evaluator.
(3) calculate the output probability value of each multinomial model.
α wherein
iBe the i component of polynomial expression disaggregated model parameter A: A=[α
1α
2α
K]
T(4) adjudicate the recognition result that is of finding out the output probability maximum by (4) formula.For improving recognition speed and accuracy of identification, the identification judging process also is divided into thick identification and two processes of smart identification.Detail flow chart as shown in Figure 7.The model parameter of thick identification is less, and model parameter is 300, and thick recognition speed is fast.Estimate poor voice and must carry out essence and discern some voice that easily mix and thick identification are credible, the parameter of smart model of cognition is more, Duos about 100 than thick identification.The training method of smart model of cognition is identical with thick recognition methods.At first slightly discern, to slightly discern 3 selects recognition result to send into the credible computing module of estimating, when the with a low credibility of recognition result or the easily mixed voice of existence, then thick recognition result is sent into smart identification module, first three selects the result to carry out further smart identification to thick identification, then smart recognition result is sent into crediblely to estimate module and further judge the credible judgement of estimating.If the only still discontented requirement of can letter estimating of Shi Bie result, system refuses to know, and voice are re-entered in prompting.(5) crediblely estimate the computing method more complicated, for selecting identification probability and first three to select the likelihood ratio of the average probability formation of recognition result with first, and first the likelihood ratio of selecting identification probability and second to select probability to constitute be combined into the comprehensive credible valuation of estimating, (this value is about 3 if this likelihood ratio is less than certain thresholding, can set different value according to the varying environment noise), then think credible estimate low.6, the self-adaptation of unspecified person speech recognition modeling: (1) adaptive process is: the speaker carries out supervised learning to the voice of identification error, and the parameter by real-time adjustment identification multinomial model increases the degree of discrimination between the model.If after the self-adaptation, can not reach the result, can carry out repeatedly adaptive learning, till obtaining satisfied recognition result.(2) adaptive approach adopts alternative manner, recognition template is revised, and this method is the method with identification feature, also can adjust other relevant template in the template that corrects mistakes simultaneously, the value of adjusting step-length α is less than 0.01, otherwise causes adjustment easily.Self-adapting regulation method is as follows:
A wherein
K+1For upgrading back model parameter, A
kFor upgrading preceding model parameter.α is for adjusting step-length, and value is about 10
-3, x is polynomial eigenvector.The TI-digit database is trained the english digit model of cognition in English, and to the english digit discrimination very low (78%) of some Chinese's pronunciation, but by after the self-adaptation adjustment, discrimination is significantly improved, and reaches more than 99%.
7, voice suggestion is handled:
(1) adopts multi-pulse excitation LPC phonetic synthesis model; Model parameter is handled on computers in advance,
Editor, compression deposits among the ROM of special chip then; The lpc analysis frame length is 20 milliseconds; LPC
The quantification of parameter is carried out vector quantization with 10 bits; Pitch period 5 bit quantizations, pitch predictor
Coefficient 3 bit quantizations, the number of driving pulse are 25, each pulse position 4 bit quantizations,
The pulse of amplitude peak at log-domain with 6 bit quantizations, the amplitude of its after pulse at log-domain with 3
Individual bit quantization.
(2), the estimation method of multiple-pulse parameter is changed for reducing the bit number that the multiple-pulse location parameter is quantized
Advance; The minimum spacing of this method paired pulses limits, and the position number of pulse only can appear at
On the point with 3 multiples; Maximum spacing between the pulse does not allow to surpass 48; The maximum impulse spacing
Restrictive condition, can not be in the optimizing process of DISCHARGE PULSES EXTRACTION complete fulfillment; DISCHARGE PULSES EXTRACTION is excellent at every turn
Change finish after, sign indicating number is removed towards 5 pulses of amplitude minimum, be inserted into pulse distance greater than 48 two
Between the individual pulse; This process repeats till the condition that satisfies the pulse distance requirement.
(3) decode procedure of parameter adopts look-up method; For guaranteeing the level and smooth of synthetic speech, carry out at decode procedure
The interframe linear interpolation; 1/3 of every frame voice are carried out the interframe line to the LPC parameter respectively with back 1/3
The property interpolation.
(4) be the subjective quality that further improves phonetic synthesis, the use feeling weighting filter carries out the back Filtering Processing.
Present embodiment has been developed medium and small vocabulary specific people, the non specific human speech sound distinguishing method of a kind of language based on sound identification special chip based on said method.Usually comprise in the speech recognition special chip: audio frequency prime amplifier, automatic gain control (AGC), D/A (A/D) converter, mould/number (D/A) converter, MCU nuclear (8051), pulse width modulator (PWM), random access memory (RAM), ROM (read-only memory) (ROM), flash memory (FLASH).Store phoneme synthesizing method, voice coding method, speech recognition training method and audio recognition method among the ROM, and suggestion voice.The template and the suggestion voice of speech recognition are stored among the FLASH.
Claims (6)
1, a kind of unspecified person speech recognition, phonetic prompt method based on the speech recognition special chip, comprise the A/D sampling, frequency spectrum shaping windowing pre-emphasis is handled, characteristic parameter extraction, end-point detection, the speech recognition template training, the speech recognition template matches, recognition result output, and voice suggestion, it is characterized in that, specifically may further comprise the steps:
The training in advance of A, unspecified person speech recognition:
Training process requires that a large amount of sound banks is arranged, and training process is finished on PC, and the template after the training is deposited in the chip, and its training method comprises: adopt based on polynomial sorting technique; The parameter of model of cognition is represented with polynomial coefficient; Approach posterior probability by polynomial expression; Model parameter is tried to achieve by the optimized calculation method of system of linear equations;
B, speech recognition parameter extract:
(1) voice signal input back adopts A/D to sample, and becomes original digital speech, adopts the electric-level gain control
System is with the high precision of guaranteeing to sample;
(2) said original figure voice signal is carried out frequency spectrum shaping and divides the frame windowing process, to guarantee to divide the frame voice
Accurate stationarity;
(3) feature of said minute frame voice is carried out phonetic feature and extract, the principal character parameter adopts linear prediction to fall
Spectral coefficient (LPCC), and storage is used for back dynamic segmentation and template extraction;
(4) use the zero-crossing rate and the short-time energy feature of voice signal to carry out end-point detection, remove the voice of no sound area
Frame is to guarantee the validity of each frame phonetic feature;
The identification of C, unspecified person voice command:
Identifying adopts the two-stage recognition structure, is divided into thick identification and smart identification.Just can obtain a result to the thick identification of the order that is not easy to obscure, the order that is easy to obscure is discerned by meticulousr model; To improve the average velocity and the accuracy of identification of identification;
The speaker adaptation study of D, unspecified person speech recognition:
The speaker is had accent or speaks when lack of standardization, and recognition system can cause erroneous judgement, adopts the speaker adaptation method that recognition template is adjusted; Said self-adapting regulation method adopts the maximum a posteriori probability method, progressively revises the recognition template parameter by alternative manner;
E. voice suggestion:
Phonetic synthesis and encoding and decoding speech technology are used in voice suggestion, the phonetic synthesis model parameter is analyzed leaching process and is finished on computers, be stored in term phonetic synthesis in the chip then, therefore the speech analysis parameter extracting method can be very complicated, thereby guarantee to have high-quality synthetic speech, but the phonetic synthesis model parameter that needs to store should be the least possible, and phoneme synthesizing method is also simple as far as possible; The phonetic synthesis model uses multiple-pulse phonetic synthesis model.
2, unspecified person speech recognition, phonetic prompt method as claimed in claim 1, it is characterized in that, electric-level gain control during said phonetic feature extracts comprises: the input speech signal sampling precision is judged, if the input speech signal sampling precision is not high enough, control by self-adaptive level, adjust the amplification quantity of voice, improve the speech sample precision; Said end-point detecting method is according to the end points thresholding of setting, and search for quiet section, determine voice, the top point; Said cepstrum parameter is that the linear prediction model (LPC) according to voice calculates.
3, unspecified person speech recognition as claimed in claim 1, phonetic prompt method, it is characterized in that, model of cognition training process in the training in advance method of said speech recognition is: set up the database of wanting voice command recognition, extract the characteristic parameter of voice then, the process of characteristic parameter extraction is identical with the front.By the learning process of iteration, extract identification parameter based on polynomial disaggregated model.Learning process adopts second best measure, adjusts parameter in the polynomial disaggregated model at every turn, all calculates up to desired model parameter; Whole training process is finished on computers, and the model parameter that will draw after will training at last deposits in the speech recognition special chip, as model of cognition; This is and the different place of specific people's speech recognition;
4, unspecified person speech recognition as claimed in claim 1, phonetic prompt method, it is characterized in that, the middle identifying of said voice command recognition methods is: calculate the output result of each polynomial disaggregated model, the model of getting the output probability maximum is a recognition result; Identifying adopts thick identification and the identification of smart identification two-stage; Its difference is that the model parameter of thick identification is less, and recognition speed is fast, and smart model of cognition parameter is more.Can improve discrimination to the order that is easy to obscure by smart identification.
5, unspecified person speech recognition as claimed in claim 1, phonetic prompt method, it is characterized in that the self-adaptation in the recognition methods of said voice command adopts the model adaptation adjustment technology, to the voice command of identification error, behind adaptive learning, discrimination can obviously improve.Adaptive process is: input requires adaptive speech data, adopts the adaptive approach based on maximum a posteriori probability, respectively speech recognition parameter is adjusted by iteration, makes to differentiate between the model to estimate and keep maximum distinctive.
6, unspecified person speech recognition as claimed in claim 1, phonetic prompt method is characterized in that, the improved multiple-pulse phoneme synthesizing method of the employing in the said voice suggestion, comprising: the estimation method of multiple-pulse amplitude and position; The interpolation method of interframe model parameter.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CNB001055488A CN1141696C (en) | 2000-03-31 | 2000-03-31 | Non-particular human speech recognition and prompt method based on special speech recognition chip |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CNB001055488A CN1141696C (en) | 2000-03-31 | 2000-03-31 | Non-particular human speech recognition and prompt method based on special speech recognition chip |
Publications (2)
Publication Number | Publication Date |
---|---|
CN1264887A true CN1264887A (en) | 2000-08-30 |
CN1141696C CN1141696C (en) | 2004-03-10 |
Family
ID=4577765
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CNB001055488A Expired - Fee Related CN1141696C (en) | 2000-03-31 | 2000-03-31 | Non-particular human speech recognition and prompt method based on special speech recognition chip |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN1141696C (en) |
Cited By (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1296886C (en) * | 2003-11-28 | 2007-01-24 | 国际商业机器公司 | Speech recognition system and method |
CN1300763C (en) * | 2004-09-29 | 2007-02-14 | 上海交通大学 | Automatic sound identifying treating method for embedded sound identifying system |
CN1302454C (en) * | 2003-07-11 | 2007-02-28 | 中国科学院声学研究所 | Method for rebuilding probability weighted average deletion characteristic data of speech recognition |
CN1741131B (en) * | 2004-08-27 | 2010-04-14 | 中国科学院自动化研究所 | Method and apparatus for identifying non-particular person isolating word voice |
CN1787070B (en) * | 2005-12-09 | 2011-03-16 | 北京凌声芯语音科技有限公司 | On-chip system for language learner |
CN101339765B (en) * | 2007-07-04 | 2011-04-13 | 黎自奋 | National language single tone recognizing method |
CN103578470A (en) * | 2012-08-09 | 2014-02-12 | 安徽科大讯飞信息科技股份有限公司 | Telephone recording data processing method and system |
CN104252860A (en) * | 2013-06-26 | 2014-12-31 | 沃福森微电子股份有限公司 | Speech recognition |
CN104835495A (en) * | 2015-05-30 | 2015-08-12 | 宁波摩米创新工场电子科技有限公司 | High-definition voice recognition system based on low pass filter |
CN104990553A (en) * | 2014-12-23 | 2015-10-21 | 上海安悦四维信息技术有限公司 | Hand-held vehicle terminal C-Pad intelligent navigation system and working method thereof |
CN105895078A (en) * | 2015-11-26 | 2016-08-24 | 乐视致新电子科技(天津)有限公司 | Speech recognition method used for dynamically selecting speech model and device |
CN106205606A (en) * | 2016-08-15 | 2016-12-07 | 南京邮电大学 | A kind of dynamic positioning and monitoring method based on speech recognition and system |
CN106356055A (en) * | 2016-09-09 | 2017-01-25 | 华南理工大学 | System and method for synthesizing variable-frequency voice on basis of sinusoidal models |
CN106898355A (en) * | 2017-01-17 | 2017-06-27 | 清华大学 | A kind of method for distinguishing speek person based on two modelings |
CN108062866A (en) * | 2015-01-29 | 2018-05-22 | 邹玉华 | Navigation system, automobile and the method for work of road passage capability are judged according to image |
CN108172242A (en) * | 2018-01-08 | 2018-06-15 | 深圳市芯中芯科技有限公司 | A kind of improved blue-tooth intelligence cloud speaker interactive voice end-point detecting method |
CN111385462A (en) * | 2018-12-28 | 2020-07-07 | 上海寒武纪信息科技有限公司 | Signal processing device, signal processing method and related product |
WO2020244153A1 (en) * | 2019-06-05 | 2020-12-10 | 平安科技(深圳)有限公司 | Conference voice data processing method and apparatus, computer device and storage medium |
CN112307253A (en) * | 2020-10-30 | 2021-02-02 | 上海明略人工智能(集团)有限公司 | Method and system for automatically generating voice file based on preset recording title |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101067929B (en) * | 2007-06-05 | 2011-04-20 | 南京大学 | Method for enhancing and extracting phonetic resonance hump trace utilizing formant |
CN102314877A (en) * | 2010-07-08 | 2012-01-11 | 盛乐信息技术(上海)有限公司 | Voiceprint identification method for character content prompt |
-
2000
- 2000-03-31 CN CNB001055488A patent/CN1141696C/en not_active Expired - Fee Related
Cited By (27)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1302454C (en) * | 2003-07-11 | 2007-02-28 | 中国科学院声学研究所 | Method for rebuilding probability weighted average deletion characteristic data of speech recognition |
CN1296886C (en) * | 2003-11-28 | 2007-01-24 | 国际商业机器公司 | Speech recognition system and method |
CN1741131B (en) * | 2004-08-27 | 2010-04-14 | 中国科学院自动化研究所 | Method and apparatus for identifying non-particular person isolating word voice |
CN1300763C (en) * | 2004-09-29 | 2007-02-14 | 上海交通大学 | Automatic sound identifying treating method for embedded sound identifying system |
CN1787070B (en) * | 2005-12-09 | 2011-03-16 | 北京凌声芯语音科技有限公司 | On-chip system for language learner |
CN101339765B (en) * | 2007-07-04 | 2011-04-13 | 黎自奋 | National language single tone recognizing method |
CN103578470A (en) * | 2012-08-09 | 2014-02-12 | 安徽科大讯飞信息科技股份有限公司 | Telephone recording data processing method and system |
CN103578470B (en) * | 2012-08-09 | 2019-10-18 | 科大讯飞股份有限公司 | A kind of processing method and system of telephonograph data |
US10431212B2 (en) | 2013-06-26 | 2019-10-01 | Cirrus Logic, Inc. | Speech recognition |
US11335338B2 (en) | 2013-06-26 | 2022-05-17 | Cirrus Logic, Inc. | Speech recognition |
CN104252860B (en) * | 2013-06-26 | 2019-07-23 | 思睿逻辑国际半导体有限公司 | Speech recognition |
CN104252860A (en) * | 2013-06-26 | 2014-12-31 | 沃福森微电子股份有限公司 | Speech recognition |
CN104990553A (en) * | 2014-12-23 | 2015-10-21 | 上海安悦四维信息技术有限公司 | Hand-held vehicle terminal C-Pad intelligent navigation system and working method thereof |
CN108062866B (en) * | 2015-01-29 | 2020-12-22 | 四川蜀天信息技术有限公司 | Navigation system, automobile and working method for judging road traffic capacity according to images |
CN108062866A (en) * | 2015-01-29 | 2018-05-22 | 邹玉华 | Navigation system, automobile and the method for work of road passage capability are judged according to image |
CN104835495B (en) * | 2015-05-30 | 2018-05-08 | 宁波摩米创新工场电子科技有限公司 | A kind of high definition speech recognition system based on low-pass filtering |
CN104835495A (en) * | 2015-05-30 | 2015-08-12 | 宁波摩米创新工场电子科技有限公司 | High-definition voice recognition system based on low pass filter |
CN105895078A (en) * | 2015-11-26 | 2016-08-24 | 乐视致新电子科技(天津)有限公司 | Speech recognition method used for dynamically selecting speech model and device |
CN106205606A (en) * | 2016-08-15 | 2016-12-07 | 南京邮电大学 | A kind of dynamic positioning and monitoring method based on speech recognition and system |
CN106356055A (en) * | 2016-09-09 | 2017-01-25 | 华南理工大学 | System and method for synthesizing variable-frequency voice on basis of sinusoidal models |
CN106356055B (en) * | 2016-09-09 | 2019-12-10 | 华南理工大学 | variable frequency speech synthesis system and method based on sine model |
CN106898355A (en) * | 2017-01-17 | 2017-06-27 | 清华大学 | A kind of method for distinguishing speek person based on two modelings |
CN106898355B (en) * | 2017-01-17 | 2020-04-14 | 北京华控智加科技有限公司 | Speaker identification method based on secondary modeling |
CN108172242A (en) * | 2018-01-08 | 2018-06-15 | 深圳市芯中芯科技有限公司 | A kind of improved blue-tooth intelligence cloud speaker interactive voice end-point detecting method |
CN111385462A (en) * | 2018-12-28 | 2020-07-07 | 上海寒武纪信息科技有限公司 | Signal processing device, signal processing method and related product |
WO2020244153A1 (en) * | 2019-06-05 | 2020-12-10 | 平安科技(深圳)有限公司 | Conference voice data processing method and apparatus, computer device and storage medium |
CN112307253A (en) * | 2020-10-30 | 2021-02-02 | 上海明略人工智能(集团)有限公司 | Method and system for automatically generating voice file based on preset recording title |
Also Published As
Publication number | Publication date |
---|---|
CN1141696C (en) | 2004-03-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN1141696C (en) | Non-particular human speech recognition and prompt method based on special speech recognition chip | |
CN1123862C (en) | Speech recognition special-purpose chip based speaker-dependent speech recognition and speech playback method | |
CN1185626C (en) | System and method for modifying speech signals | |
CN103971685B (en) | Method and system for recognizing voice commands | |
CN1188831C (en) | System and method for voice recognition with a plurality of voice recognition engines | |
CN100350453C (en) | Method and apparatus for robust speech classification | |
CN1013525B (en) | Real-time phonetic recognition method and device with or without function of identifying a person | |
KR100766761B1 (en) | Method and apparatus for constructing voice templates for a speaker-independent voice recognition system | |
CN101727901B (en) | Method for recognizing Chinese-English bilingual voice of embedded system | |
CN101030369A (en) | Built-in speech discriminating method based on sub-word hidden Markov model | |
CN1750124A (en) | Bandwidth extension of band limited audio signals | |
CN1160450A (en) | System for recognizing spoken sounds from continuous speech and method of using same | |
CN106782521A (en) | A kind of speech recognition system | |
CN1920947A (en) | Voice/music detector for audio frequency coding with low bit ratio | |
CN1300049A (en) | Method and apparatus for identifying speech sound of chinese language common speech | |
CN1924994A (en) | Embedded language synthetic method and system | |
CN105679306B (en) | The method and system of fundamental frequency frame are predicted in speech synthesis | |
CN1624766A (en) | Method for noise robust classification in speech coding | |
WO2000046791A1 (en) | Voice recognition rejection scheme | |
US8219391B2 (en) | Speech analyzing system with speech codebook | |
CN1787070A (en) | Chip upper system for language learner | |
CN1300763C (en) | Automatic sound identifying treating method for embedded sound identifying system | |
CN1201284C (en) | Rapid decoding method for voice identifying system | |
CN101819772B (en) | Phonetic segmentation-based isolate word recognition method | |
CN1628337A (en) | Speech recognizing method and device thereof |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C06 | Publication | ||
PB01 | Publication | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
C19 | Lapse of patent right due to non-payment of the annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |