CN1490787A - Phonetic recognition method based on phonetic intensification - Google Patents
Phonetic recognition method based on phonetic intensification Download PDFInfo
- Publication number
- CN1490787A CN1490787A CNA031570739A CN03157073A CN1490787A CN 1490787 A CN1490787 A CN 1490787A CN A031570739 A CNA031570739 A CN A031570739A CN 03157073 A CN03157073 A CN 03157073A CN 1490787 A CN1490787 A CN 1490787A
- Authority
- CN
- China
- Prior art keywords
- voice
- dressing
- training
- recognition method
- wave filter
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Landscapes
- Electrically Operated Instructional Devices (AREA)
Abstract
The present invention discloses a voice recogniton method based on the voice enhancement, comprising the steps: ( 1 ) training the implicit Markov model using the training data; ( 2 ) recognizing the testing data using the trained implicit Markov model; wherein the training data in step ( 1 ) and the testing data in step ( 2 ) are subjected to the treatment of the voice enhancement. As the training data and the testing data are enhanced in base voice and its harmonic by the voice recognition method of the invention, the problem of the mismatch between the enhanced testing voice and the model in reduced to minimum and the correctness of the voice recognition is increased.
Description
Technical field
The present invention relates to the speech recognition technology of computer technology application, more particularly, the present invention relates to a kind of audio recognition method that strengthens based on voice
Background technology
Meeting when people send out voiced sound causes vocal cord vibration, and its vibration frequency just is called as fundamental frequency.Fundamental frequency is one of most important parameter of voice signal.Short Time Speech frame according to windowing is estimated pitch period, all is an important ring in many fields such as speech coder and decoder, speech recognition, speaker verification and identification and physiological defect people backup systems.For fundamental tone is described,, introduce the notion of pure tone, complex tone chord here.Pure tone is meant the sound wave of single sine-wave oscillation; Complex tone then is a plurality of sinusoidal wave sound of forming, and wherein the highest common factor of each frequency is called fundamental frequency, and the sound wave composition of its correspondence just is called fundamental tone.The sinusoidal sound wave that frequency is equivalent to the integral multiple of fundamental frequency is called partials (or overtone).Voiced sound in musical sound and the voice all can be regarded the complex tone that contains many harmonic components as approx.
Mostly noise in the reality is broadband noise, therefore can strengthen fundamental tone in the voice and harmonic components thereof with comb filter, and it is constant to keep other frequency contents, reaches the purpose that voice strengthen like this.
Both at home and abroad the research of fundamental tone is compared early, also the someone proposes to carry out speech Separation by following the tracks of fundamental frequency, perhaps directly carry out the enhancing of the part of voiced sound in the voice method (referring to document [1]: Yao Tianren. digital speech processing. Wuhan: publishing house of HUST, 1999), be broadly divided into following several steps:
1), obtains the fundamental frequency of the every frame of voice signal by various Pitch Detection Algorithm.Pitch Detection was just studied with auto-correlation algorithm by Rabiner L.R. from the seventies.
2) according to fundamental frequency, determine the comb filter delay parameter, make the wave filter crest corresponding to the fundamental tone and the harmonic frequency thereof of voice signal, voice are enhanced after the filtering.
According to voice fundamental frequency, by adjusting the delay parameter of comb filter, the fundamental tone and the each harmonic thereof of voice strengthened, it is constant to keep other frequency content simultaneously, like this with regard to relative weakening noise, reach the purpose that voice strengthen.But, so just changed relativity clear in the voice, voiced energy because this method only strengthens the voiced sound part in the voice.
Existing audio recognition method is to carry out hidden Markov model (HMM) training with training data; With the hidden Markov model after the training test data is discerned then.If be applied to test data but just directly voice are strengthened, the change of the relativity of clear in the so this enhancing voice, voiced energy can cause the mismatch (mismatch) between itself and the model of cognition, reduces the speech recognition accuracy.
Summary of the invention
The objective of the invention is to overcome the shortcoming and defect of existing audio recognition method, speech enhancement technique is applied to speech recognition, thereby a kind of audio recognition method that strengthens based on voice is provided.
In order to realize the foregoing invention purpose, a kind of audio recognition method that strengthens based on voice provided by the invention comprises step:
(1) carries out the hidden Markov model training with training data;
(2) with the hidden Markov model after the training test data is discerned;
Wherein, training data in the step (1) and the test data in the step (2) are all passed through the voice enhancement process.
Described voice enhancement process is for to carry out dressing filtering with the dressing wave filter.Described dressing wave filter is FIR dressing wave filter or IIR dressing wave filter.Described dressing wave filter intensification factor between 1.3~1.7.
Because audio recognition method of the present invention has all been done the enhancing of fundamental tone and harmonic wave thereof to training data and test data, reduced the mismatch problems that strengthens between back tested speech and model to greatest extent, improved the accuracy of speech recognition.
Description of drawings
Fig. 1 schemes the amplitude-frequency response and the zero point of FIR comb filter transport function;
Fig. 2 is the amplitude-frequency response and the zero point-pole graph of IIR comb filter transport function;
The periodic extension synoptic diagram of speech data when Fig. 3 is the enhancing of IIR comb filter voice;
Fig. 4 is the contrast synoptic diagram of the sound spectrograph of one section voice, and wherein (a) is the sound spectrograph of one section noisy speech, (b) is the sound spectrograph that the voice after voice strengthen done in this section voice.
Embodiment
Below in conjunction with the drawings and specific embodiments invention is described in further detail.
In the present embodiment, training data and the test data that will be identified all use the dressing wave filter to carry out dressing filtering, thereby realize the enhancing of voice.At first introduce two class comb filter here.
1) FIR comb filter
The simplest comb filter can be regarded the stack of a signal and its reflective echo as:
y(i)=x(i)+ax(i-D)???????????(1)
Wherein, a represents attenuation coefficient, | a|≤1.D represents the delay of reflected signal.The transport function of FIR comb filter is
H(z)=1+az
-D????????????????(2)
Its amplitude-frequency response function is
Wherein, ω is an angular frequency.
Signal sampling rate is f
s, above-mentioned wave filter is at fundamental frequency f
1=f
sPresent peak value on the integral multiple of/D.Just when ω=2k π/D, obtain | H (ω) | maximal value 1+a.When the π of ω=(2k+1)/D, obtain zeros of transfer functions, also just corresponding | H (ω) | minimum value 1-a.Fig. 1 schemes at FIR comb filter transport function amplitude-frequency response and zero point.
2) IIR comb filter
The transport function of IIR comb filter
H (z)=(1-bz
-D)/(1-az
-D) amplitude-frequency response of (0<b<a<1) (4) this wave filter and zero pole plot are as shown in Figure 2.The trough of this wave filter amplitude-frequency response is more smooth, and crest is more sharp-pointed.ω
kCorresponding maximal value during=2 π k/D
max=(1-b)/(1-a)???????????????????????????(5)
ω
k=(2k+1) corresponding minimum value during π/D
min=(1+b)/(1+a)???????????????????????????(6)
Wherein k=0,1 ..., D-1.
State two class dressing wave filters in the use and carry out voice when strengthening, the IIR comb filter has good amplitude-frequency response characteristic, but considers its edge effect, and filtering is comparatively complicated; Though FIR comb filter amplitude-frequency response is bad, edge effect is easy to eliminate.Treatment step when introduction is carried out the voice enhancing with the IIR comb filter at first in detail below:
IIR comb filter amplitude-frequency response peak value is calculated by formula (5), and it has determined the intensification factor of fundamental tone and harmonic wave.Be not difficult to find out that from Fig. 2 in the amplitude-frequency response, the major part beyond the peak value is slightly less than 1, near its minimum value (seeing formula (6)), for keeping this part signal constant, can multiply by a penalty coefficient (1+a)/(1+b) to wave filter, obtains
Retardation D is obtained by following formula in the formula
D=f
s/f
b????????????????????????(8)
F wherein
sBe signal sampling rate, f
bIt is the fundamental frequency of present frame.
Correspondingly, intensification factor m is
When actual filtering, because edge effect, output must be through just reaching stable after certain delay.Experiment shows, in sample rate f
s=16kHz, fundamental frequency f
bDuring=160Hz, just tend towards stability through 6000 to 8000 outputs, and in fact, every frame filtering data has only 160 points in the experiment, therefore will do periodic extension earlier it.Definition of T
dCycle Length for continuation
T
d=ceil(160/T
b)*T
b??????????????(10)
T wherein
b=f
s/ f
b, be pitch period; Ceil (A) is a MATLAB function, returns the nearest integer that is not less than variables A, has guaranteed T like this
d>=160.Pass through the continuation of several times then, obtain one about 8000 data sequence, do filtering operation.Get preceding 160 conducts in last continuation cycle of output sequence and export (see figure 3) as a result.Finished filtering like this one time.Data are done processing frame by frame, the voice after being enhanced at last.
It is simply more than aforesaid iir filter to carry out filtering with the FIR comb filter, and it need not do periodic extension, but in order to eliminate the edge effect of wave filter, all will keep the last part suitable with filter length in the former frame data at every turn.Retardation D is still determined by formula (8).Its intensification factor
Fig. 4 is the comparison of the sound spectrograph before and after voice strengthen, and can obviously find out difference wherein, and noise is obviously suppressed in the voice after the enhancing.The enhancing process realizes with the IIR comb filter.It is emphasized that in order to keep the information of voiceless sound, the intensification factor m of comb filter should limit within the specific limits in concrete enforcement, the general m value of experimental verification between 1.3 to 1.7, be advisable (relevant) with signal to noise ratio (S/N ratio).
Can realize the enhancing of voice by aforesaid method, but that this enhancing has changed is clear in the voice, the relativity of voiced energy, this can influence the accuracy of speech recognition.For this is compensated, the present invention is before doing HMM (hidden Markov model) training, training data is also done comb filtering, make that the relativity clear, voiced energy of training data and test data is close, in the hope of reducing the mismatch between recognizing voice and the model, reduce the negative effect of comb filtering.The implicit Markov model that obtains like this is called voice and strengthens implicit Markov model (SE-HMM, Speech Enhanced Hidden Markov Model).
For the effect of method provided by the present invention is described, in the experiment below, trained two models of HMM and SE-HMM, measurement vector 39 dimensions, 7 mixed Gaussian density, wherein the HMM training data is taken from 863 databases, each 79 people of men and women's sound, everyone 650 sentences; SE-HMM is formed by result's training of top training data behind comb filtering.Test data is outside the collection, 650/people, and totally 2 people.Full syllable (404) Network Recognition.
Verify at first that under clean speech voice enhancing itself is to the adverse effect of recognition result.Test data is 2 people's voice, and table 1 is an experimental result.
The contrast of table 1 clean speech recognition result
First row is the result (m=1.0 represent not pass through the enhancement process of comb filter) of original clean speech with HMM identification, and second row is the result (m=1.3) that the voice behind comb filtering are discerned with SE-HMM.Find out that from experiment changed original energy relativity clear, voiced sound though these voice strengthen, the measures such as voiced sound enhancing by to training utterance can drop to its adverse effect very low.It is contemplated that, under noise circumstance, before the recognition effect after the enhancing can be better than strengthening.
Accuracy (Word Corr.) | Accuracy rate (Word Acc.) | |
??m=1.0,HMM | ?????78.12% | ?????75.83% |
m=1.3,SE-HMM | ?????77.49% | ?????75.70% |
Be identification then to noisy speech.Ground unrest is recorded in the laboratory in advance, comprises computer fan, air-conditioning and from outside window other noises etc.Voice are mixed the voice that obtain different signal to noise ratio (S/N ratio)s with noise by different proportion.Taking it by and large table 2 understands the relation between signal to noise ratio snr, intensification factor m and the phonetic recognization rate.Each hurdle of experimental result comprises two data in the table, and the former is an accuracy, and the latter is accuracy rate (Corr./Acc.).Experimental result shows that intensification factor m is relevant with signal to noise ratio snr, and signal to noise ratio (S/N ratio) is low more, and the m value is big more.If SNR=13, m get 1.3 and be advisable, and during SNR=6.5, m then gets 1.5.Recognition result accuracy and accuracy rate improve respectively about 5% and 7%, and first row is the recognition result (using the HMM model) before strengthening in the table, and black matrix partly is the more satisfactory result (using the SE-HMM model) after strengthening.Details see Table 2.
The contrast of table 2 noisy speech recognition result
Intensification factor m | Model | ?????SNR=13 | ????SNR=6.5 |
????1.0 | ??HMM | ???50.47%/30.27% | ??28.70%/3.78% |
????1.3 | ?SE-HMM | ???54.51%/36.69% | ??32.23%/11.21% |
????1.5 | ?SE-HMM | ???53.10%/34.99% | ??33.10%/11.20% |
From above experiment as can be known, the audio recognition method that strengthens based on voice is feasible effective.It is bigger to various noise adaptation faces, and the characteristic (as stationarity) of noise is not required, and can improve discrimination to a certain extent.
Claims (4)
1, a kind of audio recognition method that strengthens based on voice comprises step:
(1) carries out the hidden Markov model training with training data;
(2) with the hidden Markov model after the training test data is discerned;
It is characterized in that the test data in training data in the step (1) and the step (2) is all passed through the voice enhancement process.
2, audio recognition method according to claim 1 is characterized in that, described voice enhancement process is for to carry out dressing filtering with the dressing wave filter.
3, audio recognition method according to claim 2 is characterized in that, described dressing wave filter is FIR dressing wave filter or IIR dressing wave filter.
4, according to claim 2 or 3 described audio recognition methods, it is characterized in that, described dressing wave filter intensification factor between 1.3~1.7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CNB031570739A CN1212602C (en) | 2003-09-12 | 2003-09-12 | Phonetic recognition method based on phonetic intensification |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CNB031570739A CN1212602C (en) | 2003-09-12 | 2003-09-12 | Phonetic recognition method based on phonetic intensification |
Publications (2)
Publication Number | Publication Date |
---|---|
CN1490787A true CN1490787A (en) | 2004-04-21 |
CN1212602C CN1212602C (en) | 2005-07-27 |
Family
ID=34156986
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CNB031570739A Expired - Fee Related CN1212602C (en) | 2003-09-12 | 2003-09-12 | Phonetic recognition method based on phonetic intensification |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN1212602C (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107851444A (en) * | 2015-07-24 | 2018-03-27 | 声音对象技术股份有限公司 | For acoustic signal to be decomposed into the method and system, target voice and its use of target voice |
CN110447239A (en) * | 2017-03-24 | 2019-11-12 | 雅马哈株式会社 | Sound pick up equipment and sound pick-up method |
CN111292748A (en) * | 2020-02-07 | 2020-06-16 | 普强时代(珠海横琴)信息技术有限公司 | Voice input system capable of adapting to various frequencies |
WO2021217750A1 (en) * | 2020-04-30 | 2021-11-04 | 锐迪科微电子科技(上海)有限公司 | Method and system for eliminating channel difference in voice interaction, electronic device, and medium |
US11749262B2 (en) | 2019-01-10 | 2023-09-05 | Tencent Technology (Shenzhen) Company Limited | Keyword detection method and related apparatus |
-
2003
- 2003-09-12 CN CNB031570739A patent/CN1212602C/en not_active Expired - Fee Related
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107851444A (en) * | 2015-07-24 | 2018-03-27 | 声音对象技术股份有限公司 | For acoustic signal to be decomposed into the method and system, target voice and its use of target voice |
CN110447239A (en) * | 2017-03-24 | 2019-11-12 | 雅马哈株式会社 | Sound pick up equipment and sound pick-up method |
CN110447239B (en) * | 2017-03-24 | 2021-12-03 | 雅马哈株式会社 | Sound pickup device and sound pickup method |
US11749262B2 (en) | 2019-01-10 | 2023-09-05 | Tencent Technology (Shenzhen) Company Limited | Keyword detection method and related apparatus |
CN111292748A (en) * | 2020-02-07 | 2020-06-16 | 普强时代(珠海横琴)信息技术有限公司 | Voice input system capable of adapting to various frequencies |
CN111292748B (en) * | 2020-02-07 | 2023-07-28 | 普强时代(珠海横琴)信息技术有限公司 | Voice input system adaptable to multiple frequencies |
WO2021217750A1 (en) * | 2020-04-30 | 2021-11-04 | 锐迪科微电子科技(上海)有限公司 | Method and system for eliminating channel difference in voice interaction, electronic device, and medium |
Also Published As
Publication number | Publication date |
---|---|
CN1212602C (en) | 2005-07-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Alim et al. | Some commonly used speech feature extraction algorithms | |
Shrawankar et al. | Techniques for feature extraction in speech recognition system: A comparative study | |
US6587816B1 (en) | Fast frequency-domain pitch estimation | |
CN101136199B (en) | Voice data processing method and equipment | |
US20030236661A1 (en) | System and method for noise-robust feature extraction | |
CN101051464A (en) | Registration and varification method and device identified by speaking person | |
CN1142274A (en) | Speaker identification and verification system | |
Pawar et al. | Review of various stages in speaker recognition system, performance measures and recognition toolkits | |
CN112786059A (en) | Voiceprint feature extraction method and device based on artificial intelligence | |
CN1300049A (en) | Method and apparatus for identifying speech sound of chinese language common speech | |
Li et al. | A comparative study on physical and perceptual features for deepfake audio detection | |
CN1212602C (en) | Phonetic recognition method based on phonetic intensification | |
CN1182513C (en) | Antinoise voice recognition method based on weighted local energy | |
US6470311B1 (en) | Method and apparatus for determining pitch synchronous frames | |
Kaminski et al. | Automatic speaker recognition using a unique personal feature vector and Gaussian Mixture Models | |
CN1249665C (en) | Speech identification system | |
JP2005503580A (en) | Two-stage pitch judgment method and apparatus | |
CN115620731A (en) | Voice feature extraction and detection method | |
Tian et al. | Nonspeech segment rejection based on prosodic information for robust speech recognition | |
CN101067929A (en) | Method for enhancing and extracting phonetic resonance hump trace utilizing formant | |
Ding et al. | A method combining lpc-based cepstrum and harmonic product spectrum for pitch detection | |
CN111091816B (en) | Data processing system and method based on voice evaluation | |
CN114913844A (en) | Broadcast language identification method for pitch normalization reconstruction | |
Sadeghi et al. | The effect of different acoustic noise on speech signal formant frequency location | |
Prakash et al. | Fourier-Bessel based Cepstral Coefficient Features for Text-Independent Speaker Identification. |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
C19 | Lapse of patent right due to non-payment of the annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |