CN1400583A - Phonetic recognizing system and method of sensing phonetic characteristics - Google Patents

Phonetic recognizing system and method of sensing phonetic characteristics Download PDF

Info

Publication number
CN1400583A
CN1400583A CN01124051A CN01124051A CN1400583A CN 1400583 A CN1400583 A CN 1400583A CN 01124051 A CN01124051 A CN 01124051A CN 01124051 A CN01124051 A CN 01124051A CN 1400583 A CN1400583 A CN 1400583A
Authority
CN
China
Prior art keywords
language
frequency spectrum
vector
input
vectors
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN01124051A
Other languages
Chinese (zh)
Inventor
卜令楷
阙志达
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
WELBOTECK CO
Original Assignee
WELBOTECK CO
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by WELBOTECK CO filed Critical WELBOTECK CO
Priority to CN01124051A priority Critical patent/CN1400583A/en
Publication of CN1400583A publication Critical patent/CN1400583A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

There is a whole system and method for precise utterance discrimination, it is based on that three apperceive processing technology is used in utterance Fourier frequency spectrum, it projects the apperceive frequency spectrum onto apperceive frequency spectrum vector and into utterance identification device in order to realize clear apperceive frequency spectrum and its precise identification. The invention contains a apperceive utterance processor, which is used in processing utterance frequency spectrum vector in oder to generate a apperceive frequency spectrum; a memoty device stores many reference frequency spectrum vector; and a sound feature mapping device, it couples with the dprocessor and the memory in order to mapping the apperceive frequency spectrum to the reference frequency spectrum vectors.

Description

Sensing phonetic characteristics phonetic recognizing system and method
Technical field
The present invention generally relates to automatic speech recognition system, and relates more specifically to the processing and the indeclinable phonetic feature mode (regime) based on vowel of a perceptual speech, to realize accurately reaching the automatic language identification of strong (robust).
Background technology
(ASR) system that discerns modern language has automatically developed more than 30 years and existing considerable progress.Yet still have two significant problems: the resistant strength problem is usually directed to the adverse condition in the environment of speaking, and for example background noise, voice distortion, and the influence of individual's pronunciation resistant strength, and accuracy problem relate to the misidentification of importing voice.Address these problems general needs very expensive hardware cost and space, and therefore generally normally infeasible.
Problem for resistant strength, existing many trial methods utilize electronics and mechanical hook-up with filter out noise, the improvement noise when improves signal gain, but these systems have the problem of the ineffective activity (for example microphone of cancellation noise) of computation complexity (for example increasing the composite model frequency spectrum of noise) and detecting device setting.With respect to the noise perception of simple mechanical orientation, the language perception by the mankind is relatively clearly, can realize high identification accuracy in not good environment.For example, for the input SNR that is lower than 20 dB, the identification accuracy of conventional ASR system is considerably reduced, but the low language that reaches 0 dB SNR of human identification signal character easily.When noisy, distorted signals causes human serious language mistake to distinguish (unless the amplitude of signal itself is too low) sometimes, and individual sounding resistant strength characteristic (at least for the person of speaking one's mother tongue) generally can not cause significant perception problems.Therefore, carried out the mankind that are mainly two kinds of forms are imitated in many trials with the development phonetic recognizing system language perception.First kind is imitated human auditory system's functional (for example basilar memebrane of cochlear implant and growth), but this system is because be complicated from the interactive a plurality of feedback paths between nervous system and the unknown nucleus nervi acustici, and to make these attempt be flawless in theory but be restricted in the practice.Second kind of trial utilizes artificial neural network (ANN) with extraction language feature, the dynamic non-linear spoken signal of processing, or makes up with the statistical recognition device.But the ANN system has the shortcoming of huge computing requirement, makes that big lexical system is impracticable.
All ASR all require to use the spectrum analysis model so that voice signal parameter receiptsization so that with the reference spectrum signal relatively can be used for carrying out language identification.Linear predictive coding (LPC) is at the enterprising line frequency analysis of spectrum of discourse structure with the constraint of the so-called full utmost point (all-pole) modelling.This is for generally with Xn (e I ω) expression the frequency spectrum designation method, it is to be constrained to σ/A (e I ω) form, wherein A (e I ωU) be the p with z-conversion as described below ThOrder polynomial:
A (z)=1+a 1z -1+ a 2z -2+ ...+a pz -pThe LPC spectral analysis unit is output as the vector of coefficient (LPC parameter), defines to its parameter the frequency spectrum of all-pole model, and this model mates with signal spectrum on the time cycle of language sampling frame most.Existing phonetic recognizing system generally is to utilize the LPC with all-pole model constraint.Yet, being affected by appearance usually in the utmost point position of full utmost point frequency spectrum at the noise of trough section, the appearance of this noise is if significant words can make Signal Degrade significantly.
Mandarin contains tens thousand of each other characters, and it pronounces separately is a monosyllable, thereby unique basis of ASR system is provided.Yet mandarin (and other in fact Chinese dialects) is a kind of tone language of readjusting the distribution each malapropism syllable of sound with one of four kinds of vocabulary tones or a naturetone that has.There are 408 basic syllables and consider tonal variations, always have 1345 different tone syllables.Therefore, the number of unique character is about tens of times of pronunciation, and feasible generation is many only can be according to the phonetically similar word of discourse content parsing.An each self-contained consonant (initial sound) phoneme (21 altogether) of basic syllable and vowel (vowel) (end sound) phoneme (37 altogether).Conventional ASR system at first utilizes different treatment technologies to detect consonant phoneme, vowel phoneme and tone.Then,, select one group of candidate's syllable, and this group candidate syllable is checked with the last content of selecting than high likelihood in order to promote identification accuracy.Most phonetic recognizing system known in the state of the art mainly is to rely on recognition of vowels, because found that the otherness of first signal to noise ratio consonant is big.Therefore, accurate recognition of vowels can accurately be carried out language identification.
Summary of the invention
The present invention is a kind of holonomic system and method that is used for accurately reaching strong language identification, they to be being applied to three kinds of perception treatment technologies the Fourier spectrum of language, realize the accurate identification of perceived spectral and this perceived spectral clearly with the first sound spectrum vector by perceived spectral being projected to one group of reference for inputing to the language recognizer.The present invention comprises a perception language processor, is used for perception ground and handles the language frequency spectrum vector of input for producing a perceived spectral; One memory storage is used to store a plurality of reference spectrum vectors; And a phonetic feature mapper, its and this perception language processor and this memory storage are coupled, and are used for this perceived spectral is mapped to this a plurality of reference spectrum vectors.
Brief Description Of Drawings
Fig. 1 is for showing according to each step of phonetic recognizing system of the present invention and the block scheme of element;
Fig. 2 covers tone for explanation and by covering the time-domain diagram of the shutter that tone produces;
Fig. 3 is the minimum frequency domain figure of hearing zone (MAF) and contour of equal loudness;
Fig. 4 is a graph of a relation between display frequency scale and U.S.-scale;
Fig. 5 is for showing according to the sequence of apperceive characteristic of the present invention and handling to produce the process flow diagram of perceived spectral;
Fig. 6 (a) is the Fourier spectrum according to mandarin vowel of the present invention " i ", (b) shows the result of capture-effect, (c) shows the result that MAF handles, and (d) shows the result that U.S.-scale is taken a sample again;
Fig. 7 is for measuring the lab diagram of discrimination to signal to noise ratio (snr) according to the present invention;
Fig. 8 gets the synoptic diagram of the embodiment of (Winner-Take-All) circuit 800 entirely for the explanation victor of covering according to the present invention;
Fig. 9 is for illustrating according to the linear resistor PWLn of segmentation that is used to produce with respect to the electric current of different voltages of the present invention;
Figure 10 is the figure of explanation according to the electric current output of shutter of the present invention;
The figure that Figure 11 extracts for the explanation envelope (envelope) that corresponds to the node voltage of different PWLs by drawing according to the present invention;
Figure 12 is the single one-piece construction synoptic diagram that covers the WTA unit according to a specific embodiment of the present invention;
Figure 13 is the spectrogram according to the static state of explanation difference of the present invention (stationary) vowel " i " and non-static vowel " ai ";
The frequency spectrum of Figure 14 for representing according to U.S.-scale frequency of non-static vowel of the present invention " ai ";
Figure 15 (a) shows the projection similarity and the direct ratio that is projected to along the input vector x of the direction of the reference vector c (k) with predetermined weighted number; And Figure 15 (b) is presented at the situation that similarly reaches " iu " on the frequency spectrum with reference to vowel " i ";
Figure 16 (a) reaches Figure 16 (b) and Figure 16 (c) and illustrates according to relative projection similarity of the present invention for the homophylic polar plot of explanation projection;
Figure 17 is the phonetic feature profile diagram according to mandarin vowel of the present invention " ai ";
Figure 18 (a) shows with respect to a (8) (longitudinal axis) of vowel " i " (dark point) and vowel " iu " (light point) and to the projection similarity of a (6) (transverse axis);
Figure 18 (b) demonstration projection similarity (not having projection similarity relatively) reaches the comparison for the distinguishability of the phonetic feature scheme of the present invention of the reference spectrum of identical vowel;
Figure 19 is for having λ as " iu " phonetic feature of parameter figure with respect to " i " phonetic feature according to the present invention;
Figure 20 for according to of the present invention for add white noise to the spoken signal of input but not the discrimination of experiment that is added into any training group with respect to the figure of SNR;
Figure 21 for nine mandarin vowels and the projection similarity utilized according to the present invention as the discrimination of the experiment of three noise language test of input figure with respect to SNR result;
Figure 22 is according to the figure of outside discrimination of the present invention (%) (using different speakers) with respect to inner discrimination (%) (using single speaker); And
Figure 23 is according to the figure of noise language discrimination of the present invention (%) (environmental noise) with respect to inner discrimination (%) (wherein having the comparatively ideal condition of listening to).
Specific implementation
Basic concepts of the present invention come from the psychology and the physiology of human language and perception effect.More clearly, the function of the human perception effect of noise and sound and otherness to small part thereof anthropophysiology's perception effect of being human language.The present invention utilizes the perceived spectral of psychology aspect of language identification and the phonetic feature situation of physiology aspect.These factors combine become can realize simultaneously the automatic phonetic recognizing system of robustness and accuracy.Fig. 1 is the block scheme of preferred embodiment of the present invention, shows each step and the element of phonetic recognizing system.Sampling language 101 is transfused to fast Fourier transform (FFT) analyser 111, the fourier spectrum of its output sampling language, this fourier spectrum then inputs to perception language processor 112, it exports a perceived spectral 103, this perceived spectral then inputs to phonetic feature mapper 113, its exportable phonetic feature, this phonetic feature then inputs to continuous HMM recognizer 114.Perception language processor comprises and covers zone (MAF) curvometer 122 that operating control 121, maximum can be heard, and U.S.-scale sampler 123 again.Phonetic feature mapper 113 comprises projection similarity generator 131 and throws similarity generator 132 relatively, it then inputs to selector switch 133, it is selected between each corresponds to the output of frequency spectrum character of input spectrum vector (whether have the height projection similarity that has the reference spectrum vector that surpasses, more completely describe hereinafter).
The sampling spot of the language frequency spectrum of the discrete Fourier transform (DFT) computing of the component wave amplitude of automatic phonetic recognizing system sampling spoken signal.The parametrization of the language waveform that produces by loudspeaker be based on any ripple can be by simple sine and cosine wave (CW) the fact of combination representative; Combination the best of ripple is to be obtained by inverse-Fourier transform: g ( t ) = ∫ - ∞ ∞ G ( t ) e i 2 πft df Wherein fourier coefficient is to obtain by Fourier transform: G ( f ) = ∫ - ∞ ∞ g ( t ) e - i 2 πft dt It is given under the frequency f, the relative intensity of the component of ripple (amplitude), the ripple frequency spectrum in the frequency space.Because vector also has component, it can be by sine and cosine function representative, and spoken signal also can be described by the frequency spectrum vector.For actual computation, use discrete Fourier transform (DFT): G ( n τN ) = Σ k = 0 N - 1 [ τ · g ( kτ ) e - i 2 πk n N ] Wherein k is the placement order of each sampling value, and τ is the interval between the read value, and N is the sum (sample size) of read value.Sampling language 101 is that the language waveform produces by " sampling ", should " sampling " be by take out the point of capacity on the ripple frequency spectrum so that utilize FFT to carry out enough accurate magnitude determinations.Fast Fourier transform (FFT) analyser 111 is by using discrete Fourier transform (DFT) and taking one to be the fourier spectrum 102 of the shortcut of row with the generation ripple effectively, this shortcut is that the observed reading of the recurrence amount that derives from the cyclicity of trigonometric function is the basis, it allows that a result calculated can be used for another and calculates, thereby reduces the sum of required calculating.
Being used in the capture-effect of covering operating control 121 is observed phenomenon, and this phenomenon becomes and can't hear during than loud noise for some sound is close on the temporary transient and frequency spectrum as other.Capture-effect can be measured by human subjective reaction.Fig. 2 is a frequency domain figure, shows the amplitude that covers tone (solid line 201) that produces by 1kHz, 80 dB pure pitches (roundlet 200).Any signal that is lower than solid line 101 will for can't hear and if frequency near covering tone, will more seriously be restricted, restriction is bigger towards high-frequency.Fig. 3 hears the frequency field figure of zone (MAF) for I, is lower than this I and hears that then voice signal is too weak and can't be by people's perception (dotted line 300) and the loudness contour 301,302,303,304 and 305 that equates in the zone.For objective voice signal amplitude being translated into human subjective loudness, the amplitude of the specific frequency components of signal must be become MAF curve described as follows by renormalization:
L (dB)=M (dB)-MAF wherein L and M is respectively the loudness and the amplitude of the frequency component of voice signal, and MAF is the value of MAF under this frequency.In another specific embodiment of the present invention, the amplitude of a given frequency component is become all loudness contours 301 that equate etc. through renormalization.In order to describe human subjective pitch sensation, frequency scaling is adjusted to the perceived frequency scale, is called U.S.-scale.In U.S.-scale, the low frequency spectral band is more remarkable than high-frequency spectral band.Fig. 4 is for showing the figure that concerns between hertz (or frequency) scale be expressed from the next and U.S.-scale:
U.S.=2595 * log (1+f/700) wherein f is a signal frequency.
In a specific embodiment of the present invention, the sequence of said sensed feature and processing are shown in the process flow diagram of Fig. 5 to produce perceived spectral.Step 501 bears results for the FFT that inputs to step 502, and it removes all frequency components of voice signal, and this voice signal is to cover by the contiguous sound that rings according to last shutter in the previous and present frame of voice signal.Step 503 is to convert U.S.-scale for frequency component by taking a sample according to the renormalization of the amplitude of each frequency component of the voice signal of MAF curve and step 504 again.The order of step is to design for counting yield and Auditory Pathway is needed not be same sequence.Be familiar with this technician and should understand step 501,502,503, and any order of 504 is to be encompassed in the desired extent of the present invention.Step 501,502,503, and 504 result is as shown in Figure 6, wherein (a) be the mandarin vowel " i " and fourier spectrum, (b) be the result of step 502 capture-effect, (c) be the result of the MAF processing of step 503, and the result who (d) takes a sample again for U.S.-scale.Fig. 6 (b) shows the most of frequency components of capture-effect removal between 400Hz to 2kHz, reduces the background noise of pending quantity of information and removal significant quantity significantly.Fig. 6 (c) shows low and high frequency components is decayed significantly and Fig. 6 (d) shows perceived spectral according to the illustration vowel " i " of preferred embodiment of the present invention.In another specific embodiment, low frequency component, it has maximum vowel information, and is more sampled more subtly than other frequencies.Final perceived spectral only keeps the envelope of frequency spectrum, so that transmit the important information about the shape of the points of articulation separately.Pitch information is also advantageously removed, because it is for recognition of vowels and inessential.Step 502, capture-effect is different from the existing full utmost point (all-pole) spectral model.The full utmost point (all-pole) model produces the level and smooth paddy shape of concave surface in frequency spectrum, the present invention then produces sharp-pointed edge.When frequency spectrum was polluted by noise, the position of the utmost point in full utmost point frequency spectrum generally saw through the appearance of noise in the paddy zone and is affected.In the present invention, the noise in most of paddy shape zone is to remove by shutter, therefore realizes more clearly signal.
Fig. 7 is for measuring the lab diagram of discrimination to signal to noise ratio (snr).Compare with FFT spectrum envelope curve (SE), perceived spectral curve (PS) causes significantly lower SNR and high recognition.Capture-effect (covering) and MAF renormalization and covering itself also promoted discrimination significantly and compared the minimizing noise with SE.
It is a phenomenon that noise covers, thereby when contiguous tone than loudness occurred on a temporary transient and frequency spectrum, more weak tone became and can not hear.Known auditory nerve is former to be that order (having a liking for tension force (tonotopic) tissue) with separately resonance frequency is provided with, and suppresses perception effect corresponding to the former inhibiting near by frequency component of side auditory nerve with activation.Neurogenous activity depends on neurogenous input and contiguous neurogenous inhibiting effect and spread effect.Neuron with strong output will suppress the contiguous neuron of side via cynapse connection effect.Suppose that neuron i has the strongest input stimulus, neuron i will then suppress its contiguous neuron at most and stimulate itself maximum.Make mute ") because other neurons in this zone and neuron i are noncompetitive (", only have neuron i to produce output.The neuron i that this survives gets (Winner-Take-All (WTA)) entirely so-called victor neural network is called " victor ", this neural network reasonably only extends to the localization zone, because for farther neuron, a little less than reciprocation becomes." totally " model of WTA network is a circuit, has n neuron, each free two nMOS transistors representative, all node places that all are coupling in.When input stimulus utilization to transistorized electric current stimulated in a parallel manner, the voltage level of node was dependent on the transistor (neuron) with maximum current input and decides.In balance, victor's neuron that bias current is flowed through and suppressed every other neurogenous output current effectively.Have the transistor of the resistor of series connection by separation, and each transistor of bias voltage, electric current can be by localization.
One specific embodiment of the full sense circuit 800 of Fig. 8 explanation victor according to the present invention.Current source I kInput current to the nMOS transistor to T 1k, T 2k, produce transistor voltage V k, and node voltage V CkBecome the linear transistor PWLn of fragment to be coupled in series with between the node 801,802,803, these nodes are coupled to the nMOS transistor T that is connected to diode 3kBecome the linear transistor PWLn of fragment to produce as shown in Figure 9 electric current, and produce the asymmetric rejection characteristic (referring to Fig. 1) of viewed capture-effect with respect to different voltage patterns.The experiment of being carried out utilizes Unit 1 (neuron/transistor to) SPICE to stimulate.Figure 10 is the electric current output map according to shutter of the present invention, and the output of this electric current is to generate to other unit by neuron numbering 30 and 100nA that simple tone inputs to 700nA, wherein can realize the asymmetry of viewed capture-effect.Input to first sound spectrum of the present invention and produce the spectrum component (High Output Current) of winning, it is not the contiguous spectrum component of only inhibition, the output current that the also contiguous bias current of absorption, so increase " victor " has and the validity that increases the resonance peak extraction." resonance peak " is defined feature (crest in sound spectrum) and therefore more shows the author, and language identification is better.Moreover component is clearly quantized, the harmonic wave of the fundamental frequency of respectively doing for oneself.The information that is used for differentiating different phonemes is carried the envelope at the language frequency spectrum.The WTA of covering of the present invention system further extracts spectrum envelope in the language of input.Node voltage V among Fig. 8 CkPresent input current I kThe smooth spectrum envelope.If the neuron discussed corresponds to frequency spectrum paddy shape, follow neurogenous electric current output and will suppress, but node voltage will also increase (as above-mentioned) by its contiguous crest, so can realize corresponding to the envelope of input spectrum with level and smooth node voltage.Figure 11 shows that envelope extracts.The curve of solid line is that the node voltage and the dashed curve that correspond to different PWL are the no resistance place.
Figure 12 is the single notion signal point that covers the WTA unit according to a specific embodiment of the present invention.Three nMOS transistor M1, M2 and M3, a PWL R resistor, a voltage buffer, mos capacitance device MS and two current mirror MI1 and MI2.In a programming phases, input voltage is stored in mos capacitance device M5; The M4 changing voltage becomes electric current, for importing via current mirror MI1.In operation, voltage output cushions by unity gain buffer, and then is coupled to output bus.Output current is replicated and is sent to the electric current output bus by current mirror MI2.Output current then converts voltage to by linear grounding resistor PWL R.PWL R has direction of current is changed responsive resistance (Fig. 9), and curve (Fig. 2) is covered in perception, and resistance can reach 100 with respect to the ratio of resistance to the right left.Two nMOS transistor M1 and M2 have a comparator C OMP and switch between M1 and M2 as the passive resistance device that is used for two direction of current, decide (adjusting these resistance by grid voltage) according to the symbol of voltage drop.This specific embodiment of the present invention is utilize to support circuit (for stability, signal gain, and avoiding earial drainage), the UMCTM0.5 micron two-many two-realize in metal (ouble-poly double-metal) the CMOS process.Voltage output produces spectrum envelope and electric current output produces the frequency spectrum resonance peak.Utilize the WTA of covering circuit of the present invention, the resonance peak of vowel " ai " can be clearly by finding out in the frequency spectrum, or even in input signal, have under the noise situation that increases.
In preferred embodiment of covering the WTA network of the present invention, more advantageously utilized to integrate with other elements of other ASR systems with simulation parallel processing system.For example, the bandpass filter layer is coupled to the upstream provides to input to activation and covers the WTA network.
Phonetic feature mapper 113 (Fig. 1) comprises projection similarity generator 131 and throws similarity generator 132 relatively, their phonetic feature generators 133 of feeding, the latter produces and is used for the phonetic feature that extracts according to the language identification of preferred embodiment of the present invention.Speech feature extraction is based on the physiology (with respect to the said sensed frequency spectrum based on the psychology aspect of human language) of human language.When the mankind spoke, air was to be released to stimulate vocal cords by lung.The points of articulation then forms pressure wave according to the sound desiring to send.For some vowels, the shape of the points of articulation keeps not changing in whole articulation process, so that spectral shape in time presents static state.For other vowels, articulation is begun by the shape of the points of articulation, and it little by little changes, and then is positioned to another shape.For static vowel, spectral shape determines that the identification of phoneme and these shapes are used as the reference spectrum in the phonetic feature mapping.Yet non-static vowel generally has two or three with reference to vowel section and the transition zone between these vowels.Figure 13 is the frequency spectrum of static vowel " i " and non-static vowel " ai ", and difference is described.Figure 14 represents for the frequency spectrum and U.S.-scale frequency of non-static vowel " ai ", shows the prima facies with the frequency spectrum that is similar to vowel " a ", is moved to the frequency spectrum that is similar to vowel " e " and reaches the frequency spectrum that is positioned at similar vowel " i " at last.Preferred embodiment of the present invention utilizes 9 static vowels with as with reference to vowel, to form the basis of all 37 mandarin vowels.Table 1 shows that 37 mandarin vowel phonemes and 9 are with reference to phoneme.9 frequency spectrums with reference to phoneme are by c (i) representative, wherein i=1,2 ... 9 and respectively do for oneself by the specific 64-dimension amount vector (or the wave component in inverse-Fourier transform) that calculates with reference to all frames of vowel of an average training group.
In order to reduce the data dimension amount of the CHMM recognizer 114 of feeding, in a specific embodiment of the present invention, phonetic feature mapper 113 produces 9 features from the frequency spectrum vector of 64-dimension amount.Phonetic feature mapper 113 at first calculates the similarity of input spectrum to 9 reference spectrum vectors, then calculates another group between 72 relative similarities between input spectrum and the 72 pairs of reference spectrum vectors.By making up 9 phonetic features that these similarities can realize last group.Be different from conventional classification schemes, it is categorized into one in the reference spectrum with input spectrum, and the present invention standardizes quantitatively with respect to the shape of the input spectrum of 9 reference spectrums (shape of the points of articulation of also standardizing).Phonetic feature of the present invention is mapped as the method for the feature extraction (or the dimension amount reduces) of measuring via similarity.Preferred embodiment utilization of the present invention is measured based on the similarity of two kinds of forms of projection; Projection similarity and relative projection similarity.
Figure 15 (a) shows the projection similarity, and along the reference vector c with predetermined weighted number (k)The input vector x of direction is projected to direct ratio, can be expressed from the next: a ( k ) = Σ w i ( k ) · x i · c i ( k ) | | c ( k ) | | Wherein k=1 ..., 9 and | | c ( k ) | | = ( Σ i = 1 64 ( c i ( k ) ) 2 And weighting factor is to be expressed from the next: w i ( k ) = c i ( k ) / σ i ( k ) Σ i = 1 64 c i ( k ) / σ i ( k ) Wherein i=1,2 ..., 64 and k=1,2 ..., 9 and σ i (k)For corresponding to k ThStandard deviation with reference to dimension amount i in vowel overall.At weighting factor w i (k)In, σ i (k)Be used as constant, it makes all the dimension measurers in all 9 reference vectors that identical variance be arranged.C at weighting factor i (k)Item emphasizes to have the spectrum component than large amplitude.This group corresponds to the weighted number of each reference vector by normalization.
For many situations, above-mentioned projection similarity is to be enough to be used in accurate language identification.But similar reference vowel " i " reaches the situation of " iu " on Figure 15 (b) demonstration frequency spectrum, wherein these similar projection similarities with reference to the input vector on the vowel will all be big and the language input will be for frequency spectrum on similar in appearance to similar phoneme, thereby need further to distinguish to realize accurate language identification." throwing similarity relatively " only extracts conclusive spectrum component, thereby realizes preferable differentiation effect.In order to explain orally easily, Figure 16 is a polar plot, and the relative projection similarity that is used for two-dimensional vector is described.Certainly, all multidimensional vectors are in desired extent of the present invention.Input vector x is near two similar reference vector c (k)And c (l), a little near c (k), but the difference in projection is little, shown in Figure 16 (a).Pass through c (k)-c (l)The difference of boundary between c (k) and c (l) of expression has decisive for the classification of input language vector x.Figure 16 (b) and 16 (c) show x-c (l)At c (k)-c (l)On projection be greater than x-c (k)At c (l)-c (k)On projection, with and etc. difference be separately at c than x (k)And at c (l)On projection between significant difference.Utilize this observation, with respect to c (l)Input vector x at c (k)Statistics on weighting projection be: q ( k , l ) = Σ i = 1 64 v i ( k , l ) · ( x i - c i ( l ) ) · ( c i ( k ) - c i ( l ) ) | | c ( k ) - c ( l ) | | Wherein k=1 ..., 9,1 ≠ k, and | | c ( k ) - c ( l ) | | = Σ i = 1 64 ( c i ( k ) - c i ( l ) ) 2 . The normalization weighted number factor is expressed from the next: v i ( k , l ) = | c i ( k ) - c i ( l ) | / ( σ i ( k ) ) 2 + ( σ i ( l ) ) 2 Σ i = 1 64 | c i ( k ) - c i ( l ) | / ( σ i ( k ) ) 2 + ( σ i ( l ) ) 2 Wherein i=1 ..., 64; K=1 ..., 9,1 ≠ k.These components of two reference vectors that weighting factor is used for emphasizing that this has big-difference and be used to make difference to equate in all dimension amounts.At q (k, l)In the situation for negative, in order to control dynamic range and to discern the required clue of input vector, negative q in order to keep (k, l)Be set to one little on the occasion of and on the occasion of q (k, l)Do not change (one pole slant function).With respect to c (l)X at c (k)On relative projection similarity be defined as: r ( k , l ) = q ( k , l ) q ( k , l ) + q ( l , k ) Wherein k=1 ..., 9,1 ≠ k.Therefore, always have 8 * 9=72 relative projection similarity, it throws the phonetic feature that similarities define preferred embodiment of the present invention with 9.
In a preferred embodiment of the present invention, projection similarity and the homophylic integration of relative projection are to utilize a hierarchical classification with the identification language, wherein throw similarity by selecting to have bigger x at c (k)On range, in other words, to a (k)For big value candidate target determine first rude classification.Candidate target is used further paired relative projection similarity and screens.Yet,, may non-selectedly arrive good candidate target if first rude classification is suitably adjusted.
In preferred embodiment of the present invention, the projection similarity and relatively the projection similarity be to be integrated by the phonetic feature mapping, it utilizes scheme: (a) throw similarity relatively and be utilized and be used for any two and have the homophylic reference vector of big projection; And (b) otherwise, the projection similarity can use separately.This will not only produce more accurate language identification, also can calculate more efficiently.Phonetic feature is defined as: p ( k ) = 1 λ a ( k ) + 1 λ Σ l = 1 , l = k 9 ( r ( k , l ) p ( l ) - r ( l , k ) p ( k ) ) Wherein k=1,2 ..., 9 and λ be scaling factor, the degree that is used to control cross-couplings or laterally suppresses.To the equational solution (for the simplification that illustrates) of above-mentioned two reference vectors by shown in the following formula: p ( k ) p ( l ) = λa ( k ) + ( a ( k ) + a ( l ) ) r ( k , l ) λ a ( l ) + ( a ( k ) + a ( l ) ) r ( l , k ) . For a (k)And a (l)The two is all big and have under the situation of comparable amplitude, suppose the c of x in feeling near Euclid norm (k), x and c (k)Between distance less, so r (k, l)Greater than r (l, k)If λ is relatively little, follow p (k)/ p (l)Near r (k, l)/ r (l, k), it is to pass through r (k, l)And r (l, k), throw similarity relatively and be determined.For a (k)And a (l)In have only one when big, suppose a (k)For greatly, r then (k, l)And r (l, k)Approach respectively 1 and 0 and p ( k ) / p ( l ) ≈ ( λ + 1 ) a ( k ) + a ( l ) λ a ( l ) , It passes through a (k)And a (l)Be determined.For the 3rd and last a kind of may situation, wherein a (k)And a (l)It is all little,
p (k)∝ λ a (k)+ (a (k)+ a (l)) r (k, l)And
p (l)∝ λ a (l)+ (a (k)+ a (l)) r (l, k). because a (k)And a (l)All little, and r (k, l)And r (l, k)Less than 1, so p (k)And p (l)Also little and can ignore.Definition r ( k , k ) = λ + Σ l = 1 , l = k 9 r ( l , k ) Wherein k=1,2 ..., 9, follow above-mentioned p (k)Equation can be write as matrix form: For k=1,2 ..., 9 phonetic feature p (k)Solve by the inverse that is multiplied by above-mentioned matrix in both sides.
Figure 17 is the phonetic feature profile diagram of mandarin vowel " ai ", and the phonetic feature of maximum is " a " during beginning, then translates into the phonetic feature that vowel " e " and last " i " become maximum.Behind 450ms, as seen phonetic feature " u " becomes, though quite short and not remarkable.The present invention realizes significant discernment via disintegrating into 9 basic vowels.By utilizing relative projection similarity similar to the discernment between the vowel, even can realize higher language identification accuracy to promote.Figure 18 (a) shows a (8) (" iu ", the longitudinal axis) of vowel " i " (dark point) and vowel " iu " (light point) and the projection similarity of a (6) (" i ", transverse axis).Independent for the projection similarity, discernment is little because different vowels are very approaching together, shown in Figure 18 (a).Yet, be used for " i " (p when phonetic feature figure of the present invention is utilized (6), dark-shaded) and " iu " (p (8), light shade) time, discernment is improved widely, as being found out by the remarkable separation of the vowel shown in Figure 18 (b).
The mankind partly discern the perception language by several pedigrees.The present invention comprises part identification, because address just now as mentioned, vowel is disintegrated into 9 sections with reference to vowel.Moreover when listening to, the mankind ignore many irrelevant information.9 of the present invention are used to abandon many irrelevant information with reference to vowel.Therefore, the present invention specializes the feature of human language perception to realize higher language identification.
Phonetic feature p among the present invention (k)Discernment be to control by the set-point of scaling factor λ.As above-mentioned p (k)Equation shown in, if λ when big, throws similarity r relatively (k, l)Summation overwhelmed by λ.Figure 19 is " iu " phonetic feature (p (8)) with respect to " i " phonetic feature (p (6)) figure, with λ as parameter, this parameter with the work gray scale increase have higher value.The smaller value of λ makes to distribute and disperses to make two vowels more can discern away from diagonal line (it represents no discernment), thus improvement identification accuracy.Yet for λ, too little value will cause at random, and it is difficult to form model by multidimensional amount Gaussian function in continuous HMM (CHMM) recognizer 114 (Fig. 1), causes bad identification accuracy.Therefore, the present invention advantageously utilizes the value of scaling factor λ with the optimization discernment, limits at random simultaneously.
Continuous hidden Markov model recognizer 114 (Fig. 1) utilizes the statistical method of the spectral characteristic of characterization language graphic frame, and prerequisite is that spoken signal can be characterized as the stochastic process of parameter and infer that the parameter of process can accurate way mensuration.Observable Markov model is (for example to correspond to the observable incident of determinacy for each state wherein, whether be rainy day or fine day), and model be output as under each moment state group (for example, when the fate that rains), wherein each state is to correspond to observable incident.Hidden Markov model, on the other hand, supposition process (for example behind curtain, throw and surpass a copper coin) for dual embedding, supposition process with basis, it is not directly to can be observed (after being hidden in curtain), but can only observe by another group reasoning process (copper coin throwing), it produces the sequence of observing.Therefore, for the observation of discrete symbol, HMM is characterised in that (a) number of state in model, (b) difference of each state is observed the number (for example alphabet size) of symbol, (c) state-transition probability distribution (d) is observed the symbol probability distribution, and (e) initial state distribution.The present invention utilizes the word recognizer of isolation, the system's (each word by different HMM by modelling) that is used for V word to be identified of isolating, have the training group of each word the K sounding (by one or many persons speaker say), wherein each sounding constitutes the observation sequence of some representative of the feature of this word.For each the word v in the glossary, must be estimated optimization to being used for v to above-mentioned (c), (d) and HMM parameter (e) ThThe coupling of the training class value of word.The present invention discerns each unknown word by the measurement via the observation sequence of the perceived spectral of language and phonetic feature analysis.The back then by institute might model the probability of model likelihood calculate, and last selection has the word of high model likelihood.It generally is to utilize PRML path (Wei Te is than algorithm) and be performed that probability is calculated.To the detailed description of HMM, with reference to Rabiner ﹠amp; Juang, Fundamentals of Speech Recognition, 321-389 page or leaf, Prentice-Hall Signal Processing Series, 1993.
Because perception language processor 112 of the present invention and phonetic feature mapper 113, the phonetic feature 104 that inputs to continuous HMM recognizer 114 is better than conventional ASR system, thereby produces more strong and accurate language identification.Figure 20 for add white noise to import spoken signal not the discrimination of the experiment in any training group with respect to the figure of SNR.Figure 20 (a) shows that identification lists in top (top) candidate target meeting the result of language input, and Figure 20 (b) is used for three candidate targets in top (because many phonetically similar words, some language must further be distinguished according to content).The left-hand side top of figure is the zone of best language identification performance.On behalf of phonetic feature, the curve that indicates PF (PS) add that perceived spectral result (in other words, the present invention) reaches as far as the upper left side.PF (SE) represents phonetic feature (FFT spectrum envelope) (that is, utilize perceived spectral but the language of unaware frequency spectrum processing is handled) and is next the best.The parametric method of the existing language frequency spectrum of MCEP representative is known as U.S.-scale cepstra (cepstral) coefficient and must be subjected to noise effect with respect to system of the present invention.The cepstra coefficient that CEP representative is independent, no U.S.-scale is changed, and right-hand bigger to the MCEP of the validity that confirms U.S.-scale.REF (reflection coefficient) and LPC (linear predictive coding) are other existing language recognition methodss, and the gained result is more undesirable.Therefore, can find out that the present invention realizes the accuracy and the resistant strength of language identification.Figure 21 is the figure of discrimination with respect to SNR, is another result of experiment of three noise language test, utilizes 9 mandarin vowels and projection similarity as the input of HMM114 continuously, causes the identification accuracy of promoting.On behalf of the present invention, PF (PS) produce optimal results once more.PRJS (PS) represents the projection similarity (that is, the present invention that no phonetic feature is handled) of perceived spectral, and PS is independent perceived spectral (that is, the projection similarity calculating that no phonetic feature is handled).The present invention not only realizes more strong and accurate language identification, also can realize higher counting yield than classic method, because the language frequency spectrum parameterization is reduced to 9 from typical 64.Phonetic feature mapping also more is not subjected to noise effect, in part because its emphasis is at conclusive spectrum component and ignore the distortion that is caused by noise.
In order to prove that the present invention can improve language identification effectively, Figure 22 is the figure of outside discrimination (%) (using different speakers) with respect to inner discrimination (%) (using single speaker).Point towards corner, right-hand side top confirms best resistant strength and accuracy.Moreover, comparing with every other person, PF (PS) shows optimal results.Figure 23 is the figure of noise language discrimination (%) (environmental noise) with respect to inner discrimination (%) (wherein having the comparatively ideal condition of listening to).Point towards corner, right-hand side top confirms best resistant strength and accuracy.Compare with other existing language recognition methodss, PF (PS) demonstrates optimal results once more.
Though complete description particular specific embodiment above is can use different improvement, alternate configurations and equivalent.For example, though the demonstration of example in this article is mandarin Chinese, technological thought of the present invention is to be applicable to any language with syllable.Moreover no matter any technology is that simulate, numeral, numerical value or hardware processor all can advantageously use.Therefore, above-mentioned description and explanation should not cause restriction to the scope of the present invention by the appended claim definition.

Claims (16)

1, a kind of language disposal system that is used to handle input language frequency spectrum vector, it comprises:
Perception language processor is used for perception ground and handles input language frequency spectrum vector to produce perceived spectral;
Memory storage is used to store a plurality of reference spectrum vectors; And
The phonetic feature mapper, itself and this perception language processor and the coupling of this memory storage are used for the frequency spectrum of this perception is mapped to this a plurality of reference spectrum vectors.
2, according to the language disposal system of claim 1, wherein this perception language processor comprises:
Cover operating control, be used for noise and cover input language frequency spectrum vector to produce input language frequency spectrum vector through covering;
The regional curve renormalization device that I is heard, be coupled to this and cover operating control, be used for corresponding to this input language frequency spectrum vector renormalization in the zone that I hears through covering, producing the input language frequency spectrum vector through covering of renormalization, and
U.S.-scale is sampler again, is coupled to the regional curve renormalization device that this I is heard, is used to change the input language frequency spectrum vector Cheng Mei-scale through covering of this renormalization.
3, according to the language disposal system of claim 1, wherein this phonetic feature mapper comprises:
Projection similarity generator is coupled to this memory storage, is used to produce this input spectrum vector and calculates to a plurality of projection similarities on these a plurality of reference spectrum vectors;
Projection similarity generator is coupled to this memory storage relatively, is used to produce this input spectrum vector and calculates to a plurality of relative projection similarity on these a plurality of reference spectrum vectors; And
Selector switch, be coupled to this projection similarity generator and relative projection similarity generator, be used for from correspond to this input language frequency spectrum vector the projection similarity on these a plurality of reference spectrum vectors and relatively this projection similarity generator of the homophylic relative value of projection calculate and between projection similarity generator calculates relatively selection one throw similarity.
4, according to the language disposal system of claim 3, wherein these a plurality of reference spectrum vectors are made up of a plurality of static vowels.
5, according to the language disposal system of claim 4, wherein these a plurality of static vowels are made up of the mandarin vowel of 9 static state.
6, a kind of phonetic recognizing system that is used to discern once the language frequency spectrum vector of sampling, it comprises:
The fast fourier transform analyser is used to produce the Fourier transform through the language frequency spectrum vector of sampling,
Perception language processor is coupled to this fast fourier transform analyser, is used to handle this Fourier transform to produce perceived spectral;
Memory storage is used to store a plurality of reference spectrum vectors; And
The phonetic feature mapper, itself and this perception language processor and the coupling of this memory storage are used for this perceived spectral is mapped to this a plurality of reference spectrum vectors, thereby select at least onely with this perceived spectral maximum homophylic reference vector is arranged; And
The HMM recognizer is coupled to this phonetic feature mapper continuously, is used to discern this at least one reference vector.
7, according to the phonetic recognizing system of claim 6, wherein these a plurality of reference spectrum vectors are made up of a plurality of static vowels.
8, according to the phonetic recognizing system of claim 7, wherein these a plurality of static vowels are made up of the mandarin vowel of 9 static state.
9, a kind of language disposal route that is used to handle an input language frequency spectrum vector comprises following step:
Input language frequency spectrum vector is handled to produce perceived spectral in perception ground;
Store a plurality of reference spectrum vectors; And
This perceived spectral is mapped on these a plurality of reference spectrum vectors.
10, according to the language disposal route of claim 9, wherein this perception ground treatment step further comprises following step:
Noise covers input language frequency spectrum vector to produce the input language frequency spectrum vector through covering;
To correspond to this input language frequency spectrum vector renormalization in the zone that I hears through covering, producing the input language frequency spectrum vector through covering of renormalization, and
Change the input language frequency spectrum vector Cheng Mei-scale through covering of this renormalization.
11, according to the language disposal route of claim 9, wherein this mapping step further comprises following step:
Producing this input spectrum vector calculates to a plurality of projection similarities on these a plurality of reference spectrum vectors;
Producing this input spectrum vector calculates to a plurality of relative projection similarity on these a plurality of reference spectrum vectors; And
From correspond to this input language frequency spectrum vector the projection similarity on these a plurality of reference spectrum vectors and relatively this projection similarity generator of the homophylic relative value of projection calculate and select one to throw similarity between projection similarity generator calculates relatively.
12, according to the language disposal route of claim 11, wherein these a plurality of reference spectrum vectors are made up of a plurality of static vowels.
13, according to the language disposal route of claim 12, wherein these a plurality of static vowels are made up of the mandarin vowel of 9 static state.
14, a kind of language recognition methods of the input language frequency spectrum vector through taking a sample, it includes step:
Utilize the fast fourier transform analyser, produce this Fourier transform through the input language frequency spectrum vector of sampling;
By handling this Fourier transform to produce perceived spectral;
Store a plurality of reference spectrum vectors;
This perceived spectral is mapped on these a plurality of reference spectrum vectors;
Select at least one and this perceived spectral that maximum homophylic reference vector is arranged; And
Utilize a continuous HMM recognizer to discern this at least one reference vector.
15, according to the language recognition methods of claim 14, wherein these a plurality of reference spectrum vectors are made up of a plurality of static vowels.
16, according to the language recognition methods of claim 15, wherein these a plurality of static vowels are made up of the mandarin vowel of 9 static state.
CN01124051A 2001-08-08 2001-08-08 Phonetic recognizing system and method of sensing phonetic characteristics Pending CN1400583A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN01124051A CN1400583A (en) 2001-08-08 2001-08-08 Phonetic recognizing system and method of sensing phonetic characteristics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN01124051A CN1400583A (en) 2001-08-08 2001-08-08 Phonetic recognizing system and method of sensing phonetic characteristics

Publications (1)

Publication Number Publication Date
CN1400583A true CN1400583A (en) 2003-03-05

Family

ID=4665467

Family Applications (1)

Application Number Title Priority Date Filing Date
CN01124051A Pending CN1400583A (en) 2001-08-08 2001-08-08 Phonetic recognizing system and method of sensing phonetic characteristics

Country Status (1)

Country Link
CN (1) CN1400583A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104503758A (en) * 2014-12-24 2015-04-08 天脉聚源(北京)科技有限公司 Method and device for generating dynamic music haloes
CN109935226A (en) * 2017-12-15 2019-06-25 上海擎语信息科技有限公司 A kind of far field speech recognition enhancing system and method based on deep neural network
CN112614507A (en) * 2020-12-09 2021-04-06 腾讯音乐娱乐科技(深圳)有限公司 Method and apparatus for detecting noise

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104503758A (en) * 2014-12-24 2015-04-08 天脉聚源(北京)科技有限公司 Method and device for generating dynamic music haloes
CN109935226A (en) * 2017-12-15 2019-06-25 上海擎语信息科技有限公司 A kind of far field speech recognition enhancing system and method based on deep neural network
CN112614507A (en) * 2020-12-09 2021-04-06 腾讯音乐娱乐科技(深圳)有限公司 Method and apparatus for detecting noise

Similar Documents

Publication Publication Date Title
Sroka et al. Human and machine consonant recognition
US20020128827A1 (en) Perceptual phonetic feature speech recognition system and method
Ghitza Temporal non-place information in the auditory-nerve firing patterns as a front-end for speech recognition in a noisy environment
Lee et al. Tone recognition of isolated Cantonese syllables
Lech et al. Amplitude-frequency analysis of emotional speech using transfer learning and classification of spectrogram images
Cole et al. Speaker-independent recognition of spoken English letters
CN109979436A (en) A kind of BP neural network speech recognition system and method based on frequency spectrum adaptive method
Chelali et al. Text dependant speaker recognition using MFCC, LPC and DWT
Piotrowska et al. Machine learning-based analysis of English lateral allophones
Mehta et al. Comparative study of MFCC and LPC for Marathi isolated word recognition system
Rajan et al. Two-pitch tracking in co-channel speech using modified group delay functions
Chang et al. Automatic phonetic transcription of spontaneous speech (american English).
Wiśniewski et al. Automatic detection of disorders in a continuous speech with the hidden Markov models approach
Jeon et al. Speech analysis in a model of the central auditory system
Gaudani et al. Comparative study of robust feature extraction techniques for ASR for limited resource Hindi language
CN1400583A (en) Phonetic recognizing system and method of sensing phonetic characteristics
Ekpenyong et al. Unsupervised mining of under-resourced speech corpora for tone features classification
Kondhalkar et al. A novel algorithm for speech recognition using tonal frequency cepstral coefficients based on human cochlea frequency map
Yousfi et al. Isolated Iqlab checking rules based on speech recognition system
Ma et al. Statistical formant descriptors with linear predictive coefficients for accent classification
Kamarudin et al. Analysis on Mel Frequency Cepstral Coefficients and Linear Predictive Cepstral Coefficients as Feature Extraction on Automatic Accents Identification
Zouhir et al. Robust speaker recognition based on biologically inspired features
Muthusamy et al. A review of research in automatic language identification
Chakraborty et al. Speech recognition of isolated words using a new speech database in sylheti
Albaraq ARABIC SPEAKER RECOGNITION SYSTEM USING GAUSSIAN MIXTURE MODEL AND EM ALGORITHM.

Legal Events

Date Code Title Description
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication