EP1195744B1 - Geräuschrobuste Spracherkennung - Google Patents

Geräuschrobuste Spracherkennung Download PDF

Info

Publication number
EP1195744B1
EP1195744B1 EP01308268A EP01308268A EP1195744B1 EP 1195744 B1 EP1195744 B1 EP 1195744B1 EP 01308268 A EP01308268 A EP 01308268A EP 01308268 A EP01308268 A EP 01308268A EP 1195744 B1 EP1195744 B1 EP 1195744B1
Authority
EP
European Patent Office
Prior art keywords
feature vector
noise
section
speaker
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Lifetime
Application number
EP01308268A
Other languages
English (en)
French (fr)
Other versions
EP1195744A3 (de
EP1195744A2 (de
Inventor
Kiyoshi c/o Pioneer Corporation Yajima
Soichi c/o Pioneer Corporation Toyama
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Pioneer Corp
Original Assignee
Pioneer Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Pioneer Corp filed Critical Pioneer Corp
Publication of EP1195744A2 publication Critical patent/EP1195744A2/de
Publication of EP1195744A3 publication Critical patent/EP1195744A3/de
Application granted granted Critical
Publication of EP1195744B1 publication Critical patent/EP1195744B1/de
Anticipated expiration legal-status Critical
Expired - Lifetime legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/10Speech classification or search using distance or distortion measures between unknown speech and reference templates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • G10L15/142Hidden Markov Models [HMMs]
    • G10L15/144Training of HMMs

Definitions

  • the present invention relates to a voice recognition system, and specially relates to the speaker adaptive type voice recognition system which is robust to the noise.
  • a system shown in Fig. 9 is well known as a speaker adaptive voice recognition system, for example.
  • This voice recognition system is provided with a previously prepared standard acoustic model 100 of an unspecified speaker, and a speaker adaptive acoustic model 200 is prepared by using a feature vector of an input signal Sc generated from an input voice uttered by a specified speaker, and the standard acoustic model 100, and the voice recognition is conducted by adapting the system to the voice of the specified speaker.
  • the standard vector Va corresponding to a designated text (sentence or syllable) Tx is supplied from the standard acoustic model 100 to a path search section 4 and a speaker adaptation section 5, and further, actually, by uttering the designation text Tx by the specified speaker, the input signal Sc is inputted.
  • a feature vector generation section 2 After an additive noise reduction section 1 removes an additive noise included in the input signal Sc, a feature vector generation section 2 generates a feature vector series Vcf which represents the feature quantity of the input signal Sc. Further, a multiplicative noise reduction section 3 removes a multiplicative noise from the feature vector series Vcf, and generates the feature vector series Vc from which the additive noise and the multiplicative noise are removed.
  • the feature vector series Vc is supplied to a path search section 4 and a speaker adaptation section 5.
  • the path search section 4 compares the feature vector series Vc to the standard vector Va. Then, the appearance probability of the feature vector series Vc for each syllable, and the state transition probability from an syllable to another syllable are found. Thereafter, when the speaker adaptation section 5 compensates for the standard vector Va according to the appearance probability and the state transition probability, the speaker adaptive acoustic model 200 adaptive to the feature of the voice (input signal) proper to the specified speaker is prepared.
  • the speaker adaptive acoustic model 200 is adapted to the input signal generated from the uttered voice by the specified speaker. Thereafter, when the specified speaker utters arbitrarily, the feature vector of the input singal generated from the uttered voice is collated with the adaptive vector of the speaker adaptive acoustic model 200, and the voice recognition is conducted in such a manner that the speaker adaptive acoustic model 200 which gives the highest likelihood is made a recognition result.
  • the additive noise reduction section 1 removes the additive noise by the spectrum subtraction method
  • the multiplicative noise reduction section 3 removes the mulatiplicative noise by the CMN method (cepstrum means normalization), and thereby, the speaker adaptive acoustic model 200 not influenced by the noise is prepared.
  • the additive noise reduction section 1 removes the spectrum of the additive noise from the spectrum of the input signal Sc after the spectrum of the input signal Sc is found.
  • the multiplicative noise reduction section 3 subtracts the time average value from the cepstrum of the input signal Sc after the time average value of the cepstrum of the input signal Sc is found.
  • EP-A-0886263 discloses an environmentally compensated processing system with first and second feature vectors and the employment of noise and distortion parameters based thereon.
  • An object of the present invention is to provide a speaker adaptive type voice recognition system which is robust to noise, to attain an increase of the voice recognition rate.
  • a speech recognition system comprising:
  • the noise may include additive noise and multiplicative noise.
  • the first feature vector generation section may include an additive noise reduction section for reducing the additive noise from the input signal to generate an additive-noise reduced signal.
  • the additive noise reduction section may apply a transformation to the input signal to generate a first spectrum and subtracting an additive noise spectrum corresponding to the additive noise from the first spectrum.
  • the first feature vector generation section may include a cepstrum calculator for applying cepstrum calculation to the additive-noise reduced signal.
  • the first feature vector generation section may include a multiplicative noise reduction section for reducing the multiplicative noise by subtracting the multiplicative noise from the first feature vector.
  • the first feature vector may contain a plurality of time-series first feature vectors; and the multiplicative noise reduction section calculates a time average of the time-series first features vectors for estimating the multiplicative noise.
  • the second feature vector generation section may apply at least cepstrum calculation to the second spectrum to generate the second feature vector.
  • the first feature vector generation section generates the first feature vector except for the additive noise of the peripheral circumstance surrounding the speaker or the multiplicative noise such as transmission noise of the present speech recognition system itself.
  • the second feature vector generation section generates the second feature vector including the additive noise of the peripheral circumstance surrounding the speaker or the feature of the multiplicative noise such as transmission noise of the present voice recognition system itself.
  • the preparation section generates the adaptive vector by compensating the standard vector according to the first feature vector not including the noise and the second feature vector including the noise.
  • the adaptive vector generates the updated speaker adaptive acoustic model which is adaptive to the voice of the speaker.
  • the standard vector in the standard acousticmodel is compensated for. Therefore, the speaker adaptive acoustic model corresponding to the practical utterance circumstance can be prepared, and the voice recognition system being robust to the noise and having the higher voice recognition rate can be realized.
  • the second feature vector generation section generates the feature vector without removing the additivenoise or multicative noise and the feature vector is used for the speaker adaptation. Therefore, the feature information of the original voice is not removed, and the adequate speaker adaptive acoustic model can be generated.
  • FIG. 1 is a block diagram showing the structure of a voice recognition system according to an embodiment of the present invention.
  • the voice recognition system comprises the standard acoustic model (hereinafter, referred to as [standard voice HMM]) 300 of an unspecified speaker previously prepared by using the Hidden Markov model (HMM) and the speaker adaptation acoustic model (hereinafter, referred to as [adaptive voice HMM]) 400 prepared by the speaker adaptation.
  • [standard voice HMM] the standard acoustic model 300 of an unspecified speaker previously prepared by using the Hidden Markov model (HMM) and the speaker adaptation acoustic model (hereinafter, referred to as [adaptive voice HMM]) 400 prepared by the speaker adaptation.
  • the state number of the standard voice HMM 300 is defined as 1. Further, the standard voice HMM 300 has an appearance probability distribution for each syllable, and an average vector of the appearance probability distribution is to be a standard vector.
  • the standard voice HMM 300 has a M dimensional standard vector [a n, M ] for each syllable. That is, when the standard voice HMM 300 is prepared, for example, voice data generated from an uttered voice by one or plurality of speakers (unspecified speakers) under silent environment is framed for each predetermine time. The framed voice data is successively cepstrum-operated, to generate the feature vector series in the cepstrum domain for a plurality of frames for each syllable. Obtaining the average of the feature vector series for a plurality of frames prepares the standard voice HMM 300 composed of the standard vector [a n, M ] for each syllable.
  • a variable n of the standard vector [a n,M ] expresses the state number to recognize each syllable
  • a variable M expresses the dimension of the vector.
  • the remaining syllables are also characterized as the M dimensional standard vector [a n, M ] distinguished by the state number n.
  • the standard voice HMM 300 is supplied with the designated text Tx of the previously determined sentence or syllable, and the standard vector [a n, M ] corresponding to the syllable structuring the designated text Tx is supplied to the path search section 10 and the speaker adaptation section 11, according to the arrangement sequence of the syllable.
  • the standard vectors corresponding to respective state numbers n 10, 46, 22, 17, 44 expressing (KO] , [N], [NI], [CHI], [WA], [a 10,1 , a 10,2 , a 10, 3 , ... a 10,M ], [a 46, 1 , a 46, 2 , a 46, 3 , ... a 46, M ], [a 22,1 , a 22,2 , a 22, 3 , ... a 22, M ], [a 17, 1 , a 17, 2 , a 17, 3 , ... a 17 , M ], and [a 44, 1 , a 44, 2 , a 44, 3 , ... a 44, M ] are supplied to the path search section 10 and the speaker adaptation section 11 in order.
  • the voice recognition system of the present invention is provided with a framing section 6, additive noise reduction section 7, feature vector generation section 8, multiplicative noise reduction section 9, and feature vector generation section 12.
  • the additive noise reduction section 7 successively conducts Fourier transformation on each framed input signal Scf divided into each frame to generate a spectrum for each frame. Further, the additive noise included in each spectrum is removed in the spectrum domain to output the spectrum.
  • the feature vector generation section 8 conducts the cepstrum operation on the spectrum having no additive noise for each frame to generate the feature vector series [c 1, M ]' in the cepstrum domain.
  • the variable i of the feature vector series [c i, M ]' expresses the order (number)
  • the variable M expresses the dimension.
  • the multiplicative noise reduction section 9 removes the multiplicative noise from the feature vector series [c i, M ]' by using the CMN method. That is, a plurality of the feature vector series [c i,M ]' obtained for each frame i by the feature vector generation section 8 are time-averaged for each dimension. When the M dimensional time average value [c ⁇ M ] obtained thereby is subtracted from each feature vector [c i, M ]' to generate the feature vector series [c i,M ] from which the maultiplicative noise is removed. The feature vector series lc i,M ] thus generated is supplied to the path search section 10.
  • the feature vector generation section 12 generates spectrum for the frame when each framed input signal Scf divided for frame outputted from the framing section 6 is successively Fourier-transformed. Further, when each spectrum is cepstrum-operated for each frame, the feature vector series [s i, M ] in the cepstrum domain is generated and supplied to the speaker adaptation section 11. In this connection, the variable i of the feature vector series [s i, M ] expresses the order for each frame, and the variable M expresses the dimension.
  • the designation text Tx, standard vector [a n, M ] and feature vector series [c i, M ] are supplied to the path search section 10.
  • the designation text Tx, standard vector [a n,M ] andfeaturevector series [S i, M ] are supplied to the speaker adaptation section 11.
  • the path search section 10 compares the standard vector [a n, M ] to the feature vector series [c i, M ], and judges which syllable of the designation text Tx corresponds to the feature vector series [c i, M ] for each frame.
  • the path search result Dv is supplied to the speaker adaptation section 11.
  • the speaker adaptation section 11 divides the feature vector series [s i, M ] from the feature vector generation section 12 into each syllable according to the path search result Dv. Then, the average is obtained for each dimension with respect to the feature vector series [s i, M ] for each divided syllable. Eventually, the average feature vector [ s ⁇ n, M ] for each syllable is generated.
  • the speaker adaptation section 11 finds a difference vector [d n, M ] between the standard vector [a n, M ] of each syllable corresponding to the designation text Tx, and the average feature vector [s ⁇ n, M ]. Then, conducting the average operation on these difference vectors [d n, M ] leads to finding the M dimensional movement vector [m M ] expressing the feature of the specified speaker. Further, the adaptation vectors [x n, M ] for all syllables are generated by adding the movement vector [m M ] to the standard vectors [a n, M ] of all syllables from the standard voice HMM 300. The adaptive voice HMM 400 is updated by these adaptation vectors [x n, M ].
  • the input signal Sc of Japanese [KONNICHIWA] uttered by the speaker is divided into 30 frames by the framing section 6 and inputted.
  • the standard voice HMM 300 is, as shown in Fig. 2, prepared as the standard vector [a n, M ] of the unspecified speaker corresponding to each of a plurality of syllables. Further, each syllable is classified by the state number n.
  • the adaptive voice HMM 400 is set to the same content (default setting) as the standard vector [a n, M ] of the standard voice HMM 300 before the speaker adaptation, as shown in Fig. 2.
  • the designation text Tx of Japanese [KONNICHIWA] is supplied to the standard voice HMM 300.
  • the standard vector [a 46, 1 ,a 46, 2 , a 46, 3 , .... a 46, M ] corresponding to the state number n 46 expressing the syllable [N]
  • the framing section 6 divides the input signal Sc into 30 frames according to the lapse of time, and outputs the divided input signal Sc.
  • the feature vector generation section 12 generates the feature vectors [s 1 , 1 , s 1, 2 , s 1, 3 ,... s 1, M ] - [s 30, 1 , s 30, 2 , s 30,1 , , ... s 30, M ] of the framed input signal Scf according to the order of each frame, and supplies to the speaker adaptation section 11.
  • the processing system includes the additive noise reduction section 7, feature vector generation section 8, and multiplicative noise reduction section 9.
  • the path search section 10 compares the feature vector series [c i, M ] for 30 frames to the standard vector [a n, M ] corresponding to each syllable of the designation text Tx, by the methods of Viterbi algorithm or forward backward algorithm, and finds which syllable corresponds to the feature vector series [c i, M ] at each moment for each frame.
  • each frame number i of 30 frames is coordinated to each state number n expressing each syllable of [KONNICHIWA]. Then, the coordinated result is supplied to the speaker adaptation section 11 as the path search result Dv.
  • the speaker adaptation section 11 coordinates the feature vectors [s 1, 1 , s 1,2 , s 1, 3 , ... s 1, M ] - [s 30, 1 , s 30, 2 , s 30, 3 , ... s 30, M ] to the standard vectors [a 10, 1 , a 10, 2 , a 10, 3 , .... a 10, M ], [a 46, 1 , a 46, 2 , a 46, 3 , ... a 46, M ] , [a 22, 1 , a 22, 2 , a 22, 3 , .... a 22, M ], [a 17, 1 , a 17, 2 , a 17 , 3 , .... a 17, M ], [a 44, 1 , a 44, 2 , a 44 , 3 , .... a 44, M ], according to the path search result Dv.
  • the standard vector [a 17, 1 , a 17, 2 , a 17, 3 , .... a 17, M ] is coordinated to the feature vector [s 15, 1 , s 15, 2 , s 15, 3 , ...
  • the speaker adaptation section 11 divides the feature vectors [s 1 , 1 , s 1 , 2 , s 1 , 3 , ... s 1 M ] - [s 30 , 1 , s 30 , 2 , s 30 , 3 , ... s 30 , M ] for 30 frames shown in Fig. 6, for each syllable of [KO] , [N], [NI], [CHI], [WA].
  • the average feature vector for each syllable of [KO], [N], [NI], [CHI], [WA] , [s ⁇ n, M ] is generated by obtaining the average for each divided feature vector.
  • the element s ⁇ n, M up to the M dimensional 6 element s 1,M - s 6,M is obtained, and the M dimensional average feature vector [s ⁇ n, 1 , s ⁇ n, 2 , s ⁇ n, 3 , ... s ⁇ n, M ] corresponding to the syllable [KO] composed of these M dimensional elements s ⁇ n, 1 - s ⁇ n, M is generated.
  • the M dimensional average feature vector corresponding to the syllable [KO] is [s ⁇ 10, 1 , s ⁇ 10, 2 , s ⁇ 10, 3, ...s ⁇ 10, M ].
  • the average feature vector [s ⁇ 46, 1, ... s ⁇ 46 , M ] corresponding to the remaining syllable [N] the average feature vector [s ⁇ 22, 1 , ... s ⁇ 22, M ] corresponding to the syllable [NI]
  • the average feature vector [s ⁇ 44, 1 , ... s ⁇ 44, M ] corresponding to the syllable [WA] are also obtained in the same manner.
  • movement vector [m M ] [m 1 , m 2 , ..., m M ] expresses the feature of the specified speaker.
  • the adaptive vector [x n,M ] having the feature proper to the speaker is obtained from addition of the movement vector [m M ] to the standard vector [a n, M ] of the all syllables.
  • the processing of the speaker adaptation is completed by updating the adaptive voice HMM 400 by the obtained adaptive vector [x n, M ].
  • [x n, M ] [a n, M ] + [m M ]
  • the adaptive voice HMM 400 has the speaker adaptation according to the designation text Tx of [KONNICHIWA]. However, when the adaptive voice HMM 400 has the speaker adaptation according to the designation text Tx including other syllables, all the syllable in the adaptive voice HMM 400 can also have the speaker adaptation.
  • the framing section 6 divides the input signal Sc into the frames for each predetermined time (for example, 10 - 20 msec) in the same manner as the above. Then, the framing section 6 outputs the framed input signal Scf of each frame according to the lapse of time, and supplies to the additive noise reduction section 13.
  • the additive noise reduction section 13 in the same manner as the above additive noise reduction section 7, conducts Fourier transformation on each framed input signal Scf divided for frame, and generates the spectrum for each frame. Further, the additive noise reduction section 13 removes the additive noise included in each spectrum in the spectrum domain, and outputs the spectrum to the feature vector generation section 14.
  • the feature vector generation section 14 in the same manner as in the above feature vector generation section 8, conducts the cepstrum operation on the spectrum having no additive noise for frame, generates the feature vector series [y i,M ]' in the cepstrum domain, and outputs to the multiplicative noise reduction section 15.
  • the multiplicative noise reduction section 15 in the same manner as in the above multiplicative noise reduction section 9, removes the multiplicative noise from the feature vector series [y i,M ]' by using the CMN method, and supplies the M dimensional feature vector series [y i,M ] having no multiplicative noise, to the recognition section 16.
  • the variable i of the feature vector series [y i,M ] expresses the frame number.
  • the recognition section 16 collates the feature vector series [y 1,M ] with the adaptive vector [x n, M ] of the adaptive voice HMM 400 in which the speaker adaptation is conducted, and outputs the adaptive voice HMM 400 which gives the highest likelihood as the recognition result.
  • the additive noise reduction section 7, feature vector generation section 8 and multiplicative noise reduction section 9 generate the feature vector series [c i, M ] from which the additive noise and multipicative noise are removed.
  • the feature vector generation section 12 generates the feature vector series [s i, M ] according to the framed input signal Scf including the additive noise and multipicative noise.
  • the path search section 10 and speaker adaptation section 11 generate the adaptive vector [x i , M ] according to these feature vector series [c i, M ] , feature vector series [s i,M ], and standard vector [a i,M ].
  • the adaptive vector [x i, M ] in which the speaker adaptation is conducted updates the adaptive voice HMM 400.
  • the feature vector series [si, M] including the feature of the noise (additive noise) of the peripheral circumstance surrounding the specified speaker, or transmission noise (multiplicative noise) of the present voice recognition system itself, is used for the speaker adaptation. Therefore, the adaptive voice HMM 400 corresponding to the actual utterance circumstance can be generated from the voice recognition system which is robust to the noise and whose voice recognition rate is high.
  • the generation of the feature vector from which the additive noise and multiplicative noise are removed misses the feature information of the utterance proper to the speaker to be compensated for by the speaker adaptation.
  • the adequate speaker adaptive acoustic model cannot be prepared.
  • the feature vector generation section 12 generates the feature vector series [s i, M ] without removing the additive noise and multiplicative noise.
  • the feature information of the utterance proper to the speaker to be compensated for by the speaker adaptation is not missed because the feature vector series [s i,M ] is used for the speaker adaptation. Therefore, the adequate speaker adaptive acoustic model can be prepared to increase the voice recognition rate.
  • the adaptive voice HMM 400 on the basis of the syllables such as Japanese [AIUEO] is prepared.
  • the adaptive voice HMM 400 based on the phoneme can be prepared.
  • the speaker adaptation method of the present invention can be adapted for other various speaker adaptation methods in which the standard vector [a n,M ] is coordinated with the feature vector series [s i,M ] or [c i, M, ] of the speaker adaptation. According thereto, the speaker adaptive acoustic model can be generated.
  • the voice recognition system of the present invention when the speaker adaptation is conducted, the feature vector from which the additive noise or the multiplicative noise is removed, and the feature vector including the feature of the additive noise or the multiplicative noise are generated. According to the feature vector not including noise and the feature vector including the noise, the standard vector is compensated for. Because the speaker adaptive acoustic model adaptive for the utterance proper to the speaker is prepared, the speaker adaptive acoustic model adaptive for the actual utterance circumstance canbe generated.
  • the feature vector is used for the speaker adaptation without removing the additive noise or multiplicative noise, the feature information of the utterance proper to the speaker to be compensated for by the speaker adaptation, is not missed. Therefore, an adequate speaker adaptive acoustic model can be generated.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Complex Calculations (AREA)
  • Image Analysis (AREA)

Claims (8)

  1. Spracherkennungssystem, umfassend:
    ein akustisches Standardmodell (500), das jede Silbe oder Phonemeinheit für die Spracherkennung erkennt, mit einem von einem nichtspezifizierten Sprecher vorab erstellten Standardvektor;
    einen einen ersten Merkmalsvektor erzeugenden Abschnitt (8), der Geräusche in einem von einer Sprechstimme erzeugten und einem bestimmten Text entsprechenden Eingabesignal verringert, um einen ersten Merkmalsvektor zu erzeugen, der eine Merkmalsmenge der Sprechstimme darstellt;
    einen einen zweiten Merkmalsvektor erzeugenden Abschnitt (12), der aus dem Eingabesignal mit den Geräuschen einen zweiten Merkmalsvektor erzeugt; und
    einen einen Anpassungsvektor erzeugenden Verarbeitungsabschnitt (11), der den ersten Merkmalsvektor mit einem Standardvektor vergleicht, um ein Pfadsuchergebnis zu ermitteln, den zweiten Merkmalsvektor entsprechend dem Pfadsuchergebnis mit dem Standardvektor abstimmt und ein sprecherangepasstes akustisches Modell (400) erzeugt, das für die Sprechstimme kennzeichnend ist.
  2. Spracherkennungssystem nach Anspruch 1, bei dem die Geräusche Additivgeräusche und Multiplikativgeräusche umfassen.
  3. Spracherkennungssystem nach Anspruch 2, bei dem der einen ersten Merkmalsvektor erzeugende Abschnitt (8) einen Additivgeräusche verringernden Abschnitt (7) umfasst, der Additivgeräusche in dem Eingabesignal verringert, um ein additivgeräuschverringertes Signal zu erzeugen.
  4. Spracherkennungssystem nach Anspruch 3, bei dem der Additivgeräusche verringernde Abschnitt (7) eine Umwandlung des Eingabesignals vornimmt, um ein erstes Spektrum zu erzeugen, und ein Additivgeräuschspektrum entsprechend den Additivgeräuschen von dem ersten Spektrum abzieht.
  5. Spracherkennungssystem nach Anspruch 3, bei dem der einen ersten Merkmalsvektor erzeugende Abschnitt (8) einen Kepstrumberechner zur Vornahme einer Kepstrumberechnung an dem additivgeräuschverringerten Signal umfasst.
  6. Spracherkennungssystem nach Anspruch 5, bei dem der einen ersten Merkmalsvektor erzeugende Abschnitt (8) einen Multiplikativgeräusche verringernden Abschnitt (9) umfasst, der Multiplikativgeräusche durch Abziehen der Multiplikativgeräusche von dem ersten Merkmalsvektor verringert.
  7. Spracherkennungssystem nach Anspruch 6, bei dem der erste Merkmalsvektor eine Mehrzahl zeitreihenbasierter erster Merkmalsvektoren umfasst; und der Multiplikativgeräusche verringernde Abschnitt (9) einen Zeitdurchschnitt der zeitreihenbasierten ersten Merkmalsvektoren berechnet, um die Multiplikativgeräusche abzuschätzen.
  8. Spracherkennungssystem nach Anspruch 1, bei dem der einen zweiten Merkmalsvektor erzeugende Abschnitt (12) wenigstens eine Kepstrumberechnung an dem Eingabesignal vornimmt, um den zweiten Merkmalsvektor zu berechnen.
EP01308268A 2000-09-29 2001-09-27 Geräuschrobuste Spracherkennung Expired - Lifetime EP1195744B1 (de)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2000298536 2000-09-29
JP2000298536A JP4169921B2 (ja) 2000-09-29 2000-09-29 音声認識システム

Publications (3)

Publication Number Publication Date
EP1195744A2 EP1195744A2 (de) 2002-04-10
EP1195744A3 EP1195744A3 (de) 2003-01-22
EP1195744B1 true EP1195744B1 (de) 2005-11-16

Family

ID=18780481

Family Applications (1)

Application Number Title Priority Date Filing Date
EP01308268A Expired - Lifetime EP1195744B1 (de) 2000-09-29 2001-09-27 Geräuschrobuste Spracherkennung

Country Status (5)

Country Link
US (1) US7065488B2 (de)
EP (1) EP1195744B1 (de)
JP (1) JP4169921B2 (de)
CN (1) CN1236421C (de)
DE (1) DE60114968T2 (de)

Families Citing this family (29)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002123285A (ja) * 2000-10-13 2002-04-26 Sony Corp 話者適応装置および話者適応方法、記録媒体、並びに音声認識装置
GB2384901B (en) * 2002-02-04 2004-04-21 Zentian Ltd Speech recognition circuit using parallel processors
DE102004008225B4 (de) * 2004-02-19 2006-02-16 Infineon Technologies Ag Verfahren und Einrichtung zum Ermitteln von Merkmalsvektoren aus einem Signal zur Mustererkennung, Verfahren und Einrichtung zur Mustererkennung sowie computerlesbare Speichermedien
US7895039B2 (en) 2005-02-04 2011-02-22 Vocollect, Inc. Methods and systems for optimizing model adaptation for a speech recognition system
US7865362B2 (en) 2005-02-04 2011-01-04 Vocollect, Inc. Method and system for considering information about an expected response when performing speech recognition
US7827032B2 (en) 2005-02-04 2010-11-02 Vocollect, Inc. Methods and systems for adapting a model for a speech recognition system
US8200495B2 (en) * 2005-02-04 2012-06-12 Vocollect, Inc. Methods and systems for considering information about an expected response when performing speech recognition
US7949533B2 (en) 2005-02-04 2011-05-24 Vococollect, Inc. Methods and systems for assessing and improving the performance of a speech recognition system
JP4332129B2 (ja) * 2005-04-20 2009-09-16 富士通株式会社 文書分類プログラム、文書分類方法および文書分類装置
US20070112567A1 (en) * 2005-11-07 2007-05-17 Scanscout, Inc. Techiques for model optimization for statistical pattern recognition
WO2007105409A1 (ja) * 2006-02-27 2007-09-20 Nec Corporation 標準パタン適応装置、標準パタン適応方法および標準パタン適応プログラム
JP4245617B2 (ja) * 2006-04-06 2009-03-25 株式会社東芝 特徴量補正装置、特徴量補正方法および特徴量補正プログラム
JP4316583B2 (ja) * 2006-04-07 2009-08-19 株式会社東芝 特徴量補正装置、特徴量補正方法および特徴量補正プログラム
US20080109391A1 (en) * 2006-11-07 2008-05-08 Scanscout, Inc. Classifying content based on mood
US8577996B2 (en) * 2007-09-18 2013-11-05 Tremor Video, Inc. Method and apparatus for tracing users of online video web sites
US8549550B2 (en) 2008-09-17 2013-10-01 Tubemogul, Inc. Method and apparatus for passively monitoring online video viewing and viewer behavior
US9612995B2 (en) 2008-09-17 2017-04-04 Adobe Systems Incorporated Video viewer targeting based on preference similarity
US20110093783A1 (en) * 2009-10-16 2011-04-21 Charles Parra Method and system for linking media components
EP2502195A2 (de) * 2009-11-20 2012-09-26 Tadashi Yonezaki Verfahren und vorrichtung zur optimierung einer zuweisung von werbeinhalten
KR101060183B1 (ko) * 2009-12-11 2011-08-30 한국과학기술연구원 임베디드 청각 시스템 및 음성 신호 처리 방법
US20110150270A1 (en) * 2009-12-22 2011-06-23 Carpenter Michael D Postal processing including voice training
JP5494468B2 (ja) * 2010-12-27 2014-05-14 富士通株式会社 状態検出装置、状態検出方法および状態検出のためのプログラム
US8914290B2 (en) 2011-05-20 2014-12-16 Vocollect, Inc. Systems and methods for dynamically improving user intelligibility of synthesized speech in a work environment
CN102760436B (zh) * 2012-08-09 2014-06-11 河南省烟草公司开封市公司 一种语音词库筛选方法
US9978395B2 (en) 2013-03-15 2018-05-22 Vocollect, Inc. Method and system for mitigating delay in receiving audio stream during production of sound from audio stream
US10714121B2 (en) 2016-07-27 2020-07-14 Vocollect, Inc. Distinguishing user speech from background speech in speech-dense environments
KR102637339B1 (ko) 2018-08-31 2024-02-16 삼성전자주식회사 음성 인식 모델을 개인화하는 방법 및 장치
CN110197670B (zh) * 2019-06-04 2022-06-07 大众问问(北京)信息科技有限公司 音频降噪方法、装置及电子设备
CN112863483B (zh) * 2021-01-05 2022-11-08 杭州一知智能科技有限公司 支持多说话人风格、语言切换且韵律可控的语音合成装置

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5721808A (en) * 1995-03-06 1998-02-24 Nippon Telegraph And Telephone Corporation Method for the composition of noise-resistant hidden markov models for speech recognition and speech recognizer using the same
JP2780676B2 (ja) * 1995-06-23 1998-07-30 日本電気株式会社 音声認識装置及び音声認識方法
JP3001037B2 (ja) * 1995-12-13 2000-01-17 日本電気株式会社 音声認識装置
US6026359A (en) * 1996-09-20 2000-02-15 Nippon Telegraph And Telephone Corporation Scheme for model adaptation in pattern recognition based on Taylor expansion
US5924065A (en) * 1997-06-16 1999-07-13 Digital Equipment Corporation Environmently compensated speech processing
EP0997003A2 (de) * 1997-07-01 2000-05-03 Partran APS Verfahren und schaltung zum rauschereduktion in sprachsignalen
US6658385B1 (en) * 1999-03-12 2003-12-02 Texas Instruments Incorporated Method for transforming HMMs for speaker-independent recognition in a noisy environment
US6529866B1 (en) * 1999-11-24 2003-03-04 The United States Of America As Represented By The Secretary Of The Navy Speech recognition system and associated methods
US7089182B2 (en) * 2000-04-18 2006-08-08 Matsushita Electric Industrial Co., Ltd. Method and apparatus for feature domain joint channel and additive noise compensation
US7003455B1 (en) * 2000-10-16 2006-02-21 Microsoft Corporation Method of noise reduction using correction and scaling vectors with partitioning of the acoustic space in the domain of noisy speech

Also Published As

Publication number Publication date
JP2002108383A (ja) 2002-04-10
DE60114968D1 (de) 2005-12-22
EP1195744A3 (de) 2003-01-22
DE60114968T2 (de) 2006-07-27
CN1346125A (zh) 2002-04-24
JP4169921B2 (ja) 2008-10-22
EP1195744A2 (de) 2002-04-10
US20020042712A1 (en) 2002-04-11
CN1236421C (zh) 2006-01-11
US7065488B2 (en) 2006-06-20

Similar Documents

Publication Publication Date Title
EP1195744B1 (de) Geräuschrobuste Spracherkennung
Acero Formant analysis and synthesis using hidden Markov models
EP0831461B1 (de) Schema und Modelladaption bei Mustererkennung welche auf Taylorausdehnung basiert
EP0689194B1 (de) Verfahren und Vorrichtung zur Signalerkennung unter Kompensation von Fehlzusammensetzungen
US5806029A (en) Signal conditioned minimum error rate training for continuous speech recognition
EP0660300B1 (de) Spracherkennungsgerät
CA2228948C (en) Pattern recognition
US6671666B1 (en) Recognition system
JPH11126090A (ja) 音声認識方法及び音声認識装置並びに音声認識装置を動作させるためのプログラムが記録された記録媒体
EP1471500B1 (de) Vorrichtung und Verfahren für Spracherkennung mit Modellen, die an die aktuellen Geräuschbedingungen adaptiert werden
JPH08234788A (ja) 音声認識のバイアス等化方法および装置
EP2011115A2 (de) Weiche ausrichtung in einem auf dem gaussschen mischungsmodell beruhenden umwandlungsprozess
US20050010406A1 (en) Speech recognition apparatus, method and computer program product
EP1189204B1 (de) HMM-basierte Erkennung von verrauschter Sprache
JPH10149191A (ja) モデル適応方法、装置およびその記憶媒体
JPH10133688A (ja) 音声認識装置
JP4464797B2 (ja) 音声認識方法、この方法を実施する装置、プログラムおよびその記録媒体
JPH07121197A (ja) 学習式音声認識方法
JPH1097278A (ja) 音声認識方法および装置
JP2000259198A (ja) パターン認識装置および方法、並びに提供媒体
EP1178465A2 (de) Verfahren zur Rauschadaptierung mittels transformierten Matrizen in der automatischen Spracherkennung
EP1354312B1 (de) Verfahren, vorrichtung, endgerät und system zur automatischen erkennung verzerrter sprachdaten
JP2004309959A (ja) 音声認識装置および音声認識方法
JPH1185200A (ja) 音声認識のための音響分析方法
JPH09230886A (ja) 音声認識用耐雑音隠れマルコフモデル作成方法及びその作成方法を用いる音声認識装置

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

AK Designated contracting states

Kind code of ref document: A2

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LI LU MC NL PT SE TR

AX Request for extension of the european patent

Free format text: AL;LT;LV;MK;RO;SI

PUAL Search report despatched

Free format text: ORIGINAL CODE: 0009013

AK Designated contracting states

Kind code of ref document: A3

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LI LU MC NL PT SE TR

AX Request for extension of the european patent

Free format text: AL;LT;LV;MK;RO;SI

17P Request for examination filed

Effective date: 20030708

AKX Designation fees paid

Designated state(s): DE FR GB

17Q First examination report despatched

Effective date: 20031114

GRAP Despatch of communication of intention to grant a patent

Free format text: ORIGINAL CODE: EPIDOSNIGR1

GRAS Grant fee paid

Free format text: ORIGINAL CODE: EPIDOSNIGR3

GRAA (expected) grant

Free format text: ORIGINAL CODE: 0009210

AK Designated contracting states

Kind code of ref document: B1

Designated state(s): DE FR GB

REG Reference to a national code

Ref country code: GB

Ref legal event code: FG4D

REF Corresponds to:

Ref document number: 60114968

Country of ref document: DE

Date of ref document: 20051222

Kind code of ref document: P

ET Fr: translation filed
PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

Ref country code: FR

Payment date: 20060908

Year of fee payment: 6

PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

Ref country code: DE

Payment date: 20060922

Year of fee payment: 6

PLBE No opposition filed within time limit

Free format text: ORIGINAL CODE: 0009261

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: NO OPPOSITION FILED WITHIN TIME LIMIT

PGFP Annual fee paid to national office [announced via postgrant information from national office to epo]

Ref country code: GB

Payment date: 20060927

Year of fee payment: 6

26N No opposition filed

Effective date: 20060817

GBPC Gb: european patent ceased through non-payment of renewal fee

Effective date: 20070927

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: DE

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20080401

REG Reference to a national code

Ref country code: FR

Ref legal event code: ST

Effective date: 20080531

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: FR

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20071001

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: GB

Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES

Effective date: 20070927