WO2010109725A1 - Voice processing apapratus, voice processing method, and voice processing program - Google Patents

Voice processing apapratus, voice processing method, and voice processing program Download PDF

Info

Publication number
WO2010109725A1
WO2010109725A1 PCT/JP2009/069580 JP2009069580W WO2010109725A1 WO 2010109725 A1 WO2010109725 A1 WO 2010109725A1 JP 2009069580 W JP2009069580 W JP 2009069580W WO 2010109725 A1 WO2010109725 A1 WO 2010109725A1
Authority
WO
WIPO (PCT)
Prior art keywords
distribution
feature
speech
acoustic model
noise
Prior art date
Application number
PCT/JP2009/069580
Other languages
French (fr)
Japanese (ja)
Inventor
雄介 篠原
政巳 赤嶺
Original Assignee
株式会社東芝
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 株式会社東芝 filed Critical 株式会社東芝
Publication of WO2010109725A1 publication Critical patent/WO2010109725A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech

Definitions

  • the present invention relates to a voice processing device, a voice processing method, and a voice processing program.
  • the feature enhancement method uses noisy speech features extracted from noise-superposed speech (hereinafter referred to as noisy speech) in a noisy environment. Technology). By using clean speech features estimated by the feature enhancement method, speech recognition performance under noise can be improved.
  • Non-Patent Document 1 discloses a conventional speech recognition apparatus.
  • a conventional speech recognition apparatus includes a feature extraction unit, a first acoustic model storage unit, a probability calculation unit, a distribution storage unit, a mixed distribution generation unit, a feature enhancement unit, a second acoustic model storage unit, and a decoding unit.
  • the feature extraction unit extracts a noisy speech feature from each frame of the input noisy speech.
  • the first acoustic model storage unit stores a first acoustic model representing a standard phonemic pattern in a noisy environment.
  • the probability calculation unit collates the noisy speech feature sequence with the first acoustic model, and calculates a probability of staying in each distribution of the first acoustic model in each frame (distribution posterior probability).
  • the distribution storage unit stores a set of basis distributions. Each of the basis distributions is a combined Gaussian distribution of clean speech features and noisy speech features.
  • the mixed distribution generation unit generates a mixed distribution by mixing the base distribution with the distribution posterior probability for each frame. This mixed distribution represents a combined distribution of clean speech characteristics and noisy speech features in the frame.
  • the feature enhancement unit estimates the clean speech feature from the noisy speech feature using the mixture distribution in each frame.
  • the second acoustic model storage unit stores a second acoustic model representing a standard phonemic pattern in a clean environment.
  • the decoding unit collates the sequence of clean speech features estimated by the feature enhancement unit with the second acoustic model, and outputs an optimal word string.
  • Non-Patent Document 1 since the speech recognition apparatus of Non-Patent Document 1 performs feature enhancement using a joint Gaussian distribution learned in advance, there is a problem in that speech recognition performance deteriorates under noise different from that during learning. .
  • a combined Gaussian distribution of clean speech characteristics and noisy speech features can be dynamically synthesized from a Gaussian distribution of clean speech features each time the noise changes.
  • an ordinary acoustic model has several thousand to several tens of thousands of Gaussian distributions, enormous amount of calculation is required to dynamically synthesize coupled Gaussian distributions, which is not realistic.
  • the present invention has been made in view of the above, and provides a voice processing device, a voice processing method, and a voice processing program capable of achieving high voice recognition performance with a small amount of calculation even in an environment where noise changes.
  • the purpose is to provide.
  • the present invention extracts a first voice feature from each frame of the first voice on which noise is superimposed in a noisy environment, and extracts the first voice feature.
  • a feature extraction unit for calculating a sequence; a noise estimation unit for estimating the noise superimposed on the first speech; and a first distribution representing a second speech feature distribution of the second speech in an environment free from noise.
  • a first distribution storage unit that stores a set of basis distributions and a combined distribution of the first speech feature and the second speech feature from each of the first basis distributions based on the noise
  • a distribution synthesis unit that synthesizes a second basis distribution
  • a first acoustic model storage unit that stores a first acoustic model representing a standard pattern of phonemes in a noisy environment, a sequence of the first speech features, The first acoustic model is collated and each frame is checked.
  • a probability calculating unit that calculates a state posterior probability that is a probability of staying in each state of the first acoustic model, and each state of the first acoustic model corresponds to each of the second basis distributions
  • a blending weight storage unit for storing blending weights, a blending weight blending unit for blending the blending weights using the state posterior probabilities and calculating a blended blending weight, and for each frame, with the blending blending weight, Mixing the second basis distribution, generating a mixture distribution that is a combined distribution of the first speech feature and the second speech feature in the frame; and for each frame, the mixture distribution
  • a feature enhancement unit for estimating the second speech feature from the first speech feature.
  • FIG. 1 is a block diagram showing the configuration of the speech processing apparatus according to the present embodiment.
  • the speech recognition apparatus includes a feature extraction unit 1, a noise estimation unit 2, a first distribution storage unit 3, a first distribution storage control unit 4, a second distribution storage unit 5, a distribution synthesis unit 6, and a first acoustic model.
  • the feature extraction unit 1 extracts features from each frame of the input noisy speech, and calculates a sequence of noisy speech features.
  • the frame is obtained by cutting out a part of the input audio signal, and is sequentially cut out while gradually shifting the cut out section.
  • a vector having a mel frequency cepstrum coefficient (MFCC) as an element can be used as a feature.
  • MFCC mel frequency cepstrum coefficient
  • the feature dimension is d.
  • a sequence of noisy speech features is calculated by extracting features from each of the sequentially extracted frames.
  • the noise estimation unit 2 estimates noise superimposed on the input noisy speech. For example, it is possible to select a section that does not include speech and includes only noise using a voice section detector, and perform noise estimation using this section. More specifically, in the section consisting only of noise, the above features are extracted from each frame, and the average / covariance is obtained from the obtained set of features. This mean / covariance defines a Gaussian distribution of noise features.
  • the first distribution storage unit 3 stores a set of first basis distributions.
  • a d-dimensional Gaussian distribution is used as the first basis distribution.
  • Each basis distribution represents a distribution of clean speech features. A method of calculating the first basis distribution set will be described in detail later.
  • the first distribution storage control unit 4 performs control such that the first distribution storage unit 3 stores the first set of basis distributions.
  • the second distribution storage unit 5 stores a set of second basis distributions.
  • a 2 ⁇ d-dimensional Gaussian distribution is used as the second basis distribution.
  • Each basis distribution represents a combined Gaussian distribution of clean speech features and noisy speech features.
  • the distribution synthesis unit 6 synthesizes the second basis distribution from each of the first basis distributions stored in the first distribution storage unit 3 based on the noise calculated by the noise estimation unit 2, Store in the distribution storage unit 5. That is, a combined Gaussian distribution of clean speech features and noisy speech features is synthesized from a Gaussian distribution of noise features and a Gaussian distribution of clean speech features.
  • a combined Gaussian distribution of clean speech features and noisy speech features is synthesized from a Gaussian distribution of noise features and a Gaussian distribution of clean speech features.
  • Vector Taylor Series method or uncentred transformation can be used.
  • the first acoustic model storage unit 7 stores a first acoustic model representing a standard phonemic pattern in a noisy environment. More specifically, the acoustic model is a hidden Markov model, and the output distribution in each state and the transition probability between states are stored. The first acoustic model is created in advance from learning data consisting of a set of noisy speech features.
  • the first acoustic model storage control unit 8 performs control such that the first acoustic model storage unit 7 stores the first acoustic model.
  • the probability calculation unit 9 collates the noisy speech feature series calculated by the feature extraction unit 1 with the first acoustic model stored in the first acoustic model storage unit 7, and in each frame, The probability of staying in each state of the acoustic model (state posterior probability) is calculated.
  • state posterior probability can be calculated by using a forward backward algorithm.
  • the state posterior probability can be calculated from the N best candidate list. The method for calculating the state posterior probability using the N best candidate list is described in detail in Non-Patent Document 1, for example.
  • the mixing weight storage unit 10 stores the mixing weight corresponding to each of the second basis distributions for each state of the first acoustic model.
  • L the number of states
  • K the number of basis distributions
  • L ⁇ K values are stored.
  • the mixed distribution generated by mixing the K basis distributions stored in the first distribution storage unit 3 with the mixing weight corresponding to the state is a clean distribution in the state. Represents the distribution of voice features.
  • a mixture distribution generated by mixing K basis distributions stored in the second distribution storage unit 5 with a mixture weight corresponding to the state is Represents the joint distribution of clean and noisy speech features in the state. The method for calculating the mixing weight will be described in detail later.
  • the mixing weight storage control unit 11 controls the mixing weight storage unit 10 to store the mixing weight.
  • the blending weight fusion unit 12 blends the blending weights stored in the blending weight storage unit 10 using the state posterior probabilities calculated by the probability calculation unit 9 and calculates a blending blending weight. Specifically, the blending of the blending weights is performed according to the equation (1).
  • ⁇ (t, j) is the state posterior probability of staying in state j in frame t
  • w (j, k) is the mixing weight of the kth basis distribution in state j
  • ⁇ j is the sum for j
  • v (t, k) represents the fusion mixture weight of the kth basis distribution in frame t.
  • the mixed distribution generation unit 13 mixes the second base distribution acquired from the second distribution storage unit 5 with the combined mixing weight calculated by the mixing weight combining unit 12 for each frame to generate a mixed distribution.
  • the mixture distribution is a Gaussian mixture distribution.
  • the generated mixture distribution represents a combined distribution of clean speech characteristics and noisy speech features in the frame.
  • Non-Patent Document 1 discloses details of a feature enhancement method using a mixed distribution representing a combined distribution of clean speech features and noisy speech features.
  • the second acoustic model storage unit 15 stores a second acoustic model representing a standard phonemic pattern in a clean environment. More specifically, the acoustic model is a hidden Markov model, and the output distribution in each state and the transition probability between states are stored.
  • the second acoustic model is created in advance using learning data composed of a set of clean speech features. Preferably, it is created in advance using learning data consisting of a set of clean speech features processed by the feature enhancement unit 14. That is, an acoustic model is created using a set of clean speech features obtained by processing the set of noisy speech features used for learning the first acoustic model by the feature enhancement unit 14 as learning data.
  • the second acoustic model storage control unit 16 controls the second acoustic model storage unit 15 to store the second acoustic model.
  • the decoding unit 17 collates the sequence of clean speech features estimated by the feature enhancement unit 14 with the second acoustic model stored in the second acoustic model storage unit 15 and outputs an optimum word string.
  • a Viterbi algorithm is used for collation.
  • a first set of basis distributions and a mixture weight are calculated using an EM algorithm so as to maximize the likelihood of learning data including a given set of clean speech features.
  • EM algorithm a method of creating learning data including a set of clean speech features will be described first, and then a likelihood maximization method using the EM algorithm will be described.
  • each clean speech feature is associated with one of the states of the first acoustic model.
  • the sequence is obtained by using the Viterbi algorithm. Can be associated with any state of the acoustic model. Or you may perform a fuzzy matching using a forward backward algorithm.
  • learning data that is a set of clean speech features associated with any state of the acoustic model can be created.
  • a set of learning data is D
  • a set of learning data associated with the jth state is Dj.
  • the i-th learning data (clean speech feature) is set to x i .
  • ⁇ k is an average / covariance parameter of the kth Gaussian distribution.
  • w jk be the mixing weight of the kth basis distribution in state j
  • w j ⁇ w j1 ,..., W jK ⁇ that is collected for all k
  • w that is collected for all j. ⁇ W 1 ,..., W L ⁇ .
  • K is the number of basis distributions
  • L is the number of states.
  • the (logarithm) likelihood L ( ⁇ , w) for the learning data is defined as in Expression (2).
  • ⁇ and w are calculated using an EM algorithm so as to maximize this likelihood.
  • the posterior probability of which distribution each learning sample belongs to is calculated based on the current values of ⁇ and w.
  • ⁇ and w are calculated so as to maximize the expected value of the log likelihood of complete data based on this posterior probability.
  • initial values of ⁇ and w are required. For example, a Gaussian mixture distribution with a distribution number K is learned for the entire learning data (D), and a set of obtained Gaussian distributions and a mixture weight (u) are set. ) Can be used.
  • the maximum likelihood learning method using the EM algorithm is described in detail in, for example, “L. Rabiner, B.-H. Jung (Author), Sadahiro Furui (translation), Basics of Speech Recognition, NTT Advanced Technology, 1995”. It is disclosed.
  • the first basis distribution set and the mixture weight calculated as described above are stored in the first distribution storage unit 3 and the mixture weight storage unit 10, respectively.
  • FIG. 2 is a flowchart showing the operation of the speech processing apparatus according to this embodiment.
  • the feature extraction unit 1 extracts features from each frame of the input noisy speech and calculates a sequence of noisy speech features (step S1).
  • the noise estimation unit 2 performs noise estimation from the noisy speech feature sequence calculated by the feature extraction unit 1 (step S2).
  • the distribution synthesis unit 6 synthesizes the second basis distribution from each of the first basis distributions using the noise estimated by the noise estimation unit 2 and stores it in the second distribution storage unit 5 (Ste S3).
  • steps S4 and S5 are executed. That is, the probability calculation unit 9 collates the sequence of noisy speech features calculated by the feature extraction unit 1 with the first acoustic model, and the probability of staying in each state of the first acoustic model in each frame (state posterior (Probability) is calculated (step S4).
  • the blending weight blending unit 12 blends the blending weights acquired from the blending weight storage unit 10 with the state posterior probabilities calculated by the probability calculating unit 9 for each frame, and calculates the blending blending weight (Step S1). S5).
  • step S6 the mixture distribution generation unit 13 mixes the second basis distribution stored in the second distribution storage unit 5 with the fusion mixture weight to generate a mixture distribution.
  • the feature emphasizing unit 14 calculates a clean speech feature from the noisy speech feature using the mixture distribution generated by the mixture distribution generation unit 13 for each frame (step S7).
  • the decoding unit 17 collates the series of clean speech features calculated by the feature enhancement unit 14 with the second acoustic model stored in the second acoustic model storage unit 15 to determine an optimum word string.
  • the voice recognition (voice processing) is finished (step S8). As described above, the correct voice is recognized from the noisy voice.
  • the speech processing apparatus instead of using a large number of distributions as in the prior art, only a small number of basis distributions are used, so that the combined distribution of clean speech features and noisy speech features is obtained. Therefore, it is possible to greatly reduce the amount of calculation required for the synthesis of the speech and maintain high speech recognition performance with a small amount of calculation even in an environment where noise changes.
  • the voice processing apparatus of this embodiment includes a control device such as a CPU, a storage device, an external storage device, a display device such as a display device, and an input device such as a keyboard and a mouse.
  • a control device such as a CPU
  • a storage device such as a hard disk drive
  • an external storage device such as a hard disk drive
  • a display device such as a display device
  • an input device such as a keyboard and a mouse.
  • the hardware configuration is used.
  • the audio processing program executed by the audio processing apparatus is a file in an installable or executable format, such as a CD-ROM, a flexible disk (FD), a CD-R, a DVD (Digital Versatile Disk), or the like.
  • the program is recorded on a computer-readable recording medium and provided as a computer program product.
  • the voice processing program executed by the voice processing apparatus of the present embodiment may be provided by being stored on a computer connected to a network such as the Internet and downloaded via the network.
  • the voice processing program executed by the voice processing apparatus according to the present embodiment may be provided or distributed via a network such as the Internet.
  • the voice processing program of the present embodiment may be provided by being incorporated in advance in a ROM or the like.
  • the speech processing program executed by the speech processing apparatus includes the above-described units (feature extraction unit, noise estimation unit, first distribution storage control unit, distribution synthesis unit, first acoustic model storage control unit,
  • the module configuration includes a probability calculation unit, a mixture weight storage control unit, a mixture weight fusion unit, a mixture distribution generation unit, a feature enhancement unit, a second acoustic model storage control unit, and a decoding unit.
  • a CPU reads out and executes an audio processing program from the storage medium, and the above-described units are loaded on the main storage device, so that a feature extraction unit, a noise estimation unit, and a first distribution storage A control unit, a distribution synthesis unit, a first acoustic model storage control unit, a probability calculation unit, a mixture weight storage control unit, a mixture weight fusion unit, a mixture distribution generation unit, a feature enhancement unit, a second acoustic model storage control unit, and The decoding unit is generated on the main storage device.
  • the present invention is not limited to the above-described embodiment as it is, and can be embodied by modifying the constituent elements without departing from the scope of the invention in the implementation stage.
  • various inventions can be formed by appropriately combining a plurality of constituent elements disclosed in the above embodiments. For example, some components may be deleted from all the components shown in the embodiment. Furthermore, constituent elements over different embodiments may be appropriately combined.
  • the speech processing apparatus, speech processing method, and speech processing program according to the present invention are useful when speech recognition is performed under noise.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A voice processing apparatus comprises a characteristic extraction unit (1) which extracts first voice characteristics, a noise deduction unit (2) which deduces a noise, a first distribution storage unit (3) which stores a set of first base distributions, a distribution synthesization unit (6) which synthesizes a second base distribution from each base distribution on the basis of the deduced noise, a first acoustic model storage unit (7) which stores a first acoustic model, a probability calculation unit (9) which calculates a state posterior probability by comparing a series of the first voice characteristics and the first acoustic model, a mixture weight storage unit (10) which stores mixture weights corresponding to the respective second base distributions, a mixture weight fusion unit (12) which fuses the mixture weights using the state posterior probability to calculate a fused mixture weight, a mixed distribution generation unit (13) which mixes the second base distributions with the fused mixture weight to generate a mixed distribution, and a characteristic emphasis unit (14) which deduces second voice characteristics from the first voice characteristics using the mixed distribution.

Description

音声処理装置、音声処理方法、及び、音声処理プログラムAudio processing apparatus, audio processing method, and audio processing program
 本発明は、音声処理装置、音声処理方法、及び、音声処理プログラムに関する。 The present invention relates to a voice processing device, a voice processing method, and a voice processing program.
 従来より、雑音下で音声認識装置(音声処理装置)を安定に動作させるための手法が数多く提案されている。特に、特徴強調法に関する手法は盛んに研究されている。特徴強調法は、雑音環境下で雑音が重畳した音声(以後、ノイジー音声と呼ぶ)から抽出されたノイジー音声特徴から、防音設備室内などの雑音がない環境下での音声(以後、クリーン音声と呼ぶ)の特徴を推定する技術である。特徴強調法によって推定されたクリーン音声特徴を用いることで、雑音下での音声認識の性能を改善することができる。 Conventionally, many methods for stably operating a speech recognition device (speech processing device) under noise have been proposed. In particular, methods related to feature enhancement methods are actively studied. The feature enhancement method uses noisy speech features extracted from noise-superposed speech (hereinafter referred to as noisy speech) in a noisy environment. Technology). By using clean speech features estimated by the feature enhancement method, speech recognition performance under noise can be improved.
 例えば、非特許文献1には、従来の音声認識装置が開示されている。従来の音声認識装置は、特徴抽出部、第1の音響モデル記憶部、確率算出部、分布記憶部、混合分布生成部、特徴強調部、第2の音響モデル記憶部、及び、デコード部、を備える。特徴抽出部は、入力されたノイジー音声の各フレームからノイジー音声特徴を抽出する。第1の音響モデル記憶部は、ノイジーな環境における音韻の標準パターンを表わす第1の音響モデルを記憶する。確率算出部は、ノイジー音声特徴の系列と第1の音響モデルを照合し、各フレームにおいて第1の音響モデルの各分布に滞在する確率(分布事後確率)を算出する。分布記憶部は、基底分布の集合を記憶する。基底分布の各々は、クリーン音声特徴とノイジー音声特徴との結合ガウス分布である。混合分布生成部は、各フレームについて、前記分布事後確率で前記基底分布を混合して、混合分布を生成する。この混合分布は、該フレームにおけるクリーン音声特徴・ノイジー音声特徴の結合分布をあらわす。特徴強調部は、各フレームにおいて、前記混合分布を用いて、ノイジー音声特徴からクリーン音声特徴を推定する。第2の音響モデル記憶部は、クリーンな環境における音韻の標準パターンを表わす第2の音響モデルを記憶する。デコード部は、特徴強調部で推定されたクリーン音声特徴の系列と第2の音響モデルを照合し、最適な単語列を出力する。 For example, Non-Patent Document 1 discloses a conventional speech recognition apparatus. A conventional speech recognition apparatus includes a feature extraction unit, a first acoustic model storage unit, a probability calculation unit, a distribution storage unit, a mixed distribution generation unit, a feature enhancement unit, a second acoustic model storage unit, and a decoding unit. Prepare. The feature extraction unit extracts a noisy speech feature from each frame of the input noisy speech. The first acoustic model storage unit stores a first acoustic model representing a standard phonemic pattern in a noisy environment. The probability calculation unit collates the noisy speech feature sequence with the first acoustic model, and calculates a probability of staying in each distribution of the first acoustic model in each frame (distribution posterior probability). The distribution storage unit stores a set of basis distributions. Each of the basis distributions is a combined Gaussian distribution of clean speech features and noisy speech features. The mixed distribution generation unit generates a mixed distribution by mixing the base distribution with the distribution posterior probability for each frame. This mixed distribution represents a combined distribution of clean speech characteristics and noisy speech features in the frame. The feature enhancement unit estimates the clean speech feature from the noisy speech feature using the mixture distribution in each frame. The second acoustic model storage unit stores a second acoustic model representing a standard phonemic pattern in a clean environment. The decoding unit collates the sequence of clean speech features estimated by the feature enhancement unit with the second acoustic model, and outputs an optimal word string.
 しかしながら、非特許文献1の音声認識装置は、予め学習された結合ガウス分布を用いて特徴強調を実行するため、学習時と異なる雑音下においては、音声認識の性能が低下するという問題があった。この問題を解決するために、雑音が変化するたびに、クリーン音声特徴のガウス分布から、クリーン音声特徴・ノイジー音声特徴の結合ガウス分布を動的に合成することもできる。しかし、通常音響モデルには数千から数万のガウス分布があるため、結合ガウス分布を動的に合成するには膨大な計算量が必要となり現実的ではない。 However, since the speech recognition apparatus of Non-Patent Document 1 performs feature enhancement using a joint Gaussian distribution learned in advance, there is a problem in that speech recognition performance deteriorates under noise different from that during learning. . In order to solve this problem, a combined Gaussian distribution of clean speech characteristics and noisy speech features can be dynamically synthesized from a Gaussian distribution of clean speech features each time the noise changes. However, since an ordinary acoustic model has several thousand to several tens of thousands of Gaussian distributions, enormous amount of calculation is required to dynamically synthesize coupled Gaussian distributions, which is not realistic.
 本発明は、上記に鑑みてなされたものであって、雑音が変化する環境下でも少ない計算量で高い音声認識性能を達成することができる音声処理装置、音声処理方法、及び、音声処理プログラムを提供することを目的とする。 The present invention has been made in view of the above, and provides a voice processing device, a voice processing method, and a voice processing program capable of achieving high voice recognition performance with a small amount of calculation even in an environment where noise changes. The purpose is to provide.
 上述した課題を解決し、目的を達成するために、本発明は、雑音環境下で雑音が重畳した第1の音声の各フレームから第1の音声特徴を抽出し、前記第1の音声特徴の系列を算出する特徴抽出部と、前記第1の音声に重畳された前記雑音を推定する雑音推定部と、雑音がない環境下の第2の音声の第2の音声特徴の分布を表わす第1の基底分布の集合を記憶する第1の分布記憶部と、前記雑音に基づいて、前記第1の基底分布の各々から前記第1の音声特徴と前記第2の音声特徴との結合分布を表わす第2の基底分布を合成する分布合成部と、雑音環境下における音韻の標準パターンを表わす第1の音響モデルを記憶する第1の音響モデル記憶部と、前記第1の音声特徴の系列と、前記第1の音響モデルとを照合して、前記各フレームにおいて、前記第1の音響モデルの各状態に滞在する確率である状態事後確率を算出する確率算出部と、前記第1の音響モデルの各状態について、前記第2の基底分布の各々に対応する混合重みを記憶する混合重み記憶部と、前記状態事後確率を用いて、前記混合重みを融合して、融合混合重みを算出する混合重み融合部と、前記各フレームについて、前記融合混合重みで、前記第2の基底分布を混合し、該フレームにおける前記第1の音声特徴と前記第2の音声特徴の結合分布である混合分布を生成する混合分布生成部と、各フレームについて、前記混合分布を用いて、前記第1の音声特徴から前記第2の音声特徴を推定する特徴強調部と、を備えたこと、を特徴とする。 In order to solve the above-described problems and achieve the object, the present invention extracts a first voice feature from each frame of the first voice on which noise is superimposed in a noisy environment, and extracts the first voice feature. A feature extraction unit for calculating a sequence; a noise estimation unit for estimating the noise superimposed on the first speech; and a first distribution representing a second speech feature distribution of the second speech in an environment free from noise. A first distribution storage unit that stores a set of basis distributions and a combined distribution of the first speech feature and the second speech feature from each of the first basis distributions based on the noise A distribution synthesis unit that synthesizes a second basis distribution, a first acoustic model storage unit that stores a first acoustic model representing a standard pattern of phonemes in a noisy environment, a sequence of the first speech features, The first acoustic model is collated and each frame is checked. A probability calculating unit that calculates a state posterior probability that is a probability of staying in each state of the first acoustic model, and each state of the first acoustic model corresponds to each of the second basis distributions A blending weight storage unit for storing blending weights, a blending weight blending unit for blending the blending weights using the state posterior probabilities and calculating a blended blending weight, and for each frame, with the blending blending weight, Mixing the second basis distribution, generating a mixture distribution that is a combined distribution of the first speech feature and the second speech feature in the frame; and for each frame, the mixture distribution And a feature enhancement unit for estimating the second speech feature from the first speech feature.
 本発明によれば、雑音が変化する環境下でも少ない計算量で高い音声認識性能を保つことができるという効果を奏する。 According to the present invention, it is possible to maintain high speech recognition performance with a small amount of calculation even in an environment where noise changes.
本実施の形態にかかる音声処理装置の構成を示すブロック図である。It is a block diagram which shows the structure of the audio | voice processing apparatus concerning this Embodiment. 本実施の形態にかかる音声処理装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the audio processing apparatus concerning this Embodiment.
 以下に添付図面を参照して、この発明にかかる音声処理装置、音声処理方法、及び、音声処理プログラムの最良な実施の形態を詳細に説明する。 DETAILED DESCRIPTION Exemplary embodiments of an audio processing device, an audio processing method, and an audio processing program according to the present invention are explained in detail below with reference to the accompanying drawings.
 図1は、本実施の形態にかかる音声処理装置の構成を示すブロック図である。音声認識装置は、特徴抽出部1、雑音推定部2、第1の分布記憶部3、第1の分布記憶制御部4、第2の分布記憶部5、分布合成部6、第1の音響モデル記憶部7、第1の音響モデル記憶制御部8、確率算出部9、混合重み記憶部10、混合重み記憶制御部11、混合重み融合部12、混合分布生成部13、特徴強調部14、第2の音響モデル記憶部15、第2の音響モデル記憶制御部16、及び、デコード部17を備えて構成されている。 FIG. 1 is a block diagram showing the configuration of the speech processing apparatus according to the present embodiment. The speech recognition apparatus includes a feature extraction unit 1, a noise estimation unit 2, a first distribution storage unit 3, a first distribution storage control unit 4, a second distribution storage unit 5, a distribution synthesis unit 6, and a first acoustic model. Storage unit 7, first acoustic model storage control unit 8, probability calculation unit 9, mixture weight storage unit 10, mixture weight storage control unit 11, mixture weight fusion unit 12, mixture distribution generation unit 13, feature enhancement unit 14, 2 acoustic model storage unit 15, second acoustic model storage control unit 16, and decoding unit 17.
 特徴抽出部1は、入力されたノイジー音声の各フレームから特徴を抽出し、ノイジー音声特徴の系列を算出する。フレームは、入力された音声信号の一部の区間を切り出したもので、切り出す区間を少しずつずらしながら、順次切り出される。特徴としては、例えば、メル周波数ケプストラム係数(MFCC)を要素とするベクトルを用いることができる。以下、特徴の次元をdとおく。順次切り出されたフレームの各々から特徴を抽出することで、ノイジー音声特徴の系列が算出される。 The feature extraction unit 1 extracts features from each frame of the input noisy speech, and calculates a sequence of noisy speech features. The frame is obtained by cutting out a part of the input audio signal, and is sequentially cut out while gradually shifting the cut out section. As a feature, for example, a vector having a mel frequency cepstrum coefficient (MFCC) as an element can be used. Hereinafter, the feature dimension is d. A sequence of noisy speech features is calculated by extracting features from each of the sequentially extracted frames.
 雑音推定部2は、入力されたノイジー音声に重畳された雑音を推定する。例えば、音声区間検出器を用いて、音声が含まれず雑音のみからなる区間を選択して、この区間を用いて雑音の推定を行うことができる。より具体的には、雑音のみからなる区間において、各フレームから前記の特徴を抽出し、得られた特徴の集合から平均・共分散を求める。この平均・共分散は、雑音特徴のガウス分布を規定する。 The noise estimation unit 2 estimates noise superimposed on the input noisy speech. For example, it is possible to select a section that does not include speech and includes only noise using a voice section detector, and perform noise estimation using this section. More specifically, in the section consisting only of noise, the above features are extracted from each frame, and the average / covariance is obtained from the obtained set of features. This mean / covariance defines a Gaussian distribution of noise features.
 第1の分布記憶部3は、第1の基底分布の集合を記憶する。本実施の形態では、第1の基底分布としてd次元のガウス分布を用いる。各基底分布は、クリーン音声特徴の分布を表わす。第1の基底分布の集合の算出方法については、後に詳述する。 The first distribution storage unit 3 stores a set of first basis distributions. In this embodiment, a d-dimensional Gaussian distribution is used as the first basis distribution. Each basis distribution represents a distribution of clean speech features. A method of calculating the first basis distribution set will be described in detail later.
 第1の分布記憶制御部4は、第1の分布記憶部3が第1の基底分布の集合を記憶する制御を行う。 The first distribution storage control unit 4 performs control such that the first distribution storage unit 3 stores the first set of basis distributions.
 第2の分布記憶部5は、第2の基底分布の集合を記憶する。本実施の形態では、第2の基底分布として、2×d次元のガウス分布を用いる。各基底分布は、クリーン音声特徴とノイジー音声特徴の結合ガウス分布を表わす。 The second distribution storage unit 5 stores a set of second basis distributions. In the present embodiment, a 2 × d-dimensional Gaussian distribution is used as the second basis distribution. Each basis distribution represents a combined Gaussian distribution of clean speech features and noisy speech features.
 分布合成部6は、雑音推定部2で算出された雑音に基づいて、第1の分布記憶部3に記憶された第1の基底分布の各々から第2の基底分布を合成し、第2の分布記憶部5に記憶する。すなわち、雑音特徴のガウス分布およびクリーン音声特徴のガウス分布から、クリーン音声特徴とノイジー音声特徴の結合ガウス分布を合成する。分布の合成には、例えば、Vector Taylor Series法、あるいはアンセンテッド変換を用いることができる。 The distribution synthesis unit 6 synthesizes the second basis distribution from each of the first basis distributions stored in the first distribution storage unit 3 based on the noise calculated by the noise estimation unit 2, Store in the distribution storage unit 5. That is, a combined Gaussian distribution of clean speech features and noisy speech features is synthesized from a Gaussian distribution of noise features and a Gaussian distribution of clean speech features. For the synthesis of the distribution, for example, Vector Taylor Series method or uncentred transformation can be used.
 第1の音響モデル記憶部7は、ノイジーな環境における音韻の標準パターンを表わす第1の音響モデルを記憶する。より具体的には、音響モデルは隠れマルコフモデルであり、各状態における出力分布、及び状態間の遷移確率を記憶しておく。第1の音響モデルは、ノイジー音声特徴の集合からなる学習データから、あらかじめ作成しておく。 The first acoustic model storage unit 7 stores a first acoustic model representing a standard phonemic pattern in a noisy environment. More specifically, the acoustic model is a hidden Markov model, and the output distribution in each state and the transition probability between states are stored. The first acoustic model is created in advance from learning data consisting of a set of noisy speech features.
 第1の音響モデル記憶制御部8は、第1の音響モデル記憶部7が第1の音響モデルを記憶する制御を行う。 The first acoustic model storage control unit 8 performs control such that the first acoustic model storage unit 7 stores the first acoustic model.
 確率算出部9は、特徴抽出部1で算出されたノイジー音声特徴の系列と、第1の音響モデル記憶部7に記憶された第1の音響モデルとを照合して、各フレームにおいて、第1の音響モデルの各状態に滞在する確率(状態事後確率)を算出する。例えば、フォワードバックワードアルゴリズムを用いることで、状態事後確率を算出することができる。あるいは、Nベスト候補リストから、状態事後確率を算出することもできる。Nベスト候補リストを用いた状態事後確率の算出法については、例えば、非特許文献1に詳細が説明されている。 The probability calculation unit 9 collates the noisy speech feature series calculated by the feature extraction unit 1 with the first acoustic model stored in the first acoustic model storage unit 7, and in each frame, The probability of staying in each state of the acoustic model (state posterior probability) is calculated. For example, the state posterior probability can be calculated by using a forward backward algorithm. Alternatively, the state posterior probability can be calculated from the N best candidate list. The method for calculating the state posterior probability using the N best candidate list is described in detail in Non-Patent Document 1, for example.
 混合重み記憶部10は、第1の音響モデルの各状態について、第2の基底分布の各々に対応する混合重みを記憶する。状態数をL、基底分布の数をKとするとき、L×K個の値を記憶する。第1の音響モデルの各状態について、第1の分布記憶部3に記憶されるK個の基底分布を、該状態に対応する混合重みで混合して生成される混合分布は、該状態におけるクリーン音声特徴の分布をあらわす。同様に、第1の音響モデルの各状態について、第2の分布記憶部5に記憶されるK個の基底分布を、該状態に対応する混合重みで混合して生成される混合分布は、該状態におけるクリーン音声特徴・ノイジー音声特徴の結合分布をあらわす。混合重みの算出方法については、後に詳述する。 The mixing weight storage unit 10 stores the mixing weight corresponding to each of the second basis distributions for each state of the first acoustic model. When the number of states is L and the number of basis distributions is K, L × K values are stored. For each state of the first acoustic model, the mixed distribution generated by mixing the K basis distributions stored in the first distribution storage unit 3 with the mixing weight corresponding to the state is a clean distribution in the state. Represents the distribution of voice features. Similarly, for each state of the first acoustic model, a mixture distribution generated by mixing K basis distributions stored in the second distribution storage unit 5 with a mixture weight corresponding to the state is Represents the joint distribution of clean and noisy speech features in the state. The method for calculating the mixing weight will be described in detail later.
 混合重み記憶制御部11は、混合重み記憶部10が混合重みを記憶する制御を行う。 The mixing weight storage control unit 11 controls the mixing weight storage unit 10 to store the mixing weight.
 混合重み融合部12は、確率算出部9で算出された状態事後確率を用いて、混合重み記憶部10に記憶された混合重みを融合して、融合混合重みを算出する。具体的には、式(1)にしたがって、混合重みの融合を行う。 The blending weight fusion unit 12 blends the blending weights stored in the blending weight storage unit 10 using the state posterior probabilities calculated by the probability calculation unit 9 and calculates a blending blending weight. Specifically, the blending of the blending weights is performed according to the equation (1).
Figure JPOXMLDOC01-appb-M000001
Figure JPOXMLDOC01-appb-M000001
 ここで、γ(t,j)はフレームtにおいて状態jに滞在する状態事後確率を、w(j,k)は状態jにおけるk番目の基底分布の混合重みを、Σjはjに関する和を、v(t,k)はフレームtにおけるk番目の基底分布の融合混合重みをそれぞれ表わす。 Where γ (t, j) is the state posterior probability of staying in state j in frame t, w (j, k) is the mixing weight of the kth basis distribution in state j, Σj is the sum for j, v (t, k) represents the fusion mixture weight of the kth basis distribution in frame t.
 混合分布生成部13は、各フレームについて、混合重み融合部12で算出された融合混合重みで、第2の分布記憶部5から取得した第2の基底分布を混合し、混合分布を生成する。本実施の形態では、基底分布はガウス分布であるから、混合分布はガウス混合分布となる。生成された混合分布は、該フレームにおけるクリーン音声特徴・ノイジー音声特徴の結合分布をあらわす。 The mixed distribution generation unit 13 mixes the second base distribution acquired from the second distribution storage unit 5 with the combined mixing weight calculated by the mixing weight combining unit 12 for each frame to generate a mixed distribution. In the present embodiment, since the basis distribution is a Gaussian distribution, the mixture distribution is a Gaussian mixture distribution. The generated mixture distribution represents a combined distribution of clean speech characteristics and noisy speech features in the frame.
 特徴強調部14は、各フレームについて、混合分布生成部13で生成された混合分布を用いて、ノイジー音声特徴からクリーン音声特徴を推定する。クリーン音声特徴とノイジー音声特徴の結合分布をあらわす混合分布を用いた特徴強調法については、例えば、非特許文献1に詳細が開示されている。 The feature enhancement unit 14 estimates a clean speech feature from a noisy speech feature using the mixture distribution generated by the mixture distribution generation unit 13 for each frame. For example, Non-Patent Document 1 discloses details of a feature enhancement method using a mixed distribution representing a combined distribution of clean speech features and noisy speech features.
 第2の音響モデル記憶部15は、クリーン環境における音韻の標準パターンを表わす第2の音響モデルを記憶する。より具体的には、音響モデルは隠れマルコフモデルであり、各状態における出力分布、及び状態間の遷移確率を記憶しておく。第2の音響モデルは、クリーン音声特徴の集合からなる学習データを用いて、あらかじめ作成しておく。好ましくは、特徴強調部14によって処理されたクリーン音声特徴の集合からなる学習データを用いて、あらかじめ作成しておく。すなわち、第1の音響モデルの学習に用いたノイジー音声特徴の集合を、特徴強調部14で処理することにより得られたクリーン音声特徴の集合を学習データとして、音響モデルの作成を行う。音響モデル学習時と音声認識時で同じ特徴強調処理を行うことで、学習時と認識時の特徴ミスマッチの問題を防ぐことができる。 The second acoustic model storage unit 15 stores a second acoustic model representing a standard phonemic pattern in a clean environment. More specifically, the acoustic model is a hidden Markov model, and the output distribution in each state and the transition probability between states are stored. The second acoustic model is created in advance using learning data composed of a set of clean speech features. Preferably, it is created in advance using learning data consisting of a set of clean speech features processed by the feature enhancement unit 14. That is, an acoustic model is created using a set of clean speech features obtained by processing the set of noisy speech features used for learning the first acoustic model by the feature enhancement unit 14 as learning data. By performing the same feature enhancement processing at the time of acoustic model learning and at the time of speech recognition, the problem of feature mismatch between learning and recognition can be prevented.
 第2の音響モデル記憶制御部16は、第2の音響モデル記憶部15が第2の音響モデルを記憶する制御を行う。 The second acoustic model storage control unit 16 controls the second acoustic model storage unit 15 to store the second acoustic model.
 デコード部17は、特徴強調部14で推定されたクリーン音声特徴の系列と、第2の音響モデル記憶部15に記憶された第2の音響モデルとを照合し、最適な単語列を出力する。照合には、ビタビアルゴリズムが用いられる。 The decoding unit 17 collates the sequence of clean speech features estimated by the feature enhancement unit 14 with the second acoustic model stored in the second acoustic model storage unit 15 and outputs an optimum word string. A Viterbi algorithm is used for collation.
 次に、第1の分布記憶部3に記憶される第1の基底分布の集合、及び、混合重み記憶部10に記憶される混合重みを算出する方法について説明する。 Next, a method for calculating the first basis distribution set stored in the first distribution storage unit 3 and the mixture weight stored in the mixture weight storage unit 10 will be described.
 本方法では、与えられたクリーン音声特徴の集合からなる学習データに対する尤度を最大にするように、EMアルゴリズムを用いて、第1の基底分布の集合と混合重みとを算出し、それぞれ第1の分布記憶部3と混合重み記憶部10とに保存する。以下、最初にクリーン音声特徴の集合からなる学習データを作成する方法について説明し、次にEMアルゴリズムを用いた尤度最大化の方法について説明する。 In this method, a first set of basis distributions and a mixture weight are calculated using an EM algorithm so as to maximize the likelihood of learning data including a given set of clean speech features. Are stored in the distribution storage unit 3 and the mixture weight storage unit 10. Hereinafter, a method of creating learning data including a set of clean speech features will be described first, and then a likelihood maximization method using the EM algorithm will be described.
 学習データを作成する手順について説明する。初めに、クリーン音声特徴の集合を用意する。次に、各クリーン音声特徴を、第1の音響モデルのいずれかの状態に対応づける。具体的には、ある一発声から抽出されたクリーン音声特徴の系列、トランスクリプション(発声された単語列)、及び、第1の音響モデルが与えられたとき、ビタビアルゴリズムを用いることによって、系列に含まれる各クリーン音声特徴を、音響モデルのいずれかの状態に対応づけることができる。又は、フォワードバックワードアルゴリズムを用いて、ファジーな対応付けを行ってもよい。以上のようにして、音響モデルのいずれかの状態に対応づけられたクリーン音声特徴の集合である学習データを作成することができる。 Explain the procedure for creating learning data. First, prepare a set of clean speech features. Next, each clean speech feature is associated with one of the states of the first acoustic model. Specifically, given a sequence of clean speech features extracted from a single utterance, a transcription (spoken word sequence), and a first acoustic model, the sequence is obtained by using the Viterbi algorithm. Can be associated with any state of the acoustic model. Or you may perform a fuzzy matching using a forward backward algorithm. As described above, learning data that is a set of clean speech features associated with any state of the acoustic model can be created.
 次に、学習データの尤度を最大化するように、EMアルゴリズムを用いて、基底分布の集合及び混合重みを算出する手順について説明する。学習データの集合をD、第j番目の状態に対応づけられた学習データの集合をDjとする。第i番目の学習データ(クリーン音声特徴)をxとおく。第k番目の基底分布のパラメータをθ、それらを集めてθ={θ・・・θ}とおく。具体的には、θは第k番目のガウス分布の平均・共分散パラメータである。状態jにおけるk番目の基底分布の混合重みをwjkとおき、すべてのkについて集めたものをw={wj1,・・・,wjK}、さらにすべてのjについて集めたものをw={w,・・・,w}とおく。ここで、Kは基底分布の数、Lは状態の数である。このとき、学習データに対する(対数)尤度L(θ,w)は式(2)のように定義される。 Next, a procedure for calculating a set of base distributions and a mixture weight using an EM algorithm so as to maximize the likelihood of learning data will be described. A set of learning data is D, and a set of learning data associated with the jth state is Dj. The i-th learning data (clean speech feature) is set to x i . The parameters of the kth basis distribution are θ k , and they are collected and set as θ = {θ 1 ... Θ K }. Specifically, θ k is an average / covariance parameter of the kth Gaussian distribution. Let w jk be the mixing weight of the kth basis distribution in state j, w j = {w j1 ,..., W jK } that is collected for all k, and w that is collected for all j. = {W 1 ,..., W L }. Here, K is the number of basis distributions, and L is the number of states. At this time, the (logarithm) likelihood L (θ, w) for the learning data is defined as in Expression (2).
Figure JPOXMLDOC01-appb-M000002
Figure JPOXMLDOC01-appb-M000002
 この尤度を最大化するように、EMアルゴリズムを用いて、θ及びwを算出する。Eステップでは、現在のθ、wの値に基づいて、各学習サンプルがどの分布に属するのか事後確率を算出する。Mステップでは、この事後確率に基づいて、完全データの対数尤度の期待値を最大化するようにθ、wを算出する。ここで、θ、wの初期値が必要となるが、例えば、学習データ全体(D)に対して分布数Kのガウス混合分布を学習し、求まったガウス分布の集合及び混合重み(uとおく)を用いることができる。ここで、L個の混合重みwは、全てのjについて同一の値をセットする。すなわち、w=・・・=w=uとする。θ、wの初期化の後、尤度の上昇が収束するまで、上記のEステップ、Mステップを反復することで、尤度を最大化するθ、wを算出することができる。EMアルゴリズムを用いた最尤学習法については、例えば、“L.Rabiner、B.-H.Juang(著)、古井貞煕(訳)、音声認識の基礎、NTTアドバンストテクノロジ、1995”に詳細が開示されている。以上のようにして算出された第1の基底分布の集合及び混合重みを、それぞれ第1の分布記憶部3及び混合重み記憶部10に保存する。 Θ and w are calculated using an EM algorithm so as to maximize this likelihood. In step E, the posterior probability of which distribution each learning sample belongs to is calculated based on the current values of θ and w. In the M step, θ and w are calculated so as to maximize the expected value of the log likelihood of complete data based on this posterior probability. Here, initial values of θ and w are required. For example, a Gaussian mixture distribution with a distribution number K is learned for the entire learning data (D), and a set of obtained Gaussian distributions and a mixture weight (u) are set. ) Can be used. Here, L mixing weights w j are set to the same value for all j. That is, w 1 =... = W L = u. After initialization of θ and w, it is possible to calculate θ and w that maximize the likelihood by repeating the E step and the M step until the increase in likelihood converges. The maximum likelihood learning method using the EM algorithm is described in detail in, for example, “L. Rabiner, B.-H. Jung (Author), Sadahiro Furui (translation), Basics of Speech Recognition, NTT Advanced Technology, 1995”. It is disclosed. The first basis distribution set and the mixture weight calculated as described above are stored in the first distribution storage unit 3 and the mixture weight storage unit 10, respectively.
 次に、本実施の形態にかかる音声処理装置の動作について説明する。図2は、本実施の形態にかかる音声処理装置の動作を示すフローチャートである。 Next, the operation of the speech processing apparatus according to this embodiment will be described. FIG. 2 is a flowchart showing the operation of the speech processing apparatus according to this embodiment.
 初めに、特徴抽出部1は、入力されたノイジー音声の各フレームから特徴を抽出し、ノイジー音声特徴の系列を算出する(ステップS1)。 First, the feature extraction unit 1 extracts features from each frame of the input noisy speech and calculates a sequence of noisy speech features (step S1).
 次に、雑音推定部2は、特徴抽出部1において算出されたノイジー音声特徴の系列から、雑音の推定を行う(ステップS2)。次に、分布合成部6は、雑音推定部2で推定された雑音を用いて、第1の基底分布の各々から第2の基底分布を合成し、第2の分布記憶部5に保存する(ステップS3)。 Next, the noise estimation unit 2 performs noise estimation from the noisy speech feature sequence calculated by the feature extraction unit 1 (step S2). Next, the distribution synthesis unit 6 synthesizes the second basis distribution from each of the first basis distributions using the noise estimated by the noise estimation unit 2 and stores it in the second distribution storage unit 5 ( Step S3).
 ステップS2及びS3を実行するのと並行して、ステップS4及びS5を実行する。すなわち、確率算出部9は、特徴抽出部1において算出されたノイジー音声特徴の系列と第1の音響モデルとを照合し、各フレームにおいて第1の音響モデルの各状態に滞在する確率(状態事後確率)を算出する(ステップS4)。次に、混合重み融合部12は、各フレームについて、確率算出部9で算出された状態事後確率で、混合重み記憶部10から取得した混合重みを融合して、融合混合重みを算出する(ステップS5)。 In parallel with executing steps S2 and S3, steps S4 and S5 are executed. That is, the probability calculation unit 9 collates the sequence of noisy speech features calculated by the feature extraction unit 1 with the first acoustic model, and the probability of staying in each state of the first acoustic model in each frame (state posterior (Probability) is calculated (step S4). Next, the blending weight blending unit 12 blends the blending weights acquired from the blending weight storage unit 10 with the state posterior probabilities calculated by the probability calculating unit 9 for each frame, and calculates the blending blending weight (Step S1). S5).
 ステップS6で、混合分布生成部13は、融合混合重みで、第2の分布記憶部5に記憶された第2の基底分布を混合し、混合分布を生成する。次に、特徴強調部14は、各フレームについて、混合分布生成部13で生成された混合分布を用いて、ノイジー音声特徴からクリーン音声特徴を算出する(ステップS7)。 In step S6, the mixture distribution generation unit 13 mixes the second basis distribution stored in the second distribution storage unit 5 with the fusion mixture weight to generate a mixture distribution. Next, the feature emphasizing unit 14 calculates a clean speech feature from the noisy speech feature using the mixture distribution generated by the mixture distribution generation unit 13 for each frame (step S7).
 最後に、デコード部17は、特徴強調部14において算出されたクリーン音声特徴の系列と、第2の音響モデル記憶部15に記憶された第2の音響モデルとを照合し、最適な単語列を出力して、音声認識(音声処理)を終了する(ステップS8)。以上より、ノイジー音声から正しい音声が認識される。 Finally, the decoding unit 17 collates the series of clean speech features calculated by the feature enhancement unit 14 with the second acoustic model stored in the second acoustic model storage unit 15 to determine an optimum word string. The voice recognition (voice processing) is finished (step S8). As described above, the correct voice is recognized from the noisy voice.
 このように、本実施の形態にかかる音声処理装置によれば、従来技術のように多数の分布を用いるかわりに、少数の基底分布のみを用いることで、クリーン音声特徴・ノイジー音声特徴の結合分布の合成にかかる計算量を大幅に削減することができ、雑音が変化する環境下でも少ない計算量で高い音声認識性能を保つことが可能となる。 As described above, according to the speech processing apparatus according to the present embodiment, instead of using a large number of distributions as in the prior art, only a small number of basis distributions are used, so that the combined distribution of clean speech features and noisy speech features is obtained. Therefore, it is possible to greatly reduce the amount of calculation required for the synthesis of the speech and maintain high speech recognition performance with a small amount of calculation even in an environment where noise changes.
 本実施の形態の音声処理装置は、CPUなどの制御装置と、記憶装置と、外部記憶装置と、ディスプレイ装置などの表示装置と、キーボードやマウスなどの入力装置を備えており、通常のコンピュータを利用したハードウェア構成となっている。 The voice processing apparatus of this embodiment includes a control device such as a CPU, a storage device, an external storage device, a display device such as a display device, and an input device such as a keyboard and a mouse. The hardware configuration is used.
 本実施形態の音声処理装置で実行される音声処理プログラムは、インストール可能な形式又は実行可能な形式のファイルでCD-ROM、フレキシブルディスク(FD)、CD-R、DVD(Digital Versatile Disk)等のコンピュータで読み取り可能な記録媒体に記録されてコンピュータプログラムプロダクトとして提供される。 The audio processing program executed by the audio processing apparatus according to the present embodiment is a file in an installable or executable format, such as a CD-ROM, a flexible disk (FD), a CD-R, a DVD (Digital Versatile Disk), or the like. The program is recorded on a computer-readable recording medium and provided as a computer program product.
 また、本実施形態の音声処理装置で実行される音声処理プログラムを、インターネット等のネットワークに接続されたコンピュータ上に格納し、ネットワーク経由でダウンロードさせることにより提供するように構成しても良い。また、本実施形態の音声処理装置で実行される音声処理プログラムをインターネット等のネットワーク経由で提供または配布するように構成しても良い。 Further, the voice processing program executed by the voice processing apparatus of the present embodiment may be provided by being stored on a computer connected to a network such as the Internet and downloaded via the network. The voice processing program executed by the voice processing apparatus according to the present embodiment may be provided or distributed via a network such as the Internet.
 また、本実施形態の音声処理プログラムを、ROM等に予め組み込んで提供するように構成してもよい。 Also, the voice processing program of the present embodiment may be provided by being incorporated in advance in a ROM or the like.
 本実施の形態の音声処理装置で実行される音声処理プログラムは、上述した各部(特徴抽出部、雑音推定部、第1の分布記憶制御部、分布合成部、第1の音響モデル記憶制御部、確率算出部、混合重み記憶制御部、混合重み融合部、混合分布生成部、特徴強調部、第2の音響モデル記憶制御部、及び、デコード部)を含むモジュール構成となっている。そして、実際のハードウェアとしてはCPU(プロセッサ)が上記記憶媒体から音声処理プログラムを読み出して実行することにより上記各部が主記憶装置上にロードされ特徴抽出部、雑音推定部、第1の分布記憶制御部、分布合成部、第1の音響モデル記憶制御部、確率算出部、混合重み記憶制御部、混合重み融合部、混合分布生成部、特徴強調部、第2の音響モデル記憶制御部、及び、デコード部が主記憶装置上に生成されるようになっている。 The speech processing program executed by the speech processing apparatus according to the present embodiment includes the above-described units (feature extraction unit, noise estimation unit, first distribution storage control unit, distribution synthesis unit, first acoustic model storage control unit, The module configuration includes a probability calculation unit, a mixture weight storage control unit, a mixture weight fusion unit, a mixture distribution generation unit, a feature enhancement unit, a second acoustic model storage control unit, and a decoding unit. As actual hardware, a CPU (processor) reads out and executes an audio processing program from the storage medium, and the above-described units are loaded on the main storage device, so that a feature extraction unit, a noise estimation unit, and a first distribution storage A control unit, a distribution synthesis unit, a first acoustic model storage control unit, a probability calculation unit, a mixture weight storage control unit, a mixture weight fusion unit, a mixture distribution generation unit, a feature enhancement unit, a second acoustic model storage control unit, and The decoding unit is generated on the main storage device.
 なお、本発明は、上記実施の形態そのままに限定されるものではなく、実施段階ではその要旨を逸脱しない範囲で構成要素を変形して具体化することができる。また、上記実施の形態に開示されている複数の構成要素の適宜な組み合わせにより、種々の発明を形成することができる。例えば、実施の形態に示される全構成要素からいくつかの構成要素を削除してもよい。さらに、異なる実施の形態にわたる構成要素を適宜組み合わせても良い。 Note that the present invention is not limited to the above-described embodiment as it is, and can be embodied by modifying the constituent elements without departing from the scope of the invention in the implementation stage. In addition, various inventions can be formed by appropriately combining a plurality of constituent elements disclosed in the above embodiments. For example, some components may be deleted from all the components shown in the embodiment. Furthermore, constituent elements over different embodiments may be appropriately combined.
 以上のように、本発明にかかる音声処理装置、音声処理方法、及び、音声処理プログラムは、雑音下で音声認識を行う場合に有用である。 As described above, the speech processing apparatus, speech processing method, and speech processing program according to the present invention are useful when speech recognition is performed under noise.
 1 特徴抽出部
 2 雑音推定部
 3 第1の分布記憶部
 6 分布合成部
 7 第1の音響モデル記憶部
 9 確率算出部
 10 混合重み記憶部
 12 混合重み融合部
 13 混合分布生成部
 14 特徴強調部
DESCRIPTION OF SYMBOLS 1 Feature extraction part 2 Noise estimation part 3 1st distribution storage part 6 Distribution synthetic | combination part 7 1st acoustic model storage part 9 Probability calculation part 10 Mixed weight memory | storage part 12 Mixed weight fusion part 13 Mixed distribution generation part 14 Feature emphasis part

Claims (5)

  1.  雑音環境下で雑音が重畳した第1の音声の各フレームから第1の音声特徴を抽出し、前記第1の音声特徴の系列を算出する特徴抽出部と、
     前記第1の音声に重畳された前記雑音を推定する雑音推定部と、
     雑音がない環境下の第2の音声の第2の音声特徴の分布を表わす第1の基底分布の集合を記憶する第1の分布記憶部と、
     前記雑音に基づいて、前記第1の基底分布の各々から前記第1の音声特徴と前記第2の音声特徴との結合分布を表わす第2の基底分布を合成する分布合成部と、
     雑音環境下における音韻の標準パターンを表わす第1の音響モデルを記憶する第1の音響モデル記憶部と、
     前記第1の音声特徴の系列と、前記第1の音響モデルとを照合して、前記各フレームにおいて、前記第1の音響モデルの各状態に滞在する確率である状態事後確率を算出する確率算出部と、
     前記第1の音響モデルの各状態について、前記第2の基底分布の各々に対応する混合重みを記憶する混合重み記憶部と、
     前記状態事後確率を用いて、前記混合重みを融合して、融合混合重みを算出する混合重み融合部と、
     前記各フレームについて、前記融合混合重みで、前記第2の基底分布を混合し、該フレームにおける前記第1の音声特徴と前記第2の音声特徴の結合分布である混合分布を生成する混合分布生成部と、
     各フレームについて、前記混合分布を用いて、前記第1の音声特徴から前記第2の音声特徴を推定する特徴強調部と、を備えたこと、
     を特徴とする音声処理装置。
    A feature extraction unit that extracts a first speech feature from each frame of the first speech on which noise is superimposed in a noisy environment, and calculates a sequence of the first speech feature;
    A noise estimator for estimating the noise superimposed on the first speech;
    A first distribution storage unit for storing a set of first basis distributions representing a distribution of the second voice feature of the second voice in a noise-free environment;
    A distribution synthesizer that synthesizes a second basis distribution representing a combined distribution of the first speech feature and the second speech feature from each of the first basis distributions based on the noise;
    A first acoustic model storage unit for storing a first acoustic model representing a standard pattern of phonemes in a noisy environment;
    Probability calculation that collates the first speech feature series with the first acoustic model and calculates a state posterior probability that is a probability of staying in each state of the first acoustic model in each frame. And
    A mixing weight storage unit that stores a mixing weight corresponding to each of the second basis distributions for each state of the first acoustic model;
    Using the state posterior probabilities, fusing the blend weights to calculate a blend blend weight; and
    For each frame, mixed distribution generation is performed by mixing the second base distribution with the fusion mixture weight, and generating a mixed distribution that is a combined distribution of the first audio feature and the second audio feature in the frame. And
    A feature enhancement unit that estimates the second speech feature from the first speech feature using the mixture distribution for each frame;
    A voice processing apparatus characterized by the above.
  2.  雑音がない環境下における音韻の標準パターンを表わす第2の音響モデルを記憶する第2の音響モデル記憶部と、
     前記第2の音声特徴と前記第2の音響モデルとを照合し、最適な単語列を出力するデコード部と、をさらに備えたこと、
     を特徴とする請求項1に記載の音声処理装置。
    A second acoustic model storage unit for storing a second acoustic model representing a standard phonemic pattern in an environment free from noise;
    A decoder that collates the second speech feature with the second acoustic model and outputs an optimal word string;
    The speech processing apparatus according to claim 1.
  3.  前記第1の基底分布の集合及び前記混合重みは、前記第1の音響モデルのいずれかの状態に対応づけられた前記第2の音声特徴の集合からなる学習データを用いて、EMアルゴリズムにより前記学習データに対する尤度を最大化するように算出されること、
     を特徴とする請求項1に記載の音声処理装置。
    The set of the first basis distribution and the mixture weight are determined by the EM algorithm using learning data including the second set of speech features associated with any state of the first acoustic model. Being calculated to maximize the likelihood for the training data,
    The speech processing apparatus according to claim 1.
  4.  音声処理装置で実行される音声処理方法であって、
     特徴抽出部が、雑音環境下で雑音が重畳した第1の音声の各フレームから第1の音声特徴を抽出し、前記第1の音声特徴の系列を算出する特徴抽出ステップと、
     雑音推定部が、前記第1の音声に重畳された前記雑音を推定する雑音推定ステップと、
     分布合成部が、前記雑音に基づいて、雑音がない環境下の第2の音声の第2の音声特徴の分布を表わす第1の基底分布の集合を記憶する第1の分布記憶部の前記第1の基底分布の各々から、前記第1の音声特徴と前記第2の音声特徴との結合分布を表わす第2の基底分布を合成する分布合成ステップと、
     確率算出部が、前記第1の音声特徴の系列と、雑音環境下における音韻の標準パターンを表わす第1の音響モデルを記憶する第1の音響モデル記憶部の前記第1の音響モデルとを照合して、前記各フレームにおいて、前記第1の音響モデルの各状態に滞在する確率である状態事後確率を算出する確率算出ステップと、
     混合重み融合部が、前記状態事後確率を用いて、前記第1の音響モデルの各状態について、前記第2の基底分布の各々に対応する混合重みを記憶する混合重み記憶部の前記混合重みを融合して、融合混合重みを算出する混合重み融合ステップと、
     混合分布生成部が、前記各フレームについて、前記融合混合重みで、前記第2の基底分布を混合し、該フレームにおける前記第1の音声特徴と前記第2の音声特徴の結合分布である混合分布を生成する混合分布生成ステップと、
     特徴強調部が、各フレームについて、前記混合分布を用いて、前記第1の音声特徴から前記第2の音声特徴を推定する特徴強調ステップと、を含むこと、
     を特徴とする音声処理方法。
    A speech processing method executed by a speech processing apparatus,
    A feature extraction step of extracting a first voice feature from each frame of the first voice on which noise is superimposed in a noise environment, and calculating a sequence of the first voice feature;
    A noise estimation step in which a noise estimation unit estimates the noise superimposed on the first speech;
    Based on the noise, the distribution synthesis unit stores the first basis distribution set representing the distribution of the second speech feature of the second speech in a noise-free environment, in the first distribution storage unit. A distribution synthesis step of synthesizing a second basis distribution representing a combined distribution of the first voice feature and the second voice feature from each of the one basis distribution;
    A probability calculation unit collates the first acoustic feature series with the first acoustic model of a first acoustic model storage unit that stores a first acoustic model representing a standard pattern of phonemes in a noisy environment. A probability calculating step of calculating a state posterior probability that is a probability of staying in each state of the first acoustic model in each frame;
    The mixture weight merging unit uses the state posterior probability to determine the mixture weight of the mixture weight storage unit that stores the mixture weight corresponding to each of the second basis distributions for each state of the first acoustic model. A blend weight blending step for blending and calculating a blend blend weight;
    The mixed distribution generation unit mixes the second base distribution with the fusion mixed weight for each frame, and is a mixed distribution that is a combined distribution of the first audio feature and the second audio feature in the frame. A mixed distribution generation step for generating
    A feature enhancement step including, for each frame, a feature enhancement step of estimating the second speech feature from the first speech feature using the mixture distribution;
    A voice processing method characterized by the above.
  5.  雑音環境下で雑音が重畳した第1の音声の各フレームから第1の音声特徴を抽出し、前記第1の音声特徴の系列を算出する特徴抽出ステップと、
     前記第1の音声に重畳された前記雑音を推定する雑音推定ステップと、
     前記雑音に基づいて、雑音がない環境下の第2の音声の第2の音声特徴の分布を表わす第1の基底分布の集合を記憶する第1の分布記憶部の前記第1の基底分布の各々から、前記第1の音声特徴と前記第2の音声特徴との結合分布を表わす第2の基底分布を合成する分布合成ステップと、
     前記第1の音声特徴の系列と、雑音環境下における音韻の標準パターンを表わす第1の音響モデルを記憶する第1の音響モデル記憶部の前記第1の音響モデルとを照合して、前記各フレームにおいて、前記第1の音響モデルの各状態に滞在する確率である状態事後確率を算出する確率算出ステップと、
     前記状態事後確率を用いて、前記第1の音響モデルの各状態について、前記第2の基底分布の各々に対応する混合重みを記憶する混合重み記憶部の前記混合重みを融合して、融合混合重みを算出する混合重み融合ステップと、
     前記各フレームについて、前記融合混合重みで、前記第2の基底分布を混合し、該フレームにおける前記第1の音声特徴と前記第2の音声特徴の結合分布である混合分布を生成する混合分布生成ステップと、
     各フレームについて、前記混合分布を用いて、前記第1の音声特徴から前記第2の音声特徴を推定する特徴強調ステップと、
     をコンピュータに実行させるための音声処理プログラム。
    A feature extraction step of extracting a first speech feature from each frame of the first speech on which noise is superimposed in a noisy environment, and calculating a sequence of the first speech feature;
    A noise estimation step of estimating the noise superimposed on the first speech;
    Based on the noise, the first basis distribution of the first distribution storage unit that stores a set of first basis distributions representing a distribution of the second speech feature of the second speech in a noise-free environment. A distribution synthesis step for synthesizing a second basis distribution representing a combined distribution of the first voice feature and the second voice feature from each;
    The first acoustic feature series is collated with the first acoustic model in a first acoustic model storage unit that stores a first acoustic model representing a standard pattern of phonemes in a noisy environment. In a frame, a probability calculating step of calculating a state posterior probability that is a probability of staying in each state of the first acoustic model;
    Using the state posterior probabilities, for each state of the first acoustic model, the mixture weights of the mixture weight storage unit that stores the mixture weights corresponding to each of the second basis distributions are merged, and fusion mixing A mixed weight fusion step for calculating weights;
    For each frame, mixed distribution generation is performed by mixing the second base distribution with the fusion mixture weight, and generating a mixed distribution that is a combined distribution of the first audio feature and the second audio feature in the frame. Steps,
    A feature enhancement step for estimating the second speech feature from the first speech feature using the mixture distribution for each frame;
    A voice processing program for causing a computer to execute.
PCT/JP2009/069580 2009-03-26 2009-11-18 Voice processing apapratus, voice processing method, and voice processing program WO2010109725A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2009077325A JP2010230913A (en) 2009-03-26 2009-03-26 Voice processing apparatus, voice processing method, and voice processing program
JP2009-077325 2009-03-26

Publications (1)

Publication Number Publication Date
WO2010109725A1 true WO2010109725A1 (en) 2010-09-30

Family

ID=42780427

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2009/069580 WO2010109725A1 (en) 2009-03-26 2009-11-18 Voice processing apapratus, voice processing method, and voice processing program

Country Status (2)

Country Link
JP (1) JP2010230913A (en)
WO (1) WO2010109725A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106384587A (en) * 2015-07-24 2017-02-08 科大讯飞股份有限公司 Voice recognition method and system thereof
CN108511002A (en) * 2018-01-23 2018-09-07 努比亚技术有限公司 The recognition methods of hazard event voice signal, terminal and computer readable storage medium

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105788600B (en) * 2014-12-26 2019-07-26 联想(北京)有限公司 Method for recognizing sound-groove and electronic equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004004509A (en) * 2001-12-20 2004-01-08 Matsushita Electric Ind Co Ltd Method, device and computer program for creating acoustic model
JP2007279444A (en) * 2006-04-07 2007-10-25 Toshiba Corp Feature amount compensation apparatus, method and program
JP2007279349A (en) * 2006-04-06 2007-10-25 Toshiba Corp Feature amount compensation apparatus, method, and program

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004004509A (en) * 2001-12-20 2004-01-08 Matsushita Electric Ind Co Ltd Method, device and computer program for creating acoustic model
JP2007279349A (en) * 2006-04-06 2007-10-25 Toshiba Corp Feature amount compensation apparatus, method, and program
JP2007279444A (en) * 2006-04-07 2007-10-25 Toshiba Corp Feature amount compensation apparatus, method and program

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106384587A (en) * 2015-07-24 2017-02-08 科大讯飞股份有限公司 Voice recognition method and system thereof
CN106384587B (en) * 2015-07-24 2019-11-15 科大讯飞股份有限公司 A kind of audio recognition method and system
CN108511002A (en) * 2018-01-23 2018-09-07 努比亚技术有限公司 The recognition methods of hazard event voice signal, terminal and computer readable storage medium
CN108511002B (en) * 2018-01-23 2020-12-01 太仓鸿羽智能科技有限公司 Method for recognizing sound signal of dangerous event, terminal and computer readable storage medium

Also Published As

Publication number Publication date
JP2010230913A (en) 2010-10-14

Similar Documents

Publication Publication Date Title
Deng et al. Recursive estimation of nonstationary noise using iterative stochastic approximation for robust speech recognition
JP6293912B2 (en) Speech synthesis apparatus, speech synthesis method and program
WO2010119534A1 (en) Speech synthesizing device, method, and program
JP2008203469A (en) Speech recognition device and method
JP2004310098A (en) Method for speech recognition using variational inference with switching state spatial model
WO2012001458A1 (en) Voice-tag method and apparatus based on confidence score
JP2004279466A (en) System and method for noise adaptation for speech model, and speech recognition noise adaptation program
JP2010055030A (en) Acoustic processor and program
JP2010078650A (en) Speech recognizer and method thereof
JP2004226982A (en) Method for speech recognition using hidden track, hidden markov model
JP5351856B2 (en) Sound source parameter estimation device, sound source separation device, method thereof, program, and storage medium
Saheer et al. VTLN adaptation for statistical speech synthesis
WO2010109725A1 (en) Voice processing apapratus, voice processing method, and voice processing program
JP6594251B2 (en) Acoustic model learning device, speech synthesizer, method and program thereof
JP6142401B2 (en) Speech synthesis model learning apparatus, method, and program
JP4964194B2 (en) Speech recognition model creation device and method thereof, speech recognition device and method thereof, program and recording medium thereof
JP6542823B2 (en) Acoustic model learning device, speech synthesizer, method thereof and program
JP2004117624A (en) Noise adaptation system of voice model, noise adaptation method, and noise adaptation program of voice recognition
JP2003005785A (en) Separating method and separating device for sound source
CN111933121B (en) Acoustic model training method and device
JP4464797B2 (en) Speech recognition method, apparatus for implementing the method, program, and recording medium therefor
JP2007233308A (en) Speech recognition device
JP4654452B2 (en) Acoustic model generation apparatus and program
JPH0822296A (en) Pattern recognition method
Lei et al. Investigation of prosodie FO layers in hierarchical FO modeling for HMM-based speech synthesis

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 09842335

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 09842335

Country of ref document: EP

Kind code of ref document: A1