JP2010072446A

JP2010072446A - Coarticulation feature extraction device, coarticulation feature extraction method and coarticulation feature extraction program

Info

Publication number: JP2010072446A
Application number: JP2008241072A
Authority: JP
Inventors: Tsuneo Nitta; 恒雄新田
Original assignee: Toyohashi University of Technology NUC
Current assignee: Toyohashi University of Technology NUC
Priority date: 2008-09-19
Filing date: 2008-09-19
Publication date: 2010-04-02
Anticipated expiration: 2028-09-19
Also published as: JP5300000B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a coarticulation feature extraction device, a coarticulation feature extraction method and a coarticulation feature extraction program capable of extracting unknown words with high phoneme identification accuracy which meets requirements from voice interaction and voice search. <P>SOLUTION: In the coarticulation feature extraction device, a voice which is inputted from an input section 201 is converted to digital in an A-D converter 202, and Fourier analysis and filtering are performed in a feature analysis section 210, as a result, a voice spectrum data are obtained. Then, a coarticulation feature sequence being time sequence data of a coarticulation feature is extracted in a coarticulation feature extraction section 220. A speed component and an acceleration component are extracted from a deviation component of the coarticulation feature sequence in a coarticulation movement modification section 230. Coarticulation movement is modified by passing a neural network based on each component. Based on the modified coarticulation movement, corresponding words are searched in a word classification section 204, and voice recognition is performed. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、調音特徴抽出装置、調音特徴抽出方法、及び調音特徴抽出プログラムに関する。より詳細には、音声発話に伴う調音運動を高い精度で識別する調音特徴抽出装置、調音特徴抽出方法、及び調音特徴抽出プログラムに関する。 The present invention relates to an articulation feature extraction device, an articulation feature extraction method, and an articulation feature extraction program. More specifically, the present invention relates to an articulation feature extraction device, an articulation feature extraction method, and an articulation feature extraction program for identifying articulation movements accompanying voice utterance with high accuracy.

音声を用いたユーザインタフェースとして音声認識技術が一般的に知られている。音声認識技術では、周波数スペクトルなどの特徴分析処理結果をもとに、音素・音節・単語などを認識単位とするパターン認識処理を行うことが一般に行われてきた。これは、人間の聴覚神経系がスペクトル分析能力を持ち、続いて大脳において高次言語処理を行うという推測に基づいている。これまで開発されている音声認識装置は、音響特徴から直接単語分類を行う。これに対して近年の脳研究から、人間は音響信号としての音声ではなく、調音運動としての音声を知覚しているとする仮説が有力視されつつある（非特許文献１参照） Speech recognition technology is generally known as a user interface using speech. In speech recognition technology, pattern recognition processing using phonemes, syllables, words, and the like as recognition units has been generally performed based on the result of feature analysis processing such as frequency spectrum. This is based on the assumption that the human auditory nervous system has the ability to analyze spectrum and subsequently performs higher-level language processing in the cerebrum. Voice recognition devices developed so far perform word classification directly from acoustic features. On the other hand, from recent brain research, the hypothesis that humans perceive speech as articulatory motion rather than speech as an acoustic signal is promising (see Non-Patent Document 1).

標準的な音声認識技術の概要について、図１５を参照して説明する。図１５は、音声認識装置に搭載される標準的な音声認識技術の一例を示す機能ブロック図である。図１５に示すように、音声認識に必要な機能ブロックとして、入力部１０１、Ａ／Ｄ変換部１０２、特徴分析部１０３、単語分類部１０４、出力部１０５、及び記憶部１０６が設けられている。また記憶部１０６には、単語発音辞書１０７、隠れマルコフモデル（ＨＭＭ）１０８、言語モデル１０９、及びその他のデータが記憶される。この音声認識装置では、認識対象単語セットを予め定め，言語モデル１０９（単語間の連鎖確率をテーブルに表現したもの。通常，三単語連鎖の確率が利用される。これを３（ｔｒｉ）−ｇｒａｍという。）を参照しながら、音声信号中の単語列が探索される。 An outline of standard speech recognition technology will be described with reference to FIG. FIG. 15 is a functional block diagram illustrating an example of a standard speech recognition technology installed in the speech recognition apparatus. As shown in FIG. 15, an input unit 101, an A / D conversion unit 102, a feature analysis unit 103, a word classification unit 104, an output unit 105, and a storage unit 106 are provided as functional blocks necessary for speech recognition. . The storage unit 106 also stores a word pronunciation dictionary 107, a hidden Markov model (HMM) 108, a language model 109, and other data. In this speech recognition apparatus, a recognition target word set is determined in advance, and the language model 109 (representing a chain probability between words in a table. Usually, the probability of three word chains is used. This is represented by 3 (tri) -gram. The word string in the speech signal is searched for with reference to FIG.

入力部１０１は、外部から入力される音声を受け付け、アナログ電気信号に変換するために設けられる。Ａ／Ｄ変換部１０２は、入力部１０１にて受け付けられたアナログ信号をデジタル信号に変換するために設けられる。特徴分析部１０３は、音声認識のための所定の特徴量を抽出する為に設けられる。単語分類部１０４は、特徴分析部１０３にて抽出された特徴量に基づいて、音声に含まれる単語を検索するために設けられる。記憶部１０６は、単語分類部１０４において単語を検索する場合に必要なデータを記憶しており、単語分類部１０４より参照される。出力部１０５は、単語分類部１０４において検索された結果の単語を出力するために設けられる。 The input unit 101 is provided for receiving audio input from the outside and converting it into an analog electrical signal. The A / D converter 102 is provided for converting the analog signal received by the input unit 101 into a digital signal. The feature analysis unit 103 is provided to extract a predetermined feature amount for speech recognition. The word classification unit 104 is provided for searching for a word included in the speech based on the feature amount extracted by the feature analysis unit 103. The storage unit 106 stores data necessary when the word classification unit 104 searches for a word, and is referred to by the word classification unit 104. The output unit 105 is provided to output a word obtained as a result of searching in the word classification unit 104.

図１５の機能ブロックに基づいた単語列決定の流れについて概説する。入力部１０１より入力された未知の音声は、Ａ／Ｄ変換部１０２を通して離散化され、デジタル信号に変換される。次いで特徴分析部１０３において、変換されたデジタル信号はフーリエ解析され、２４チャネル程度の帯域通過フィルタ（ＢＰＦ）に通されてノイズ成分が除去された結果、音声のスペクトルが抽出される。なお，近年の標準的音声認識では、音声スペクトルを聴覚特性に合わせて周波数をメル尺度化するとともに，スペクトルの対数値を離散コサイン変換（ＤＣＴ）したメルケプストラム（Mel Frequency Cepstrum Coefficient; ＭＦＣＣ）を音声のスペクトル特徴として使用することが多い。 The flow of word string determination based on the functional blocks in FIG. 15 will be outlined. The unknown speech input from the input unit 101 is discretized through the A / D conversion unit 102 and converted into a digital signal. Next, in the feature analysis unit 103, the converted digital signal is Fourier-analyzed and passed through a band-pass filter (BPF) of about 24 channels to remove noise components. As a result, a speech spectrum is extracted. In recent standard speech recognition, the frequency of the speech spectrum is mel scaled according to the auditory characteristics, and the mel cepstrum (MFCC) obtained by performing discrete cosine transform (DCT) on the logarithm of the spectrum is converted into speech. Often used as a spectral feature.

次に単語分類部１０４において、特徴分析部１０３において得られたスペクトルに基づき、入力された音声に含まれる単語が検索される。単語分類部１０４では、はじめに、単語を構成する音素系列（これらは単語発音辞書１０７に記憶されている。）が抽出される。次いで、音素単位に用意されたＨＭＭ１０８が参照されて音響尤度が算出される。入力音声特徴Ｘの単語ｋ（もしくは音素Ｋ）に対する音響尤度Ｌｋは、式（１）で計算された後、ＨＭＭ１０８の状態遷移に沿って音響尤度Ｌｋを累積加算したものが用いられる。
Next, the word classification unit 104 searches for words included in the input speech based on the spectrum obtained by the feature analysis unit 103. In the word classification unit 104, first, phoneme sequences (these are stored in the word pronunciation dictionary 107) constituting the word are extracted. Next, the acoustic likelihood is calculated with reference to the HMM 108 prepared for each phoneme unit. The acoustic likelihood Lk for the word k (or phoneme K) of the input speech feature X is calculated by Equation (1) and then cumulatively added with the acoustic likelihood Lk along the state transition of the HMM 108.

ここで、μ_ｋは平均ベクトル、Σ_ｋ ^−１と｜Σ_ｋ｜は、各々共分散行列の逆行列と行列式である。なお実際には、単語発音辞書１０７から音素系列を逐次読み出す方法は効率が悪いため、認識対象の単語全てについて音素系列が予め単一の木構造グラフに縮退表現され、グラフ上で音素の音響尤度を累積しながら探索を進めるなどの手法が用いられる。 Here, μ _k is an average vector, and Σ _k ⁻¹ and | Σ _k | are an inverse matrix and a determinant of a covariance matrix, respectively. In practice, since the method of sequentially reading the phoneme sequence from the word pronunciation dictionary 107 is inefficient, the phoneme sequence is preliminarily expressed in a single tree structure graph for all the words to be recognized, and the acoustic likelihood of the phoneme is displayed on the graph. A technique such as advancing a search while accumulating degrees is used.

また単語分類部１０４における単語探索の途中には、累積尤度が低いパスをカットする、所謂ビームサーチが一般的に適用され、高速化が図られている。どの単語について探索を行うかを決定する場合には、言語モデル１０９が参照される。そして、検索の最初では文頭にくる単語全てが対象とされ、この探索が終了すると、言語モデル１０９の連鎖確率が参照され、次に接続可能な単語が決定される。 In addition, a so-called beam search that cuts a path with a low cumulative likelihood is generally applied during the word search in the word classification unit 104 to increase the speed. The language model 109 is referred to when determining which word to search. At the beginning of the search, all the words at the beginning of the sentence are targeted. When this search is completed, the chain probability of the language model 109 is referred to and the next connectable word is determined.

なお、単語分類部１０４における単語探索の途中で使用される累積尤度は、音響尤度と単語連鎖尤度（これらは確率値を対数化した値として使用される）を重み付き加算することにより求められる。重み付き加算時における重み係数は、ＨＭＭ１０８の音響尤度と、言語コーパスから求められた単語連鎖尤度（値としては、単語連鎖尤度の方が一桁程度小さい。）という二つの異種な尤度を結合することから必要となり、シミュレーションから両者のバランスを取って決定される。入力音声の終端では，最大の累積尤度を与える単語系列が、認識結果として取り出される。（非特許文献２及び非特許文献３参照） The cumulative likelihood used during word search in the word classification unit 104 is obtained by weighted addition of acoustic likelihood and word chain likelihood (these are used as logarithm values of probability values). Desired. The weighting coefficient at the time of weighted addition is two different likelihoods: the acoustic likelihood of the HMM 108 and the word chain likelihood obtained from the language corpus (in terms of value, the word chain likelihood is about one digit smaller). It is necessary to combine the degrees, and is determined in a balanced manner from the simulation. At the end of the input speech, a word sequence giving the maximum cumulative likelihood is taken out as a recognition result. (See Non-Patent Document 2 and Non-Patent Document 3)

以上の処理を経て検索された単語は、入力部１０１より受け付けられた音声に含まれる単語を認識した結果として、出力部１０５より出力される。このように、従来の標準的な音声認識装置では、ＨＭＭ１０８の音響尤度と言語モデル１０９の単語連鎖尤度とを組み合わせることにより、高い認識精度を得ることが可能となっている。 The word searched through the above processing is output from the output unit 105 as a result of recognizing the word included in the speech received from the input unit 101. Thus, in the conventional standard speech recognition apparatus, it is possible to obtain high recognition accuracy by combining the acoustic likelihood of the HMM 108 and the word chain likelihood of the language model 109.

ここで、音声認識可能な単語数は、単語発音辞書に格納される言語コーパスの規模に依存する。そして、言語コーパスの規模を大きくする程、認識可能な単語数が大きくなるものの、記憶領域や処理時間の制約上、言語コーパスの規模には限界がある。このような中、所定回数繰り返して入力された単語を言語モデルとして登録し使用することによって、音声の認識精度を維持しつつ、言語モデルの容量を抑制して処理時間を短縮することが可能な音声認識装置が提案されている（例えば、特許文献１参照）。
特開２００７−２４８５２９号公報柏野牧夫、音声知覚の運動理論をめぐって、日本音響学会誌第６２巻５号，ｐｐ．３９１−３９６（平成１８年）安藤彰男、リアルタイム音声認識、電子情報通信学会（２００３年（平成１５年））ｐｐ．４〜９「１．３音声認識技術の概要」鹿野清宏、伊藤克亘、河原達也、武田一哉、山本幹雄、音声認識システム，オーム社（２００１年（平成１３年））ｐｐ．９３〜１１０「第６章大語彙連続音声認識アルゴリズム」 Here, the number of words that can be recognized by speech depends on the scale of the language corpus stored in the word pronunciation dictionary. The larger the size of the language corpus, the larger the number of recognizable words. However, the size of the language corpus is limited due to storage area and processing time constraints. Under such circumstances, it is possible to reduce the processing time by suppressing the capacity of the language model while maintaining the speech recognition accuracy by registering and using a word input repeatedly a predetermined number of times as a language model. A voice recognition device has been proposed (see, for example, Patent Document 1).
JP 2007-248529 A Makio Makino, Journal of the Acoustical Society of Japan, Vol. 62, No. 5, pp. 391-396 (2006) Akio Ando, Real-Time Speech Recognition, IEICE (2003), pp. 4-9 “1.3 Outline of Speech Recognition Technology” Kiyohiro Shikano, Katsunobu Ito, Tatsuya Kawahara, Kazuya Takeda, Mikio Yamamoto, Speech Recognition System, Ohmsha (2001 (Heisei 13)) pp. 93-110 "Chapter 6 Large Vocabulary Continuous Speech Recognition Algorithm"

しかしながら上述の音声認識装置では、未知語の認識が不可能となるという問題点がある。また、大規模な言語コーパスを使用した場合であっても、すべての単語をカバーすることは不可能であるという問題点がある。 However, the above speech recognition apparatus has a problem that it is impossible to recognize unknown words. Further, even when a large-scale language corpus is used, there is a problem that it is impossible to cover all words.

また、未知語への対応を可能にする音声認識装置の実現には、高精度に音素を認識できる手段が必要になるが、現在の音声認識装置は、言語モデルなし、即ち、単語辞書を参照できない場合の音素認識性能は６０〜８０％に留まる（なお、人間は９８％以上の高い精度で音素を聞き取ることができるため、未知語についても効果的に聞き直すなどして効率よく処置できる。）。以上の理由が、未知語の認識が不可欠な音声対話や音声検索などのアプリケーションにおいて、音声によるインタフェースの導入が阻害される大きな要因となっているという問題点がある。 In addition, in order to realize a speech recognition device that can handle unknown words, a means capable of recognizing phonemes with high accuracy is required. However, current speech recognition devices have no language model, that is, refer to a word dictionary. When it is not possible, the phoneme recognition performance is limited to 60 to 80% (note that a human can listen to phonemes with a high accuracy of 98% or more, and thus can efficiently treat unknown words and the like effectively. ). For the above reasons, there is a problem that introduction of a voice interface is a major factor in applications such as voice conversation and voice search in which recognition of unknown words is indispensable.

一方、音声を調音特徴で表現する方法が古くから音声学の分野で提案されている。国際音声記号（International Phonetic Alphabet：IPA）として標準的な記法も提案されている。また、調音に関わる構造的な特徴を基に音素（音韻）を分類する、弁別的特徴（有声性／非有声性／連続性／半母音性／破裂性／摩擦性／破擦性／舌端性／鼻音性／高舌性／低舌性／（舌の盛上る位置が）前方性／後方性／・・・；Distinctive Feature：DF）も古くから提案されている。また、音声から弁別的特徴などの調音特徴を直接抽出する方法も，ニューラルネットワークを利用する手法など多く提案されている（非特許文献４参照）。 On the other hand, methods for expressing speech with articulatory features have been proposed in the field of phonetics for a long time. A standard notation has also been proposed as an International Phonetic Alphabet (IPA). Also, distinguishing features (voiced / non-voiced / continuity / semi-vowel / bursting / friction / friction / tip) classifying phonemes (phonemes) based on structural features related to articulation / Nasal tone / high tongue / low tongue / (the position where the tongue rises) is forward / backward / ...; Distinctive Feature (DF) has been proposed for a long time. In addition, many methods for directly extracting articulatory features such as discriminative features from speech have been proposed, including a method using a neural network (see Non-Patent Document 4).

日本語の音素に関する弁別的音素特徴（Distinctive Phonetic Feature; DPF）を図１６に示す。ここで弁別的音素特徴とは、調音特徴の表現方法の一つである。図は、縦欄が弁別的特徴を示しており、横欄が個々の音素を示している。そして、この表から一つの音素を生成する際に必要な発声器官の動作を知ることができる。図１６のうちｎｉｌ（高／低）およびｎｉｌ（前／後）は、各々、高舌性／低舌性のどちらにも属さない音素、及び（舌の盛上る位置が）前方性／後方性のどちらにも属さない音素に対して、弁別特徴を割り当てるため，新たに追加した特徴であることを示す。このように音素間のバランスをとることで，音声認識性能が向上することが知られている。 FIG. 16 shows a distinctive phonetic feature (DPF) regarding Japanese phonemes. Here, the discriminative phoneme feature is one method of expressing articulatory features. In the figure, the vertical column shows the distinguishing features, and the horizontal column shows the individual phonemes. Then, it is possible to know the operation of the vocal organs necessary for generating one phoneme from this table. In FIG. 16, nil (high / low) and nil (front / rear) are phonemes that do not belong to either high tongue / low tongue, respectively, and (the position where the tongue rises) is forward / backward. In order to assign a discrimination feature to a phoneme that does not belong to any of the above, it indicates that it is a newly added feature. It is known that the speech recognition performance is improved by balancing the phonemes.

しかしながら、抽出した弁別的音素特徴から音声認識を行った場合、音声スペクトルもしくは音声ケプストラムを特徴とする従来の特徴と比べて顕著な性能が得られていないのが実情である（非特許文献５参照）。 However, when speech recognition is performed from the extracted discriminative phoneme features, the actual situation is that remarkable performance is not obtained as compared with conventional features characterized by a speech spectrum or speech cepstrum (see Non-Patent Document 5). ).

本発明は上記の問題点を解決するためになされたものであり、未知語への対応が可能であり、音声対話や音声検索からの要求に耐えうる高い音素識別精度を有する調音特徴抽出装置、調音特徴抽出方法、及び調音特徴抽出プログラムを提供することを目的とする。
板橋秀一編，音声工学，森北出版（１９７３年（昭和４８年））ｐｐ．６〜ｐｐ．１０２．１．１．音声・音素・音節（表２．２日本語の弁別素性）福田隆，新田恒雄，"Orthogonalized Distinctive Phonetic Feature Extraction for Noise-robust Automatic Speech Recognition", 電子情報通信学会英文論文誌，Vol.E87-D, No.5，pp.1110-1118，(2004-5)． The present invention has been made to solve the above-mentioned problems, and it is capable of dealing with unknown words, and it has a high phoneme identification accuracy capable of withstanding the demands from voice conversation and voice search, It is an object to provide an articulation feature extraction method and an articulation feature extraction program.
Shuichi Itabashi, Speech Engineering, Morikita Publishing (1973) pp. 6-pp. 10 2.1.1. Speech / phonemes / syllables (Table 2.2 Japanese discrimination features) Takashi Fukuda, Tsuneo Nitta, "Orthogonalized Distinctive Phonetic Feature Extraction for Noise-robust Automatic Speech Recognition", IEICE English Journal, Vol.E87-D, No.5, pp.1110-1118, (2004-5 ).

上述の問題点を解決するために、請求項１に係る発明の調音特徴抽出装置では、音声を取得する音声取得手段と、前記音声取得手段にて取得された音声の調音特徴を抽出する調音特徴抽出手段と、前記調音特徴抽出手段にて抽出された前記調音特徴の時系列データである調音特徴系列を運動軌跡に変換し、前記運動軌跡に基づいて、前記音声に含まれる調音結合の影響を排除して音素認識可能なように、前記調音特徴系列にて表わされる調音の運動である調音運動を修正する調音運動修正手段と、前記調音運動修正手段にて修正された前記調音運動である修正調音運動を記憶手段に記憶する記憶制御手段とを備えている。 In order to solve the above-described problem, in the articulatory feature extraction apparatus according to the first aspect of the present invention, a voice acquisition unit that acquires voice, and an articulation feature that extracts the articulation feature of the voice acquired by the voice acquisition unit. An articulation feature sequence that is time-series data of the articulation feature extracted by the extraction means and the articulation feature extraction unit is converted into a motion trajectory, and based on the motion trajectory, the influence of articulation coupling included in the speech is Articulation motion correcting means for correcting articulation motion, which is an articulatory motion represented by the articulatory feature series, so that phoneme recognition is possible, and correction of the articulatory motion corrected by the articulatory motion correcting means. Storage control means for storing the articulatory movement in the storage means.

また、請求項２に係る発明の調音特徴抽出装置では、請求項１に記載の発明の構成に加えて、前記調音特徴系列の変位成分より、速度成分と加速度成分とを抽出する成分抽出手段を備え、前記調音運動修正手段は、前記変位成分、前記成分抽出手段にて抽出された前記速度成分及び前記加速度成分のうち少なくともいずれかに基づいて、前記調音運動を前記修正調音運動に修正することを特徴とする。 According to a second aspect of the present invention, the articulatory feature extracting apparatus further comprises a component extracting means for extracting a speed component and an acceleration component from the displacement component of the articulatory feature series in addition to the configuration of the first aspect of the invention. And the articulatory motion correcting means corrects the articulatory motion to the corrected articulatory motion based on at least one of the displacement component, the velocity component extracted by the component extractor, and the acceleration component. It is characterized by.

また、請求項３に係る発明の調音特徴抽出装置では、請求項２に記載の発明の構成に加えて、前記変位成分、前記速度成分、及び前記加速度成分のうち少なくともいずれかに基づき、前記変位成分を時間軸に沿って観測した場合において、その推移が凹パターンとなるか凸パターンとなるかを認識するパターン認識手段を備え、前記調音運動修正手段は、前記パターン認識手段にて認識されたパターンに基づき、前記調音運動を前記修正調音運動に修正することを特徴とする。 In addition, in the articulatory feature extraction device according to a third aspect, in addition to the configuration according to the second aspect, the displacement is based on at least one of the displacement component, the velocity component, and the acceleration component. When the component is observed along the time axis, it comprises pattern recognition means for recognizing whether the transition is a concave pattern or a convex pattern, and the articulatory movement correction means is recognized by the pattern recognition means Based on the pattern, the articulatory motion is corrected to the corrected articulatory motion.

また、請求項４に係る発明の調音特徴抽出装置では、請求項２又は３に記載の発明の構成に加えて、前記調音運動修正手段は、ニューラルネットワークに前記変位成分、前記速度成分、及び前記加速度成分のうち少なくともいずれかを通すことによって、前記調音運動を前記修正調音運動に修正することを特徴とする。 In addition, in the articulatory feature extraction device according to a fourth aspect of the invention, in addition to the configuration of the invention according to the second or third aspect, the articulatory motion correcting means includes the displacement component, the velocity component, and the neural network. The articulatory motion is corrected to the corrected articulatory motion by passing at least one of the acceleration components.

また、請求項５に係る発明の調音特徴抽出方法では、音声を取得する音声取得ステップと、前記音声取得ステップにて取得された音声の調音特徴を抽出する調音特徴抽出ステップと、前記調音特徴抽出ステップにて抽出された前記調音特徴の時系列データである調音特徴系列を運動軌跡に変換し、前記運動軌跡に基づいて、前記音声に含まれる調音結合の影響を排除して音素認識可能なように、前記調音特徴系列にて表わされる調音の運動である調音運動を修正する調音運動修正ステップと、前記調音運動修正ステップにて修正された前記調音運動である修正調音運動を記憶手段に記憶する記憶制御ステップとを備えている。 In the articulation feature extraction method of the invention according to claim 5, a sound acquisition step of acquiring sound, an articulation feature extraction step of extracting the articulation feature of the sound acquired in the sound acquisition step, and the articulation feature extraction The articulatory feature sequence, which is time series data of the articulatory features extracted in the step, is converted into a motion trajectory, and based on the motion trajectory, the effect of articulation coupling included in the speech is eliminated so that phoneme recognition is possible. Further, the articulatory motion correcting step for correcting the articulatory motion represented by the articulatory feature series, and the corrected articulatory motion being the articulatory motion corrected in the articulatory motion correcting step are stored in the storage means. And a storage control step.

また、請求項６に係る発明の調音特徴抽出方法では、請求項５に記載の発明の構成に加えて、前記調音特徴系列の変位成分より、速度成分と加速度成分とを抽出する成分抽出ステップを備え、前記調音運動修正ステップは、前記変位成分、前記成分抽出ステップにて抽出された前記速度成分及び前記加速度成分のうち少なくともいずれかに基づいて、前記調音運動を前記修正調音運動に修正することを特徴とする。 Further, in the articulation feature extraction method of the invention according to claim 6, in addition to the configuration of the invention of claim 5, a component extraction step of extracting a speed component and an acceleration component from the displacement component of the articulation feature series is provided. And the articulatory motion correction step corrects the articulatory motion to the corrected articulatory motion based on at least one of the displacement component, the velocity component extracted in the component extraction step, and the acceleration component. It is characterized by.

また、請求項７に係る発明の調音特徴抽出方法では、請求項６に記載の発明の構成に加えて、前記変位成分、前記速度成分、及び前記加速度成分のうち少なくともいずれかに基づき、前記変位成分を時間軸に沿って観測した場合において、その推移が凹パターンとなるか凸パターンとなるかを認識するパターン認識ステップを備え、前記調音運動修正ステップは、前記パターン認識ステップにて認識されたパターンに基づき、前記調音運動を前記修正調音運動に修正することを特徴とする。 In addition, in the articulatory feature extraction method of the invention according to claim 7, in addition to the configuration of the invention of claim 6, the displacement is based on at least one of the displacement component, the velocity component, and the acceleration component. A pattern recognition step for recognizing whether the transition is a concave pattern or a convex pattern when the component is observed along the time axis, and the articulatory motion correction step is recognized in the pattern recognition step Based on the pattern, the articulatory motion is corrected to the corrected articulatory motion.

また、請求項８に係る発明の調音特徴抽出方法では、請求項６又は７に記載の発明の構成に加えて、前記調音運動修正ステップは、ニューラルネットワークに前記変位成分、前記速度成分、及び前記加速度成分のうち少なくともいずれかを通すことによって、前記調音運動を前記修正調音運動に修正することを特徴とする。 In the articulatory feature extraction method according to an eighth aspect of the present invention, in addition to the configuration of the invention according to the sixth or seventh aspect, the articulatory motion correcting step includes a neural network that includes the displacement component, the velocity component, and the The articulatory motion is corrected to the corrected articulatory motion by passing at least one of the acceleration components.

また、請求項９に係る発明の調音特徴抽出プログラムでは、請求項１乃至４のいずれかに記載の調音特徴抽出装置の各処理手段としてコンピュータを駆動させる。 In the articulation feature extraction program according to the ninth aspect, the computer is driven as each processing means of the articulation feature extraction device according to any one of the first to fourth aspects.

請求項１に係る発明の調音特徴抽出装置は、音声発話に伴う調音動作の特徴を高い精度で抽出することが可能となる。これにより、音声スペクトルを使用して音声を認識する従来の音声認識装置と比較して、精度の高い音声認識を行うことが可能となる。 The articulatory feature extraction apparatus according to the first aspect of the present invention can extract the characteristics of the articulatory operation accompanying voice utterance with high accuracy. Thereby, compared with the conventional speech recognition apparatus which recognizes a speech using a speech spectrum, it becomes possible to perform highly accurate speech recognition.

従来の音声のスペクトルを特徴とした音声認識では、話者や発話時の文脈、周囲騒音等によってスペクトルが大きく変動してしまうため、音響的な尤度を求める際に使用するＨＭＭの設計に多くの音声データを必要としていた。また、ＨＭＭの混合数も１０以上が必要とされ，高性能な音声認識装置とするためにはコストが嵩んでしまっていた。これに対し本発明の調音特徴抽出装置では、音声中の調音特徴を高精度に抽出できるため、ＨＭＭの混合数は数個程度で済む。音声スペクトルを特徴として利用する従来法の場合，その中に言語情報以外の様々な情報，例えば外部騒音や発話時の調音結合（前後の音素の影響）が混入する結果，分類目的の音素や単語の変形が爆発的に増えることになる。近年のＨＭＭに基づく音声認識装置では，音声スペクトル（実際に多用されるのは，音声スペクトルを聴覚特性に合わせて周波数をメル尺度化するとともに，スペクトルの対数値を離散コサイン変換(DCT)した「メルケプストラム (Mel Frequency Cepstrum Coefficient; 通称MFCC)」が使用される）を直接，入力特徴として使用した場合，個々のベクトル要素の変動を複数の正規分布から表現する。複数の正規分布は混合分布と呼ばれ，この数は前述した様々な変形に対処するため，近年では６０〜７０の分布を使用するものが現れている。このように，厖大なメモリと演算が必要となった原因は，音声中に隠された変数を特定せずに，音素や単語を分類しようとした結果といえる。本発明は，隠れ変数を調音動作と特定した結果，音素分類器や単語分類器の規模（ここでは混合数）を小規模に押さえることが可能になる。 In conventional speech recognition characterized by the spectrum of speech, the spectrum varies greatly depending on the speaker, the context at the time of speech, ambient noise, etc., so it is often used for designing HMMs used for obtaining acoustic likelihood. Needed voice data. Further, the number of HMMs to be mixed is required to be 10 or more, and the cost has been increased in order to obtain a high-performance speech recognition apparatus. On the other hand, the articulatory feature extraction apparatus of the present invention can extract articulatory features in speech with high accuracy, so that only a few HMMs are mixed. In the case of the conventional method using the speech spectrum as a feature, various information other than linguistic information, for example, external noise and articulation combination during speech (effect of phonemes before and after) are mixed, resulting in phonemes and words for classification purposes. The deformation of will increase explosively. In recent speech recognition devices based on HMM, the speech spectrum (in fact, the frequency of the speech spectrum is matched to the auditory characteristics and the frequency is Mel scaled, and the logarithmic value of the spectrum is discrete cosine transformed (DCT). When the Mel cepstrum (Mel Frequency Cepstrum Coefficient; commonly called MFCC) is used directly as an input feature, the variation of individual vector elements is expressed from multiple normal distributions. A plurality of normal distributions are called mixed distributions, and in order to cope with the various variations described above, in recent years, those using distributions of 60 to 70 have appeared. In this way, the reason why a large amount of memory and computation are required is the result of trying to classify phonemes and words without specifying the hidden variables in the speech. According to the present invention, as a result of specifying the hidden variable as the articulatory operation, the scale of the phoneme classifier and the word classifier (here, the number of mixtures) can be reduced to a small scale.

また、調音特徴の高精度抽出は、音素認識性能を飛躍的に向上させ、未知語の問題に対して人間が行っている対応と同様の対応を行うことが可能となる。従って、音素系列を利用した確認発話文の合成により，対話をスムースに進めることが可能になる。 Also, high-precision extraction of articulatory features can dramatically improve phoneme recognition performance, and can respond to the unknown word problem in the same way as humans do. Therefore, it is possible to smoothly advance the dialogue by synthesizing the confirmation utterance sentence using the phoneme sequence.

さらに，調音特徴は多くの場合，テキスト（かな系列に変換した読み）と一対一に対応するため、音声ドキュメントとテキストドキュメントに対する検索を，音声およびテキスト（キーボード）の双方から相互に検索することが可能となる。 Furthermore, in many cases, articulation features correspond one-to-one with text (reading converted into a kana sequence), so that searches for speech documents and text documents can be performed from both speech and text (keyboard). It becomes possible.

また、請求項２に係る発明の調音特徴抽出装置は、請求項１に記載の発明の効果に加えて、調音運動は、変位成分、速度成分、及び加速度成分のうち少なくともいずれかに基づいて修正調音運動に修正されるので、話者や発話時の文脈、周囲の騒音等に依存せず、音声発話に伴う調音動作の特徴を高い精度で抽出することが可能となる。 In addition to the effect of the invention described in claim 1, the articulatory feature extraction device according to claim 2 corrects the articulatory motion based on at least one of a displacement component, a velocity component, and an acceleration component. Since the movement is corrected to the articulatory movement, it is possible to extract the characteristics of the articulatory movement accompanying the voice utterance with high accuracy without depending on the speaker, the context at the time of utterance, ambient noise, and the like.

また、請求項３に係る発明の調音特徴抽出装置は、請求項２に記載の発明の効果に加えて、修正調音運動は、変位成分を時間軸に沿って観測した場合における推移のパターン（凹パターン、凸パターン）に基づいて修正されるので、調音結合により音素が単音の状態と異なる状態となった場合であっても、音声発話に伴う調音動作の特徴を高い精度で抽出することが可能となる。 In addition to the effect of the invention according to claim 2, the articulatory feature extraction device according to claim 3 is characterized in that the modified articulatory motion has a transition pattern (recessed) when the displacement component is observed along the time axis. Pattern, convex pattern), it is possible to extract the characteristics of articulatory movements associated with speech utterance with high accuracy even if the phoneme is different from the state of a single phone due to articulation coupling. It becomes.

また、請求項４に係る発明の調音特徴抽出装置は、請求項２又は３に記載の発明の効果に加えて、ニューラルネットワークを使用することにより、高速に修正調音運動を得ることができる。 Further, the articulatory feature extraction device of the invention according to claim 4 can obtain a corrected articulatory motion at high speed by using a neural network in addition to the effect of the invention of claim 2 or 3.

また、請求項５に係る発明の調音特徴抽出方法は、音声発話に伴う調音動作の特徴を高い精度で抽出することが可能となる。これにより、音声スペクトルを使用して音声を認識する従来の音声認識装置と比較して、精度の高い音声認識を行うことが可能となる。 Further, the articulation feature extraction method of the invention according to claim 5 can extract the feature of the articulation operation accompanying the voice utterance with high accuracy. Thereby, compared with the conventional speech recognition apparatus which recognizes a speech using a speech spectrum, it becomes possible to perform highly accurate speech recognition.

従来の音声のスペクトルを特徴とした音声認識では、話者や発話時の文脈、周囲騒音等によってスペクトルが大きく変動してしまうため、音響的な尤度を求める際に使用する隠れマルコフモデル（ＨＭＭ）の設計に多くの音声データを必要としていた。また、ＨＭＭの混合数も１０以上が必要とされ，高性能な音声認識装置とするためにはコストが嵩んでしまっていた。これに対し本発明の調音特徴抽出装置では、音声中の調音特徴を高精度に抽出できるため、ＨＭＭの混合数は数個程度で済む。音声スペクトルを特徴として利用する従来法の場合，その中に言語情報以外の様々な情報，例えば外部騒音や発話時の調音結合（前後の音素の影響）が混入する結果，分類目的の音素や単語の変形が爆発的に増えることになる。近年のＨＭＭに基づく音声認識装置では，音声スペクトル（実際に多用されるのは，音声スペクトルを聴覚特性に合わせて周波数をメル尺度化するとともに，スペクトルの対数値を離散コサイン変換(DCT)した「メルケプストラム (Mel Frequency Cepstrum Coefficient; 通称MFCC)」が使用される）を直接，入力特徴として使用した場合，個々のベクトル要素の変動を複数の正規分布から表現する。複数の正規分布は混合分布と呼ばれ，この数は前述した様々な変形に対処するため，近年では６０〜７０の分布を使用するものが現れている。このように，厖大なメモリと演算が必要となった原因は，音声中に隠された変数を特定せずに，音素や単語を分類しようとした結果といえる。本発明は，隠れ変数を調音動作と特定した結果，音素分類器や単語分類器の規模（ここでは混合数）を小規模に押さえることが可能になる。 In the conventional speech recognition characterized by the spectrum of speech, the spectrum greatly fluctuates depending on the speaker, the context at the time of speech, ambient noise, etc., so the hidden Markov model (HMM) used when obtaining the acoustic likelihood ) Required a lot of audio data. Further, the number of HMMs to be mixed is required to be 10 or more, and the cost has been increased in order to obtain a high-performance speech recognition apparatus. On the other hand, the articulatory feature extraction apparatus of the present invention can extract articulatory features in speech with high accuracy, so that only a few HMMs are mixed. In the case of the conventional method using the speech spectrum as a feature, various information other than linguistic information, for example, external noise and articulation combination during speech (effect of phonemes before and after) are mixed, resulting in phonemes and words for classification purposes. The deformation of will increase explosively. In recent speech recognition devices based on HMM, the speech spectrum (in fact, the frequency of the speech spectrum is matched to the auditory characteristics and the frequency is Mel scaled, and the logarithmic value of the spectrum is discrete cosine transformed (DCT). When the Mel cepstrum (Mel Frequency Cepstrum Coefficient; commonly called MFCC) is used directly as an input feature, the variation of individual vector elements is expressed from multiple normal distributions. A plurality of normal distributions are called mixed distributions, and in order to cope with the various variations described above, in recent years, those using distributions of 60 to 70 have appeared. In this way, the reason why a large amount of memory and computation are required is the result of trying to classify phonemes and words without specifying the hidden variables in the speech. According to the present invention, as a result of specifying the hidden variable as the articulatory operation, the scale of the phoneme classifier and the word classifier (here, the number of mixtures) can be reduced to a small scale.

また、請求項６に係る発明の調音特徴抽出方法は、請求項５に記載の発明の効果に加えて、調音運動は、変位成分、速度成分、及び加速度成分のうち少なくともいずれかに基づいて修正調音運動に修正されるので、話者や発話時の文脈、周囲の騒音等に依存せず、音声発話に伴う調音動作の特徴を高い精度で抽出することが可能となる。 In addition to the effect of the invention described in claim 5, the articulatory feature extraction method of the invention according to claim 6 is modified based on at least one of a displacement component, a velocity component, and an acceleration component. Since the movement is corrected to the articulatory movement, it is possible to extract the characteristics of the articulatory movement accompanying the voice utterance with high accuracy without depending on the speaker, the context at the time of utterance, ambient noise, and the like.

また、請求項７に係る発明の調音特徴抽出方法は、請求項６に記載の発明の効果に加えて、修正調音運動は、変位成分を時間軸に沿って観測した場合における推移のパターン（凹パターン、凸パターン）に基づいて修正されるので、調音結合により音素が単音の状態と異なる状態となった場合であっても、音声発話に伴う調音動作の特徴を高い精度で抽出することが可能となる。 In addition to the effect of the invention according to claim 6, the articulatory feature extraction method according to the invention according to claim 7 has a modified articulation motion that is a transition pattern (recessed) when the displacement component is observed along the time axis. Pattern, convex pattern), it is possible to extract the characteristics of articulatory movements associated with speech utterance with high accuracy even if the phoneme is different from the state of a single phone due to articulation coupling. It becomes.

また、請求項８に係る発明の調音特徴抽出方法は、請求項６又は７に記載の発明の効果に加えて、ニューラルネットワークを使用することにより、高速に修正調音運動を得ることができる。 The articulatory feature extraction method of the invention according to claim 8 can obtain a corrected articulatory motion at high speed by using a neural network in addition to the effect of the invention of claim 6 or 7.

また、請求項９に係る発明の調音特徴抽出プログラムは、請求項１乃至４のいずれかに記載の調音特徴抽出装置の各処理手段としてコンピュータを駆動させることが可能となる。 The articulation feature extraction program of the invention according to claim 9 can drive a computer as each processing means of the articulation feature extraction device according to any one of claims 1 to 4.

以下、本発明の調音特徴抽出装置、調音特徴抽出方法の実施の形態について、図面を参照して説明する。なお、これらの図面は、本発明が採用しうる技術的特徴を説明するために用いられるものであり、記載されている装置の構成、各種処理のフローチャートなどは、特に特定的な記載がない限り、それのみに限定する趣旨ではなく、単なる説明例である。 DESCRIPTION OF THE PREFERRED EMBODIMENTS Embodiments of the articulation feature extraction device and articulation feature extraction method of the present invention will be described below with reference to the drawings. These drawings are used for explaining the technical features that can be adopted by the present invention, and the configuration of the apparatus described, the flowcharts of various processes, etc., unless otherwise specified. It is not intended to be limited to that, but merely an illustrative example.

はじめに、図１を参照し、調音特徴抽出装置１の電気的構成について説明する。図１は、調音特徴抽出装置１の電気的構成を示す模式図である。図１に示すように、調音特徴抽出装置１は、中央演算処理装置１１、入力装置１２、出力装置１３、記憶装置１４、及び、外部記憶装置１５から構成されている。 First, the electrical configuration of the articulation feature extraction apparatus 1 will be described with reference to FIG. FIG. 1 is a schematic diagram showing an electrical configuration of the articulation feature extraction device 1. As shown in FIG. 1, the articulation feature extraction device 1 includes a central processing unit 11, an input device 12, an output device 13, a storage device 14, and an external storage device 15.

中央演算処理装置１１は、数値演算・制御などの処理を行うために設けられており、本実施の形態において説明する処理手順に従って演算・処理を行う。例えばＣＰＵ等が使用可能である。入力装置１２は、マイクロホンやキーボード等で構成され、利用者が発声した音声やキー入力された文字列が入力される。出力装置１３は、ディスプレーやスピーカ等で構成され、特徴抽出結果，あるいは特徴抽出結果を処理することによって得られた情報が出力される。記憶装置１４は、中央演算処理装置１１によって実行される処理手順（調音特徴抽出プログラム）や，その処理に必要な一時データが格納される。例えば、ＲＯＭ（リード・オンリー・メモリ）やＲＡＭ（ランダム・アクセス・メモリ）が使用可能である。また外部記憶装置１５は、調音特徴抽出処理に使用される特徴分析用係数セット、調音特徴抽出処理に使用されるニューラルネットの重み係数セット、調音運動修正処理に使用される係数セット、音声認識処理に必要なモデル、入力された音声のデータ、解析結果データ等を記憶する為に設けられている。例えばハードディスクドライブ（ＨＤＤ）が使用可能である。そしてこれらは、互いにデータの送受信が可能なように、バス２２を介して電気的に接続されている。 The central processing unit 11 is provided for performing processing such as numerical computation and control, and performs computation and processing according to the processing procedure described in the present embodiment. For example, a CPU or the like can be used. The input device 12 is configured by a microphone, a keyboard, or the like, and inputs a voice uttered by a user or a character string input by a key. The output device 13 includes a display, a speaker, and the like, and outputs a feature extraction result or information obtained by processing the feature extraction result. The storage device 14 stores a processing procedure (articulation feature extraction program) executed by the central processing unit 11 and temporary data necessary for the processing. For example, ROM (Read Only Memory) or RAM (Random Access Memory) can be used. The external storage device 15 also includes a feature analysis coefficient set used for articulation feature extraction processing, a neural network weighting coefficient set used for articulation feature extraction processing, a coefficient set used for articulation motion correction processing, and speech recognition processing. This model is provided to store models necessary for data, input voice data, analysis result data, and the like. For example, a hard disk drive (HDD) can be used. These are electrically connected via the bus 22 so that data can be transmitted and received between them.

なお，本発明の調音特徴抽出装置１のハードウエア構成は，図1に示す構成に限定されるものではない。従って、インターネット等の通信ネットワークと接続する通信Ｉ／Ｆを備えていても構わない。 The hardware configuration of the articulatory feature extraction apparatus 1 of the present invention is not limited to the configuration shown in FIG. Accordingly, a communication I / F connected to a communication network such as the Internet may be provided.

また、本実施の形態では、調音特徴抽出装置１および調音特徴抽出プログラムは他のシステムから独立した構成を有しているが、本発明はこの構成に限定されるものではない。従って、他の装置の一部として組込まれた構成や，他のプログラムの一部として組込まれた構成とすることも可能である。またその場合における入力は，上述の他の装置やプログラムを介して間接的に行われることになる。 In the present embodiment, the articulation feature extraction device 1 and the articulation feature extraction program have a configuration independent of other systems, but the present invention is not limited to this configuration. Therefore, a configuration incorporated as a part of another device or a configuration incorporated as a part of another program is also possible. Further, the input in that case is indirectly performed through the other devices and programs described above.

次いで、外部記憶装置１５に記憶されている記憶データについて説明する。図１に示すように、外部記憶装置１５には、単語発音辞書が記憶されている単語発音辞書記憶領域１６、隠れマルコフモデルが記憶されている隠れマルコフモデル記憶領域１７、言語モデルが記憶されている言語モデル記憶領域１８、各処理時に使用される係数が記憶されている係数記憶領域１９、入力された音声が記憶される入力音声記憶領域２０、処理後のデータが記憶される処理結果記憶領域２１、及びその他の領域が設けられている。 Next, storage data stored in the external storage device 15 will be described. As shown in FIG. 1, the external storage device 15 stores a word pronunciation dictionary storage area 16 in which a word pronunciation dictionary is stored, a hidden Markov model storage area 17 in which a hidden Markov model is stored, and a language model. Language model storage area 18, coefficient storage area 19 in which coefficients used in each process are stored, input speech storage area 20 in which input speech is stored, and processing result storage area in which processed data is stored 21 and other areas are provided.

単語発音辞書記憶領域１６には、単語を構成する音素列が記憶されている。隠れマルコフモデル記憶領域１７には、中央演算処理装置１１において音声認識が行われる場合に参照される隠れマルコフモデルが記憶されている。言語モデル記憶領域１８には、認識可能な単語モデル（言語コーパス）が記憶されている。係数記憶領域１９には、調音特徴抽出処理に使用される特徴分析用係数セット、調音特徴抽出処理に使用されるニューラルネットの重み係数セット、調音運動修正処理に使用される係数セット等が記憶される。入力音声記憶領域２０には、入力装置１２を介して入力された音声データが記憶される。処理結果記憶領域２１には、中央演算処理装置１１において実行される各種処理の結果得られたデータが記憶される。なおこれらのデータの詳細は後述する。 In the word pronunciation dictionary storage area 16, phoneme strings constituting words are stored. The hidden Markov model storage area 17 stores a hidden Markov model that is referred to when speech recognition is performed in the central processing unit 11. In the language model storage area 18, recognizable word models (language corpus) are stored. The coefficient storage area 19 stores a feature analysis coefficient set used for articulation feature extraction processing, a neural network weighting coefficient set used for articulation feature extraction processing, a coefficient set used for articulation motion correction processing, and the like. The The input voice storage area 20 stores voice data input via the input device 12. The processing result storage area 21 stores data obtained as a result of various processes executed in the central processing unit 11. Details of these data will be described later.

次に、本発明の調音特徴抽出装置１にて実行される音声認識処理について、図２〜８を参照して説明する。図２は、調音特徴抽出装置１にて実行される調音特徴抽出処理を示す機能ブロック図である。図３は、特徴分析部２１０の機能詳細を示すブロック図である。図４は、調音特徴抽出部２２０の機能詳細を示すブロック図である。図５は、局所特徴抽出部２２１より得られる特徴分析部時間方向の局所特徴の一例である。図６は、局所特徴抽出部２２１より得られる周波数方向の局所特徴の一例である。図７は、弁別的音素特徴抽出部２２２にて得られる調音特徴の一例である。図８は、調音運動修正部２３０の機能詳細を示すブロック図である。図９は、調音運動修正処理部２３２における処理を示したフローチャートである。 Next, speech recognition processing executed by the articulatory feature extraction apparatus 1 of the present invention will be described with reference to FIGS. FIG. 2 is a functional block diagram showing the articulation feature extraction processing executed by the articulation feature extraction device 1. FIG. 3 is a block diagram illustrating details of functions of the feature analysis unit 210. FIG. 4 is a block diagram showing details of the function of the articulation feature extraction unit 220. FIG. 5 is an example of local features in the time direction of the feature analysis unit obtained from the local feature extraction unit 221. FIG. 6 is an example of local features in the frequency direction obtained from the local feature extraction unit 221. FIG. 7 is an example of the articulation feature obtained by the discriminative phoneme feature extraction unit 222. FIG. 8 is a block diagram showing details of the function of the articulation motion correction unit 230. FIG. 9 is a flowchart showing processing in the articulation motion correction processing unit 232.

図２に示すように、本発明の調音特徴抽出装置１において実行される調音抽出処理に必要な機能ブロックとして、入力部２０１、Ａ／Ｄ変換部２０２、特徴分析部２１０、調音特徴抽出部２２０、調音運動修正部２３０、単語分類部２０４、出力部２０５、記憶部２０６及び記憶部２０７が設けられている。 As shown in FIG. 2, as a functional block necessary for the articulation extraction process executed in the articulation feature extraction apparatus 1 of the present invention, an input unit 201, an A / D conversion unit 202, a feature analysis unit 210, and an articulation feature extraction unit 220. , An articulatory motion correction unit 230, a word classification unit 204, an output unit 205, a storage unit 206, and a storage unit 207 are provided.

記憶部２０７には、各種係数セット２０７１が記憶されている。そして、特徴分析部２１０、調音特徴抽出部２２０、及び調音運動修正部２３０より、記憶されている係数セットが参照可能な状態となっている。記憶部２０６には、発音単語辞書２０６１、隠れマルコフモデル２０６２、言語モデル２０６３、及びその他のデータが記憶されている。そして、単語分類部２０４より記憶されているデータが参照可能な状態となっている。 The storage unit 207 stores various coefficient sets 2071. The feature analysis unit 210, the articulation feature extraction unit 220, and the articulation motion correction unit 230 are in a state where the stored coefficient set can be referred to. The storage unit 206 stores a pronunciation word dictionary 2061, a hidden Markov model 2062, a language model 2063, and other data. The data stored by the word classification unit 204 can be referred to.

なお、図２における入力部２０１、Ａ／Ｄ変換部２０２、単語分類部２０４、及び出力部２０５については、図１５にて示した従来の音声認識処理装置における該当部分の機能と同一であるため、説明を省略し又は簡略する。 Note that the input unit 201, A / D conversion unit 202, word classification unit 204, and output unit 205 in FIG. 2 have the same functions as the corresponding parts in the conventional speech recognition processing apparatus shown in FIG. The description is omitted or simplified.

入力部２０１は、外部から入力される音声を受け付け、アナログ電気信号に変換するために設けられる。Ａ／Ｄ変換部２０２は、入力部２０１にて受け付けられたアナログ信号をデジタル信号に変換するために設けられる。特徴分析部２１０は、音声認識のために必要となる所定の特徴量を抽出するために設けられる（図３参照、詳細後述。）。調音特徴抽出部２２０は、特徴分析部２１０において抽出された特徴量の時系列データから、調音特徴の時系列データ（以下、「調音特徴系列」という。）を抽出するために設けられる（図４参照、詳細後述）。調音運動修正部２３０は、調音特徴抽出部２２０にて抽出された調音特徴系列を運動軌跡に変換し、さらに、変換された運動軌跡を所定の規則に基づいて修正するために設けられる（図８参照、詳細後述。）。 The input unit 201 is provided for receiving a sound input from the outside and converting it into an analog electric signal. The A / D conversion unit 202 is provided to convert an analog signal received by the input unit 201 into a digital signal. The feature analysis unit 210 is provided to extract a predetermined feature amount necessary for speech recognition (see FIG. 3, details will be described later). The articulation feature extraction unit 220 is provided to extract time series data of articulation features (hereinafter referred to as “articulation feature series”) from the time series data of the feature amounts extracted by the feature analysis unit 210 (FIG. 4). See also details below). The articulatory motion correcting unit 230 is provided to convert the articulatory feature sequence extracted by the articulatory feature extracting unit 220 into a motion trajectory, and further correct the converted motion trajectory based on a predetermined rule (FIG. 8). See details below.)

単語分類部２０４は、調音運動修正部２３０より得られる修正された調音運動（以下「修正調音運動」という。）に基づいて、音声に含まれる単語を検索するために設けられる。記憶部２０７は、特徴分析部２１０、調音特徴抽出部２２０、及び、調音運動修正部２３０において処理が実行される場合に参照される。記憶部２０６は、単語分類部２０４において単語を検索する場合に参照される。出力部２０５は、単語分類部２０４において検索された結果の単語を出力するために設けられている。 The word classification unit 204 is provided for searching for a word included in the speech based on the corrected articulation motion (hereinafter referred to as “corrected articulation motion”) obtained from the articulation motion correction unit 230. The storage unit 207 is referred to when processing is executed in the feature analysis unit 210, the articulation feature extraction unit 220, and the articulation motion correction unit 230. The storage unit 206 is referred to when the word classification unit 204 searches for a word. The output unit 205 is provided to output a word obtained as a result of searching in the word classification unit 204.

図２の機能ブロックに基づいた音声認識処理の流れについて説明する。入力部２０１より入力された未知の音声は、Ａ／Ｄ変換部２０２を通して離散化され、デジタル信号に変換される。そして変換されたデジタル信号は、特徴分析部２１０に出力される。 The flow of the speech recognition process based on the functional block of FIG. 2 will be described. The unknown speech input from the input unit 201 is discretized through the A / D conversion unit 202 and converted into a digital signal. The converted digital signal is output to the feature analysis unit 210.

特徴分析部２１０の機能詳細について、図３を参照して説明する。図３に示すように、特徴分析部２１０は、フーリエ変換部２１１とフィルタ部２１２とから構成されている。特徴分析部２１０では、Ａ／Ｄ変換部２０２にて変換されたデジタル信号は、はじめに、フーリエ変換部２１１においてフーリエ分析（窓幅２４〜３２ｍｓｅｃのハミング窓使用）される。次いでフィルタ部２１２において、２４チャネル程度の帯域通過フィルタに通されてノイズ成分が除去される。これにより、５〜１０ｍｓｅｃ間隔の音声スペクトル系列及び音声パワー系列が抽出される。そして得られた音声スペクトル系列及び音声パワー系列は、調音特徴抽出部２２０に対して出力される。 Details of the function of the feature analysis unit 210 will be described with reference to FIG. As shown in FIG. 3, the feature analysis unit 210 includes a Fourier transform unit 211 and a filter unit 212. In the feature analysis unit 210, the digital signal converted by the A / D conversion unit 202 is first subjected to Fourier analysis (using a Hamming window having a window width of 24 to 32 msec) in the Fourier transform unit 211. Next, in the filter unit 212, the noise component is removed by passing through a band-pass filter of about 24 channels. Thereby, a voice spectrum series and a voice power series at intervals of 5 to 10 msec are extracted. The obtained speech spectrum sequence and speech power sequence are output to the articulation feature extraction unit 220.

調音特徴抽出部２２０の機能詳細について、図４を参照して説明する。調音特徴抽出部２２０では、調音に関わる運動特徴が抽出される。図４に示すように、調音特徴抽出部２２０は、局所特徴抽出部２２１と弁別的音素特徴抽出部２２２とから構成されている。 Details of the function of the articulation feature extraction unit 220 will be described with reference to FIG. The articulation feature extraction unit 220 extracts motion features related to articulation. As shown in FIG. 4, the articulation feature extraction unit 220 includes a local feature extraction unit 221 and a discriminative phoneme feature extraction unit 222.

特徴分析部２１０より得られる音声スペクトル系列は、はじめに局所特徴抽出部２２１に入力される。局所特徴抽出部２２１では、時間軸微分特徴抽出部２２３及び周波数軸微分特徴抽出部２２４により時間軸方向及び周波数軸方向の微分特徴が抽出される。またこれとは別に、音声パワー系列の時間軸微分特徴が計算される。これらの微分特徴（以下「局所特徴」という。）の抽出にあたっては、ノイズ変動などの影響を抑えるため線形回帰演算が用いられる。これらの微分特徴抽出の際には、ノイズ変動などの影響を抑制するため、（２）式及び（３）式にて与えられる線形回帰演算が用いられる。
The speech spectrum series obtained from the feature analysis unit 210 is first input to the local feature extraction unit 221. In the local feature extraction unit 221, differential features in the time axis direction and the frequency axis direction are extracted by the time axis differential feature extraction unit 223 and the frequency axis differential feature extraction unit 224. Separately from this, the time axis differential feature of the audio power sequence is calculated. In extracting these differential features (hereinafter referred to as “local features”), linear regression calculation is used to suppress the influence of noise fluctuations and the like. When these differential features are extracted, linear regression operations given by the equations (2) and (3) are used to suppress the influence of noise fluctuations and the like.

ここで、ｘ（ｉ，ｔ）は音声スペクトル系列もしくは音声パワー系列を示す。ｉは周波数チャンネルを示す（なお音声パワー系列の場合は、ｉ＝１の関係が成立する。）。ｔは時刻を示す。Δ_ｔｘ（ｉ，ｔ）、Δ_ｆｘ（ｉ，ｔ）は、各々、ｘ（ｉ，ｔ）の時間方向の一次微分量と周波数方向の一次微分量であることを示す。 Here, x (i, t) represents a speech spectrum sequence or a speech power sequence. i indicates a frequency channel (in the case of an audio power sequence, the relationship of i = 1 is established). t indicates time. Δ _t x (i, t) and Δ _f x (i, t) indicate a primary differential amount in the time direction and a primary differential amount in the frequency direction of x (i, t), respectively.

式中のｋは、線形回帰演算を行う位置を示す。δはその片側の幅である。具体的には、局所特徴抽出の場合、δ＝１で線形回帰演算は三点、すなわち時間方向では着目する時刻を中心としてｔ＝−１，０，＋１の三点が，また周波数方向では着目するチャンネルを中心としてｉ＝−１，０，＋１の三点から線形回帰係数が各々（２）式と（３）式とを用いて求められる。局所特徴抽出部２２１にて算出された時間方向の局所特徴（図５参照）、及び周波数方向の局所特徴（図６参照）の一例について、図５及び図６に示す。図５及び図６は、「人工衛星」（ｊｉｎｋｏｅｓｅ）という発話に対して求められた局所特徴を示している。そして、抽出された局所特徴は、弁別的音素特徴抽出部２２２に出力される。 K in a formula shows the position which performs linear regression calculation. δ is the width on one side. Specifically, in the case of local feature extraction, δ = 1 and linear regression calculation has three points, that is, three points of t = -1, 0, +1 centered on the time of interest in the time direction, and attention in the frequency direction. The linear regression coefficients are obtained from the three points i = -1, 0, +1 using the equations (2) and (3), respectively. An example of the local feature in the time direction (see FIG. 5) and the local feature in the frequency direction (see FIG. 6) calculated by the local feature extraction unit 221 is shown in FIGS. FIG. 5 and FIG. 6 show local features required for an utterance “artificial satellite”. The extracted local features are output to the discriminative phoneme feature extraction unit 222.

なお、弁別的音素特徴抽出部２２２の入力データとしては，上述した局所特徴以外にも、性能は若干劣るが、音声スペクトル、あるいは音声スペクトルを直交化したケプストラム（実際には周波数軸をメル尺度化して求めるメルケプストラムが用いられる）を使用してもよい。 In addition to the above-described local features, the input data of the discriminative phoneme feature extraction unit 222 is slightly inferior in performance, but the speech spectrum or a cepstrum obtained by orthogonalizing the speech spectrum (actually, the frequency axis is converted into a mel scale). May be used).

次いで図４に示すように、弁別的音素特徴抽出部２２２では、局所特徴抽出部２２１にて抽出された局所特徴に基づき、調音特徴系列が抽出される。弁別的音素特徴抽出部２２２は、二段のニューラルネットワーク（第一多層ニューラルネット２２５、第二多層ニューラルネット２２６）から構成される。 Next, as shown in FIG. 4, the discriminative phoneme feature extraction unit 222 extracts the articulation feature series based on the local features extracted by the local feature extraction unit 221. The discriminative phoneme feature extraction unit 222 includes a two-stage neural network (a first multilayer neural network 225 and a second multilayer neural network 226).

弁別的音素特徴抽出部２２２を構成するニューラルネットワークについて詳説する。弁別的音素特徴抽出部２２２を構成するニューラルネットワークは、図４に示すように、初段の第一多層ニューラルネット２２５と、次段の第二多層ニューラルネット２２６との二段から構成される。第一多層ニューラルネット２２５では、音声スペクトル系列及び音声パワー系列より求めた局所特徴間の相関から、調音特徴系列を抽出する。また、第二多層ニューラルネット２２６では、調音特徴系列が持つ相互依存関係から意味のある部分空間を抽出し、精度の高い調音特徴系列を求める。弁別的音素特徴抽出部２２２にて算出された調音特徴抽出結果の一例について、図７に示す。図７は、「人工衛星」（ｊｉｎｋｏｅｓｅ）という発話に対して求められた調音特徴抽出結果を示している。 The neural network constituting the discriminative phoneme feature extraction unit 222 will be described in detail. As shown in FIG. 4, the neural network constituting the discriminative phoneme feature extraction unit 222 is composed of two stages, a first multilayer neural network 225 in the first stage and a second multilayer neural network 226 in the next stage. . The first multilayer neural network 225 extracts the articulation feature series from the correlation between the local features obtained from the speech spectrum series and the speech power series. In the second multilayer neural network 226, a meaningful subspace is extracted from the interdependence of the articulation feature series, and a highly accurate articulation feature series is obtained. An example of the articulation feature extraction result calculated by the discriminative phoneme feature extraction unit 222 is shown in FIG. FIG. 7 shows the articulation feature extraction result obtained for the utterance “artificial satellite” (jinkose).

なお、調音特徴系列を求めるニューラルネットワークの構成は、図４にて述べた二段構成のほか、性能を犠牲にすれば一段構成でも実現可能である（非特許文献５参照）。個々のニューラルネットワークは階層構造を持っており、入力層と出力層を除く隠れ層を１から２層持つ（多層ニューラルネットワーク）。また、出力層や隠れ層から入力層にフィードバックする構造を持つ、所謂リカレントニューラルネットワークが利用されることもある。調音特徴抽出に対する性能という点で比較すると、其々のニューラルネットワークにおいて算出された結果にそれほど大きな差はない。これらのニューラルネットワークは，非特許文献６に示される重み係数の学習を通して調音特徴抽出器として機能する（非特許文献６参照）。
坂和正敏，田中雅博，ニューロコンピューティング入門，森北出版（１９９７年平成９年）多層ニューラルネットワークについては，ｐｐ．１３- ４８２章「階層型ネットワークと学習メカニズム」に，誤差逆伝播法による重み係数の計算方法が記述されている。また，リカレントニューラルネットワークについては，ｐｐ．８３-９６４章「リカレントニューラルネットワーク」に同じく重み係数の計算方法が記載されている。 The configuration of the neural network for obtaining the articulatory feature series can be realized by a one-stage configuration in addition to the two-stage configuration described in FIG. 4 if performance is sacrificed (see Non-Patent Document 5). Each neural network has a hierarchical structure, and has one to two hidden layers excluding an input layer and an output layer (multilayer neural network). Also, a so-called recurrent neural network having a structure that feeds back from the output layer or hidden layer to the input layer may be used. When compared in terms of performance for articulatory feature extraction, the results calculated in each neural network are not significantly different. These neural networks function as articulatory feature extractors through learning of weighting coefficients shown in Non-Patent Document 6 (see Non-Patent Document 6).
Masatoshi Sakawa, Masahiro Tanaka, Introduction to Neurocomputing, Morikita Publishing (1997, 1997) For information on multilayer neural networks, see pp. Chapter 13-48 Chapter 2 “Hierarchical Networks and Learning Mechanisms” describes how to calculate the weighting coefficient using the error back-propagation method. For recurrent neural networks, see pp. 83-96 Chapter 4 “Recurrent Neural Network” also describes the calculation method of the weighting factor.

弁別的音素特徴抽出部２２２のニューラルネットワークでの学習は、入力層に音声の局所特徴データを加え，出力層には，音声の調音特徴を教師信号として与えることで行われる。 Learning by the neural network of the discriminative phoneme feature extraction unit 222 is performed by adding local feature data of speech to the input layer and providing the articulation feature of speech as a teacher signal to the output layer.

一方，調音特徴系列自体は脳から調音器官へ指令される信号であり，音声から求められた調音特徴系列は，指令を受けて調音動作した結果，すなわち発話器官の筋動作によるなまけを伴っていると考えられる。そこで，発話のアナログ的筋運動の結果を理想的な調音の系列（２値の離散系列）に近づける処理として，本発明では調音運動修正部２３０を導入している。 On the other hand, the articulatory feature sequence itself is a signal commanded from the brain to the articulatory organ, and the articulatory feature sequence obtained from the speech is accompanied by a slack caused by the articulatory action in response to the command, that is, the muscle action of the speech organ. it is conceivable that. Therefore, in the present invention, the articulatory motion correcting unit 230 is introduced as a process for bringing the analog muscle motion result of the speech closer to an ideal articulatory sequence (binary discrete sequence).

調音運動修正部２３０について、図８を参照して説明する。図８に示すように、調音運動修正部２３０は、速度／加速度成分抽出部２３１と調音運動修正処理部２３２とから構成されている。速度／加速度成分抽出部２３１では、調音特徴系列（弁別的音素特徴系列など）から，速度及び加速度が求められる。また調音運動修正処理部２３２では，速度／加速度成分抽出部２３１にて求められた速度及び加速度の値に基づき、調音特徴系列により表わされる調音の運動（「調音運動」という。）が修正される。調音運動とは、調音運動変位（変位成分、調音特徴の振幅値），調音運動速度（速度成分、調音特徴の時間微分値），および調音運動加速度（加速度成分、調音運動速度の時間微分値，調音運動変位の２階微分値）の三つから規定される。 The articulatory motion correcting unit 230 will be described with reference to FIG. As shown in FIG. 8, the articulation motion correction unit 230 includes a speed / acceleration component extraction unit 231 and an articulation motion correction processing unit 232. The speed / acceleration component extraction unit 231 obtains the speed and acceleration from the articulation feature sequence (such as the discriminative phoneme feature sequence). The articulatory motion correction processing unit 232 corrects the articulatory motion (referred to as “articulatory motion”) represented by the articulatory feature series based on the speed and acceleration values obtained by the speed / acceleration component extracting unit 231. . The articulatory motion means articulatory motion displacement (displacement component, amplitude value of articulatory feature), articulatory motion speed (speed component, time differential value of articulatory feature), and articulatory motion acceleration (acceleration component, time differential value of articulatory motion speed, (Second order differential value of articulatory movement displacement).

はじめに、速度／加速度成分抽出部２３１における処理の詳細について、図８を参照して説明する。速度／加速度成分抽出部２３１において調音特徴系列の変位成分より速度成分と加速度成分を求める場合には，はじめに、（２）式におけるｘ（ｉ，ｔ）を調音特徴系列ＤＰＦ（ｍ，ｔ）と置き換える。これにより，速度成分系列ＶＤＰＦ（ｍ，ｔ）が求められる。なお式中、「ｍ」（＝１，２、・・・Ｍ）は，破裂性，高舌性などを示す調音特徴番号を示しており、「ｔ」（＝１，２、・・・Ｔ）は時刻を示している。 First, details of processing in the speed / acceleration component extraction unit 231 will be described with reference to FIG. When the velocity / acceleration component extraction unit 231 obtains the velocity component and the acceleration component from the displacement component of the articulation feature sequence, first, x (i, t) in the equation (2) is expressed as the articulation feature sequence DPF (m, t). replace. As a result, the velocity component series VDPF (m, t) is obtained. In the formula, “m” (= 1, 2,... M) indicates an articulation feature number indicating rupture property, high tongue property, etc., and “t” (= 1, 2,... T ) Indicates the time.

次に、上述により求めた速度成分系列ＶＤＰＦ（ｍ，ｔ）を，同じく（２）式のｘ（ｉ，ｔ）に代入する。これにより，加速度成分系列ＡＤＰＦ（ｍ，ｔ）が求められる。図８のうち、速度／加速度成分抽出部２３１のＶ／ＡＤＰＦ（１）・・・（１５）２３３は，この算出アルゴリズムを示している。 Next, the velocity component series VDPF (m, t) obtained as described above is substituted into x (i, t) in the same expression (2). Thereby, an acceleration component series ADPF (m, t) is obtained. 8, V / ADPF (1) (15) 233 of the speed / acceleration component extraction unit 231 represents this calculation algorithm.

次に、調音運動修正処理部２３２における処理の詳細について、図８を参照して説明する。調音運動修正処理部２３２では、速度／加速度成分抽出部２３１にて得られた速度成分及び加速度成分（ＶＤＰＦ（ｍ，ｔ）、ＡＤＰＦ（ｍ，ｔ））を用い、調音特徴ｍ毎に調音運動を修正する。図８のうち、調音運動修正処理部２３２のＭＤＰＦ（１）・・・（１５）２３４は，この修正アルゴリズムを示している。 Next, details of processing in the articulation motion correction processing unit 232 will be described with reference to FIG. The articulatory motion correction processing unit 232 uses the speed component and the acceleration component (VDPF (m, t), ADPF (m, t)) obtained by the speed / acceleration component extracting unit 231 to perform articulatory motion for each articulatory feature m. To correct. 8, MDPF (1)... (15) 234 of the articulation motion correction processing unit 232 indicates this correction algorithm.

調音運動修正部２３０における具体的な処理内容について、図９に示すフローチャートを参照して説明する。なお本処理では，「調音運動は調音動作(唇が閉じる／前舌が上がる／・・・)を実現するべく行われ，その結果として，上に凸の運動が観測される。一方，調音が終了すると下に凸の運動が観測される」という推定に基づいている。 The specific processing content in the articulation motion correction unit 230 will be described with reference to the flowchart shown in FIG. In this process, “articulation movement is performed to realize articulation movement (lips close / front tongue rises / ...), and as a result, upward convex movement is observed. It is based on the assumption that a downwardly convex motion is observed when it is finished.

図９に示すように、調音運動修正部２３０では、はじめに、速度／加速度成分抽出部２３１において調音特徴系列ＤＰＦ（ｍ，ｔ）から加速度成分ＡＤＰＦ（ｍ，ｔ）が算出される（Ｓ１１）。次いで、算出された加速度成分ＡＤＰＦ（ｍ，ｔ）の値が正であるか、負であるか、又は零であるかが判断される（Ｓ１３、Ｓ１５）。そして、判断結果に応じ、調音運動の修正が行われる（Ｓ１７、Ｓ１９、Ｓ２１）。加速度成分ＡＤＰＦ（ｍ，ｔ）が負である場合は、調音特徴系列の運動軌跡はピークを示し、極大値（この時点を調音点と呼ぶ）に接近した後、離れていく途中であることを意味する。また正である場合は、調音特徴系列の運動軌跡は下降の状態、すなわち調音動作が終了したか、次の調音動作に向かう準備中であることし、調音動作が終了して調音点から離れていく動作を意味する。 As shown in FIG. 9, in the articulation motion correction unit 230, first, the velocity / acceleration component extraction unit 231 calculates the acceleration component ADPF (m, t) from the articulation feature series DPF (m, t) (S11). Next, it is determined whether the calculated acceleration component ADPF (m, t) is positive, negative, or zero (S13, S15). Then, the articulatory movement is corrected according to the determination result (S17, S19, S21). When the acceleration component ADPF (m, t) is negative, the movement trajectory of the articulatory feature series shows a peak, indicating that it is in the middle of moving away after approaching the maximum value (this time point is called an articulation point). means. If it is positive, the motion trajectory of the articulation feature series is in a descending state, i.e., the articulation operation has been completed, or preparation for the next articulation operation has been completed, and the articulation operation has been completed and has moved away from the articulation point. It means the movement that goes.

図９に示すように、ＡＤＰＦ（ｍ，ｔ）の値が正である場合（Ｓ１３：ＹＥＳ）、調音動作を抑制する為に、（４）式に加速度成分ＡＤＰＦ（ｍ，ｔ）が代入される。その結果、抑制強調関数ｆ（ｍ，ｔ）が求められる（Ｓ１７）。（４）式は、ニューラルネットワークで利用されることの多いシグモイド関数を用いて、抑制を実現したものである。そしてＳ２３の処理に移行する。
As shown in FIG. 9, when the ADPF (m, t) value is positive (S13: YES), the acceleration component ADPF (m, t) is substituted into the equation (4) to suppress the articulation operation. The As a result, the suppression enhancement function f (m, t) is obtained (S17). Expression (4) realizes suppression using a sigmoid function that is often used in neural networks. Then, the process proceeds to S23.

一方、ＡＤＰＦ（ｍ，ｔ）の値が零である場合（Ｓ１３：ＮＯ、Ｓ１５：ＹＥＳ）、加速度成分ＡＤＰＦ（ｍ，ｔ）に修正は行われない。その結果、抑制強調関数ｆ（ｍ，ｔ）には１が代入される（（５）式参照）（Ｓ１９）。そしてＳ２３の処理に移行する。
On the other hand, when the value of ADPF (m, t) is zero (S13: NO, S15: YES), the acceleration component ADPF (m, t) is not corrected. As a result, 1 is substituted into the suppression enhancement function f (m, t) (see equation (5)) (S19). Then, the process proceeds to S23.

一方、ＡＤＰＦ（ｍ，ｔ）の値が負である場合（Ｓ１５：ＮＯ）、調音動作を強調する為に、（６）式に加速度成分ＡＤＰＦ（ｍ，ｔ）が代入される。その結果、抑制強調関数ｆ（ｍ，ｔ）が求められる（Ｓ２１）。（６）式は、（４）式と同様、ニューラルネットワークで利用されることの多いシグモイド関数を用いて、強調を実現したものである。そしてＳ２３の処理に移行する。
On the other hand, if the value of ADPF (m, t) is negative (S15: NO), acceleration component ADPF (m, t) is substituted into equation (6) to emphasize the articulation operation. As a result, the suppression enhancement function f (m, t) is obtained (S21). The expression (6), like the expression (4), realizes enhancement using a sigmoid function that is often used in a neural network. Then, the process proceeds to S23.

次いで、Ｓ２３において、調音特徴系列ＤＰＦ（ｍ，ｔ）に算出された抑制強調関数ｆ（ｍ，ｔ）が乗算される（Ｓ２３）。これにより、調音運動が修正される。そして処理が終了される。 Next, in S23, the articulation feature series DPF (m, t) is multiplied by the calculated suppression enhancement function f (m, t) (S23). Thereby, articulation movement is corrected. Then, the process ends.

このように、図９に示すフローチャートでは、シグモイド関数を利用して、抑制強調関数ｆ（ｍ，ｔ）を算出する。そして，算出された値を元の調音特徴系列ＤＰＦ（ｍ，ｔ）に乗算することで，調音運動を修正し、修正調音運動ＤＰＦ'を得ている。 Thus, in the flowchart shown in FIG. 9, the suppression enhancement function f (m, t) is calculated using the sigmoid function. Then, by multiplying the calculated articulation feature series DPF (m, t) by the calculated value, the articulation motion is corrected to obtain a corrected articulation motion DPF ′.

なお、図８及び図９を参照して説明した調音運動修正部２３０の調音運動修正処理は，本実施の形態に限定されず、他の方法でも実現可能である。図１０及び図１１を参照し、異なる調音運動修正部の変形例について説明する。図１０は、調音運動修正部３３０の機能詳細を示すブロック図である。図１１は、調音運動修正部４３０の機能詳細を示すブロック図である。 Note that the articulation motion correction processing of the articulation motion correction unit 230 described with reference to FIGS. 8 and 9 is not limited to the present embodiment, and can be realized by other methods. With reference to FIG.10 and FIG.11, the modification of a different articulation movement correction part is demonstrated. FIG. 10 is a block diagram illustrating details of functions of the articulation motion correction unit 330. FIG. 11 is a block diagram illustrating details of functions of the articulation motion correction unit 430.

はじめに図１０を参照して、ニューラルネットワークを使用した調音運動修正部３３０の構成について説明する。図１０に示す調音運動修正部３３０では、調音運動修正処理部３３２が調音特徴毎に設けられたニューラルネットワークＮＤＰＦ（１）・・・（１５）３３４にて構成されている。調音運動修正部３３０では、はじめに、速度／加速度成分抽出部３３１において、調音特徴系列ＤＰＦ（ｍ，ｔ）より速度成分系列ＶＤＰＦ（ｍ，ｔ）、及び、加速度成分系列ＡＤＰＦ（ｍ，ｔ）が算出される（図１０のうち、速度／加速度成分抽出部３３１のＶ／ＡＤＰＦ（１）・・・（１５）３３３は，この算出アルゴリズムを示している。）。 First, the configuration of the articulatory motion correction unit 330 using a neural network will be described with reference to FIG. In the articulation motion correction unit 330 shown in FIG. 10, the articulation motion correction processing unit 332 is configured by a neural network NDPF (1) (15) 334 provided for each articulation feature. In the articulatory motion correcting unit 330, first, in the speed / acceleration component extracting unit 331, the speed component series VFPF (m, t) and the acceleration component series ADPF (m, t) are generated from the articulatory feature series DPF (m, t). (V / ADPF (1)... (15) 333 of the speed / acceleration component extraction unit 331 in FIG. 10 represents this calculation algorithm).

そして、調音特徴系列ＤＰＦ（ｍ，ｔ）と、算出された速度成分系列ＶＤＰＦ（ｍ，ｔ）及び加速度成分系列ＡＤＰＦ（ｍ，ｔ）とが、調音運動修正処理部３３２のニューラルネットワークＮＤＰＦ（１）・・・（１５）３３４に入力される。そしてＮＤＰＦ（１）・・・（１５）３３４において調音運動が修正され、修正調音運動が出力される。 The articulatory feature series DPF (m, t), the calculated velocity component series VFPF (m, t), and the acceleration component series ADPF (m, t) are combined into a neural network NDPF (1) of the articulatory motion correction processing unit 332. ) (15) is input to 334. Then, in NDPF (1) (15) 334, the articulatory motion is corrected, and the corrected articulatory motion is output.

次に、図１１を参照して、統合ニューラルネットワークを使用した調音運動修正部４３０の構成について説明する。図１１に示す調音運動修正部４３０では、調音運動修正処理部４３２が、図１０にて示した調音特徴毎に独立したニューラルネットワークの代わりに、調音特徴間の制約を入れた，統合型のニューラルネットワークＮＤＰＦ４３４として構成されている。速度／加速度成分抽出部４３１における処理、及び、調音運動修正処理部４３２に対して出力されるデータについては、図１０の場合と同様であるので、説明を省略する。 Next, the configuration of the articulatory motion correction unit 430 using an integrated neural network will be described with reference to FIG. In the articulatory motion correction unit 430 shown in FIG. 11, the articulatory motion correction processing unit 432 is an integrated neural network in which constraints between articulation features are inserted instead of the neural network independent for each articulation feature shown in FIG. 10. The network NDPF 434 is configured. The processing in the speed / acceleration component extraction unit 431 and the data output to the articulation motion correction processing unit 432 are the same as those in FIG.

図４に示すように、調音運動修正部２３０（調音運動修正部３３０、及び調音運動修正部３３０も同様）において修正された修正調音運動は、単語分類部２０４において単語発音辞書２０６１、ＨＭＭ２０６２、及び言語モデル２０６３が参照され、発話された単語が特定される。そして、特定された単語が出力部２０５より出力される。単語分類における計算過程は、背景技術に述べた従来方式と同じである。すなわち（１）式中の入力音声特徴ｘ（ｉ，ｔ）（従来方式では音声スペクトルやMFCC）に，調音特徴（ＤＰＦ（ｍ，ｔ））を代入することで，単語ｋ（もしくは音素ｋ）の音響尤度が得られる。 As shown in FIG. 4, the corrected articulatory motion corrected in the articulatory motion correcting unit 230 (the same applies to the articulatory motion correcting unit 330 and the articulatory motion correcting unit 330) is stored in the word classification dictionary 204 by the word pronunciation dictionary 2061, the HMM 2062, and The language model 2063 is referred to, and the spoken word is specified. Then, the identified word is output from the output unit 205. The calculation process in word classification is the same as the conventional method described in the background art. That is, the word k (or phoneme k) is obtained by substituting the articulatory feature (DPF (m, t)) into the input speech feature x (i, t) in the equation (1) (speech spectrum or MFCC in the conventional method). The acoustic likelihood of is obtained.

以上説明したように、本発明の調音特徴抽出装置では、調音特徴系列を抽出する処理（調音特徴抽出部２２０）と，その結果得られる調音特徴系列に対して，本来の調音動作に近づけ修正する処理（調音運動修正部２３０）とが設けられている。これにより、音声発話に伴う調音動作の特徴を高い精度で抽出することが可能となるので、音声スペクトルを使用して音声を認識する従来の音声認識装置と比較して、精度の高い音声認識を行うことが可能となる。 As described above, in the articulatory feature extraction apparatus of the present invention, the articulatory feature sequence extraction process (articulation feature extraction unit 220) and the resulting articulatory feature sequence are modified to approximate the original articulatory operation. Processing (articulation motion correction unit 230) is provided. As a result, it is possible to extract the characteristics of the articulatory movement associated with the speech utterance with high accuracy, so that the speech recognition with higher accuracy can be achieved compared to the conventional speech recognition device that recognizes the speech using the speech spectrum. Can be done.

従来の音声のスペクトルを特徴とした音声認識では、話者や発話時の文脈、周囲騒音等によってスペクトルが大きく変動してしまうため、音響的な尤度を求める際に使用する隠れマルコフモデル（ＨＭＭ）の設計に多くの音声データを必要としていた。また、ＨＭＭの混合数も１０以上が必要とされ，高性能な音声認識装置とするためにはコストが嵩んでしまっていた。これに対し本発明の調音特徴抽出装置では、音声中の調音特徴を高精度に抽出できるため、ＨＭＭの混合数は数個程度で済む。また、調音特徴の高精度抽出は、音素認識性能を飛躍的に向上させ、未知語の問題に対して人間が行っている対応と同様の対応を行うことが可能となる。従って、音素系列を利用した確認発話文の合成により，対話をスムースに進めることが可能になる。 In the conventional speech recognition characterized by the spectrum of speech, the spectrum greatly fluctuates depending on the speaker, the context at the time of speech, ambient noise, etc., so the hidden Markov model (HMM) used when obtaining the acoustic likelihood ) Required a lot of audio data. Further, the number of HMMs to be mixed is required to be 10 or more, and the cost has been increased in order to obtain a high-performance speech recognition apparatus. On the other hand, the articulatory feature extraction apparatus of the present invention can extract articulatory features in speech with high accuracy, so that only a few HMMs are mixed. Also, high-precision extraction of articulatory features can dramatically improve phoneme recognition performance, and can respond to the unknown word problem in the same way as humans do. Therefore, it is possible to smoothly advance the dialogue by synthesizing the confirmation utterance sentence using the phoneme sequence.

また，調音特徴は多くの場合，テキスト（かな系列に変換した読み）と一対一に対応するため、音声ドキュメントとテキストドキュメントに対する検索を，音声およびテキスト（キーボード）の双方から相互に検索することが可能となる。 In many cases, the articulation feature corresponds to the text (reading converted into a kana series) one-to-one, so that the search for the voice document and the text document can be mutually searched from both the voice and the text (keyboard). It becomes possible.

また、調音運動は、変位成分、速度成分、及び加速度成分に基づいて修正調音運動に修正されるので、話者や発話時の文脈、周囲の騒音等に依存せず、音声発話に伴う調音動作の特徴を高い精度で抽出することが可能となる。 In addition, the articulatory motion is corrected to the correct articulatory motion based on the displacement component, velocity component, and acceleration component. Can be extracted with high accuracy.

また、修正調音運動は、運動軌跡のパターン（凹パターン、凸パターン）に基づいて修正されるので、調音結合により音素が単音の状態と異なる状態となった場合であっても、音声発話に伴う調音動作の特徴を高い精度で抽出することが可能となる。 In addition, the corrected articulatory motion is corrected based on the motion trajectory pattern (concave pattern, convex pattern). Therefore, even if the phoneme is different from the single tone state due to the articulatory connection, it is accompanied by the voice utterance. It is possible to extract the characteristics of the articulation operation with high accuracy.

また、調音運動を修正する為の処理はニューラルネットワークを介して実行されるので、高速に修正調音運動を得ることができる。 Further, since the process for correcting the articulatory motion is executed via a neural network, the corrected articulatory motion can be obtained at high speed.

なお、図１の入力装置１２が、本発明の「音声取得手段」に相当し、図２の調音特徴抽出部２２０の処理を行う中央演算処理装置１１が、本発明の「調音特徴抽出手段」に相当し、調音運動修正部２３０の処理を行う中央演算処理装置１１が、本発明の「調音運動修正手段」に相当し、図１における外部記憶装置１５が、本発明の「記憶手段」に相当し、記憶手段に修正調音運動のデータを記憶する処理を行う中央演算処理装置１１が、本発明の「記憶制御手段」に相当する。 The input device 12 of FIG. 1 corresponds to the “voice acquisition unit” of the present invention, and the central processing unit 11 that performs the processing of the articulation feature extraction unit 220 of FIG. 2 is the “articulation feature extraction unit” of the present invention. The central processing unit 11 that performs the processing of the articulation motion correction unit 230 corresponds to the “articulation motion correction means” of the present invention, and the external storage device 15 in FIG. 1 serves as the “storage means” of the present invention. Correspondingly, the central processing unit 11 that performs processing for storing corrected articulation motion data in the storage means corresponds to the “storage control means” of the present invention.

また、図８の速度／加速度成分抽出部２３１において速度成分及び加速度成分を抽出する処理を行う中央演算処理装置１１が、本発明の「成分抽出手段」に相当し、図９のＳ１３、Ｓ１５の処理を行う中央演算処理装置１１が、本発明の「パターン認識手段」に相当する。
＜実験例＞ Further, the central processing unit 11 that performs the process of extracting the speed component and the acceleration component in the speed / acceleration component extraction unit 231 in FIG. 8 corresponds to the “component extraction unit” of the present invention, and in S13 and S15 in FIG. The central processing unit 11 that performs processing corresponds to the “pattern recognition means” of the present invention.
<Experimental example>

以下、上述の調音特徴抽出装置を使用した実験例について、図面を参照して説明する。はじめに、図１２及び図１３を参照し、調音運動修正前後における発話音声の調音特徴の抽出例について説明する。図１２は、調音運動修正前における調音特徴の抽出例を示している。図１３は、調音運動修正後における調音特徴の抽出例を示す。なお本実施例では、調音特徴として弁別的音素特徴を使用しているが、他の調音特徴表示（例えば国際音声記号（ＩＰＡ）の表にある調音特徴を利用するなど）を用いても効果が得られるものと推察される。 Hereinafter, an experimental example using the above-described articulatory feature extraction apparatus will be described with reference to the drawings. First, with reference to FIGS. 12 and 13, an example of extracting articulation features of speech speech before and after articulation motion correction will be described. FIG. 12 shows an example of articulation feature extraction before articulation motion correction. FIG. 13 shows an example of articulation feature extraction after articulation motion correction. In this embodiment, the discriminative phoneme feature is used as the articulation feature. However, the effect can be obtained by using another articulation feature display (for example, using the articulation feature in the table of international phonetic symbols (IPA)). It is assumed that it will be obtained.

図１２を参照し、調音運動修正前における調音特徴の抽出例について説明する。図１２は、発話「人工衛星」に対する調音特徴の抽出例を示す。なおこの例では，弁別的音素特徴抽出部２２２（図４参照）におけるニューラルネットワーク（第一多層ニューラルネット２２５、第二多層ニューラルネット２２６）の入力として，時刻ｔの局所特徴と共に，ｔ−３フレーム目の局所特徴およびｔ＋３フレーム目の局所特徴の三フレームにまたがるデータを加えている。また併せて、弁別的音素特徴抽出部２２２（図４参照）におけるニューラルネットの出力も、時刻（ｔ−３，ｔ，ｔ＋３）に対応する調音特徴系列（ＤＰＦ（ｍ，ｔ−３），ＤＰＦ（ｍ，ｔ），ＤＰＦ（ｍ，ｔ＋３））（（ｍ：調音特徴の番号、ｍ＝１，２，・・・，１５））と，前後の文脈を含む調音特徴系列が得られる形式を採用した。図１２ではそれらのうち，中央の調音特徴系列ＤＰＦ（ｍ，ｔ）についての調音特徴の推移を示したものである。 With reference to FIG. 12, an example of articulation feature extraction before articulation motion correction will be described. FIG. 12 shows an example of articulation feature extraction for the utterance “artificial satellite”. In this example, the input of the neural network (first multilayer neural network 225, second multilayer neural network 226) in the discriminative phoneme feature extraction unit 222 (see FIG. 4) is t− Data extending over three frames of the local feature of the third frame and the local feature of the t + 3 frame is added. At the same time, the output of the neural network in the discriminative phoneme feature extraction unit 222 (see FIG. 4) is also the articulation feature series (DPF (m, t-3), DPF) corresponding to the time (t-3, t, t + 3). (M, t), DPF (m, t + 3)) ((m: number of articulation feature, m = 1, 2,..., 15)) and a format for obtaining articulation feature series including contexts before and after. Adopted. FIG. 12 shows the transition of the articulation feature for the central articulation feature series DPF (m, t).

図１２には、縦欄として弁別的特徴が示され、横欄として個々の音素が示されている。また、最上欄「silB」（silence of beginning part）と示された部分は、無音の区間であることを示しており、「ｊｉｎｋｏｅｓｅ」と示された部分は、それぞれの音声の発声区間であることを示している。また、図１２中点線は、理想的な正しい調音特徴を示しており、実線が実際に算出された調音特徴（弁別的音素特徴抽出部２２２（図４参照）より抽出された状態のもの）を示している。 In FIG. 12, the distinguishing features are shown as vertical columns, and individual phonemes are shown as horizontal columns. In addition, the portion indicated as “silB” (silence of beginning part) in the top column indicates that there is a silent interval, and the portion indicated as “jinkose” indicates the utterance interval of each voice. Is shown. Also, the dotted line in FIG. 12 indicates an ideal correct articulation feature, and the articulation feature (the state extracted by the discriminative phoneme feature extraction unit 222 (see FIG. 4)) in which the solid line is actually calculated is shown. Show.

図１２に示すように、実線（算出データ）と点線（理想データ）とを比較すると、無音の区間、及び発声区間において、点線と実線との間には大きな隔たりがあり、また実線の推移に大きな変動が確認される。 As shown in FIG. 12, when comparing the solid line (calculated data) and the dotted line (ideal data), there is a large gap between the dotted line and the solid line in the silent section and the utterance section, and the transition of the solid line Large fluctuations are confirmed.

次に、図１３を参照し、調音運動修正後における調音特徴の抽出例について説明する。図１３は、図１２と同様、発話「人工衛星」に対する弁別液音素特徴の抽出例を示す。なおこの例では、調音運動修正部として、図１０にて説明した、調音特徴毎に学習したニューラルネットワークを使用した調音運動修正部３３０が使用されている。また、ニューラルネットワークＮＤＰＦ（１）・・・（１５）３３４への入力として，７フレームのＤＰＦ（ｍ，ｔ）、ＶＤＰＦ（ｍ，ｔ），及びＡＤＰＦ（ｍ，ｔ）が使用されている。ここで一般的に使用するフレーム数が小さいと効果が少なく，大きすぎると平滑され過ぎの傾向を示すため，調音特徴に依って，３〜９の間の値を用いることが望ましい（破裂音などでは短く，一方，母音などでは長く設定することが望ましい。）。 Next, an example of articulation feature extraction after articulation motion correction will be described with reference to FIG. FIG. 13 shows an example of extracting the discrimination liquid phoneme feature for the utterance “artificial satellite”, as in FIG. 12. In this example, the articulation motion correction unit 330 using the neural network learned for each articulation feature described in FIG. 10 is used as the articulation motion correction unit. Also, 7-frame DPF (m, t), VPPF (m, t), and ADPF (m, t) are used as inputs to the neural network NDPF (1) (15) 334. In general, if the number of frames used is small, the effect is small, and if it is too large, it tends to be too smooth. Therefore, it is desirable to use a value between 3 and 9 depending on the articulation characteristics (such as plosives) Is shorter, but longer for vowels etc.)

図１３に示すように、実線（算出データ）と点線（理想データ）とを比較すると、双方は値がよく一致し、算出結果が実際の発話に非常に近い値となることがわかった。また、図１２の結果と比較して実線の推移が平滑となっていることがわかった。さらに、無音の区間にて発生していたノイズも抑制されていることがわかった。また，ＤＰＦ（ｍ，ｔ）が実際の発話に沿った値を示していることが確認された。これにより，調音運動修正により調音特徴系列が大きく改善することがわかった。 As shown in FIG. 13, when comparing the solid line (calculated data) and the dotted line (ideal data), it was found that the values of both coincide well, and the calculated result is very close to the actual utterance. Moreover, it turned out that the transition of a continuous line is smooth compared with the result of FIG. Furthermore, it was found that noise generated in the silent section was also suppressed. It was also confirmed that DPF (m, t) showed a value in line with the actual utterance. As a result, it was found that the articulatory feature sequence was greatly improved by the articulatory motion correction.

次に、調音特徴抽出率を算出した結果について、図１４を参照して説明する。図１４は、調音特徴正解率の算出結果を示したグラブである。調音特徴正解率は、日本語の新聞読み上げコーパス（約１００名の男声データ）を用い、調音運動修正処理の有無、及び調音運動修正処理の条件を変化させた場合に得られる調音特徴に基づいて音声認識が行われた場合の正解率を算出することにより得た。 Next, the result of calculating the articulation feature extraction rate will be described with reference to FIG. FIG. 14 is a grab showing calculation results of the articulation feature accuracy rate. The articulation feature accuracy rate is based on the articulation features obtained when the Japanese newspaper reading corpus (about 100 male voice data) is used and whether or not the articulatory motion correction processing is changed and the conditions of the articulatory motion correction processing are changed. It was obtained by calculating the correct answer rate when speech recognition was performed.

調音運動修正処理の条件を変化させた場合の評価を行う場合には、方式１：簡易修正処理、方式２：調音特徴毎のニューラルネット処理（図８参照）、方式３：統合ニューラルネットワーク処理（図１１参照）の合計３種類の条件にて調音運動修正処理を行った。そして、得られる調音特徴に基づいて音声認識が行われた場合の正解率を算出することにより行った。 In the case of performing an evaluation when the condition of articulatory motion correction processing is changed, method 1: simple correction processing, method 2: neural network processing for each articulation feature (see FIG. 8), method 3: integrated neural network processing ( The articulation motion correction process was performed under a total of three types of conditions (see FIG. 11). And it performed by calculating the correct answer rate in case speech recognition was performed based on the obtained articulation feature.

図１４に示すように、調音運動の修正を施さない場合（図中「修正無」），抽出性能は９０％に満たない程度となる。一方、調音運動の修正処理を施した場合（図中「方式１」「方式２」「方式３」）、修正を施さない場合と比較して、正解率が大きく向上することがわかった。 As shown in FIG. 14, when the articulation motion is not corrected (“No correction” in the figure), the extraction performance is less than 90%. On the other hand, it was found that when the articulation movement correction process was performed (“method 1”, “method 2”, and “method 3” in the figure), the accuracy rate was greatly improved as compared to the case where correction was not performed.

また図１４に示すように、方式１（９２％），方式２（９３％），方式３（９４％）の順で正解率が向上することがわかった。しかしながら、調音運動の修正処理に必要な計算量は、方式１、方式２、方式３の順に大きくなるため，目的に応じ、方式を選択して利用することが望ましい。 Further, as shown in FIG. 14, it was found that the accuracy rate improved in the order of method 1 (92%), method 2 (93%), and method 3 (94%). However, since the amount of calculation required for the correction processing of articulation motion increases in the order of method 1, method 2, and method 3, it is desirable to select and use the method according to the purpose.

次に、上述の調音特徴抽出装置を使用した場合に必要となるＨＭＭ（音素認識器）の混合数と認識精度との関係について、表１を参照して説明する。表１は、調音特徴抽出時におけるＨＭＭの混合数と認識率との関係を示している。表１においては、ＨＭＭに基づく音素認識器に対して，ＭＦＣＣを直接入力する場合と，調音特徴を入力する場合とを比較した結果が示されている。表１中、調音特徴（修正無）は，弁別的音素特徴（ＤＰＦ）をニューラルネットワークで抽出した場合の混合数を示しており、調音特徴（修正有）は、本発明に係る調音動作の修正を加えた場合の混合数を示している。
Next, the relationship between the number of mixed HMMs (phoneme recognizers) and the recognition accuracy required when using the above-mentioned articulatory feature extraction apparatus will be described with reference to Table 1. Table 1 shows the relationship between the number of HMM mixtures and the recognition rate at the time of articulation feature extraction. Table 1 shows a result of comparing the case where the MFCC is directly input and the case where the articulatory feature is input to the phoneme recognizer based on the HMM. In Table 1, the articulation feature (uncorrected) indicates the number of mixtures when the discriminative phoneme feature (DPF) is extracted by the neural network, and the articulation feature (corrected) is the modification of the articulation operation according to the present invention. The number of mixtures when adding is shown.

表１の結果から，従来法のＭＦＣＣを直接入力した場合では，認識精度を高めるために大きな混合数が必要となることがわかった。一方、調音特徴を入力した場合では、混合数が１である場合も比較的高い性能を得ることが可能であることがわかった。さらに、調音動作の修正を行った調音特徴を入力した場合では，さらに一段高い性能を得ることが可能であることがわかった。これにより、音素分類器や単語分類器の規模（ここでは混合数）を小規模に押さえることが可能であることが明らかとなった。 From the results in Table 1, it was found that a large number of mixtures was required to improve recognition accuracy when the conventional MFCC was directly input. On the other hand, when the articulation feature is input, it was found that a relatively high performance can be obtained even when the number of mixing is one. Furthermore, it was found that a higher performance can be obtained when an articulation feature with a modified articulation operation is input. As a result, it became clear that the scale of the phoneme classifier and the word classifier (here, the number of mixtures) can be reduced to a small scale.

なお実施例では、調音運動を修正する手段として、ニューラルネットワークを用い、これに調音特徴（変位）成分、速度成分、及び加速度成分を通すことで、調音運動を修正する方法を示したが、本発明はこれに限られるものではない。例えば、本申請では音素や単語認識にＨＭＭを用いたが、調音特徴修正手段としてのニューラルネットワークに代えて、ＨＭＭなどの統計的パターン分類手段を導入して調音特徴修正手段とすることも可能である。この場合、ＨＭＭなどは音素や単語に対するモデルではなく，調音特徴に対するモデルとして用いられることになる。要は、「調音特徴を抽出する手段」により実現された調音運動に対して、これを「修正する手段」を設けるとともに、調音運動を表現する、調音特徴（変位）成分、速度成分、及び加速度成分をこの調音運動修正手段に通すことにより修正を実現することがキーである。 In the embodiment, a method for correcting the articulatory motion by using a neural network as a means for correcting the articulatory motion and passing the articulatory feature (displacement) component, the velocity component, and the acceleration component through this is shown. The invention is not limited to this. For example, in this application, HMM is used for phoneme and word recognition, but instead of the neural network as articulation feature correction means, statistical pattern classification means such as HMM can be introduced and used as articulation feature correction means. is there. In this case, HMM or the like is not used as a model for phonemes or words but as a model for articulatory features. In short, for the articulatory motion realized by the “means for extracting articulatory features”, a “means for correcting” this is provided, and the articulatory feature (displacement) component, velocity component, and acceleration for expressing the articulatory motion are provided. The key is to realize the correction by passing the component through the articulatory movement correcting means.

調音特徴抽出装置１の電気的構成を示す模式図である。2 is a schematic diagram illustrating an electrical configuration of the articulatory feature extraction apparatus 1. FIG. 調音特徴抽出装置１にて実行される調音特徴抽出処理を示す機能ブロック図である。It is a functional block diagram which shows the articulation feature extraction process performed with the articulation feature extraction apparatus. 特徴分析部２１０の機能詳細を示すブロック図である。4 is a block diagram illustrating details of functions of a feature analysis unit 210. FIG. 調音特徴抽出部２２０の機能詳細を示すブロック図である。3 is a block diagram illustrating details of functions of an articulatory feature extraction unit 220. FIG. 局所特徴抽出部２２１より得られる特徴分析部時間方向の局所特徴の一例である。The feature analysis unit obtained from the local feature extraction unit 221 is an example of local features in the time direction. 局所特徴抽出部２２１より得られる周波数方向の局所特徴の一例である。It is an example of the local feature of the frequency direction obtained from the local feature extraction part 221. 弁別的音素特徴抽出部２２２にて得られる調音特徴の一例である。It is an example of the articulation feature obtained by the discriminative phoneme feature extraction unit 222. 調音運動修正部２３０の機能詳細を示すブロック図である。4 is a block diagram showing details of functions of an articulatory motion correcting unit 230. FIG. 調音運動修正処理部２３２における処理を示したフローチャートである。5 is a flowchart showing processing in an articulation motion correction processing unit 232; 調音運動修正部３３０の機能詳細を示すブロック図であり、It is a block diagram which shows the functional detail of the articulation movement correction part 330, 調音運動修正部４３０の機能詳細を示すブロック図である。It is a block diagram which shows the functional details of the articulation movement correction part 430. 調音運動修正前における調音特徴の抽出例である。It is an example of articulation feature extraction before articulation movement correction. 調音運動修正後における調音特徴の抽出例である。It is an example of extraction of the articulation feature after articulation movement correction. 調音特徴抽出率の算出結果を示したグラブである。It is the grab which showed the calculation result of the articulation feature extraction rate. 従来の音声認識装置における音声認識処理を示す機能ブロック図である。It is a functional block diagram which shows the speech recognition process in the conventional speech recognition apparatus. 弁別的音素特徴を示しているShows distinctive phoneme features

Explanation of symbols

１調音特徴抽出装置
１１中央演算処理装置
１２入力装置
１３出力装置
１４記憶装置
１５外部記憶装置
２０１入力部
２０２変換部
２０４単語探索部
２０５出力部
２０６記憶部
２０７記憶部
２１０特徴分析部
２１２フィルタ部
２２０調音特徴抽出部
２３０調音運動修正部
３３０調音運動修正部
４３０調音運動修正部 1 articulation feature extraction device 11 central processing unit 12 input device 13 output device 14 storage device 15 external storage device 201 input unit 202 conversion unit 204 word search unit 205 output unit 206 storage unit 207 storage unit 210 feature analysis unit 212 filter unit 220 Articulation feature extraction unit 230 Articulation motion correction unit 330 Articulation motion correction unit 430 Articulation motion correction unit

Claims

Audio acquisition means for acquiring audio;
Articulation feature extraction means for extracting the articulation characteristics of the voice acquired by the voice acquisition means;
The articulatory feature sequence, which is time series data of the articulatory feature extracted by the articulatory feature extraction means, is converted into a motion trajectory, and based on the motion trajectory, the effect of articulation coupling included in the speech is eliminated. Articulatory motion correcting means for correcting articulatory motion, which is a motion of articulation represented by the articulatory feature series, so that it can be recognized
An articulatory feature extraction apparatus comprising: a storage control unit that stores a modified articulation motion, which is the articulation motion corrected by the articulation motion correction unit, in a storage unit.

Component extraction means for extracting a velocity component and an acceleration component from the displacement component of the articulation feature series,
The articulatory movement correcting means includes:
2. The articulatory motion is corrected to the corrected articulatory motion based on at least one of the displacement component, the velocity component and the acceleration component extracted by the component extraction unit. Articulatory feature extraction device.

Based on at least one of the displacement component, the velocity component, and the acceleration component, when the displacement component is observed along the time axis, it is recognized whether the transition is a concave pattern or a convex pattern. With pattern recognition means,
The articulatory movement correcting means includes:
The articulatory feature extraction device according to claim 2, wherein the articulatory motion is corrected to the corrected articulatory motion based on the pattern recognized by the pattern recognition means.

The articulatory movement correcting means includes:
The articulation according to claim 2 or 3, wherein the articulatory movement is corrected to the corrected articulatory movement by passing at least one of the displacement component, the velocity component, and the acceleration component through a neural network. Feature extraction device.

An audio acquisition step for acquiring audio;
Articulation feature extraction step of extracting the articulation feature of the voice acquired in the voice acquisition step;
The articulation feature sequence, which is time series data of the articulation feature extracted in the articulation feature extraction step, is converted into a motion trajectory, and based on the motion trajectory, the effect of articulation coupling included in the speech is eliminated. An articulatory motion correcting step for correcting articulatory motion, which is an articulatory motion represented by the articulatory feature series, so that it can be recognized;
An articulation feature extraction method comprising: a storage control step of storing in a storage means the corrected articulation motion which is the articulation motion corrected in the articulation motion correction step.

A component extraction step of extracting a velocity component and an acceleration component from the displacement component of the articulation feature series;
The articulatory motion correcting step includes:
6. The articulatory motion is corrected to the corrected articulatory motion based on at least one of the displacement component, the velocity component and the acceleration component extracted in the component extraction step. Articulatory feature extraction method.

Based on at least one of the displacement component, the velocity component, and the acceleration component, when the displacement component is observed along the time axis, it is recognized whether the transition is a concave pattern or a convex pattern. With a pattern recognition step,
The articulatory motion correcting step includes:
The articulatory feature extraction method according to claim 6, wherein the articulatory motion is corrected to the corrected articulatory motion based on the pattern recognized in the pattern recognition step.

The articulatory motion correcting step includes:
The articulation according to claim 6 or 7, wherein the articulatory motion is corrected to the corrected articulatory motion by passing at least one of the displacement component, the velocity component, and the acceleration component through a neural network. Feature extraction method.

An articulation feature extraction program for driving a computer as each processing means of the articulation feature extraction device according to any one of claims 1 to 4.