JP3413384B2

JP3413384B2 - Articulation state estimation display method and computer-readable recording medium recording computer program for the method

Info

Publication number: JP3413384B2
Application number: JP2000062049A
Authority: JP
Inventors: 建武党; 清志本多
Original assignee: ATR Advanced Telecommunications Research Institute International
Current assignee: ATR Advanced Telecommunications Research Institute International
Priority date: 2000-03-07
Filing date: 2000-03-07
Publication date: 2003-06-03
Anticipated expiration: 2020-03-07
Also published as: JP2001249675A

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】この発明は人間の音声波形か
ら、発話者の調音器官の形状を推定するシステムおよび
方法に関し、特に生理学的３次元調音モデルを用い、音
声波形に基づいて調音器官の調音状態を視覚化し表示す
るためのシステムおよび方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a system and method for estimating the shape of a speaker's articulatory organ from a human voice waveform, and more particularly, using a physiological three-dimensional articulatory model, based on the voice waveform, the articulatory organ articulation. Systems and methods for visualizing and displaying status.

【０００２】[0002]

【従来の技術】外国語学習、特に外国語による会話の学
習は日本人にとって困難であると考えられている。その
原因としてさまざまなものが考えられるが、発話者は自
己の発話器官の調音状態とそれらの調整とを理解してい
ないのが大きな原因と考えられる。ことに外国語の発音
は、教師となる者の発音をまねるべく、自らの舌、下顎
などの調音器官をしかるべく制御しなければならない。
そうした学習作業は、日常的に外国語を聞き、話す機会
に恵まれる者であればともかく、そうした機会をほとん
ど持つことのない平均的日本人にとっては極めて困難で
ある。2. Description of the Related Art Learning a foreign language, especially learning a conversation in a foreign language, is considered to be difficult for a Japanese person. There are various possible causes, but it is considered that the main cause is that the speaker does not understand the articulation state of his or her own speech organs and their adjustment. In particular, the pronunciation of a foreign language must be controlled appropriately in order to imitate the pronunciation of a teacher, and to control the articulatory organs of the tongue and lower jaw.
Such learning work is extremely difficult for the average Japanese who rarely have the opportunity to listen to and speak foreign languages on a daily basis.

【０００３】こうした困難は、結局、自らの調音器官を
どのように調整すれば正しい発音ができるかについて、
教師となるものがないためにフィードバックをかけるこ
とができないことに起因する。そのため、自己流で発音
練習をするが発音はいっこうに上達しないという事態に
なる。また仮に教師がいたとしても、教師の発音を聞
き、その音声と似た音が発声できるように調音器官を正
しく制御することは実はそれほど容易ではない。そもそ
も、発声しているときの自己の調音器官がどのような形
状となっているかさえ容易には分からないのである。そ
のため結局は、いつまでたっても外国語の上達は望めな
いということになりかねない。After all, these difficulties are related to how to adjust one's articulatory organs to produce a correct pronunciation.
This is because it is not possible to give feedback because there is no teacher. As a result, they practice their own pronunciation but cannot improve their pronunciation at all. Even if there is a teacher, it is actually not so easy to control the articulatory organs so that the teacher hears the pronunciation and can produce a sound similar to the voice. In the first place, it is not easy to know even what shape the articulatory organ of one's own voice is. Therefore, in the long run, it may not be possible to expect improvement in a foreign language.

【０００４】こうした問題は、外国語学習のときに限ら
ない。たとえば何らかの原因で調音器官の制御に障害を
持つ者、または聴覚に障害を持つために自己の発声して
いる音声と調音器官との間の関係を確実に把握できない
者が音声を用いて周囲とコミュニケーションをとるため
には、調音器官の制御の訓練をし、できるだけ通常の発
音が行えるようにすることが望ましい。しかしこの場合
にも、どのように自己の調音器官を制御したら望ましい
発音で発声ができるのかは容易には分からず、その結果
適切な発声を行なうことが困難である。These problems are not limited to learning a foreign language. For example, a person who has some difficulty in controlling articulatory organs, or someone who cannot reliably grasp the relationship between his or her voice and articulatory organs due to hearing impairment may use the voice to In order to communicate, it is desirable to train articulatory control so that normal pronunciation is possible. However, even in this case, it is not easy to know how to control the articulatory organs of oneself to produce the desired pronunciation, and as a result, it is difficult to perform the appropriate utterance.

【０００５】こうした問題を解決するためには、音声
と、それに対応する調音状態とを教えて同じ調音状態を
実現するように学習者を訓練するだけでは足りず、実際
に学習者が発声しているときの学習者の調音状態と、模
範となるものとの差異を認識させることにより調音状態
にフィードバックをかけることが望ましい。そのために
は学習者の調音状態を音声から推定することが必要であ
る。In order to solve these problems, it is not enough to teach the voice and the corresponding articulation state and train the learner to achieve the same articulation state. It is desirable to give feedback to the articulatory state by recognizing the difference between the learner's articulatory state at the time of learning and the model. For that purpose, it is necessary to estimate the articulatory state of the learner from the voice.

【０００６】現在のところ、音声波から調音状態を推定
する手法を大別すると、理想化された声道モデルを用い
て音声の音響的特徴量に等価な音響管を推定するもの
（Shroeder, M.R: "Determination of the geometry of
the human vocal tract by acoustic measurement,"
J. Acoust. Soc. Am., 4, p. 1002 (1967), Yehia, H.
and Itakura, F: "A method to combine acoustic and
morphological constraints in the speech production
inverse problem," Speech Comm. 18, 151-174 (199
6)）と、より強い拘束条件を持つ調音モデルを用いて調
音目標を推定するものとがある（白井、誉田:"音声波か
らの調音パラメータの推定、" 信学誌、61,409-416 (19
78)）。At present, the methods for estimating articulatory states from speech waves are roughly divided into those that estimate an acoustic tube equivalent to acoustic features of speech using an idealized vocal tract model (Shroeder, MR : "Determination of the geometry of
the human vocal tract by acoustic measurement, "
J. Acoust. Soc. Am., 4, p. 1002 (1967), Yehia, H.
and Itakura, F: "A method to combine acoustic and
morphological constrain t s in the speech production
inverse problem, "Speech Comm. 18, 151-174 (199
6)), and one that estimates an articulatory target using an articulatory model with a stronger constraint (Shirai, Honda: "Estimation of articulatory parameters from speech waves," IEICE, 61, 409-416 (19).
78)).

【０００７】[0007]

【発明が解決しようとする課題】しかし、音声波から調
音状態を推定する手法では、音声波と調音状態との対応
関係が一意的ではないという問題がある。この逆問題を
解決するために従来から種々の拘束条件が考えられてい
るが、いずれも推定結果の精度は低く、信頼度も高くな
い。However, the method of estimating the articulation state from the voice wave has a problem that the correspondence relationship between the voice wave and the articulation state is not unique. Various constraint conditions have been considered in order to solve the inverse problem, but the accuracy of the estimation result is low and the reliability is not high.

【０００８】したがってこの発明の目的は、音声波か
ら、その音声を発する際の調音状態を推定し、従来より
もより理解容易な形式で提示することが可能な調音状態
の推定表示方法およびコンピュータを調音状態表示装置
として動作させるコンピュータプログラムを記憶したコ
ンピュータ読取可能な記録媒体を提供することである。Therefore, an object of the present invention is to provide a method and computer for estimating and displaying an articulatory state capable of estimating an articulatory state when a voice is emitted from a sound wave and presenting the articulated state in a format easier to understand than conventional methods. A computer-readable recording medium that stores a computer program that operates as an articulatory state display device.

【０００９】[0009]

【課題を解決するための手段】この発明のある局面によ
れば、入力音声信号から、話者の調音器官の調音状態を
推定し表示するための調音状態の推定表示方法は、入力
音声信号から所定のパラメータを抽出するステップと、
抽出された所定のパラメータに基づいて初期の調音目標
を設定するステップと、調音目標と、所定のパラメータ
とに基づいて、入力音声信号に対応する話者の発話器官
の調音状態を推定するステップとを含む。推定するステ
ップは、調音目標に基づいて、生理学的３次元調音モデ
ルを用いて入力音声信号に対応する調音状態を推定する
ステップと、推定された調音状態に基づいて所定の音響
モデルによって合成音声信号を生成するステップと、合
成音声信号から、所定のパラメータを抽出するステップ
と、合成音声信号から抽出された所定のパラメータが入
力音声信号から抽出された所定のパラメータと所定の関
係に接近するように調音目標を更新するステップとを含
む。この調音状態の推定表示方法はさらに、合成音声信
号から抽出された所定のパラメータが入力音声信号から
抽出された所定のパラメータと所定の関係を充足するか
否かを判定するステップと、判定するステップの判定結
果にしたがって、更新された調音目標に基づいて推定す
るステップから再度処理を繰返す処理と、調音目標にし
たがって生理学的３次元調音モデルを駆動することによ
り得られる調音器官の調音状態を表示装置上に表示する
処理とを選択的に行なうステップとを含む。According to one aspect of the present invention, there is provided a method for estimating and displaying an articulatory state for estimating and displaying an articulatory state of a speaker's articulatory organ from an input audio signal. Extracting a predetermined parameter,
A step of setting an initial articulation target based on the extracted predetermined parameter; a step of estimating the articulation state of the speaker's speech organ corresponding to the input voice signal based on the articulation target and the predetermined parameter; and including. The estimating step includes the step of estimating the articulatory state corresponding to the input voice signal by using the physiological three-dimensional articulatory model based on the articulatory target, and the synthetic voice signal by the predetermined acoustic model based on the estimated articulatory state. And a step of extracting a predetermined parameter from the synthesized voice signal, and a predetermined parameter extracted from the synthesized voice signal so as to approach a predetermined relationship with the predetermined parameter extracted from the input voice signal. Updating the articulation target. The method for estimating and displaying the articulatory state further includes a step of determining whether or not the predetermined parameter extracted from the synthesized voice signal satisfies a predetermined relationship with the predetermined parameter extracted from the input voice signal, and a step of determining Display device for displaying the articulation state of the articulatory organ obtained by driving the physiological three-dimensional articulation model according to the articulation target. And the step of selectively performing the processing displayed above.

【００１０】この方法によれば、音声波から調音状態を
推定することにより、発話時の調音器官の調音状態を視
覚的に表示させることができる。発話者などがその調音
状態を視覚的に確認し、正しい調音状態に適合させるこ
とが容易になる。According to this method, the articulation state of the articulatory organ at the time of utterance can be visually displayed by estimating the articulation state from the voice wave. It becomes easy for the speaker or the like to visually check the articulatory state and adapt it to the correct articulatory state.

【００１１】この発明にかかる方法は好ましくはさら
に、入力音声信号のストリームに含まれるフレームの各
々に対して、入力音声信号から所定のパラメータを抽出
するステップから選択的に行なうステップまでをそれぞ
れ行なうことにより調音状態をアニメーション表示する
ステップを含む。The method according to the present invention preferably further includes, for each frame included in the stream of the input audio signal, performing steps from extracting a predetermined parameter from the input audio signal to selectively performing the parameter. Includes the step of animating the articulation state.

【００１２】入力音声信号のストリームに含まれるフレ
ームの各々に対して調音状態が表示されるため、発話時
の調音器官の調音状態がアニメーションとして表示さ
れ、より分かりやすくなるとともに、動的に調音状態を
変化させる必要のある発話などの学習が容易になる。Since the articulatory state is displayed for each frame included in the stream of the input audio signal, the articulatory state of the articulatory organ at the time of utterance is displayed as an animation for easier understanding and dynamic articulatory state. It becomes easy to learn utterances that need to be changed.

【００１３】好ましくは、表示するステップは調音器官
の形状を表示装置上に３次元的に表示するステップを含
む。Preferably, the step of displaying includes the step of three-dimensionally displaying the shape of the articulatory organ on the display device.

【００１４】３次元的な表示によって、調音状態を直感
的に認識できる。さらに好ましくは、３次元的に表示す
るステップは、調音器官の複数枚の２次元断面形状を配
列することにより調音器官の３次元形状を表示装置上に
表示するステップを含む。The articulated state can be intuitively recognized by the three-dimensional display. More preferably, the step of displaying three-dimensionally includes the step of displaying the three-dimensional shape of the articulatory organ on the display device by arranging a plurality of two-dimensional cross-sectional shapes of the articulatory organ.

【００１５】[0015]

【発明の実施の形態】上述したような調音状態の推定結
果の精度と信頼度とを高めるためには、人間の発話機構
に忠実に基づいた調音モデルを用いることが望ましい。
これに関連して、特定話者の３次元ＭＲＩ（magnetic r
esonance imaging）画像データに基づいて３次元生理学
的調音モデルを構築する方法が提案されている（党、本
多:"生理学的調音モデルを用いた母音系列の合成、" 音
講論、(1998, 9)、Dang, J andHonda, K., "Speech pro
duction of vowel sequences using a physiologicalar
ticulatory model," Proc. ICSLP98, Vol. 5, pp.1767-
1770 (1998)、党、本多:"生理学的調音モデルを用いる
音声合成法、" 音講論、243-244 (1999, 9)）。このモ
デルは、発話器官（調音器官）とその周辺の筋肉の生理
学的な特性に基づいて、主な発話器官の構造およびその
周辺の筋肉の構造に関連する生理学的拘束条件を備えて
おり、調音目標と筋肉の収縮力との間の関係を記述する
ことにより人間の発話機構をほぼ忠実に再現することが
できる。したがってこの生理学的調音モデルを応用する
ことにより、音声波からより自然な声道形状を得られる
可能性がある。BEST MODE FOR CARRYING OUT THE INVENTION In order to improve the accuracy and reliability of the estimation result of the articulation state as described above, it is desirable to use an articulation model faithfully based on the human speech mechanism.
In connection with this, three-dimensional MRI (magnetic r
esonance imaging) A method of constructing a three-dimensional physiological articulatory model based on image data has been proposed (Party, Honda: "Synthesis of vowel sequences using physiological articulatory model," Sound lecture, (1998, 9 ), Dang , J and Honda, K., "Speech pro
duction of vowel sequences using a physiologicalar
ticulatory model, "Proc. ICSLP98, Vol. 5, pp.1767-
1770 (1998), Party, Honda: "Synthesis method using physiological articulatory model," Onkyo, 243-244 (1999, 9)). This model has physiological constraints related to the structure of the main speech organs and the structure of the muscles around it based on the physiological characteristics of the speech organs (articulatory organs) and the muscles around them. By describing the relationship between the goal and the contractile force of muscles, the human speech mechanism can be reproduced almost faithfully. Therefore, by applying this physiological articulatory model, it is possible to obtain a more natural vocal tract shape from the sound wave.

【００１６】以下に説明する本発明の実施の形態１にか
かるシステムは、この生理学的調音モデルを用い、リア
ルタイムに音声波から調音状態を推定してアニメーショ
ン表示することにより、発話者が容易に自己の調音状態
を理解することができるようにするためのものである。
このシステムは、外国語の学習、子供または障害者の発
音の学習などに補助装置として用いることができる。［ハードウェア構成］以下、本発明の実施の形態１にかかる方法を実現するた
めの調音状態表示装置について説明する。この調音状態
表示装置は、パーソナルコンピュータまたはワークステ
ーション等、コンピュータと、そのコンピュータ上で実
行されるソフトウェアとにより実現されるものであっ
て、人の発話した音声波形から、発話者の調音状態を推
定してアニメーション表示するとともに、推定された調
音状態に基づいて合成音声を発声するためのものであ
る。図１に、この調音状態表示装置の外観を示す。The system according to the first embodiment of the present invention described below uses the physiological articulatory model, estimates the articulatory state from the voice wave in real time, and displays the animation, so that the speaker can easily perform self-expression. It is for understanding the articulatory state of.
This system can be used as an auxiliary device for learning foreign languages, learning pronunciation of children or people with disabilities. [Hardware Configuration] Hereinafter, an articulation state display device for implementing the method according to the first embodiment of the present invention will be described. The articulatory state display device, a personal computer or workstation or the like, and the computer, there is implemented by software executing on the computer, from the uttered speech waveform human estimated articulatory state of a speaker In addition to displaying an animation, a synthetic voice is produced based on the estimated articulation state. FIG. 1 shows the appearance of this articulation state display device.

【００１７】図１を参照してこの調音状態推定表示装置
２０は、ＣＤ−ＲＯＭ（Compact Disc Read-Only Memor
y）ドライブ５０およびＦＤ（Flexible Disk）ドライブ
５２を備えたコンピュータ本体４０と、コンピュータ本
体４０に接続された表示装置としてのディスプレイ４２
と、同じくコンピュータ本体４０に接続された入力装置
としてのキーボード４６およびマウス４８と、コンピュ
ータ本体４０に接続された、発話者の発した音声を取込
むためのマイク３０と、合成音声を出力するための、増
幅器を内蔵したスピーカ３２とを含む。Referring to FIG. 1, the articulated state estimation display device 20 is a CD-ROM (Compact Disc Read-Only Memor).
y) A computer main body 40 having a drive 50 and an FD (Flexible Disk) drive 52, and a display 42 as a display device connected to the computer main body 40.
A keyboard 46 and a mouse 48, which are also connected to the computer main body 40 as input devices, a microphone 30 connected to the computer main body 40 for capturing a voice uttered by a speaker, and for outputting a synthetic voice. And a speaker 32 having a built-in amplifier.

【００１８】図２に、この調音状態推定表示装置２０の
構成をブロック図形式で示す。図２に示されるようにこ
の調音状態推定表示装置２０を構成するコンピュータ本
体４０は、ＣＤ−ＲＯＭドライブ５０およびＦＤドライ
ブ５２に加えて、それぞれバス６６に接続されたＣＰＵ
（Central Processing Unit）５６と、ＲＯＭ（ReadOnl
y Memory)５８と、ＲＡＭ（Random Access Memory）６
０と、ハードディスク５４と、マイク３０からの音声を
取込むための音声取込装置６８と、ＣＰＵ５６から与え
られる音信号からスピーカ３２を駆動するための信号を
生成するための、音源を内蔵したサウンドボード７０を
含んでいる。ＣＤ−ＲＯＭドライブ５０にはＣＤ−ＲＯ
Ｍ６２が装着される。ＦＤドライブ５２にはＦＤ６４が
装着される。FIG. 2 is a block diagram showing the structure of the articulated state estimation display device 20. As shown in FIG. 2, the computer main body 40 constituting the articulated state estimation display device 20 includes a CPU connected to a bus 66 in addition to a CD-ROM drive 50 and an FD drive 52.
(Central Processing Unit) 56 and ROM (ReadOnl
y Memory) 58 and RAM (Random Access Memory) 6
0, a hard disk 54, a voice capturing device 68 for capturing a voice from the microphone 30, and a sound with a built-in sound source for generating a signal for driving the speaker 32 from a sound signal provided from the CPU 56. It includes a board 70. The CD-ROM drive 50 has a CD-RO
M62 is installed. An FD 64 is attached to the FD drive 52.

【００１９】既に述べたようにこの調音状態表示装置の
主要部は、コンピュータハードウェアと、ＣＰＵ５６に
より実行されるソフトウェアとにより実現される。一般
的にこうしたソフトウェアはＣＤ−ＲＯＭ６２、ＦＤ６
４等の記憶媒体に格納されて流通し、ＣＤ−ＲＯＭドラ
イブ５０またはＦＤドライブ５２等により記憶媒体から
読取られてハードディスク５４に一旦格納される。また
は、当該装置がネットワークに接続されている場合に
は、ネットワーク上のサーバから一旦ハードディスク５
４にコピーされる。そうしてさらにハードディスク５４
からＲＡＭ６０に読出されてＣＰＵ５６により実行され
る。なお、ネットワーク接続されている場合には、ハー
ドディスク５４に格納することなくＲＡＭ６０に直接ロ
ードして実行するようにしてもよい。As described above, the main part of this articulation state display device is realized by computer hardware and software executed by the CPU 56. Generally, such software is CD-ROM 62, FD6.
It is stored in a storage medium such as No. 4 and distributed, and is read from the storage medium by the CD-ROM drive 50 or the FD drive 52 and stored in the hard disk 54 once. Alternatively, when the device is connected to the network, the hard disk 5 is temporarily sent from the server on the network.
4 is copied to. Then further hard disk 54
Is read from the RAM 60 to the RAM 60 and executed by the CPU 56. When connected to the network, it may be directly loaded into the RAM 60 and executed without being stored in the hard disk 54.

【００２０】図１および図２に示したコンピュータのハ
ードウェア自体およびその動作原理は一般的なものであ
る。したがって、本発明の最も本質的な部分はＦＤドラ
イブ５２、ＦＤ６４、ハードディスク５４等の記憶媒体
に記憶されたソフトウェアである。The hardware itself of the computer shown in FIGS. 1 and 2 and the operating principle thereof are general. Therefore, the most essential part of the present invention is the software stored in the storage medium such as the FD drive 52, the FD 64, and the hard disk 54.

【００２１】なお、最近の一般的傾向として、コンピュ
ータのオペレーティングシステムの一部として様々なプ
ログラムモジュールを用意しておき、アプリケーション
プログラムは主としてこれらモジュールを所定の配列で
必要な時に呼び出して処理を進めるという役割のみを担
うことがある。そうした場合、当該調音状態表示装置を
実現するためのソフトウェア自体にはそうした機能モジ
ュールは含まれず、当該コンピュータというハードウェ
ア上で、かつその上のオペレーティングシステムと協働
してはじめて、本実施の形態の調音状態表示装置が実現
することになる。しかし、一般的なプラットフォームを
使用する限り、そうしたモジュールを含ませたソフトウ
ェアを流通させる必要はなく、それらモジュールを含ま
ないソフトウェア自体およびそれらソフトウェアを記録
した記録媒体（およびそれらソフトウェアがネットワー
ク上を流通する場合のデータ信号）が実施の形態を構成
すると考えることができる。［機能的構成］図３を参照して、本実施の形態１にかか
る調音状態推定表示装置２０は、実音声に対してプリエ
ンファシスをかけて唇などからの放射の影響に対し放射
特性を補正するための放射特性補正部８０と、補正され
た音声のスペクトルを平坦化する処理を行なうための適
応フィルタ８２および自己相関計算部８４と、スペクト
ルの平坦化された入力音声信号から音響パラメータとし
てＬＰＣケプストラム係数を計算により求めるためのＬ
ＰＣ（linear predictive coding）ケプストラム計算部
９０と、予めいくつかの音声について得られたＬＰＣケ
プストラムを、後述する反復処理のための初期調音目標
として記憶しておくための初期調音目標記憶部１００
と、ＬＰＣケプストラム計算部９０によって得られた実
音声に対するＬＰＣケプストラム係数と、初期調音目標
記憶部１００に記憶されているいくつかの初期調音目標
のＬＰＣケプストラム係数とを比較し、実音声から得ら
れたＬＰＣケプストラム係数に最も近いものを初期調音
目標として選択するための初期目標設定部９６と、初期
目標設定部９６により設定された初期調音目標から出発
して調音モデルを駆動することにより調音状態を更新
し、更新した調音状態から得られる合成音声から計算さ
れたＬＰＣケプストラム係数とＬＰＣケプストラム計算
部９０が計算したＬＰＣケプストラム係数との差が小さ
くなるように反復処理をしながら調音目標を更新してい
き、最終的に得られた調音状態を、最初に与えられた実
音声を発声したときの発話者の調音状態として、得られ
た調音状態から調音器官の形状をアニメーションで表示
（１０８）するための調音モデル反復計算部９８とを含
む。As a recent general tendency, various program modules are prepared as a part of the operating system of a computer, and an application program mainly calls these modules in a predetermined arrangement when necessary to proceed with processing. May only play a role. In such a case, the software itself for realizing the articulatory state display device does not include such a functional module, and the hardware of the computer concerned and the operating system of the present invention are not used until this embodiment of the present embodiment is performed. The articulation state display device will be realized. However, as long as a general platform is used, it is not necessary to distribute software that includes such modules, and the software itself that does not include those modules and a recording medium that records these software (and those software are distributed on the network). The data signal in the case) can be considered to constitute an embodiment. [Functional Configuration] Referring to FIG. 3, the articulatory state estimation and display device 20 according to the first embodiment corrects the radiation characteristic with respect to the effect of radiation from the lips by pre-emphasising the actual voice. A radiation characteristic correction unit 80, an adaptive filter 82 and an autocorrelation calculation unit 84 for flattening the spectrum of the corrected voice, and an LPC as an acoustic parameter from the input voice signal with the flattened spectrum. L for calculating the cepstrum coefficient by calculation
A PC (linear predictive coding) cepstrum calculation unit 90 and an initial articulation target storage unit 100 for storing an LPC cepstrum obtained in advance for some speeches as an initial articulation target for iterative processing described later.
And the LPC cepstrum coefficient for the actual voice obtained by the LPC cepstrum calculation section 90 and the LPC cepstrum coefficients of some initial articulation targets stored in the initial articulation target storage section 100 are compared to obtain the actual voice. The LPC cepstrum coefficient closest to the LPC cepstrum coefficient is selected as an initial articulation target, and the articulation model is driven by starting from the initial articulation target set by the initial target setting unit 96. And updating the articulation target while performing iterative processing so that the difference between the LPC cepstrum coefficient calculated from the synthetic speech obtained from the updated articulation state and the LPC cepstrum coefficient calculated by the LPC cepstrum calculation unit 90 becomes small. When the actual articulation given at the beginning is spoken, As articulation state of a speaker, and a articulatory model iterative calculation unit 98 for displaying (108) the shape of the articulator animation from the resulting articulation state.

【００２２】調音モデル反復計算部９８は、初期目標設
定部９６により初期設定される調音目標を用いて調音器
官とその周囲の筋肉の収縮パターンを推定し、筋肉の収
縮力により調音モデルを駆動して新しい声道形状を生成
する調音モデル部１０２と、調音モデル部１０２により
生成された声道形状に基づいて合成音声を出力（１１
０）するための音響モデル部１０４と、音響モデル部１
０４が出力する合成音声を受け、適応フィルタ８２およ
び自己相関計算部８４が行なうのと同じ処理を行なって
スペクトルの平坦化を行なうための適応フィルタ８６お
よび自己相関計算部８８と、このようにスペクトルの平
坦化された合成音声からＬＰＣケプストラム計算部９０
と同様にして音響パラメータとしてＬＰＣケプストラム
係数を計算するためのＬＰＣケプストラム計算部９２
と、ＬＰＣケプストラム計算部９０が出力するＬＰＣケ
プストラム係数と、ＬＰＣケプストラム計算部９２が出
力するＬＰＣケプストラム係数とが接近するように、す
なわち両者の差異が小さくなるように調音モデル部１０
２のための調音目標を推定し更新するための調音目標更
新部９４とを含む。The articulation model iterative calculation unit 98 estimates the contraction pattern of the articulatory organ and the muscles around it using the articulation target initially set by the initial target setting unit 96, and drives the articulation model by the contraction force of the muscles. And a synthetic voice based on the vocal tract shape generated by the articulatory model unit 102 (11
0) acoustic model unit 104 and acoustic model unit 1
04, an adaptive filter 86 and an autocorrelation calculation unit 88 for flattening the spectrum by performing the same processing as that performed by the adaptive filter 82 and the autocorrelation calculation unit 84, and the spectrum LPC cepstrum calculator 90 from the flattened synthesized speech of
The LPC cepstrum calculation unit 92 for calculating the LPC cepstrum coefficient as an acoustic parameter in the same manner as
And the LPC cepstrum coefficient output by the LPC cepstrum calculation section 90 and the LPC cepstrum coefficient output by the LPC cepstrum calculation section 92 are close to each other, that is, the difference between the two is small, and the articulation model section 10
Articulatory target updating unit 94 for estimating and updating the articulatory target for 2).

【００２３】図４を参照して、音響モデル部１０４は、
調音モデル部１０２が出力する、調音器官部位の３枚の
断面形状１２４に基づいて声道断面積形状を推定するた
めの断面積関数の推定部１２６と、断面積関数の推定部
１２６の出力する声道断面積形状に基づいてパラメータ
を変化させて電気回路モデルを駆動することにより音源
を加えて合成音声を生成する電気回路モデル部１２８と
を含む。［ソフトウェアの制御構造］図５を参照して、図３および図４に示した調音状態推定
表示装置２０を実現するためのソフトウェアは、以下の
ような制御構造を有する。以下に述べる例では、入力さ
れた音声に対しては既に放射特性の補正、スペクトルの
平坦化処理は行なわれているものとする。なお放射特性
の補正、スペクトルの平坦化処理に加えて、ＬＰＣケプ
ストラム係数の計算、音声の合成、調音モデルに基づく
声道状態のアニメーション表示などは本実施の形態では
いずれもソフトウェアにより行なっているが、これらは
いずれも専用のＬＳＩ回路（Large Scale Integrated回
路）を用いて行なってもよい。なお、以下に述べる処理
は入力音声信号のストリームの各フレームに対して行な
われる。また、以下の処理では各フレームという表現を
用いているが、要するに、入力音声信号のうち、音声処
理の対象となる所定の単位に対して以下の処理を適用す
ればよい。Referring to FIG. 4, the acoustic model section 104 is
Output from the cross-sectional area function estimating unit 126 for estimating the vocal tract cross-sectional area shape based on the three cross-sectional shapes 124 of the articulatory organ region output by the articulatory model unit 102, and the cross-sectional area function estimating unit 126. A sound source is generated by driving an electric circuit model by changing parameters based on the vocal tract cross-sectional area shape.
And an electric circuit model unit 128 for generating a synthesized voice. [Control Structure of Software] With reference to FIG. 5, the software for realizing the articulated state estimation display device 20 shown in FIGS. 3 and 4 has the following control structure. In the example described below, it is assumed that the radiation characteristics and the spectrum flattening processing have already been performed on the input voice. Note that, in addition to the correction of the radiation characteristic and the flattening process of the spectrum, the calculation of the LPC cepstrum coefficient, the synthesis of the voice, the animation display of the vocal tract state based on the articulatory model are all performed by software in the present embodiment. Any of these may be performed using a dedicated LSI circuit (Large Scale Integrated circuit). The processing described below is performed for each frame of the stream of the input audio signal. Further, although the expression of each frame is used in the following processing, in short, the following processing may be applied to a predetermined unit of the input audio signal which is a target of audio processing.

【００２４】まず、入力された実音声に対して初期ＬＰ
Ｃケプストラム係数の計算を行なう（１４０）。これは
ＬＰＣケプストラム計算部９０での処理に相当する。こ
うして計算されたＬＰＣケプストラム係数に最も近い音
響パラメータを持つ初期調音目標を初期調音目標記憶部
１００に記憶されている初期調音目標の候補のうちから
選択する（１４２）。本実施の形態では、初期調音目標
として予め日本語の５母音の標準調音目標を準備してお
き、これら母音のうち最初に入力された実音声に最も近
いものを初期調音目標に選択する。First, an initial LP is applied to the input real voice.
The C cepstrum coefficient is calculated (140). This corresponds to the processing in the LPC cepstrum calculation unit 90. An initial articulation target having an acoustic parameter closest to the calculated LPC cepstrum coefficient is selected from candidates of the initial articulation target stored in the initial articulation target storage unit 100 (142). In the present embodiment, the standard articulation target of five Japanese vowels is prepared in advance as the initial articulation target, and one of these vowels that is closest to the first input actual voice is selected as the initial articulation target.

【００２５】続いてこの調音目標を用い、調音器官の筋
肉の収縮パターンを推定し、得られた筋肉の収縮力にし
たがって調音モデルを駆動して（１４４）新しい声道形
状を生成する（１４６）。この声道形状を用い、音響モ
デルにより合成音声を出力（１１０）する。Subsequently, using this articulatory target, the contraction pattern of the muscle of the articulatory organ is estimated, and the articulatory model is driven according to the obtained contractile force of the muscle (144) to generate a new vocal tract shape (146). . Using this vocal tract shape, a synthetic voice is output (110) by an acoustic model.

【００２６】このようにして得られた合成音声に対し、
ステップ１４０の前に行なわれていたのと同じスペクト
ル平坦化処理を行ない、さらにステップ１４０と同じ、
ＬＰＣケプストラム係数の計算を行なう（１４８）。For the synthesized speech thus obtained,
The same spectral flattening process that was performed before step 140 is performed, and the same as step 140,
The LPC cepstrum coefficient is calculated (148).

【００２７】合成音声と入力音声との音響パラメータを
比較し、それらの差異が所定の範囲か否かを判定する
（１５０）。差異が所定の範囲内である場合には、反復
の結果調音目標がほぼ一定のものに近付いたということ
であり、処理をそれ以上行なうことはあまり意味がな
い。したがってここで処理を終わり、このとき得られて
いる調音モデルに基づいて調音状態を画像として表示す
る（１５４）。この画像としては、調音モデル部１０２
が出力する３枚の２次元断面形状１２４を用いることが
でき、これらを３次元的に配列することにより図６のよ
うな形で発話者の調音器官の推定形状を視覚的に、直感
的に分かりやすく表示することができる。The acoustic parameters of the synthetic voice and the input voice are compared, and it is determined whether the difference between them is within a predetermined range (150). If the difference is within the predetermined range, it means that the articulation target has approached an almost constant target as a result of the iteration, and it is meaningless to perform the process any more. Therefore, the processing ends here, and the articulation state is displayed as an image based on the articulation model obtained at this time (154). As this image, the articulation model unit 102
It is possible to use the three two-dimensional cross-sectional shapes 124 output by, and by arranging these three-dimensionally, the estimated shape of the articulatory organ of the speaker is visually and intuitively arranged as shown in FIG. It can be displayed in an easy-to-understand manner.

【００２８】ステップ１５０の判定の結果合成音声と入
力音声との差異が大きいと判定された場合には、ここで
得られた音響パラメータを、実音声から得られた音響パ
ラメータと接近させるように調音目標の更新を行ない
（１５２）、制御はステップ１４４に戻る。以下、ステ
ップ１４４以降の処理を繰返し行なうことにより、最終
的に調音目標は一定のもの、すなわち実際の話者の調音
によく一致する調音状態が得られるように調音モデルを
駆動する調音目標が得られてステップ１５２で肯定的な
判定が得られる。When it is determined that the difference between the synthesized voice and the input voice is large as a result of the determination in step 150, the acoustic parameter obtained here is articulated so as to be close to the acoustic parameter obtained from the actual voice. The target is updated (152) and control returns to step 144. Hereinafter, by repeating the processes in and after step 144, finally articulation goal as constant, i.e., the actual speaker tone sound
An affirmative determination is obtained in step 152 by obtaining an articulation target that drives the articulation model so that an articulation state that closely matches

【００２９】調音状態の表示が終われば、制御はステッ
プ１４０に戻り、次のフレームの音声に対して、上述し
た処理を再度行なう。この処理を入力音声信号のストリ
ームに対して繰返して行なうことにより、ディスプレイ
４２上には発話者の発声時の調音器官の形状がアニメー
ション的に表示される。この調音器官の形状をたとえば
他のウィンドウ上に表示される、模範的な調音器官の形
状のアニメーションと比較することにより、発話者は自
己の調音動作に対して非常に有効なフィードバックをか
けることが可能となる。When the display of the articulated state ends, the control returns to step 140, and the above-mentioned processing is performed again for the voice of the next frame. By repeating this processing for the stream of the input audio signal, the shape of the articulatory organ at the time of the speaker's utterance is displayed on the display 42 in an animation manner. By comparing this articulatory organ shape with an exemplary articulatory organ shape animation displayed on another window, the speaker can give very effective feedback to his articulatory movement. It will be possible.

【００３０】上記した実施の形態のシステムでは調音モ
デルに基づく調音器官のアニメーション表示を、３枚の
矢状断面形状を配列することにより行なっているが、矢
状断面形状の枚数が３枚に限定されるわけではない。コ
ンピュータおよび周辺機器の性能、ならびに応用が必要
とする条件に応じて適切な数を選択することができる。
また、単に矢状断面形状を複数枚並べて表示することに
より立体的な表示を実現するのではなく、調音モデル部
１０２からの出力を、調音器官の表面（声道内面）を表
わす３次元形状データとして出力するようにし、それに
よって３次元形状を表現するようにしてもよい。また、
３次元形状は必ずしも必要ではない場合もあり、その場
合には矢状断面形状を１枚のみ、またはそれぞれ別々に
して複数枚、表示するようにしてもよい。[0030] The animation display articulators based on articulatory model in a system of the embodiment described above, but is performed by arranging three sagittal cross section, limited to three is the number of sagittal cross section It is not done. An appropriate number can be selected depending on the performance of the computer and peripherals and the conditions required by the application.
Further, the output from the articulatory model unit 102 is not limited to a three-dimensional display by simply displaying a plurality of sagittal cross-sections side by side, but three-dimensional shape data representing the surface of the articulatory organ (inner surface of vocal tract). Alternatively, the three-dimensional shape may be represented. Also,
In some cases, the three-dimensional shape is not always necessary, and in that case, one sagittal cross-sectional shape may be displayed, or a plurality of sagittal cross-sectional shapes may be displayed separately.

【００３１】また上記した実施の形態のシステムはスタ
ンドアローンであるが、ネットワークによってアニメー
ション表示を他の地点のコンピュータに伝送し、そこで
も表示することができる。これは一方向でも双方向でも
よい。たとえば双方向でアニメーション表示を行なう場
合には、語学教師が遠隔地にいる学習者を指導するとき
などに、教師は学習者の声道形状を確認してそれを矯正
するためにより適切な指示を行なうことが可能となる
し、学習者は教師の声道形状を見ることにより自己の調
音器官の制御をより適切に行なうことができるようにな
るであろう。Although the system of the above-described embodiment is a stand-alone system, the animation display can be transmitted to the computer at another point by the network and can be displayed there. This may be unidirectional or bidirectional. For example, in the case of interactive animation display, when a language teacher teaches a learner in a remote place, the teacher confirms the learner's vocal tract shape and gives more appropriate instructions to correct it. it it becomes possible to carry out, the learner is self-regulating by looking at the teacher of the vocal tract shape
It will be possible to control sound organs more appropriately.

【００３２】また、そのようなネットワークによりコン
ピュータ間を接続する場合、一対一の接続のみに限定さ
れるわけではない。教師が一人に対し学習者が複数いる
場合、教師の音声から得られる声道形状のアニメーショ
ンをそれら学習者にブロードキャストすることにより、
そうした語学の学習システムがより効率的になることが
期待できる。When connecting computers by such a network, the connection is not limited to one-to-one connection. If there is more than one learner for each teacher, by broadcasting the vocal tract shape animation obtained from the teacher's voice to those learners,
It can be expected that such a language learning system will become more efficient.

【００３３】また、今回開示された実施の形態のシステ
ムは声道形状をリアルタイムでアニメーション表示する
ものではあるが、音声の内容まで取り扱っているわけで
はない。しかし、声道形状から、そのときに発声されて
いる音声が精度よく推定できるような拘束条件を定め、
さらに時系列でそのような声道形状の変化を捉えていく
ことにより、この技術を音声認識にも適用できる可能性
がある。Further, although the system of the embodiment disclosed this time displays the vocal tract shape as an animation in real time, it does not handle the contents of voice. However, from the vocal tract shape, the constraint condition is set so that the voice uttered at that time can be estimated accurately.
Furthermore, by capturing such changes in vocal tract shape over time, this technique may be applicable to speech recognition.

【００３４】今回開示された実施の形態はすべての点で
例示であって制限的なものではないと考えられるべきで
ある。本発明の範囲は上記した説明ではなくて特許請求
の範囲によって示され、特許請求の範囲と均等の意味お
よび範囲内でのすべての変更が含まれることが意図され
る。The embodiments disclosed this time are to be considered as illustrative in all points and not restrictive. The scope of the present invention is shown not by the above description but by the claims, and is intended to include meanings equivalent to the claims and all modifications within the scope.

[Brief description of drawings]

【図１】本発明の実施の形態１にかかるシステムの外
観図である。FIG. 1 is an external view of a system according to a first embodiment of the present invention.

【図２】本発明の実施の形態１にかかるシステムのハ
ードウェア的構成を示すブロック図である。FIG. 2 is a block diagram showing a hardware configuration of the system according to the first embodiment of the present invention.

【図３】本発明の実施の形態１にかかるシステムの機
能的ブロック図である。FIG. 3 is a functional block diagram of the system according to the first embodiment of the present invention.

【図４】本発明の実施の形態１にかかるシステムにお
ける音声合成処理部のブロック図である。FIG. 4 is a block diagram of a voice synthesis processing unit in the system according to the first exemplary embodiment of the present invention.

【図５】本発明の実施の形態１にかかるシステムで行
なわれる処理の概略を示すフローチャートである。FIG. 5 is a flowchart showing an outline of processing performed by the system according to the first embodiment of the present invention.

【図６】アニメーション表示される調音モデルの一例
を示す図である。FIG. 6 is a diagram showing an example of an articulatory model displayed in animation.

[Explanation of symbols]

２０調音状態表示装置、３０マイク、３２スピー
カ、４０コンピュータ本体、４２ディスプレイ、８
２，８６適応フィルタ、８４，８８自己相関計算
部、９０，９２ＬＰＣケプストラム計算部、９４調
音目標更新部、９６初期目標設定部、９８調音モデ
ル反復計算部、１０２調音モデル部、１０４音響モ
デル部。20 articulation state display device, 30 microphone, 32 speaker, 40 computer main body, 42 display, 8
2,86 adaptive filter, 84,88 autocorrelation calculation unit, 90,92 LPC cepstrum calculation unit, 94 articulation target update unit, 96 initial target setting unit, 98 articulation model iterative calculation unit, 102 articulation model unit, 104 acoustic model unit .

フロントページの続き (56)参考文献特開平６−348297（ＪＰ，Ａ) 特開平10−254497（ＪＰ，Ａ) 特開昭59−71096（ＪＰ，Ａ) 特開昭59−124385（ＪＰ，Ａ) 特開平11−259097（ＪＰ，Ａ) 特開昭62−184500（ＪＰ，Ａ) 特開昭62−184499（ＪＰ，Ａ) 特開平11−202897（ＪＰ，Ａ) 党建武、本多清志，生理学的調音モデルに基づく３次元声道形状の生成，日本音響学会講演論文集，日本音響学会, 1998年３月，1998年春季Ｉ，265−266 杉浦淳、松村雅史，磁気共鳴映像法による３次元声道形状の計測，電子情報通信学会技術研究報告，電子情報通信学会，1990年１月26日，ＳＰ89−109〜 118，65−72 佐々木優、他，音声教育のための３次元声道形状の対話型表現，日本音響学会講演論文集，日本音響学会，1998年３月，1998年春季Ｉ，341−342 (58)調査した分野(Int.Cl.⁷，ＤＢ名) G10L 13/00 Continuation of front page (56) References JP-A-6-348297 (JP, A) JP-A-10-254497 (JP, A) JP-A-59-71096 (JP, A) JP-A-59-124385 (JP , A) JP 11-259097 (JP, A) JP 62-184500 (JP, A) JP 62-184499 (JP, A) JP 11-202897 (JP, A) Kiyoshi Honda, Generation of three-dimensional vocal tract shape based on physiological articulatory model, Proceedings of Acoustical Society of Japan, Acoustical Society of Japan, March 1998, Spring 1998, I, 265-266 Atsushi Sugiura, Masafumi Matsumura, Magnetics Measurement of 3D vocal tract shape by resonance imaging, IEICE technical report, IEICE, 26 January 1990, SP89-109 ~ 118, 65-72 Yu Sasaki, et al., Speech Interactive expression of three-dimensional vocal tract shape for education, Proceedings of Acoustical Society of Japan, Acoustical Society of Japan, March 1998, Spring 1998, 341-342 (58) Fields investigated (Int.Cl. ⁷ , DB name) G10L 13/00

Claims

(57) [Claims]

1. A method for estimating and displaying an articulatory state of a speaker's articulatory organ from an input audio signal, comprising: extracting a predetermined parameter from the input audio signal; And a step of setting an initial articulation target based on the predetermined parameter, based on the articulation target and the predetermined parameter,
Estimating the articulatory state of the speaker's speech organ corresponding to the input voice signal, wherein the estimating step uses the physiological three-dimensional articulatory model to determine the input voice signal based on the articulation target. Estimating a corresponding articulation state, generating a synthesized voice signal by a predetermined acoustic model based on the estimated articulation state, extracting the predetermined parameter from the synthesized voice signal, and Updating the articulation target such that the predetermined parameter extracted from the synthesized voice signal approaches a predetermined relationship with the predetermined parameter extracted from the input voice signal, and the estimated display of the articulation state is included. The method further comprises the step of: the predetermined parameter extracted from the synthesized audio signal being extracted from the input audio signal. A step of determining whether or not the predetermined parameter and the predetermined relationship are satisfied, and a step of repeating the process again from the step of estimating based on the updated articulation target according to the determination result of the determining step, and And selectively displaying the articulatory state of the articulatory organ obtained by driving the physiological three-dimensional articulatory model according to the articulatory target on a display device.

2. By further performing the steps from the step of extracting a predetermined parameter from the input audio signal to the step of selectively performing each of the frames included in the stream of the input audio signal. The method for estimating and displaying the articulatory state according to claim 1, comprising a step of displaying the articulatory state as an animation.

3. The method for estimating and displaying an articulatory state according to claim 1, wherein the displaying step includes a step of three-dimensionally displaying the shape of the articulatory organ on the display device.

4. The three-dimensionally displaying step includes a step of displaying a three-dimensional shape of the articulatory organ on the display device by arranging a plurality of two-dimensional cross-sectional shapes of the articulatory organ. The method for estimating and displaying the articulatory state according to claim 3.

5. A computer-readable recording medium storing a computer program for causing a computer to execute an articulation state estimation display method for estimating and displaying an articulation state of a speaker's articulatory organ from an input voice signal. Therein, the method comprises a step of extracting a predetermined parameter from an input audio signal, a step of setting an initial articulation target based on the extracted predetermined parameter, the articulation target, and the predetermined parameter. On the basis of,
Estimating the articulatory state of the speaker's speech organ corresponding to the input voice signal, wherein the estimating step uses the physiological three-dimensional articulatory model to determine the input voice signal based on the articulation target. Estimating a corresponding articulation state, generating a synthesized voice signal by a predetermined acoustic model based on the estimated articulation state, extracting the predetermined parameter from the synthesized voice signal, and Updating the articulation target such that the predetermined parameter extracted from the synthesized voice signal approaches a predetermined relationship with the predetermined parameter extracted from the input voice signal, the method further comprising: The predetermined parameter extracted from the synthesized voice signal is combined with the predetermined parameter extracted from the input voice signal. A step of determining whether or not a predetermined relationship is satisfied, a process of repeating the process again from the estimating step based on the updated articulation target according to the determination result of the determination step, and the articulation target. And a step of selectively displaying, on a display device, an articulatory state of an articulatory organ obtained by driving a physiological three-dimensional articulatory model in accordance with the method.

6. The method further includes, for each frame included in the stream of the input audio signal, from the step of extracting a predetermined parameter from the input audio signal to the step of selectively performing the parameter. The computer-readable recording medium according to claim 5, further comprising the step of animating the articulation state.

7. The computer-readable recording medium according to claim 5, wherein the displaying step includes a step of three-dimensionally displaying the shape of the articulatory organ on the display device.

8. The three-dimensionally displaying step includes a step of displaying a three-dimensional shape of the articulatory organ on the display device by arranging a plurality of cross-sectional shapes of the articulatory organ. Item 7. A computer-readable recording medium in the articulated state according to Item 7.