JP2789158B2

JP2789158B2 - Voice recognition device

Info

Publication number: JP2789158B2
Application number: JP5146428A
Authority: JP
Inventors: 安弘和田; 光男川人
Original assignee: Ei Tei Aaru Ningen Joho Tsushin Kenkyusho Kk
Current assignee: Ei Tei Aaru Ningen Joho Tsushin Kenkyusho Kk
Priority date: 1993-06-17
Filing date: 1993-06-17
Publication date: 1998-08-20
Anticipated expiration: 2013-08-20
Also published as: JPH075897A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【産業上の利用分野】この発明は、音声認識装置に関
し、特に、音声認識などの時系列データの認識を行なう
ことのできるような音声認識装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech recognition device, and more particularly to a speech recognition device capable of recognizing time-series data such as speech recognition.

【０００２】[0002]

【従来の技術および発明が解決しようとする課題】従来
の音声認識装置は、連続音声の認識を音響的な分析によ
り試みていた。しかし、音響的な分析だけでは連続音声
を１００％認識することはできない。このことは、人間
においても聴覚的分析のみで音声を認識していないこと
からわかる。そのため、人間が音声を意味のある言語と
して理解しているという点を音声認識装置に適用・利用
しようとする研究が行なわれている。2. Description of the Related Art Conventional speech recognition apparatuses have attempted to recognize continuous speech by acoustic analysis. However, 100% of continuous speech cannot be recognized only by acoustic analysis. This can be seen from the fact that humans do not recognize speech only by auditory analysis. For this reason, research is being conducted to apply and use the fact that humans understand speech as a meaningful language to speech recognition devices.

【０００３】つまり、単語や文を表現している音声の制
約を考慮して、音響的な分析の不確かさを単語的な分析
を用いて、単語的な分析の不確かさを文的な分析を用い
て音声の認識が試みられている。しかし、この認識方法
を用いる音声認識装置には、人間の脳に相当する人工知
能を開発する必要がある。[0003] In other words, the uncertainty of the acoustic analysis is analyzed by using the linguistic analysis, and the uncertainty of the lexical analysis is analyzed by using the linguistic analysis, taking into account the constraints of the speech expressing the words and sentences. Speech recognition has been attempted using However, a speech recognition device using this recognition method needs to develop artificial intelligence equivalent to the human brain.

【０００４】それゆえに、この発明は、音声の音響的分
析の限界に対して、人間が音声を発するときに、その音
声を特定する口や舌などの調音器官の動きの特徴から音
声の認識を行なうことができるような音声認識装置を提
供することである。[0004] Therefore, the present invention, with respect to the limit of acoustic analysis of speech, recognizes speech when a human utters a speech, based on the movement characteristics of articulators such as a mouth and a tongue that specify the speech. An object of the present invention is to provide a voice recognition device that can perform the voice recognition.

【０００５】[0005]

【０００６】[0006]

【０００７】[0007]

【課題を解決するための手段】この発明に従った音声認
識装置は、入力された第１の音響信号に基づいて調音器
官の第１の運動軌道を推定する運動軌道推定手段と、運
動軌道推定手段によって推定された第１の運動軌道に基
づいてその第１の運動軌道の経由点を推定する経由点推
定手段と、経由点推定手段によって推定された経由点を
それぞれ音素に変換して音素列を生成する音素対応手段
と、経由点推定手段によって推定された経由点または音
素対応手段によって生成された音素列に基づいて第２の
音響信号を生成する音声生成手段とを備え、音声生成手
段によって生成された第２の音響信号を第１の音響信号
と比較し、その差が所定の第１のしきい値よりも小さく
なるまで経由点推定手段による経由点の推定と音声生成
手段による第２の音響信号の生成とを繰返し、その結果
得られた音素列を第１の音響信号の認識結果として出力
する。A speech recognition apparatus according to the present invention comprises: a motion trajectory estimating means for estimating a first motion trajectory of an articulator based on an input first acoustic signal; A transit point estimating means for estimating a transit point of the first motion trajectory based on the first motion trajectory estimated by the means; And a voice generating means for generating a second acoustic signal based on a via point estimated by the via point estimating means or a phoneme string generated by the phoneme corresponding means. The generated second acoustic signal is compared with the first acoustic signal, and the transit point is estimated by the transit point estimating means until the difference between the second audio signal and the first acoustic signal becomes smaller than a predetermined first threshold value. of Repeated generation of the sound signal, and outputs the phoneme string obtained as a result as the recognition result of the first acoustic signal.

【０００８】好ましくは、上記経由点推定手段は、運動
軌道推定手段によって推定された第１の運動軌道上また
はその近傍から経由点を抽出する経由点抽出手段と、経
由点抽出手段によって抽出された経由点に基づいて調音
器官の第２の運動軌道を生成する第１の運動軌道生成手
段とを含み、上記第１の運動軌道生成手段によって生成
された第２の運動軌道を運動軌道推定手段によって推定
された第１の運動軌道と比較し、その差が所定の第２の
しきい値よりも小さくなるまで経由点抽出手段による経
由点の抽出と第１の運動軌道生成手段による第２の運動
軌道の生成とを繰返し、その結果得られた経由点を出力
する。[0008] Preferably, the via point estimating means is a via point extracting means for extracting via points from or near the first motion trajectory estimated by the motion trajectory estimating means, and is extracted by the via point extracting means. First motion trajectory generation means for generating a second motion trajectory of the articulator based on the waypoints, wherein the second motion trajectory generated by the first motion trajectory generation means is calculated by motion trajectory estimation means. Compared with the estimated first motion trajectory, the extraction of the via point by the via point extraction means and the second motion by the first motion trajectory generation means until the difference becomes smaller than a predetermined second threshold value. The generation of the trajectory is repeated, and the obtained waypoint is output.

【０００９】好ましくは、上記音声生成手段は、音素対
応手段によって生成された音素列を調音器官の第３の運
動軌道の経由点に変換する経由点変換手段と、経由点変
換手段によって変換された経由点に基づいて第３の運動
軌道を生成する第２の運動軌道生成手段と、第２の運動
軌道生成手段によって生成された第３の運動軌這に基づ
いて第２の音響信号を生成する音響信号生成手段とを含
む。Preferably, the voice generating means converts the phoneme sequence generated by the phoneme correspondence means into a transit point of the third motion trajectory of the articulator, and the transit point is converted by the transit point conversion means. Second motion trajectory generating means for generating a third motion trajectory based on the waypoint, and generating a second acoustic signal based on the third motion trajectory generated by the second motion trajectory generating means Sound signal generating means.

【００１０】[0010]

【作用】この発明に係る音声認識装置は、入力された音
響信号を生じるための動きを行なうと推定される調音器
官の運動軌道を推定し、さらにその運動軌道を特定する
特徴点を抽出することによって、特徴点に対応した音素
列として連続音声を認識することができる。The speech recognition apparatus according to the present invention estimates a motion trajectory of an articulator which is presumed to perform a motion for generating an input acoustic signal, and further extracts a feature point specifying the motion trajectory. Thereby, continuous speech can be recognized as a phoneme string corresponding to the feature point.

【００１１】[0011]

【実施例】図１は、この発明の一実施例の音声認識装置
に音響信号が入力されてからその音響信号に対応する音
素列として音声の認識が行なわれるまでの状態を示した
ブロック図である。FIG. 1 is a block diagram showing a state from when an acoustic signal is input to a speech recognition apparatus according to an embodiment of the present invention until speech recognition is performed as a phoneme string corresponding to the acoustic signal. is there.

【００１２】図１を参照して、この一実施例の音声認識
装置は、音響信号から音素の列を抽出する音素列抽出部
２０と、音素の列から音響信号を生成する音声生成部３
０とを含む。音素列抽出部２０は、音響信号が入力され
る音響信号入力装置１と、運動軌道を推定する運動軌道
推定装置２と、経由点を推定する経由点推定装置３と、
経由点を音素に変換する音素対応装置４とにより構成さ
れる。経由点推定装置３は、経由点抽出装置１１と第１
運動軌道生成装置１２とを含む。運動軌道推定装置２
は、たとえば白井克彦，誉田雅彰「音声波からの調音パ
ラメータの推定」電子通信学会論文誌，Ｖｏｌ．Ｊ６１
−Ａ，Ｎｏ．５，ｐｐ．４０９−４１６（１９７８）に
開示されるように、予め定められた調音モデルの調音パ
ラメータを音響信号からモデルマッチング法により推定
する。ここで、調音モデルは調音器官の構造に基づいて
声道形を表現するもので、たとえば白井，誉田「調音機
構のモデル化と非線形重回帰分析による調音パラメータ
の推定」信学論（Ａ），Ｊ．５９−Ａ，８，ｐ６６８
（１９７６）に開示されている。音響パラメータは調音
パラメータに従って出力されるため、音響パラメータは
調音パラメータを変数とする関数として与えられる。音
響パラメータは音響信号を等価変換して得られるパラメ
ータである。基本的な調音パラメータの推定方法は、与
えられた音響信号（実際には音響パラメータに変換）を
与える調音パラメータを推定するもので、以下の式
（１）で表わされる評価関数Ｊを最小化するように調音
パラメータを変動させる。Ｊ＝｛ｙ_ｓ−ｙ（ｘ）｝^Ｔ｛ｙ_ｓ−ｙ（ｘ）｝ …（１）ただし、ｙ_ｓは音響信号から得られた音響パラメータ
（Ｎ次元ベクトル）、ｘは調音パラメータ、ｙ（ｘ）は
調音パラメータｘから音響パラメータ（Ｎ次元べクト
ル）を推定する関数である。Ｔは転置を示す。これによ
り、音響パラメータｙ_ｓを実現するための調音パラメー
タｘ、つまり運動軌道が推定される。Referring to FIG. 1, a speech recognition apparatus according to one embodiment includes a phoneme sequence extraction unit 20 for extracting a sequence of phonemes from an audio signal, and a speech generation unit 3 for generating an audio signal from the sequence of phonemes.
0 is included. The phoneme string extraction unit 20 includes an audio signal input device 1 to which an audio signal is input, a motion trajectory estimation device 2 for estimating a motion trajectory, a via point estimation device 3 for estimating a via point,
And a phoneme correspondence device 4 for converting the waypoints into phonemes. The waypoint estimating device 3 is connected to the waypoint extracting device 11 and the first
And a motion trajectory generation device 12. Motion trajectory estimation device 2
Are described in, for example, Katsuhiko Shirai and Masaaki Yoshida, "Estimation of Articulatory Parameters from Speech Waves," IEICE Transactions, Vol. J61
-A, No. 5, pp. As disclosed in 409-416 (1978), articulation parameters of a predetermined articulation model are estimated from acoustic signals by a model matching method. Here, the articulatory model expresses the vocal tract shape based on the structure of articulatory organs. For example, Shirai and Yoshida, "Modeling of articulatory mechanisms and estimation of articulatory parameters by nonlinear multiple regression analysis," IEICE Trans. J. 59-A, 8, p668
(1976). Since the acoustic parameter is output according to the articulation parameter, the acoustic parameter is given as a function using the articulation parameter as a variable. The sound parameter is a parameter obtained by equivalently converting a sound signal. The basic method of estimating articulation parameters is to estimate an articulation parameter that gives a given acoustic signal (actually converted to an acoustic parameter), and minimizes an evaluation function J expressed by the following equation (1). So that the articulation parameters are varied. _{^{_{J = {y s -y (x}}} )} T {y s -y (x)} ... (1) However, _{y s} acoustic parameters (N dimensional vector) obtained from the audio signal, x is articulatory parameters, y (X) is a function for estimating an acoustic parameter (N-dimensional vector) from the articulation parameter x. T indicates transposition. Thus, acoustic parameter y _s articulatory parameters for realizing x, i.e. motion trajectory is estimated.

【００１３】図３は、この発明の一実施例の音声認識装
置に入力され、前処理された音響信号を示す図であり、
図４は、この発明の一実施例の音声認識装置を構成する
経由点推定装置により推定された各調音器官の軌道およ
び経由点を示す図であり、図５は、この発明の一実施例
の音声認識装置の音素列抽出部により音声が認識された
結果としての音素列を示す図である。FIG. 3 is a diagram showing a pre-processed audio signal input to the speech recognition apparatus according to one embodiment of the present invention.
FIG. 4 is a diagram showing the trajectories and the waypoints of each articulatory organ estimated by the waypoint estimation device constituting the speech recognition device according to the embodiment of the present invention, and FIG. It is a figure which shows the phoneme string as a result of which the speech was recognized by the phoneme string extraction part of the speech recognition device.

【００１４】以下、装置の動作について図１〜図５を参
照して説明する。音響信号入力装置１には、音響信号が
入力され、その入力されたデータのノイズ処理などの前
処理が行なわれる。そして、図３に示すような前処理さ
れた音響信号に対して、運動軌道推定装置２は、その前
処理された音響信号を実現するために運動するであろう
と推定される各調音器官の運動軌道を推定する。さらに
推定された運動軌道は経由点推定装置３に入力される。Hereinafter, the operation of the apparatus will be described with reference to FIGS. An audio signal is input to the audio signal input device 1, and preprocessing such as noise processing of the input data is performed. Then, with respect to the pre-processed sound signal as shown in FIG. 3, the motion trajectory estimating apparatus 2 moves each articulatory organ that is presumed to move to realize the pre-processed sound signal. Estimate the trajectory. Further, the estimated motion trajectory is input to the waypoint estimation device 3.

【００１５】そのため経由点抽出装置１１は、入力され
た運動軌道を生成するために必要な経由点を運動軌道上
またはその近傍から抽出し、第１運動軌道生成装置１２
に与える。第１運動軌道生成装置１２は、抽出された経
由点を通る軌道を生成する。[0015] Therefore, the waypoint extraction device 11 extracts the waypoints necessary for generating the input motion trajectory from or near the motion trajectory, and the first motion trajectory generation device 12
Give to. The first motion trajectory generating device 12 generates a trajectory passing through the extracted waypoint.

【００１６】そして、経由点推定装置３は生成された運
動軌道を入力された運動軌道と比較し、その精度が十分
になるまで、つまりその差が所定のしきい値よりも小さ
くなるまで経由点抽出装置１１による経由点の抽出と第
１の運動軌道生成装置１２による運動軌道の生成とを繰
返し行なう。この繰返しにより得られる経由点および運
動軌道は、たとえば図４に示すようなものである。ここ
で、ＴＢＹによる曲線は、時間に対する舌背（舌の中央
部）の上下方向の動きの軌道である。さらに、ＴＴＹ、
ＪＹおよびＬＬＹによる曲線は、それぞれ時間に対する
舌先、顎および下唇の上下方向の動きの曲線である。The transit point estimating device 3 compares the generated motion trajectory with the input motion trajectory, and until the accuracy is sufficient, that is, until the difference becomes smaller than a predetermined threshold value. The extraction of the via points by the extraction device 11 and the generation of the motion trajectory by the first motion trajectory generation device 12 are repeatedly performed. Via points and motion trajectories obtained by this repetition are as shown in FIG. 4, for example. Here, the curve by TBY is the trajectory of the vertical movement of the back of the tongue (the center of the tongue) with respect to time. In addition, TTY,
The curves by JY and LLY are curves of the vertical movement of the tongue tip, jaw and lower lip with respect to time, respectively.

【００１７】このようにして経由点推定装置３により推
定された経由点は、音声を認識するための音素に変換さ
れる必要があるため、音素対応装置４に入力される。こ
れにより、音素対応装置４が経由点を音素に変換するこ
とで、図５に示すような音素列５が同定される。The waypoints estimated by the waypoint estimation device 3 in this way need to be converted into phonemes for recognizing speech, and are input to the phoneme correspondence device 4. As a result, the phoneme correspondence device 4 converts the waypoints into phonemes, whereby the phoneme sequence 5 as shown in FIG. 5 is identified.

【００１８】したがって、この音素列５を音響信号入力
装置１に入力された音響信号として認識できるはずであ
るが、逆に、音素列５から音響信号入力装置１に入力さ
れた音響信号を生成することができなければ、音素列５
を入力された音響信号の認識結果とするには十分ではな
い。Therefore, it should be possible to recognize this phoneme sequence 5 as an audio signal input to the audio signal input device 1. Conversely, an audio signal input to the audio signal input device 1 is generated from the phoneme sequence 5. If not, the phoneme sequence 5
Is not enough to be a recognition result of the input acoustic signal.

【００１９】そのため、この音声認識装置の音声生成部
３０が設けられている。音声生成部３０は、音素を経由
点に変換する経由点への変換装置６と、運動軌道を生成
する第２運動軌道生成装置７と、音響信号を生成する音
響信号生成装置８とを含み、第２運動軌道生成装置７
は、運動指令変換装置９と、運動予測装置１０とを含
む。For this purpose, a speech generator 30 of the speech recognition device is provided. The voice generation unit 30 includes a conversion device 6 for converting a phoneme into a via point, a second motion trajectory generation device 7 for generating a motion trajectory, and an audio signal generation device 8 for generating an audio signal. Second motion trajectory generator 7
Includes a motion command conversion device 9 and a motion prediction device 10.

【００２０】経由点への変換装置６は、音素列５を各調
音器官の構えの目標経由点列、すなわち舌、顎、唇など
の運動軌道が経由するであろう点の列に変換する。経由
点への変換装置６により変換された目標経由点列は、第
２運動軌道生成装置７に入力される。The transit point conversion device 6 converts the phoneme sequence 5 into a target sequence of transit points of the posture of each articulator, that is, a sequence of points to which the motion trajectory such as the tongue, jaw, and lip passes. The target waypoint sequence converted by the waypoint conversion device 6 is input to the second motion trajectory generation device 7.

【００２１】運動指令変換装置９は運動軌道からトルク
（運動指令）を計算する。運動予測装置１０はトルクか
ら運動軌道を計算する。たとえば、運動指令変換装置９
は「電子情報通信学会論文誌Ｄ−ＩＩＶｏｌ．Ｊ７５
−Ｄ−ＩＩＮｏ．５ｐｐ．９９１〜９９９１９９２
年５月」に示された逆ダイナミクスモデル（ＩＤＭ）を
採用し、運動予測装置１０は同論文誌に示された順ダイ
ナミクスモデル（ＦＤＭ）を採用することができる。こ
のような運動指令変換装置９と運動予測装置１０を組合
せた場合には、非線形最適化問題を解くことができる。
そのため、運動指令変換装置９および運動予測装置１０
により、第２の運動軌道生成装置７は、トルク変化、筋
張力変化、運動指令変化あるいは運動指令の運動時間に
わたった二乗積分最小化を用いて、各調音器官の運動軌
道を生成する。The motion command converter 9 calculates a torque (motion command) from the motion trajectory. The motion prediction device 10 calculates a motion trajectory from the torque. For example, the motion command converter 9
Is "Transactions of the Institute of Electronics, Information and Communication Engineers D-II Vol.
-D-II No. 5pp. 991-999 1992
The motion prediction apparatus 10 can employ the forward dynamics model (FDM) shown in the same journal. When such a motion command conversion device 9 and a motion prediction device 10 are combined, a nonlinear optimization problem can be solved.
Therefore, the motion command conversion device 9 and the motion prediction device 10
Accordingly, the second motion trajectory generating device 7 generates the motion trajectory of each articulator using the torque change, the muscle tension change, the motion command change, or the square integration minimization over the motion time of the motion command.

【００２２】なお、運動軌道の生成のために、上記以外
の方法として、スプライン関数を利用した近似計算によ
る運動軌道の躍度（加速度の時間微分）の運動時間にわ
たった二乗積分最小化を用いることもできる。この第２
の運動軌道生成装置７に関することは、第１の運動軌道
生成装置１２についても同様のことがいえる。In order to generate the motion trajectory, as a method other than the above, the square integral minimization of the jerk (time derivative of acceleration) of the motion trajectory over the motion time by an approximate calculation using a spline function is used. You can also. This second
The same applies to the first motion trajectory generator 12 regarding the motion trajectory generator 7.

【００２３】このようにして生成された調音器官の運動
軌道は、音響信号（音声波形）生成装置８に入力され
る。そして、音響信号生成装置８が入力された運動軌道
から音響信号を生成する。したがって、音響信号生成装
置８により生成された音響信号と音響信号入力装置１に
入力された音響信号を比較することができる。この比較
による精度が十分な場合には、音素列５自体を音声の認
識結果とする。The motion trajectory of the articulator thus generated is input to an acoustic signal (voice waveform) generator 8. Then, the sound signal generation device 8 generates a sound signal from the input motion trajectory. Therefore, the sound signal generated by the sound signal generation device 8 and the sound signal input to the sound signal input device 1 can be compared. If the accuracy of this comparison is sufficient, the phoneme string 5 itself is used as the speech recognition result.

【００２４】しかし、精度が十分でない場合には、経由
点推定装置３による経由点の推定を再度行ない、経由点
への変換装置６および第２の運動軌道生成装置７を介し
て、音響信号生成装置８による音響信号生成が十分な精
度になるまで、つまり音響信号生成装置８によって生成
された音響信号を音響信号入力装置１に入力された音響
信号と比較した結果その差が所定のしきい値よりも小さ
くなるまで行なわれる。However, if the accuracy is not sufficient, the waypoint estimating device 3 estimates the waypoint again, and the sound signal generation via the waypoint conversion device 6 and the second motion trajectory generating device 7. Until the sound signal generation by the device 8 becomes sufficiently accurate, that is, the result of comparing the sound signal generated by the sound signal generation device 8 with the sound signal input to the sound signal input device 1, the difference is determined by a predetermined threshold value. Until it becomes smaller.

【００２５】このようにして、音声認識装置は、精度が
十分となった場合の音素列を音響信号入力装置１に入力
された音響信号に対応する音素列として認識することが
できる。In this way, the speech recognition device can recognize the phoneme sequence when the accuracy becomes sufficient as the phoneme sequence corresponding to the audio signal input to the audio signal input device 1.

【００２６】図２は、この発明の他の実施例の音声認識
装置に音響信号が入力されてから、その音響信号に対応
する音素列として音声の認識が行なわれるまでの状態を
示したブロック図である。FIG. 2 is a block diagram showing a state from when an acoustic signal is input to a speech recognition apparatus according to another embodiment of the present invention to when speech is recognized as a phoneme sequence corresponding to the acoustic signal. It is.

【００２７】図１に示した実施例では、音素列抽出部２
０を構成する経由点推定装置３により推定された経由点
は、音素対応装置４により一度音素に変換され、その音
素の列である音素列５は、音声生成部３０を構成する経
由点への変換装置６により、調音器官の構えの目標経由
点列に変換され、この目標経由点列を用いて、第２運動
軌道生成装置７は運動軌道を生成している。In the embodiment shown in FIG. 1, the phoneme string extracting unit 2
The transit point estimated by the transit point estimating device 3 constituting the zero is converted into a phoneme once by the phoneme correspondence device 4, and the phoneme sequence 5 which is a sequence of the phonemes is transmitted to the transit point constituting the voice generating unit 30. The conversion device 6 converts the articulatory organ's posture into a target via-point sequence, and using this target via-point sequence, the second motion trajectory generating device 7 generates a motion trajectory.

【００２８】これに対し、この実施例では、図２に示す
ように経由点推定装置３と第２運動軌道生成装置７とを
接続する。この場合、経由点が音素に変換されないが、
図１に示した実施例と同様に、音響信号生成装置８によ
り生成された音響信号と音響信号入力装置１に入力され
た信号とを比較することができる。On the other hand, in this embodiment, the via point estimation device 3 and the second motion trajectory generation device 7 are connected as shown in FIG. In this case, the waypoints are not converted to phonemes,
As in the embodiment shown in FIG. 1, it is possible to compare the acoustic signal generated by the acoustic signal generating device 8 with the signal input to the acoustic signal input device 1.

【００２９】しかし、音響信号を比較することができて
も入力された音響信号を音素列として認識することがこ
の発明の目的である。However, it is an object of the present invention to recognize an input audio signal as a phoneme sequence even if the audio signals can be compared.

【００３０】そのため、音響信号生成装置８により生成
された音響信号と音響信号入力装置１に入力された音響
信号との精度が十分になった場合の経由点を、音素対応
装置４が音素に変換する。For this reason, the point at which the sound signal generated by the sound signal generation device 8 and the sound signal input to the sound signal input device 1 have sufficient accuracy is converted into a phoneme by the phoneme correspondence device 4. I do.

【００３１】これにより、図１に示した実施例と同様
に、音響信号入力装置１に入力された音響信号を音素列
として認識することができる。Thus, as in the embodiment shown in FIG. 1, the acoustic signal input to the acoustic signal input device 1 can be recognized as a phoneme sequence.

【００３２】[0032]

【発明の効果】以上のようにこの発明によれば、調音器
官の動きにより生じる音響信号を、その調音器官の動き
を特定する特徴点に変換して抽出することによって、音
響信号を特徴量に対応した音素列として認識することが
できる。As described above, according to the present invention, the sound signal generated by the movement of the articulator is converted into a characteristic point for specifying the movement of the articulator, and is extracted. It can be recognized as a corresponding phoneme sequence.

[Brief description of the drawings]

【図１】この発明の一実施例の音声認識装置に音響信号
が入力されてからその音響信号に対応する音素列として
音声の認識が行なわれるまでの状態を示したブロック図
である。FIG. 1 is a block diagram showing a state from when an acoustic signal is input to a speech recognition apparatus according to an embodiment of the present invention until speech recognition is performed as a phoneme sequence corresponding to the acoustic signal.

【図２】この発明の他の実施例の音声認識装置に音響信
号が入力されてからその音響信号に対応する音素列とし
て音声の認識が行なわれるまでの状態を示したブロック
図である。FIG. 2 is a block diagram showing a state from when an acoustic signal is input to a speech recognition apparatus according to another embodiment of the present invention to when speech is recognized as a phoneme sequence corresponding to the acoustic signal.

【図３】この発明の一実施例の音声認識装置に入力され
前処理された音響信号を示す図である。FIG. 3 is a diagram showing a pre-processed audio signal input to the speech recognition apparatus according to the embodiment of the present invention;

【図４】この発明の一実施例の音声認識装置を構成する
経由点推定装置により推定された各調音器官の軌道およ
び経由点を示す図である。FIG. 4 is a diagram showing a trajectory and a waypoint of each articulatory organ estimated by a waypoint estimation device constituting the speech recognition device according to one embodiment of the present invention;

【図５】この発明の一実施例の音声認識装置の音素列抽
出部により音声が認識された結果としての音素列を示す
図である。FIG. 5 is a diagram showing a phoneme sequence as a result of speech recognition by the phoneme sequence extraction unit of the speech recognition device according to one embodiment of the present invention.

[Explanation of symbols]

１音響信号（音声波形）入力装置２運動軌道推定装置３経由点推定装置４音素対応装置５音素列６経由点への変換装置７第２運動軌道生成装置８音響信号（音声波形）生成装置９運動指令変換装置１０運動予測装置１１経由点抽出装置１２第１運動軌道生成装置２０音素列抽出部３０音声生成部 REFERENCE SIGNS LIST 1 acoustic signal (voice waveform) input device 2 motion trajectory estimation device 3 via point estimation device 4 phoneme correspondence device 5 phoneme sequence 6 conversion device to via point 7 second motion trajectory generation device 8 acoustic signal (voice waveform) generation device 9 Motion command conversion device 10 Motion prediction device 11 Via point extraction device 12 First motion trajectory generation device 20 Phoneme string extraction unit 30 Voice generation unit

───────────────────────────────────────────────────── フロントページの続き (56)参考文献特公昭52−14562（ＪＰ，Ｂ２) 電子情報通信学会論文誌（昭和53年５月）Ｖｏｌ．Ｊ61−Ａ，Ｎｏ．５, Ｐ．409〜416 (58)調査した分野(Int.Cl.⁶，ＤＢ名) G10L 9/10 301 ＪＩＣＳＴファイル（ＪＯＩＳ)──────────────────────────────────────────────────続き Continuation of the front page (56) References Japanese Patent Publication No. 52-14562 (JP, B2) Transactions of the Institute of Electronics, Information and Communication Engineers (May 1978) Vol. J61-A, no. 5, p. 409-416 (58) Fields surveyed (Int. Cl. ⁶ , DB name) G10L 9/10 301 JICST file (JOIS)

Claims

(57) [Claims]

1. A motion trajectory estimating means for estimating a first motion trajectory of an articulator based on an input first acoustic signal; and a first motion trajectory estimated by the motion trajectory estimating means. Via point estimating means for estimating a via point of the first motion trajectory; phoneme correspondence means for converting each via point estimated by the via point estimating means into a phoneme to generate a phoneme sequence; Voice generating means for generating a second audio signal based on the waypoint estimated by the means or the phoneme sequence generated by the phoneme correspondence means, and the second audio signal generated by the voice generating means Compared with the first sound signal, the estimation of the waypoint by the waypoint estimation means and the generation of the second sound signal by the sound generation means until the difference becomes smaller than a predetermined first threshold value. Repeat and, the resulting phoneme sequence the first
A speech recognition device that outputs a sound signal as a recognition result.

2. The transit point estimating means includes: transit point extracting means for extracting the transit point from or near the first motion trajectory estimated by the motion trajectory estimating means; And a first motion trajectory generating means for generating a second motion trajectory of the articulator based on the obtained waypoint, wherein the second motion trajectory generated by the first motion trajectory generating means performs the motion Comparing with the first motion trajectory estimated by the trajectory estimating means, extracting the waypoint by the waypoint extracting means and generating the first motion trajectory until the difference becomes smaller than a predetermined second threshold value 2. The speech recognition device according to claim 1, wherein the generation of the second motion trajectory by the means is repeated, and the obtained waypoint is output.

3. The transit point converting means for converting a phoneme sequence generated by the phoneme correspondence means into a transit point of a third motion trajectory of the articulator, wherein Second motion trajectory generating means for generating the third motion trajectory based on the obtained waypoint, and the second sound based on the third motion trajectory generated by the second motion trajectory generating means The speech recognition device according to claim 1, further comprising: an acoustic signal generation unit configured to generate a signal.