JP2007018006A

JP2007018006A - Speech synthesis system, speech synthesis method, and speech synthesis program

Info

Publication number: JP2007018006A
Application number: JP2006259082A
Authority: JP
Inventors: Hiroyuki Manabe; 宏幸真鍋; Akira Hiraiwa; 明平岩; Toshiaki Sugimura; 利明杉村
Original assignee: NTT Docomo Inc
Current assignee: NTT Docomo Inc
Priority date: 2006-09-25
Filing date: 2006-09-25
Publication date: 2007-01-25
Anticipated expiration: 2022-03-04
Also published as: JP4381404B2

Abstract

<P>PROBLEM TO BE SOLVED: To maintain a high recognition rate even with a small sound volume without being affected by surrounding noises in speech recognition. <P>SOLUTION: The speech recognition system includes: a sound information processing section 10 which acquires a sound signal and calculates a sound information parameter based on a change in the acquired sound signal; an electromuscular signal processing section 13 which acquires the potential change in an object surface as the electromuscular signal and calculates an electromuscular signal parameter based on the change in the acquired sound signal; an image information processing section 16 which acquires the video to be photographed object as image information and calculates an image information parameter based on the change in an object in the video; a speech recognition means 20 which recognizes the speech based on the sound information parameter, the electromuscular signal parameter and the image information parameter; and a recognition result presentation means 21 which presents the recognition result by the speech recognition means. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、音声等の音響を認識し、認識した音声に基づいて音声を合成する音声合成システム、音声合成方法、音声合成プログラムに関する。 The present invention relates to a speech synthesis system, a speech synthesis method, and a speech synthesis program for recognizing sound such as speech and synthesizing speech based on the recognized speech.

通常の音声検出装置では、発話における音声を音響信号として取り扱い、その音響信号の周波数を分析することによって、音声信号を認識し処理する音声認識技術が採用されており、このための方法として、スペクトル包絡等が利用されている。 Ordinary speech detection devices employ speech recognition technology that recognizes and processes speech signals by treating speech in speech as acoustic signals and analyzing the frequency of the acoustic signal. Envelopes are used.

しかし、この音声認識技術を用いて良好な音声検出結果をもたらすためには、発話時にある程度の音量が必要であり、発話による音響信号が入力されない限り音声情報を検出することは不可能であった。従って、音声入力時に話者の声が周囲の人々の迷惑となるため、静けさが要求されるオフィスや図書館さらに公共機関内などでは、このような音声検出装置を使用することができなかった。また周囲の雑音の大きい場所では、クロストークの問題が発生し、音声検出機能が低下するという欠点もあった。 However, in order to produce a good speech detection result using this speech recognition technology, a certain level of volume is required during speech, and speech information could not be detected unless an acoustic signal from speech was input. . Accordingly, since the voice of the speaker becomes annoying to the people around when inputting voice, such a voice detection device cannot be used in offices, libraries, and public institutions where quietness is required. In addition, there is a drawback in that a problem of crosstalk occurs in a place where the surrounding noise is large, and the voice detection function is deteriorated.

これに対して、音響信号以外から音声情報を獲得する研究も従来から行われていた。音響情報以外の情報から音声情報を獲得することができれば、音響を発することなく発話することが可能となり、上記に示した問題点を解決することができる。口唇の視覚情報による音声認識手法としてはビデオカメラにより入力された画像を用いた画像処理による手法がある（例えば、特許文献１又は特許文献２参照）。 On the other hand, research for acquiring voice information from other than acoustic signals has also been conducted. If voice information can be acquired from information other than acoustic information, it becomes possible to speak without producing sound, and the above-described problems can be solved. As a speech recognition method based on visual information on the lips, there is a method based on image processing using an image input by a video camera (see, for example, Patent Document 1 or Patent Document 2).

さらに、口の周囲の筋肉の動きに伴って発生する筋電信号を処理して発声した母音の種類を認識するという研究がある（例えば、非特許文献１参照）。非特許文献１には、筋電信号をバンドパスフィルタを通した後、閾値の交差回数をカウントして５母音（ａ，ｉ，ｕ，ｅ，ｏ）を弁別することが記載されている。 Furthermore, there is a study of recognizing the types of vowels produced by processing myoelectric signals generated with the movement of muscles around the mouth (for example, see Non-Patent Document 1). Non-Patent Document 1 describes that after a myoelectric signal is passed through a band-pass filter, the number of crossing thresholds is counted to discriminate five vowels (a, i, u, e, o).

また、他の方式としては、口の周囲の筋肉の筋電信号をニュートラルネットを用いて処理し、発声話者の母音だけでなく、子音も含めて検出する方法が特開平７−１８１８８８号に示されている。さらに、１つの入力チャネルからの情報だけでなく、複数の入力チャネルを利用したマルチモーダルインタフェースが提案・実現されてきている。 As another method, Japanese Patent Application Laid-Open No. 7-181888 discloses a method of processing a myoelectric signal of muscles around the mouth using a neutral net and detecting not only a vowel of a vocal speaker but also a consonant. It is shown. Furthermore, not only information from one input channel but also a multimodal interface using a plurality of input channels has been proposed and realized.

一方、従来の音声合成システムでは、話者の音声を特徴付けるデータを予め保存しておき、話者の発話に合わせて音声を合成する方法が開発されている。
特開昭５２−１１２２０５号公報特開平６−４３８９７号公報 Noboru Sugie et al., “A speech Employing a Speech Synthesizer Vowel Discrimination from Perioral Muscles Activities and Vowel Production,”IEEE transactions on Biomedical Engineering, Vol.32, No.7, pp485-490 On the other hand, in a conventional speech synthesis system, a method for preliminarily storing data characterizing a speaker's speech and synthesizing speech according to the speaker's speech has been developed.
JP-A-52-112205 JP-A-6-43897 Noboru Sugie et al., “A speech Employing a Speech Synthesizer Vowel Discrimination from Perioral Muscles Activities and Vowel Production,” IEEE transactions on Biomedical Engineering, Vol.32, No.7, pp485-490

しかしながら、上記した音響情報以外の情報から音声情報を獲得する音声検出方法では、音響情報を用いた音声認識に比べ、認識率が低いという問題点がある。特に、口内における筋肉の動きにより発生される子音の認識は困難であった。 However, the speech detection method for acquiring speech information from information other than the above-described acoustic information has a problem that the recognition rate is lower than speech recognition using acoustic information. In particular, it has been difficult to recognize consonants generated by the movement of muscles in the mouth.

また、従来の音声合成システムでは、上述したように、話者の音声を特徴付けるデータに基づいて音声を合成しているため、合成音声が機械的であるため表現が不自然になり、話者の感情等を適切に表現できないという問題があった。 In addition, in the conventional speech synthesis system, as described above, since speech is synthesized based on data characterizing the speaker's speech, the synthesized speech is mechanical and the expression becomes unnatural. There was a problem that emotions could not be expressed properly.

本発明は、以上の問題点を鑑みてなされたものであり、周囲の雑音の影響を受けることなく、少量の音量であっても高度な認識率を維持することができる音声認識システム、方法及びプログラムを提供することを目的とする。また、他の発明は、音声認識を音声合成に利用することにより、合成音声をより自然なものとするとともに、話者の感情等を適切に表現することのできる音声合成システム、方法、及びプログラムを提供することを目的とする。 The present invention has been made in view of the above problems, and a speech recognition system, method, and method capable of maintaining a high recognition rate even at a small volume without being affected by ambient noise. The purpose is to provide a program. Another invention is a speech synthesis system, method, and program capable of making a synthesized speech more natural and appropriately expressing a speaker's emotion and the like by using speech recognition for speech synthesis. The purpose is to provide.

上記課題を解決するために、本発明は、音響信号を取得し、取得した音響信号の変化に基づいて音響情報パラメータを算出し、対象物表面の電位変化を筋電信号として取得し、取得した音響信号の変化に基づいて筋電信号パラメータを算出し、撮影した対象物の映像を画像情報として取得し、映像中の対象物の変化に基づいて画像情報パラメータを算出し、これらの音響情報パラメータ、筋電信号パラメータ及び画像情報パラメータに基づいて音声を認識し、認識結果を提示することを特徴とする。 In order to solve the above problem, the present invention acquires an acoustic signal, calculates an acoustic information parameter based on the acquired change in the acoustic signal, acquires a potential change on the surface of the object as a myoelectric signal, and acquires EMG signal parameters are calculated based on the change in the acoustic signal, the image of the captured object is acquired as image information, the image information parameter is calculated based on the change in the object in the image, and these acoustic information parameters The voice is recognized based on the myoelectric signal parameter and the image information parameter, and the recognition result is presented.

このような本発明によれば、音響信号や筋電信号、画像情報という複数のパラメータを用いて音声認識を行っているために、対雑音性などを大幅に向上することができる。 According to the present invention as described above, since voice recognition is performed using a plurality of parameters such as an acoustic signal, a myoelectric signal, and image information, noise resistance and the like can be greatly improved.

また、他の発明は、音声を認識するとともに、音響情報から音響信号のスペクトラムを第１のスペクトラムとして取得し、音声認識手段による認識結果から再構成した音響信号のスペクトラムを第２のスペクトラムとして生成し、これら第１のスペクトラムと第２のスペクトラムとを比較し、この比較結果に応じて修正スペクトラムを生成し、修正スペクトラムから合成された音声を出力することを特徴とする。 Another invention recognizes speech, obtains the spectrum of the acoustic signal from the acoustic information as the first spectrum, and generates the spectrum of the acoustic signal reconstructed from the recognition result by the speech recognition means as the second spectrum. Then, the first spectrum and the second spectrum are compared, a corrected spectrum is generated according to the comparison result, and a sound synthesized from the corrected spectrum is output.

このような発明によれば、音響情報からのスペクトラムのみならず、他のパラメータも用いて認識した音声からのスペクトラムに基づいて音声を合成するため、周囲の雑音を効果的に除去することができる。 According to such an invention, since the voice is synthesized based on the spectrum from the voice recognized using not only the spectrum from the acoustic information but also other parameters, ambient noise can be effectively removed. .

なお、上記２つの発明における音声認識は、音響情報パラメータ、筋電信号パラメータ及び画像情報パラメータの各々について認識処理を行った後、各認識結果を比較し、この比較結果に基づいて最終的な認識処理を行うことが望ましい。さらに、音声認識は、音響情報パラメータ、筋電信号パラメータ及び画像情報パラメータを同時に用いて認識処理を行うようにしてもよい。 In the voice recognition in the above two inventions, the recognition processing is performed for each of the acoustic information parameter, the myoelectric signal parameter, and the image information parameter, and then the respective recognition results are compared, and the final recognition is performed based on the comparison result. It is desirable to perform processing. Furthermore, the speech recognition may be performed using the acoustic information parameter, the myoelectric signal parameter, and the image information parameter at the same time.

また、他の音声認識としては、データの入力部及び出力部を備えた非線形素子の集合である素子群を上流から下流に向けて階層的に配置し、隣接する素子群間において、上流の非線形素子の出力部と、下流の非線形素子の入力部とを相互に接続し、各非線形素子において、当該非線形素子の入力部への接続及びこれら接続の組み合わせ毎に重み係数を付与し、入力部へ入力されたデータ及び前記重み係数に応じて、下流へ出力するデータ及び出力部からの接続を決定する階層ネットワークを構築し、音響情報パラメータ、筋電信号パラメータ及び画像情報パラメータを上流側から入力し、最下流の素子群から出力されるデータを認識音声とすることが好ましい。 As another speech recognition, an element group that is a set of nonlinear elements having a data input unit and an output unit is arranged hierarchically from upstream to downstream, and an upstream nonlinear element is arranged between adjacent element groups. The output part of the element and the input part of the downstream non-linear element are connected to each other, and each non-linear element is given a weighting factor for each connection of the non-linear element to the input part and a combination of these connections. In accordance with the input data and the weighting factor, a hierarchical network that determines the data to be output downstream and the connection from the output unit is constructed, and the acoustic information parameter, the myoelectric signal parameter, and the image information parameter are input from the upstream side. The data output from the most downstream element group is preferably recognized voice.

この階層ネットワークを利用した場合には、階層ネットワークの下流側からサンプルデータを入力してデータを上流側へ逆流させることにより、各非線形素子に付与された前記重み係数を変更させる学習機能を実現することができる。 When this hierarchical network is used, a learning function for changing the weighting coefficient assigned to each nonlinear element is realized by inputting sample data from the downstream side of the hierarchical network and causing the data to flow backward to the upstream side. be able to.

以上説明したように、本発明の音声認識システム、方法及びプログラムによれば、周囲の雑音の影響を受けることなく、少量の音量であっても高度な認識率を維持することができる。また、他の発明の音声合成システム、方法及びプログラムによれば、音声認識を音声合成に利用することにより、合成音声をより自然なものとするとともに、話者の感情等を適切に表現することができる。 As described above, according to the speech recognition system, method and program of the present invention, a high recognition rate can be maintained even with a small volume without being affected by ambient noise. In addition, according to the speech synthesis system, method and program of another invention, by using speech recognition for speech synthesis, the synthesized speech can be made more natural and the emotion of the speaker can be appropriately expressed. Can do.

［第１実施形態］
（基本構成）
以下に本発明の実施形態に係る音声認識システムについて詳細に説明する。図１は、本実施形態に係る音声認識システムの基本構成を示すブロック図である。 [First Embodiment]
(Basic configuration)
Hereinafter, a voice recognition system according to an embodiment of the present invention will be described in detail. FIG. 1 is a block diagram showing a basic configuration of a speech recognition system according to the present embodiment.

同図に示すように、音声認識システムは、音響情報処理部１０と、筋電信号処理部１３と、画像情報処理部１６と、情報総合認識部１９とを備えている。 As shown in the figure, the voice recognition system includes an acoustic information processing unit 10, a myoelectric signal processing unit 13, an image information processing unit 16, and an information comprehensive recognition unit 19.

音響情報処理部１０は、発話時に発せられる音響情報を処理するものであり、発話時の音響信号を取得するための音響信号取得手段１１と、音響信号取得手段で得られた音響信号のスペクトル包絡やスペクトル微細構造を分離するなどして音響情報パラメータを抽出する音響信号処理手段１２とを備えている。 The acoustic information processing unit 10 processes acoustic information emitted at the time of utterance, and includes an acoustic signal acquisition unit 11 for acquiring an acoustic signal at the time of utterance, and a spectrum envelope of the acoustic signal obtained by the acoustic signal acquisition unit. And an acoustic signal processing means 12 for extracting acoustic information parameters by separating the spectral fine structure.

音響信号取得手段１１は、マイクロフォン等の音響を取得する装置であり、発話時に発せられる音響をマイクロフォンで検出し、取得した音響信号を音響信号処理手段１２に伝達する。 The sound signal acquisition unit 11 is a device that acquires sound such as a microphone, detects the sound uttered during speech with the microphone, and transmits the acquired sound signal to the sound signal processing unit 12.

音響信号処理手段１２は、音響信号取得手段１１から取得した音響信号を音声認識手段２０で処理可能な音響情報パラメータを算出する装置であり、音響信号を設定された時間窓で切り出し、切り出された音響信号に対して、一般的な音声認識で用いられている短時間スペクトル分析や、ゲプストラム分析、最尤スペクトル推定、共分散法、ＰＡＲＣＯＲ分析、ＬＳＰ分析などの分析法を用いて音響情報パラメータを算出する。 The acoustic signal processing means 12 is a device for calculating acoustic information parameters that can be processed by the speech recognition means 20 for the acoustic signal acquired from the acoustic signal acquisition means 11, and the acoustic signal is cut out and cut out in a set time window. For acoustic signals, acoustic information parameters are set using analysis methods such as short-time spectrum analysis, gepstrum analysis, maximum likelihood spectrum estimation, covariance method, PARCOR analysis, and LSP analysis, which are used in general speech recognition. calculate.

筋電信号処理部１３は、発話時に口周辺の筋肉の動きを検出して処理を行うものであり、発話時の口周辺の筋肉の動きに伴う筋電信号を取得するための筋電信号取得手段１４と、筋電信号取得手段で得られた筋電信号のパワーの計算や周波数分析などして筋電信号パラメータを抽出する筋電信号処理手段１５とを備えている。 The myoelectric signal processing unit 13 performs processing by detecting the movement of the muscle around the mouth at the time of utterance, and acquires the myoelectric signal for acquiring the myoelectric signal accompanying the movement of the muscle around the mouth at the time of utterance. Means 14 and myoelectric signal processing means 15 for extracting myoelectric signal parameters by calculating the power of the myoelectric signal obtained by the myoelectric signal acquiring means or performing frequency analysis are provided.

筋電信号取得手段１４は、発話時における口周辺の筋肉の活動に伴う筋電信号を検出する装置であり、話者の口周辺の皮膚表面の電位変化を検出する。すなわち、発話時には口周辺の複数の筋肉が協調して活動しており、それら複数の筋肉の活動を捉えるために、筋電信号取得手段１４では、発話時に活動する複数の筋肉に対応した複数の皮膚表面電極から複数の筋電信号を導出し、増幅して筋電信号処理手段１５に伝達する。 The myoelectric signal acquisition means 14 is a device that detects a myoelectric signal associated with the activity of muscles around the mouth during speech, and detects changes in the skin surface around the mouth of the speaker. That is, a plurality of muscles in the vicinity of the mouth act in cooperation during utterance, and in order to capture the activities of the plurality of muscles, the myoelectric signal acquisition means 14 has a plurality of muscles corresponding to the plurality of muscles active during the utterance. A plurality of myoelectric signals are derived from the skin surface electrode, amplified and transmitted to the myoelectric signal processing means 15.

筋電信号処理手段１５は、筋電信号取得手段１４から伝達された複数の筋電信号から筋電信号パラメータを算出する装置であり、具体的には、筋電信号に対して、設定した時間窓で切り出しを行い、切り出された筋電信号に対して、スペクトル分析や、二乗平均平方（ＲＭＳ）、整流化平均値（ＡＲＶ）、積分筋電図（ＩＥＭＧ）などの平均振幅の特徴量の算出を行い、筋電信号パラメータを算出する。 The myoelectric signal processing means 15 is a device that calculates myoelectric signal parameters from a plurality of myoelectric signals transmitted from the myoelectric signal acquisition means 14, and specifically, a set time for the myoelectric signal. Cut out in the window, and for the cut out myoelectric signal, features of the average amplitude such as spectrum analysis, root mean square (RMS), rectified average (ARV), integral electromyogram (IEMG) Calculation is performed to calculate myoelectric signal parameters.

画像情報処理部１６は、発話時の口周辺の空間的な変形を検出して画像処理を行うものであり、発話時の口周辺の空間的な変形をビデオカメラで撮影する画像情報取得手段１７と、画像情報処理手段で得られた画像情報から唇周辺の動きパラメータを抽出する画像情報処理手段１８とを備えている。 The image information processing unit 16 performs image processing by detecting spatial deformation around the mouth at the time of utterance, and image information acquisition means 17 for photographing the spatial deformation around the mouth at the time of utterance with a video camera. And image information processing means 18 for extracting motion parameters around the lips from the image information obtained by the image information processing means.

画像情報取得手段１７は、発話時における口周辺の動きを撮像するビデオカメラなどの撮影機であり、口周辺の動きを画像として検出し、画像情報処理手段１８に伝達する。 The image information acquisition means 17 is a photographing machine such as a video camera that captures the movement around the mouth at the time of speech, detects the movement around the mouth as an image, and transmits it to the image information processing means 18.

画像情報処理手段１８は、画像情報取得手段１７で得られた画像情報から画像情報パラメータを算出する装置であり、具体的には、画像情報から口周辺の動きの特徴量をオプティカルフローにより抽出し、画像情報パラメータを算出する。 The image information processing means 18 is an apparatus for calculating image information parameters from the image information obtained by the image information acquisition means 17, and specifically extracts the feature amount of the movement around the mouth from the image information by optical flow. The image information parameter is calculated.

情報総合認識部１９は、音響情報処理部及び筋電信号処理部及び画像情報処理部から得られた各種情報を統合して認識し、その認識結果を提示するものであり、音響音声認識部で得られた音響情報パラメータと、筋電音声認識部で得られた筋電信号パラメータと、画像情報処理部で得られた画像情報パラメータとを比較・統合し、音声認識結果の判断を下す音声認識手段２０と、音声認識手段で得られた認識結果を提示する認識結果提示手段２１とを備えている。 The information comprehensive recognition unit 19 recognizes the various information obtained from the acoustic information processing unit, the myoelectric signal processing unit, and the image information processing unit, and presents the recognition result. Speech recognition that compares and integrates the obtained acoustic information parameters, the myoelectric signal parameters obtained by the myoelectric speech recognition unit, and the image information parameters obtained by the image information processing unit to determine the speech recognition result Means 20 and a recognition result presentation means 21 for presenting the recognition result obtained by the voice recognition means.

音声認識手段２０は、上記各部１０，１３，１６から取得した音響情報パラメータ、筋電信号パラメータ及び画像情報パラメータを用いて、音声認識を行う演算装置である。この音声認識手段２０は、周囲の雑音が少ない場合や、発話時の音量が大きい場合等、音響情報パラメータから十分に音声認識することが可能な場合は、音響情報パラメータのみから音声認識を行う機能を備えている。また、この音声認識手段２０は、周囲の雑音が大きい場合や、発話時の音量が小さい場合等、音響情報パラメータのみからでは十分に音声認識することが不可能な場合には、音響情報パラメータだけでなく、筋電信号パラメータ及び画像情報パラメータから得られる情報を加味して、音声認識を行う機能を備えている。 The voice recognition means 20 is an arithmetic unit that performs voice recognition using the acoustic information parameter, the myoelectric signal parameter, and the image information parameter acquired from each of the units 10, 13, and 16. This voice recognition means 20 is a function for performing voice recognition only from the acoustic information parameter when the voice information can be sufficiently recognized from the acoustic information parameter, such as when there is little ambient noise or when the volume of speech is high. It has. In addition, the voice recognition unit 20 is configured to use only the acoustic information parameter when it is impossible to recognize the voice sufficiently only from the acoustic information parameter, such as when the surrounding noise is large or when the volume of speech is low. In addition, it has a function of performing speech recognition in consideration of information obtained from the myoelectric signal parameters and the image information parameters.

さらにこの音声認識手段２０は、筋電信号パラメータ及び画像情報パラメータを用いて音声認識する際に、認識率が低い場合には、誤認識する音素などについて、音響情報パラメータを用いることによって、全体としての認識率を高める機能を有している。 Further, when the speech recognition unit 20 recognizes speech using the myoelectric signal parameter and the image information parameter and the recognition rate is low, the speech recognition unit 20 uses the acoustic information parameter for the phoneme and the like that are recognized incorrectly as a whole. Has a function to increase the recognition rate.

認識結果提示手段２１は、音声認識手段２０による認識結果を出力する出力デバイスであり、音声認識手段２０で得られた音声認識結果を、話者に対して音声で出力する発生装置や、画面に文字テキストで表示する液晶等の表示モニターを採用することができる。また、この認識結果提示手段２１としては、通信インターフェース等を設けることにより、音声認識結果を、話者に提示するだけでなく、パーソナルコンピュータ等の端末装置上で起動しているアプリケーションにデータとして出力するようにしてもよい。 The recognition result presentation unit 21 is an output device that outputs the recognition result obtained by the voice recognition unit 20. A display monitor such as a liquid crystal that displays text can be used. Further, as the recognition result presentation means 21, by providing a communication interface or the like, the speech recognition result is not only presented to the speaker, but also output as data to an application running on a terminal device such as a personal computer. You may make it do.

（基本動作）
上記基本構成を有する本実施形態に係る音声認識システムは、以下のように動作する。図２は、本実施形態にかかる音声認識システムの動作を示すフロー図である。 (basic action)
The speech recognition system according to the present embodiment having the above basic configuration operates as follows. FIG. 2 is a flowchart showing the operation of the speech recognition system according to the present embodiment.

先ず、話者が発話を開始する（Ｓ１０１）。このとき、話者が発話している際の音響信号、筋電信号、画像情報はそれぞれ、音響信号取得手段１１、筋電信号取得手段１４、画像情報取得手段１７により検出される（Ｓ１０２〜Ｓ１０４）。検出された音響信号、筋電信号、画像情報はそれぞれ、音響信号処理手段１２、筋電信号処理手段１５、画像情報処理手段１８により音響情報パラメータ、筋電信号パラメータ、画像情報パラメータとして算出される（Ｓ１０５〜Ｓ１０７）。 First, the speaker starts speaking (S101). At this time, the acoustic signal, the myoelectric signal, and the image information when the speaker is speaking are detected by the acoustic signal acquiring unit 11, the myoelectric signal acquiring unit 14, and the image information acquiring unit 17, respectively (S102 to S104). ). The detected acoustic signal, myoelectric signal, and image information are calculated as an acoustic information parameter, myoelectric signal parameter, and image information parameter by the acoustic signal processing unit 12, the myoelectric signal processing unit 15, and the image information processing unit 18, respectively. (S105 to S107).

算出された各種パラメータは音声認識手段２０により音声認識され（Ｓ１０８）、認識結果提示手段２１により音声認識結果が提示される（Ｓ１０９）。この認識結果の提示は、上述したように、音声によって行うことも、画面に表示することも可能である。 The calculated various parameters are recognized by the speech recognition means 20 (S108), and the recognition result presentation means 21 presents the speech recognition result (S109). As described above, the recognition result can be presented by voice or displayed on the screen.

（各手段の動作）
上記基本構成における各手段のそれぞれの動作を以下に詳細に説明する。 (Operation of each means)
The operation of each means in the basic configuration will be described in detail below.

（１）音声認識手段
図４は、音声認識手段２０を説明するブロック図である。ここでは、音響情報パラメータ、筋電信号パラメータ及び画像情報パラメータの各々について認識処理を行った後、各認識結果を比較し、この比較結果に基づいて最終的な認識処理を行う。 (1) Voice Recognition Unit FIG. 4 is a block diagram illustrating the voice recognition unit 20. Here, after performing the recognition process for each of the acoustic information parameter, the myoelectric signal parameter, and the image information parameter, the recognition results are compared, and the final recognition process is performed based on the comparison result.

具体的には、同図に示すように、本実施形態にかかる音声認識手段２０は、最終的な音声認識を行う前に、音響情報パラメータのみ、筋電信号パラメータのみ、画像情報パラメータのみを用いてそれぞれ音声認識を行い、それぞれのパラメータから得られた認識結果を統合することによって、最終的に音声認識を行う。それぞれのパラメータから得られた認識結果のうち、複数の認識結果が一致していれば、その一致したものを最終的な認識結果とし、全ての認識結果が一致していなければ、最も認識確度が高いと思われる認識結果を最終的な認識結果とする。 Specifically, as shown in the figure, the speech recognition means 20 according to the present embodiment uses only acoustic information parameters, only myoelectric signal parameters, and only image information parameters before performing final speech recognition. The speech recognition is performed, and the recognition results obtained from the respective parameters are integrated to finally perform the speech recognition. Of the recognition results obtained from each parameter, if multiple recognition results match, the matching result is the final recognition result. If all the recognition results do not match, the recognition accuracy is the highest. The recognition result that seems to be high is used as the final recognition result.

また、例えば、筋電信号パラメータのみを用いた音声認識では、ある特定の音素、または発話パターンの認識率が低いことが予めわかっているような場合、他のパラメータを用いた音声認識の結果、それらの発話が行われていると推測できるような時には、筋電信号パラメータを用いた音声認識結果を無視することによって、最終的な音声認識結果の認識率を向上させることができる。 Also, for example, in speech recognition using only myoelectric signal parameters, when it is known in advance that the recognition rate of a specific phoneme or utterance pattern is low, the result of speech recognition using other parameters, When it can be estimated that those utterances are being made, the recognition rate of the final speech recognition result can be improved by ignoring the speech recognition result using the myoelectric signal parameter.

さらに、例えば音響情報パラメータから周囲の雑音が大きい、または発話時の音量が小さいと判断できた場合には、音響情報パラメータを用いた音声認識の結果が最終的な音声認識の結果に与える影響を小さくし、筋電信号パラメータ及び画像情報パラメータを用いた音声認識の結果を重視して、最終的な音声認識を行う。なお、それぞれのパラメータを用いた音声認識は、通常用いられている手法を用いることが可能である。 Furthermore, for example, if it can be determined from the acoustic information parameter that the ambient noise is large or the volume during speech is low, the effect of the speech recognition result using the acoustic information parameter on the final speech recognition result is affected. The final speech recognition is performed by reducing the size and placing importance on the result of speech recognition using the myoelectric signal parameter and the image information parameter. In addition, the speech recognition using each parameter can use the method normally used.

さらに、音声認識手段２０は、上記方式に代えて、３つのパラメータから音声認識を行うようにしてもよい。図３は、３つのパラメータから音声認識を行う際の、音声認識手段２０の動作を説明する説明図である。 Further, the voice recognition means 20 may perform voice recognition from three parameters instead of the above method. FIG. 3 is an explanatory diagram for explaining the operation of the speech recognition means 20 when performing speech recognition from three parameters.

この３つのパラメータから音声認識を行う方式としては、例えばニューラルネットワークを用いたものがある。このニューラルネットワークは、同図に示すように、パラメータの入力部及び出力部を備えた非線形素子の集合である素子群を上流から下流に向けて階層的に配置し、隣接する素子群間において、上流の非線形素子の出力部と、下流の非線形素子の入力部とを相互に接続して構築されるものである。 As a method of performing speech recognition from these three parameters, for example, there is a method using a neural network. In this neural network, as shown in the figure, a group of elements, which are a set of nonlinear elements having a parameter input part and an output part, are arranged hierarchically from upstream to downstream, and between adjacent element groups, It is constructed by mutually connecting the output part of the upstream nonlinear element and the input part of the downstream nonlinear element.

そして、各非線形素子において、当該非線形素子の入力部への接続及びこれら接続の組み合わせ毎に重み係数を付与し、入力部へ入力されたパラメータ及び重み係数に応じて、下流へ出力するパラメータ及び出力部からの接続を決定する。具体的に音声認識手段２０では、音響情報パラメータ及び筋電信号パラメータ及び画像情報パラメータを受け取り、出力は母音及び子音である。 Then, in each nonlinear element, a weighting factor is given for each connection of the nonlinear element to the input unit and a combination of these connections, and the parameter and output to be output downstream according to the parameter and the weighting factor input to the input unit Determine the connection from the department. Specifically, the voice recognition means 20 receives acoustic information parameters, myoelectric signal parameters, and image information parameters, and outputs are vowels and consonants.

このニューラルネットワークとして本実施形態では、全結合型の３層ニューラルネットワーク（西川・北村、「ニューラルネットと計測制御」、朝倉書店、pp.18-50を参照）を用いる。 In this embodiment, a fully-coupled three-layer neural network (Nishikawa / Kitamura, “Neural Network and Measurement Control”, Asakura Shoten, pp. 18-50) is used as this neural network.

このニューラルネットワークでは、予め、重み係数を学習しておく必要がある。本実施形態における学習は、バックプロパゲーション法により行う。そのために予め用意しておいた発話パターンに沿った発話動作を行い、それに伴う音響情報パラメータ及び筋電信号パラメータ及び画像情報パラメータを取得し、用意しておいた発話パターンを教師信号として、各種パラメータを学習する。この学習処理については、後述する。 In this neural network, it is necessary to learn weighting factors in advance. Learning in the present embodiment is performed by a back propagation method. For that purpose, utterance operation is performed according to the prepared utterance pattern, acoustic information parameters, myoelectric signal parameters, and image information parameters are acquired, and various parameters are set using the prepared utterance pattern as a teacher signal. To learn. This learning process will be described later.

また、本実施形態に係る音声認識手段２０は、話者が発話する際に行う発話動作において、筋電信号は、音響信号及び画像情報よりも時間的に早く発声するため、筋電信号パラメータのみ遅延させることにより音響信号、筋電信号及び画像情報との同期を取る機能を有している。 In the speech recognition unit 20 according to the present embodiment, the myoelectric signal is uttered earlier in time than the acoustic signal and the image information in the speech operation performed when the speaker speaks. It has a function of synchronizing the acoustic signal, the myoelectric signal, and the image information by delaying.

そして、各種パラメータを入力として受け取った音声認識手段２０のニューラルネットは、入力されたパラメータがどの音素に対応しているかを出力する。またある音素を発声した場合、それに対応する筋電信号は音響信号及び画像情報よりも時間的に早く出力されるため、筋電信号は時間的に遅らせてニューラルネットに入力することにより、各パラメータの同期を取ることも可能である。 The neural network of the speech recognition means 20 that receives various parameters as input outputs which phoneme the input parameter corresponds to. When a phoneme is uttered, the corresponding myoelectric signal is output earlier in time than the acoustic signal and image information. Therefore, the myoelectric signal is input to the neural network after being delayed in time. It is also possible to synchronize.

なお、このニューラルネットワークとしては、直前の認識結果を入力に戻すリカレント型のニューラルネットワークを用いることも可能である。なお、本発明において、認識に用いるアルゴリズムはニューラルネットワークだけではなく、隠れマルコフモデル（ＨＭＭ）などの他の認識アルゴリズムを用いることも可能である。 As this neural network, a recurrent neural network that returns the previous recognition result to the input can also be used. In the present invention, an algorithm used for recognition is not limited to a neural network, and other recognition algorithms such as a hidden Markov model (HMM) can be used.

この音声認識手段２０によれば、発話音量が小さかったり、周囲の雑音が大きかったり、筋電信号をうまく検出することができなかった場合に、音響情報パラメータ、筋電信号パラメータ及び画像情報パラメータのうちいずれかのパラメータが音声認識にとって無効となってしまったとしても、最終的な音声認識は、意味のあるパラメータを用いて行うことが可能となり、対雑音性などが大幅に向上することができる。 According to the voice recognition means 20, when the utterance volume is low, the ambient noise is large, or the myoelectric signal cannot be detected well, the acoustic information parameter, myoelectric signal parameter, and image information parameter are set. Even if one of the parameters becomes invalid for speech recognition, the final speech recognition can be performed using meaningful parameters, and the noise resistance can be greatly improved. .

なお、本実施形態に係る音声認識手段２０において、音響情報音声認識は現在用いられている各種の音響信号を用いた音声認識手法を用いることが可能である。また筋電信号音声認識は文献「Noboru Sugie et al., “A speech Employing a Speech Synthesizer Vowel Discrimination from Perioral Muscles Activities and Vowel Production,”IEEE transactions on Biomedical Engineering, Vol.32, No.7, pp485-490」に示されている方法や特開平７−１８１８８８号に示されている方法を用いることが可能である。また画像情報音声認識は特開２００１−５１６９３もしくは特開２０００−２０６９８６に示されている方法を用いることが可能である。さらに、上記に挙げた手法以外の音声認識手法を用いることも可能である。 In the speech recognition unit 20 according to the present embodiment, the speech information speech recognition using various acoustic signals currently used can be used for the acoustic information speech recognition. Also, myoelectric signal speech recognition is described in the document “Noboru Sugie et al.,“ A speech Employing a Speech Synthesizer Vowel Discrimination from Perioral Muscles Activities and Vowel Production, ”IEEE transactions on Biomedical Engineering, Vol.32, No.7, pp485-490. Or a method disclosed in JP-A-7-181888 can be used. Further, the image information voice recognition can use a method disclosed in Japanese Patent Laid-Open No. 2001-51693 or Japanese Patent Laid-Open No. 2000-206986. Furthermore, it is possible to use a speech recognition method other than the methods listed above.

さらに、本発明における音声認識は、図３で示した方式または図４で示した方式のいずれか一方のみ行うようにしてもよい。また、図４で示した方式を行い、全てのパラメータによっては、音声を認識できない場合に、図３で示した、ニューラルネットワークを用いた音声認識を行うようにしてもよく、また、図３で示した方式で行った認識結果と、図４で示した方式で行った認識結果とを比較し、或いは統合することによって最終的な音声認識を行うようにしてもよい。 Furthermore, the speech recognition in the present invention may be performed only in one of the method shown in FIG. 3 or the method shown in FIG. In addition, when the method shown in FIG. 4 is performed and the voice cannot be recognized depending on all the parameters, the voice recognition using the neural network shown in FIG. 3 may be performed. The final speech recognition may be performed by comparing or integrating the recognition result performed by the method shown in FIG. 4 with the recognition result performed by the method shown in FIG.

また、本実施形態では、３つのパラメータを用いて音声認識を行う方法として図３で示したニューラルネットワークを例に説明したが、本発明はこれに限定されるものではなく、ニューラルネットワーク以外の方法を用いて、３つのパラメータから音声を認識することもできる。 In the present embodiment, the neural network shown in FIG. 3 is described as an example of a method for performing speech recognition using three parameters. However, the present invention is not limited to this, and a method other than the neural network is used. Can be used to recognize speech from three parameters.

（２）音響信号処理手段及び筋電信号処理手段
上述した音響信号処理手段１２及び筋電信号処理手段１５の動作について詳述する。図６は、音響情報パラメータ及び筋電信号パラメータ抽出の一例を説明するための図である。 (2) Acoustic signal processing means and myoelectric signal processing means The operations of the above-described acoustic signal processing means 12 and myoelectric signal processing means 15 will be described in detail. FIG. 6 is a diagram for explaining an example of acoustic information parameter and myoelectric signal parameter extraction.

音響信号取得手段１１及び筋電信号取得手段１４により検出された音響信号及び筋電信号は、音響信号処理手段１２及び筋電信号処理手段１５によって、まず時間窓により切り出される（図中（ａ））。次に、切り出された信号からＦＦＴを用いてスペクトラムを抽出する（図中（ｂ））。そして抽出したスペクトラムに対して１／３オクターブ解析を行い（図中（ｃ））、各バンドのパワーを算出し、それを音響情報パラメータ及び筋電信号パラメータとする（図中（ｄ））。この音響情報パラメータ及び筋電信号パラメータは、音声認識手段２０に送られ音声認識される。 The acoustic signal and myoelectric signal detected by the acoustic signal acquiring unit 11 and the myoelectric signal acquiring unit 14 are first cut out by a time window by the acoustic signal processing unit 12 and the myoelectric signal processing unit 15 ((a) in the figure). ) Next, a spectrum is extracted from the extracted signal using FFT ((b) in the figure). Then, 1/3 octave analysis is performed on the extracted spectrum ((c) in the figure), and the power of each band is calculated and used as the acoustic information parameter and the myoelectric signal parameter ((d) in the figure). The acoustic information parameter and the myoelectric signal parameter are sent to the voice recognition means 20 for voice recognition.

なお、本発明における音響情報パラメータ及び筋電信号パラメータの抽出方法は図６に示した以外の方法により行うことも可能である。 Note that the method for extracting the acoustic information parameter and the myoelectric signal parameter in the present invention can be performed by a method other than that shown in FIG.

（３）画像情報処理手段
上述した画像情報処理手段１８の動作について詳述する。図７は、画像情報パラメータを抽出する方法を説明するための図である。 (3) Image Information Processing Unit The operation of the image information processing unit 18 will be described in detail. FIG. 7 is a diagram for explaining a method of extracting image information parameters.

先ず、時刻ｔ0における口周辺の画像から口周辺の特徴点の位置を抽出する（図中（ａ）、Ｓ５０１）。口周辺の特徴点の位置を抽出するのは、口周辺の特徴点にマーカーを張り、そのマーカーの位置を特徴点の位置とすることや、撮影された画像から特徴点を探し出すことにより位置を抽出することも可能である。また、位置は画像上の２次元的な位置でもよいし、複数のカメラを用いて３次元の位置を抽出してもよい。 First, the position of the feature point around the mouth is extracted from the image around the mouth at time t0 ((a) in the figure, S501). The feature points around the mouth are extracted by placing a marker on the feature point around the mouth and setting the position of the marker as the feature point position or by searching for the feature point from the captured image. It is also possible to extract. The position may be a two-dimensional position on the image, or a three-dimensional position may be extracted using a plurality of cameras.

次に、時刻ｔ0時と同様に、時刻ｔ0よりΔｔ経過した時刻ｔ1における口周辺の特徴点の位置を抽出する（図中（ｂ）、Ｓ５０２）。そして、時刻ｔ0と時刻ｔ1における口周辺の特徴点の位置から差分を計算することにより各特徴点の移動量を算出する（図中（ｃ）、Ｓ５０３）。この算出結果から、パラメータを生成する（図中（ｄ）、Ｓ５０４）。 Next, similarly to the time t0, the position of the feature point around the mouth at time t1 when Δt has elapsed from time t0 is extracted ((b) in the figure, S502). Then, the amount of movement of each feature point is calculated by calculating the difference from the position of the feature point around the mouth at time t0 and time t1 ((c) in the figure, S503). A parameter is generated from this calculation result ((d) in the figure, S504).

なお、画像情報パラメータの抽出方法は図７に示した以外の方法により行うことも可能である。 Note that the image information parameter extraction method can be performed by a method other than that shown in FIG.

（学習処理）
次いで、上述した学習処理について説明する。図８は、本実施形態における学習処理を説明するフロー図である。本実施形態において音声認識精度を向上させるためには、話者個人の発話の特徴を学習することが重要である。なお、個々で説明する学習方式は、上述したニューラルネットワークを用いて音声認識を行う場合を前提としており、他の方式により音声認識を行う場合には、それに適合した学習方式を適宜採用する。 (Learning process)
Next, the learning process described above will be described. FIG. 8 is a flowchart for explaining the learning process in the present embodiment. In order to improve the voice recognition accuracy in the present embodiment, it is important to learn the characteristics of the speaker's individual speech. Note that the learning method described individually is based on the assumption that speech recognition is performed using the above-described neural network. When speech recognition is performed using another method, a learning method suitable for the method is appropriately employed.

本実施形態では、同図に示すように、先ず、話者は発話動作を開始する（Ｓ３０１，Ｓ３０２）。話者は発話と同時に、キーボード等によって発話している内容、つまり学習における教師データ（サンプルデータ）を入力する（Ｓ３０５）。これと平行して音声認識システムにより音響信号・筋電信号・画像情報を検出し（Ｓ３０３）、それぞれの信号からパラメータを抽出する（Ｓ３０４）。 In this embodiment, as shown in the figure, first, the speaker starts an utterance operation (S301, S302). At the same time as the utterance, the speaker inputs the content uttered by the keyboard or the like, that is, teacher data (sample data) in learning (S305). In parallel with this, acoustic signals, myoelectric signals, and image information are detected by the voice recognition system (S303), and parameters are extracted from the respective signals (S304).

そして、抽出されたパラメータをキーボードから入力された教師信号を基に学習を行う（Ｓ３０６）。すなわち、上述した階層ネットワークの下流側から教師データを入力してデータを上流側へ逆流させることにより、各非線形素子に付与された重み係数を変更させる。 The extracted parameters are learned based on the teacher signal input from the keyboard (S306). That is, the weighting factor given to each nonlinear element is changed by inputting teacher data from the downstream side of the hierarchical network described above and causing the data to flow backward to the upstream side.

その後、学習による認識誤差がある一定値以下となった場合は、学習終了と判定し（Ｓ３０７）、学習を終了する（Ｓ３０８）。一方、ステップＳ３０７において、また学習が終了していないと判定した場合には、上記ステップＳ３０２〜Ｓ３０６により再度学習を繰り返す。 Thereafter, when the recognition error due to learning becomes equal to or smaller than a certain value, it is determined that the learning is finished (S307), and the learning is finished (S308). On the other hand, if it is determined in step S307 that the learning has not ended, the learning is repeated again in steps S302 to S306.

（効果）
以上説明した本実施形態にかかる音声認識システムによれば、音響情報及び筋電信号及び画像情報から得られた、複数のパラメータを用いて音声認識を行っているために、対雑音性などが大幅に向上する。すなわち、３種類の入力インタフェースを持つことにより雑音などの影響を受けにくく、３種類の中で使用できないインタフェースがあっても、残ったインタフェースを用いることによって音声認識を行うことが可能となり、音声の認識率を向上させることができる。その結果、話者が、小さな音量で発話しても、また周囲の雑音が大きな場所で発話しても、十分に音声を認識することができる音声認識システムを提供することが可能となった。 (effect)
According to the speech recognition system according to the present embodiment described above, since speech recognition is performed using a plurality of parameters obtained from acoustic information, myoelectric signals, and image information, noise resistance is greatly improved. To improve. In other words, having three types of input interfaces makes it difficult to be affected by noise, etc. Even if there are interfaces that cannot be used among the three types, it is possible to perform voice recognition by using the remaining interfaces. The recognition rate can be improved. As a result, it is possible to provide a speech recognition system that can sufficiently recognize a speech even when a speaker speaks at a low volume or speaks in a place where the surrounding noise is large.

［第２実施形態］
上述した音声認識システムを応用することにより音声合成システムを構成することができる。図９は、上述した音声認識システムを用いて音声合成行う際の動作を示すフロー図である。 [Second Embodiment]
A speech synthesis system can be configured by applying the speech recognition system described above. FIG. 9 is a flowchart showing an operation when speech synthesis is performed using the speech recognition system described above.

本実施形態にかかる音声合成システムは、同図に示すように、上述した音声認識システムにおける動作ステップＳ２０２〜Ｓ２０８を行った後、ステップＳ２０９において、検出した音響信号から発話者が発した音響信号以外の雑音を除去し、ステップＳ２０においてクリアな合成音声を出力する。 In the speech synthesis system according to the present embodiment, as shown in the figure, after performing the operation steps S202 to S208 in the speech recognition system described above, in step S209, other than the acoustic signal emitted by the speaker from the detected acoustic signal In step S20, a clear synthesized speech is output.

この音声合成について詳述すると、図１０に示すように、本実施形態では、音声認識システムによる認識結果を用いて、発声した音素のホルマント周波数などの特徴量から、発声した音素のスペクトラムを再構成する。そして、この再構成したスペクトラム（図中(a) ）と、検出した雑音成分を含む音響信号のスペクトラム（図中(c)）とを掛け合わせることによって、雑音を除去した音声のスペクトラムを得る（図中(d)）ことができる。そして雑音を除去した音声のスペクトラムはフーリエ逆変換により雑音を除去した音響信号（図中(b) ）として出力する。すなわち、雑音成分を含む音響信号を、再構成したスペクトラムで表される周波数特性を持つフィルタに通して出力する。 This speech synthesis will be described in detail. As shown in FIG. 10, in the present embodiment, the spectrum of the uttered phoneme is reconstructed from the feature amount such as the formant frequency of the uttered phoneme using the recognition result by the speech recognition system. To do. Then, by multiplying this reconstructed spectrum ((a) in the figure) with the spectrum of the detected acoustic signal including the noise component ((c) in the figure), the spectrum of the speech from which noise has been removed is obtained ( (D) in the figure). The spectrum of the speech from which noise has been removed is output as an acoustic signal ((b) in the figure) from which noise has been removed by inverse Fourier transform. That is, an acoustic signal including a noise component is output through a filter having a frequency characteristic represented by a reconstructed spectrum.

本実施形態にかかる音声認識システムによれば、種々の方式により音声認識を行い、この認識結果から再構成した信号と、検出した音響信号の中から、発話者が発声した音響信号と、周囲の雑音とを分離することが可能となり、これにより、周囲の雑音が大きくても、話者の声を生かしたクリアな合成音声を出力することができる。その結果、本実施形態によれば、話者が小さな音量で発話しても、また話者が雑音の大きな場所において発話を行っても、相手にはあたかも雑音のない環境下で通常に発話しているかのような合成音声を出力することが可能となる。 According to the voice recognition system according to the present embodiment, voice recognition is performed by various methods, a signal reconstructed from the recognition result, an acoustic signal uttered by the speaker from the detected acoustic signal, and surroundings It is possible to separate the noise from the noise, so that even if the surrounding noise is large, it is possible to output a clear synthesized voice utilizing the voice of the speaker. As a result, according to the present embodiment, even if the speaker speaks at a low volume, or the speaker speaks in a noisy place, the other party normally speaks in an environment with no noise. It is possible to output a synthesized speech as if it were.

なお、本実施形態では、音声の認識処理を上述した実施形態１による方式を採用したが、本発明は、これに限定されるものではなく、音響情報以外のパラメータを用いて音声認識を行い、これと音響情報とにより音声を合成するようにしてもよい。 In the present embodiment, the method according to the first embodiment described above is used for the speech recognition process. However, the present invention is not limited to this, and performs speech recognition using parameters other than acoustic information. You may make it synthesize | combine a sound by this and acoustic information.

［第３実施形態］
上述した音声認識システム及び音声合成システムは以下の形態により実施することができる。図１１は、本実施形態にかかる音声認識合成システムの第３実施形態を説明するための図である。 [Third Embodiment]
The speech recognition system and speech synthesis system described above can be implemented in the following forms. FIG. 11 is a diagram for explaining a third embodiment of the speech recognition and synthesis system according to the present embodiment.

同図に示したように、本実施形態に係る音声認識合成システムは、携帯電話機本体３０と、この携帯電話機本体３０とは離隔された腕時計型端末３１とから構成される。 As shown in the figure, the speech recognition / synthesis system according to the present embodiment includes a mobile phone body 30 and a wristwatch-type terminal 31 that is separated from the mobile phone body 30.

携帯電話機本体３０は、周知の携帯電話機に、上述した音響情報処理部１０，筋電信号処理部１３，音声認識手段２０及び音声合成手段を付加したものであり、携帯電話機本体３０の表面に、筋電信号取得手段１４と、音響信号取得手段１１とが設けられている。本実施形態において、筋電信号取得手段１４は、話者３２の皮膚に接触可能に設けられた複数の皮膚表面電極で構成されており、音響信号取得手段１１は、話者３２の口付近に設けられたマイクとで構成されている。 The mobile phone main body 30 is obtained by adding the above-described acoustic information processing unit 10, myoelectric signal processing unit 13, voice recognition means 20 and voice synthesis means to a known mobile phone, A myoelectric signal acquisition unit 14 and an acoustic signal acquisition unit 11 are provided. In the present embodiment, the myoelectric signal acquisition means 14 is composed of a plurality of skin surface electrodes provided so as to be able to contact the skin of the speaker 32, and the acoustic signal acquisition means 11 is located near the mouth of the speaker 32. It is comprised with the provided microphone.

また、この携帯電話機本体３０には、通信手段が内蔵されており音声認識手段２０の認識結果に基づいて合成された合成音声を、話者３２の通話音声として送信する機能を有している。 In addition, the mobile phone body 30 has a function of transmitting a synthesized voice synthesized based on a recognition result of the voice recognition means 20 as a call voice of the speaker 32 because the communication means is built in.

腕時計型端末３１は、上述した画像情報処理部１６と、認識結果提示手段２１を備えたものであり、腕時計型端末３１の表面に設けられた画像情報取得手段１７としてのビデオカメラと、認識結果提示手段２１としての画面表示装置を備えている。 The wristwatch type terminal 31 includes the image information processing unit 16 and the recognition result presentation unit 21 described above, a video camera as the image information acquisition unit 17 provided on the surface of the wristwatch type terminal 31, a recognition result, and the like. A screen display device as the presenting means 21 is provided.

このような構成の音声認識合成システムは、携帯電話機本体３０の筋電信号取得手段１４及び音響信号取得手段１１により話者３２からの筋電信号と音響信号を取得するとともに、腕時計型端末３１の画像情報取得手段１７により話者３２の画像情報を取得する。そして、携帯電話機本体３０と、腕時計型端末３１とは、有線もしくは無線により通信を行い、各信号を携帯電話機本体３０に内蔵された音声認識手段２０に集約し、音声認識を行い、有線若しくは無線により認識結果を腕時計型端末３１の認識結果提示手段２１に表示させる。さらに、携帯電話機本体３０では、認識結果に基づいて、周囲の雑音を除去したクリアーの音声を合成し、通話相手に送信する。 The voice recognition / synthesis system having such a configuration acquires the myoelectric signal and the acoustic signal from the speaker 32 by the myoelectric signal acquisition unit 14 and the acoustic signal acquisition unit 11 of the mobile phone body 30, and The image information acquisition means 17 acquires the image information of the speaker 32. Then, the mobile phone body 30 and the wristwatch-type terminal 31 communicate with each other by wire or wirelessly, collect each signal in the voice recognition means 20 built in the mobile phone body 30, perform voice recognition, and perform wired or wireless communication. Thus, the recognition result is displayed on the recognition result presentation means 21 of the wristwatch type terminal 31. Further, the mobile phone body 30 synthesizes a clear voice from which ambient noise has been removed based on the recognition result and transmits it to the other party.

なお、本実施形態では、音声認識手段を携帯電話機本体３０に内蔵させ、認識結果を腕時計型端末３１の認識結果提示手段２１に表示させるようにしたが、例えば、音声認識手段を腕時計型端末３１側に設けることもでき、或いは、これらの各装置３０及び３１と通信可能な他の端末側で音声認識及び音声合成を行うようにしてもよい。また、音声認識を行った際の認識結果は、携帯電話機本体３０から音声で出力することも、腕時計型端末３１（或いは携帯電話機本体３０）の画面に表示することも、それらと通信を行う他の端末に出力することも可能である。 In the present embodiment, the voice recognition means is built in the mobile phone body 30 and the recognition result is displayed on the recognition result presentation means 21 of the wristwatch type terminal 31. For example, the voice recognition means is the wristwatch type terminal 31. Alternatively, the voice recognition and the voice synthesis may be performed on the other terminal side capable of communicating with each of the devices 30 and 31. In addition, the recognition result when voice recognition is performed can be output by voice from the mobile phone body 30, displayed on the screen of the wristwatch-type terminal 31 (or the mobile phone body 30), and communicated with them. It is also possible to output to other terminals.

［第４実施形態］
さらに、上述した音声認識システム及び音声合成システムは以下の形態により実施することもできる。図１２は、本発明の第４の実施形態を説明するための図である。 [Fourth Embodiment]
Furthermore, the above-described speech recognition system and speech synthesis system can be implemented in the following forms. FIG. 12 is a diagram for explaining a fourth embodiment of the present invention.

同図に示したように、本実施形態に係る音声認識合成システムは、話者３２の頭部に装着可能な眼鏡形状をなす保持器具４１と、この保持器具４１に音源である話者３２の口周辺を撮影可能に固定された画像情報取得手段１７としてのビデオカメラ、及び固定部４２と、認識結果取得手段２１としてのシースルーHMDと、保持器具４１に内蔵された音声認識手段とから構成される。固定部４２には、筋電信号取得手段１４としての皮膚表面電極、音響信号取得手段１１としてのマイクが取付けられている。 As shown in the figure, the speech recognition and synthesis system according to the present embodiment includes a holding device 41 in the shape of glasses that can be worn on the head of the speaker 32, and a speaker 32 that is a sound source in the holding device 41. It is composed of a video camera as the image information acquisition means 17 fixed so that the periphery of the mouth can be photographed, a fixing unit 42, a see-through HMD as the recognition result acquisition means 21, and a voice recognition means built in the holding device 41. The A skin surface electrode as the myoelectric signal acquisition unit 14 and a microphone as the acoustic signal acquisition unit 11 are attached to the fixing unit 42.

このような音声認識合成システムを装着することにより、話者３２は、フリーハンド状態で、音声認識及び音声合成を行うことができる。 By mounting such a speech recognition and synthesis system, the speaker 32 can perform speech recognition and speech synthesis in a freehand state.

なお、音声認識手段は保持器具４１内に納めることもできるし、保持器具４１と通信を行うことが可能な外部の端末に納めることもできる。また、音声認識の認識結果は、シースルーＨＭＤ（透過性の表示部）に表示することも、また保持器具４１に備えられたスピーカ等の出力装置から音声で出力することもでき、さらに外部の端末に出力することも可能である。さらに、保持器具４１にスピーカー等の音声出力装置を設けた場合には、音声認識に基づいて合成された音声を出力するようにしてもよい。 Note that the voice recognition means can be stored in the holding device 41 or in an external terminal capable of communicating with the holding device 41. In addition, the recognition result of the voice recognition can be displayed on a see-through HMD (transparent display unit), or can be output by voice from an output device such as a speaker provided in the holding device 41, and an external terminal. Can also be output. Furthermore, when a voice output device such as a speaker is provided in the holding device 41, a voice synthesized based on voice recognition may be output.

［第５実施形態］
なお、上述した第１〜４の実施形態にかかる音声認識システム・音声合成システム及び方法は、パーソナルコンピュータ等の汎用コンピュータや、携帯電話機等に備えられたＩＣチップ上において、所定のコンピュータ言語で記述されたプログラムを実行することにより実現することができる。 [Fifth Embodiment]
The speech recognition system / speech synthesis system and method according to the first to fourth embodiments described above are described in a predetermined computer language on a general-purpose computer such as a personal computer or an IC chip provided in a mobile phone or the like. This can be realized by executing the programmed program.

そして、このような通信制御プログラムは、図１３に示すようなコンピュータ１１５で読み取り可能な記録媒体（フロッピー(登録商標）ディスク１１６，ＣＤ−ＲＯＭ１１７，ＲＡＭ１１８，カセットテープ１１９）に記録し、この記録媒体を介して、コンピュータ１１５を通じて、或いは、移動電話機本体３０のメモリ等に直接インストールすることにより、上述した実施形態で説明した音声認識システムや音声合成システムを実現することができる。 Such a communication control program is recorded on a recording medium (floppy (registered trademark) disk 116, CD-ROM 117, RAM 118, cassette tape 119) readable by a computer 115 as shown in FIG. The voice recognition system and the voice synthesis system described in the above-described embodiment can be realized through the computer 115 or directly installed in the memory of the mobile phone body 30 or the like.

第１実施形態にかかる音声認識システムの基本構成を説明するためのブロック図である。It is a block diagram for demonstrating the basic composition of the speech recognition system concerning 1st Embodiment. 第１実施形態にかかる音声認識システムの動作を説明するためのフロー図である。It is a flowchart for demonstrating operation | movement of the speech recognition system concerning 1st Embodiment. 第１実施形態にかかる音声認識手段の動作を説明するための説明図である。It is explanatory drawing for demonstrating operation | movement of the speech recognition means concerning 1st Embodiment. 第１実施形態にかかる音声認識手段の動作を説明するための説明図である。It is explanatory drawing for demonstrating operation | movement of the speech recognition means concerning 1st Embodiment. 第１実施形態にかかる音声認識手段における階層ネットワークの動作を説明するための説明図である。It is explanatory drawing for demonstrating operation | movement of the hierarchical network in the speech recognition means concerning 1st Embodiment. 第１実施形態におけるパラメータの抽出処理を説明するための説明図である。It is explanatory drawing for demonstrating the extraction process of the parameter in 1st Embodiment. 第１実施形態におけるパラメータの抽出処理を説明するための説明図である。It is explanatory drawing for demonstrating the extraction process of the parameter in 1st Embodiment. 第１実施形態における学習処理を説明するためのフロー図である。It is a flowchart for demonstrating the learning process in 1st Embodiment. 第２実施形態にかかる音声合成システムの動作を説明するためのフロー図である。It is a flowchart for demonstrating operation | movement of the speech synthesis system concerning 2nd Embodiment. 第２実施形態にかかる音声合成システムの動作を説明するための説明図である。It is explanatory drawing for demonstrating operation | movement of the speech synthesis system concerning 2nd Embodiment. 第３実施形態にかかる音声認識合成システムの説明図である。It is explanatory drawing of the speech recognition synthesis system concerning 3rd Embodiment. 第４実施形態にかかる音声認識合成システムの説明図である。It is explanatory drawing of the speech recognition synthesis system concerning 4th Embodiment. 第５実施形態にかかる音声認識プログラム及び音声合成プログラムを記録したコンピュータ読み取り可能な記録媒体の斜視図である。It is a perspective view of the computer-readable recording medium which recorded the speech recognition program and speech synthesis program concerning 5th Embodiment.

Explanation of symbols

１０…音響情報処理部
１１…音響信号取得手段
１２…音響信号処理手段
１３…筋電信号処理部
１４…筋電信号取得手段
１５…筋電信号処理手段
１６…画像情報処理部
１７…画像情報取得手段
１８…画像情報処理手段
１９…情報総合認識部
２０…音声認識手段
２１…認識結果提示手段
３０…携帯電話機本体
３１…腕時計型端末
３２…話者
４１…保持器具
４２…固定部 DESCRIPTION OF SYMBOLS 10 ... Acoustic information processing part 11 ... Acoustic signal acquisition means 12 ... Acoustic signal processing means 13 ... Myoelectric signal processing part 14 ... Myoelectric signal acquisition means 15 ... Myoelectric signal processing means 16 ... Image information processing part 17 ... Image information acquisition Means 18 ... Image information processing means 19 ... Information comprehensive recognition unit 20 ... Speech recognition means 21 ... Recognition result presentation means 30 ... Mobile phone body 31 ... Wristwatch type terminal 32 ... Speaker 41 ... Holding device 42 ... Fixing part

Claims

Speech recognition means for recognizing speech;
Sound acquisition means for acquiring an acoustic signal;
Means for acquiring a spectrum of an acoustic signal as a first spectrum from the acquired acoustic information;
Means for generating a spectrum of an acoustic signal reconstructed from a recognition result by the voice recognition means as a second spectrum;
Means for comparing the first spectrum with the second spectrum and generating a modified spectrum according to the comparison result;
And a speech synthesis system comprising: output means for outputting speech synthesized from the modified spectrum.

The speech synthesis system according to claim 1, wherein the output unit includes a communication unit that transmits the synthesized speech as data.

Step (1) for recognizing speech;
Obtaining an acoustic signal (2);
(3) acquiring a spectrum of an acoustic signal as a first spectrum from the acquired acoustic information;
Generating a spectrum of the acoustic signal reconstructed from the recognition result by the voice recognition means as a second spectrum;
Comparing the first spectrum with the second spectrum, and generating a modified spectrum according to the comparison result;
And (6) outputting a voice synthesized from the modified spectrum.

The speech synthesis method according to claim 3, wherein the step (6) includes a step of transmitting the synthesized speech as data.

On the computer,
Step (1) for recognizing speech;
Obtaining an acoustic signal (2);
(3) acquiring a spectrum of an acoustic signal as a first spectrum from the acquired acoustic information;
Generating a spectrum of the acoustic signal reconstructed from the recognition result by the voice recognition means as a second spectrum;
Comparing the first spectrum with the second spectrum, and generating a modified spectrum according to the comparison result;
A speech synthesis program for executing a process including a step (6) of outputting speech synthesized from the modified spectrum.

The speech synthesis program according to claim 5, wherein the step (6) includes a step of transmitting the synthesized speech as data.