JP4189744B2

JP4189744B2 - Voiceless communication system

Info

Publication number: JP4189744B2
Application number: JP2003271166A
Authority: JP
Inventors: 義章新倉
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2003-07-04
Filing date: 2003-07-04
Publication date: 2008-12-03
Anticipated expiration: 2023-07-04
Also published as: JP2005033568A

Description

本発明は、無音声通信システムに関し、特に、音声を発することなく通話を可能とする無音声通信システムに関する。 The present invention relates to a voiceless communication system, and more particularly, to a voiceless communication system that enables a telephone call without producing a voice.

従来の無音声通話手段としては、口唇形状から音声を認識して信号を送信する送信器と、前記信号を受信して音声に復号する受信器とからなる音声処理装置及び電話装置が知られている（特許文献１、図６、１１、段落0039参照）。 As conventional silent voice communication means, there are known a voice processing device and a telephone device comprising a transmitter for recognizing a voice from a lip shape and transmitting a signal, and a receiver for receiving the signal and decoding it into a voice. (See Patent Document 1, FIGS. 6 and 11, paragraph 0039).

図９は、特許文献１記載の装置の構成を示す図である。この装置の送信器は図９（ａ）に示すように、口唇形状の画像データを入力し当該口唇形状の画像データに応じた音声合成用元データを出力する口唇形状認識部１ａと、前記音声合成用元データから音声合成、符号化及び送信をそれぞれ行う音声合成部２ａ、音声符号化部３ａ及び送信部４ａとを備え、受信部５ａ及び音声復号化部６ａを備える受信器側と声を出さずに通話を可能としようとするものである。口唇形状認識部９１は図９（ｂ）に示すように、口唇付近の形状を撮影してデータを出力する入力部１ｂと、画像データに対応する音声合成用元データがデータベース５ｂとして格納されたメモリ部４ｂと、入力部１ｂからの画像データに応じた音声合成用元データをデータ出力部３ｂに出力する制御部２ｂから構成されている。 FIG. 9 is a diagram illustrating a configuration of the apparatus described in Patent Document 1. In FIG. As shown in FIG. 9 (a), the transmitter of this apparatus inputs the lip shape image data and outputs the speech synthesis original data corresponding to the lip shape image data, and the voice A voice synthesizer 2a, a voice coder 3a, and a transmitter 4a that perform voice synthesis, encoding, and transmission from the original data for synthesis, respectively, and a receiver side that has a receiver 5a and a voice decoder 6a It tries to be able to make a call without taking it out. As shown in FIG. 9B, the lip shape recognizing unit 91 stores an input unit 1b that captures the shape of the vicinity of the lip and outputs data, and original data for speech synthesis corresponding to the image data is stored as a database 5b. The memory unit 4b and the control unit 2b that outputs original data for speech synthesis corresponding to the image data from the input unit 1b to the data output unit 3b.

特開平１０−２４０２８３号公報Japanese Patent Laid-Open No. 10-240283

従来の音声処理装置及び電話装置は、口唇等により発語しようとする音声データを生成することが可能であるとしているが、単に、画像データに対応する音声合成用元データをデータベースに基づいて生成することが開示されているにすぎず、ユーザが意図する音声を正しく生成するための具体的、論理的な認識処理の開示はなされていない。 Conventional speech processing devices and telephone devices are capable of generating speech data to be spoken by the lips, but simply generate original data for speech synthesis corresponding to image data based on the database. However, there is no disclosure of a specific logical recognition process for correctly generating a voice intended by the user.

（目的）
本発明の目的は、ユーザの口部の形状を認識し該形状に応じ意図した正しい音声を出力する新規な無音声通信システムを提供することにある。
本発明の目的は、口部の形状の認識に対するユーザの確認操作を含む音声の無音声通信システムを提供することにある。
本発明の他の目的は、口部の形状を撮影するためのマスクを用いた無音声通信システムを提供することにある。 (the purpose)
An object of the present invention is to provide a novel silent communication system that recognizes the shape of a user's mouth and outputs a correct voice intended according to the shape.
An object of the present invention is to provide a voiceless voice communication system including a user's confirmation operation for recognizing a mouth shape.
Another object of the present invention is to provide a voiceless communication system using a mask for photographing the shape of the mouth.

本発明は音声発話が困難な状況等においての通話を可能とするために、ユーザが発話する際の口の形を識別し音声合成することで正しい音声発話を行う新規な手段を提供する。本発明の無音声通話システムは、口部を撮影する画像撮影部と、撮影した画像データから文字データを生成する画像解析部と、前記文字データから文字列データを生成する文字列変換部と、前記文字列データから音声データを合成する音声合成出力部と、前記音声データを音声に変換する音声出力部（例えば図１の１２）と、前記文字列データを外部出力部に出力するか否かを選択する認識結果判定スイッチ（例えば図１の９）と、を備え、更に、画像データの口部の形状と文字データとの関係をデータベースとする基本口形画像データベースを備え、前記画像解析部は、前記基本口形画像データベースを参照して前記撮影画像の口部の形状から文字データを生成することを特徴とする。 The present invention provides a novel means for performing correct voice utterance by identifying a mouth shape when a user speaks and synthesizing the voice in order to enable a call in a situation where voice utterance is difficult. The voiceless call system of the present invention includes an image photographing unit for photographing the mouth, an image analyzing unit for generating character data from the photographed image data, a character string converting unit for generating character string data from the character data, A voice synthesis output unit that synthesizes voice data from the character string data, a voice output unit (for example, 12 in FIG. 1) that converts the voice data into voice, and whether or not to output the character string data to an external output unit A recognition result determination switch (for example, 9 in FIG. 1), and a basic mouth shape image database in which the relationship between the shape of the mouth portion of the image data and the character data is used as a database. Character data is generated from the shape of the mouth of the photographed image with reference to the basic mouth image database.

また、前記画像解析部は同じ文字データが一定数連続した時点で、当該連続する文字データの後続の文字データを間引いて前記文字列変換部に出力するとともに、当該時点までの文字データを文字列データとして出力するように前記文字列変換部を制御することを特徴とする。 The image analysis unit thins out the subsequent character data of the continuous character data when a certain number of the same character data continues, and outputs the thinned character data to the character string conversion unit. The character string converter is controlled to output as data.

更に、口部を覆い該口部を撮影するためのマスクを備え、少なくとも口部を照明する照明部と口部を撮影する前記画像撮影部を前記マスクの内側に配置したことを特徴とする。 Furthermore, a mask for covering the mouth part and photographing the mouth part is provided, and at least the illumination part for illuminating the mouth part and the image photographing part for photographing the mouth part are arranged inside the mask.

より具体的には、照明部３で照らされたユーザの口の部分を、タイマー部２で設定された時間間隔毎に画像撮影部１で撮影し、その結果を画像解析部４に通知する。画像解析部４は基本口形画像データベース５を参照し、その口の形が何の文字を発する際の形に類似するかを判断し、その結果を文字データで文字列変換部６に出力する。文字列変換部６は文字データを複数ならべた文字列データを語彙データベース７の内容と比較し、当該文字列データに近い語彙を複数検索し、その語彙の中から一致度の高い語彙順の１ないし複数の文字列データを候補として音声合成出力部１０に出力する。音声合成出力部１０は文字列データを音声データに変換してイヤホン１２に出力する。ユーザはイヤホン１２からの可聴音によりユーザの意図したものか否かを判定し、それぞれ認識ＯＫ又は認識ＮＧとして出力可能な認識結果判定スイッチ１３を操作することにより、それぞれ音声データを外部出力部１１に出力するか、又は次候補の音声データを外部出力部１１に出力する。認識ＯＫの場合、音声合成出力部１０は音声データを外部出力部１１に出力し、そうでない場合は次の候補をイヤホン１２から出力する処理を繰り返すことで音声発語を実現する。（図１） More specifically, the portion of the user's mouth illuminated by the illumination unit 3 is photographed by the image photographing unit 1 at time intervals set by the timer unit 2, and the result is notified to the image analyzing unit 4. The image analysis unit 4 refers to the basic mouth image database 5 to determine what character the mouth shape is similar to when the character is emitted, and outputs the result to the character string conversion unit 6 as character data. The character string conversion unit 6 compares the character string data including a plurality of character data with the contents of the vocabulary database 7, searches a plurality of vocabularies close to the character string data, and searches the vocabulary in the lexical order having a high degree of matching from the vocabulary. Or a plurality of character string data are output as candidates to the speech synthesis output unit 10. The voice synthesis output unit 10 converts the character string data into voice data and outputs the voice data to the earphone 12. The user determines whether or not the user intended by the audible sound from the earphone 12, and operates the recognition result determination switch 13 that can be output as recognition OK or recognition NG, respectively, thereby respectively outputting the audio data to the external output unit 11 Or the audio data of the next candidate is output to the external output unit 11. In the case of recognition OK, the speech synthesis output unit 10 outputs speech data to the external output unit 11, otherwise, the speech utterance is realized by repeating the process of outputting the next candidate from the earphone 12. (Figure 1 )

本発明によれば、口部の形状から文字データに、前記文字データを文字列データに、前記文字列データを音声データに順次変換することで音声の合成を実現し、ユーザは一切発声する必要がなく、口の動きと手の操作で発話内容を決定することができるから、ユーザが発話したい正しい内容をスピーカや電話機のマイク端子等に出力することで相手と通信することが可能である。電車内など発話できない環境内や身体的な問題で声を発声することができない等において極めて有効である。 According to the present invention, voice synthesis is realized by sequentially converting the shape of the mouth into character data, the character data into character string data, and the character string data into voice data, and the user needs to speak at all. Therefore, the content of the utterance can be determined by the movement of the mouth and the operation of the hand, so that it is possible to communicate with the other party by outputting the correct content that the user wants to utter to the speaker or the microphone terminal of the telephone. This is extremely effective in environments where speech cannot be spoken, such as in trains, or inability to speak due to physical problems.

また、口の動きに関連して認識開始のタイミングを制御するスイッチと認識結果の判定用のスイッチを用いる構成とすることにより、ユーザの意図した正しい音声データを確実に合成することが可能である。また、一定時間発話を停止する等、同じ文字データが一定数連続するような口の動かし方により、文字列区切りとして文字列変換を開始する構成とすることにより、スイッチ操作をすることなく当該時点までの文字列データを生成し音声データに変換して自動的に発声することが可能である。 In addition, by using a switch that controls the recognition start timing in relation to the movement of the mouth and a switch for determining the recognition result, it is possible to reliably synthesize correct voice data intended by the user. . In addition, by changing the mouth so that a certain number of the same character data continues, such as when utterance is stopped for a certain time, the character string conversion is started as a character string delimiter, so that the current point in time without switching It is possible to generate character string data up to and convert it into voice data and automatically utter it.

また、内側に照明部及び画像撮影部を備えユーザの口部を覆い隠すマスクを使用することにより、ユーザが使用する場所の明るさに影響を受けることなく、また、ユーザが発話したい内容を周囲にいるユーザ以外から悟られること無く発声することが可能である。 In addition, by using a mask that includes an illumination unit and an image capturing unit on the inside and covers the user's mouth, the brightness of the location used by the user is not affected, and the user wants to speak It is possible to speak without being understood by users other than those who are in

（構成の説明）
図１は本発明の無音声通信システムの一実施の形態を示す図である。本実施の形態の無音声通信システムは、ユーザの口の付近（口部）を撮影し画像データ（画像）を出力する画像撮影部１と、一定時間間隔毎に画像撮影部１の撮影動作を行わせる信号を出力するタイマー部２と、ユーザの口部を明るく照らすための照明部３と、画像撮影部１で撮影された画像を解析する画像解析部４と、ユーザが特定の文字を発声する際にどのような口の形になっているかの画像、つまり特定の文字と該特定の文字に対応するユーザの口の形の画像との関係を記録した基本口形画像データベース５と、画像解析部４の結果から推測される文字列を出力する文字列変換部６と、文字列変換部６で文字列を生成する際の語彙データである語彙データベース７と、文字を文字列に変換するタイミングをユーザから文字列変換部６に通知するための認識開始通知スイッチ９と、文字列を音声として出力する音声合成出力部１０と、音声を外部に出力するための外部出力部１１と、ユーザが音声合成部１０の出力を聞くためのイヤホン１２と、音声合成部１０の出力から認識結果がユーザの意図する文字列か否かを判断し、ユーザが文字列変換部６に通知するための認識結果判定スイッチ１３と、から構成されている。 (Description of configuration)
FIG. 1 is a diagram showing an embodiment of a voiceless communication system according to the present invention. The silent communication system of the present embodiment includes an image capturing unit 1 that captures the vicinity (mouth) of a user's mouth and outputs image data (images), and a shooting operation of the image capturing unit 1 at regular time intervals. A timer unit 2 for outputting a signal to be performed, an illumination unit 3 for illuminating a user's mouth brightly, an image analysis unit 4 for analyzing an image photographed by the image photographing unit 1, and a user uttering a specific character A basic mouth image database 5 that records the relationship between the image of what mouth shape is, that is, the specific character and the image of the user's mouth shape corresponding to the specific character, and image analysis A character string conversion unit 6 that outputs a character string estimated from the result of the unit 4, a vocabulary database 7 that is vocabulary data when the character string conversion unit 6 generates a character string, and a timing for converting a character into a character string From the user to the character string converter 6 A recognition start notification switch 9 for notification, a speech synthesis output unit 10 for outputting a character string as speech, an external output unit 11 for outputting speech to the outside, and a user listening to the output of the speech synthesis unit 10 And a recognition result determination switch 13 for determining whether or not the recognition result is a character string intended by the user from the output of the speech synthesizer 10 and for notifying the character string conversion unit 6 by the user. ing.

各部の構成及び機能について説明する。
画像撮影部１は、ユーザの口の形を静止画として撮影するための撮影装置である。タイマー部２は一定時間間隔毎にユーザの口の形を撮影するためのタイミング信号を出力する前記時間間隔を設定可能なタイマーである。タイマー部２から一定時間毎に画像撮影部１に撮影の指示信号を送り、画像撮影部１は該指示信号によりユーザの口を撮影する。照明部３は、画像撮影部１による撮影対象のユーザの口の部分を明るく照らし、鮮明な画像の撮影を可能とする。 The configuration and function of each part will be described.
The image photographing unit 1 is a photographing device for photographing a user's mouth shape as a still image. The timer unit 2 is a timer capable of setting the time interval for outputting a timing signal for photographing the shape of the user's mouth at regular time intervals. A shooting instruction signal is sent from the timer unit 2 to the image photographing unit 1 at regular intervals, and the image photographing unit 1 photographs the user's mouth according to the instruction signal. The illumination unit 3 brightly illuminates the mouth portion of the user to be imaged by the image capturing unit 1 and enables a clear image to be captured.

画像解析部４は、画像撮影部１で撮影された画像を解析し、当該画像のユーザの口の形から、通常ユーザが発声する際にもっとも近い音を特定し、当該音に対応する文字データ（文字）を出力する。基本口形画像データベース５は、通常ユーザが発声する音とその際の口の形の画像データとの関係をデータベースとしたものであり、画像解析部４が画像のユーザの口の形から音を特定する際に参照される。 The image analysis unit 4 analyzes the image captured by the image capturing unit 1, identifies the closest sound when the normal user utters from the shape of the user's mouth of the image, and character data corresponding to the sound (Character) is output. The basic mouth image database 5 is a database of the relationship between the sound normally uttered by the user and the image data of the mouth shape at that time, and the image analysis unit 4 identifies the sound from the mouth shape of the user of the image. Referenced when doing.

文字列変換部６は、画像解析部４から出力される複数の文字を集めて、ユーザが発声している文字を結合するとともに、語彙データベース７を参照して出力する文字列の候補を特定する機能を備えている。ここで語彙データベース７は、文字列変換部６がユーザの発声しようとする文字列を特定する際に参照される語彙群であり、文字列変換部６は語彙データベース７からユーザの発声しようとする文字列を探し出して文字列を特定する。 The character string conversion unit 6 collects a plurality of characters output from the image analysis unit 4, combines the characters uttered by the user, and specifies character string candidates to be output with reference to the vocabulary database 7. It has a function. Here, the vocabulary database 7 is a vocabulary group referred to when the character string conversion unit 6 specifies a character string to be uttered by the user, and the character string conversion unit 6 attempts to utter the user from the vocabulary database 7. Find a string and identify the string.

認識開始通知スイッチ９は、ユーザが手動で操作するスイッチであり、ユーザが認識開始通知スイッチ９を押下することで、それまでに画像解析部４から文字列変換部６に出力された文字を文字列に変換するよう通知するための機能を備えている。 The recognition start notification switch 9 is a switch that is manually operated by the user. When the user presses the recognition start notification switch 9, characters that have been output from the image analysis unit 4 to the character string conversion unit 6 so far are displayed as characters. A function for notifying conversion to a column is provided.

音声合成出力部１０は、文字列変換部６で特定した文字列を外部出力部１１、もしくはイヤホン１２から音声出力する機能を有する。イヤホン１２は、音声合成出力部１０の出力をユーザに対して音声として出力し、当該音声の出力を外部出力部１１へ出力するか否かをユーザにより判断可能とするものであり、外部出力部１１は音声出力を携帯電話機等の音声入力部に出力する機能を備える。 The voice synthesis output unit 10 has a function of outputting the character string specified by the character string conversion unit 6 as a voice from the external output unit 11 or the earphone 12. The earphone 12 outputs the output of the voice synthesis output unit 10 to the user as a voice, and allows the user to determine whether or not to output the voice output to the external output unit 11. 11 has a function of outputting an audio output to an audio input unit such as a mobile phone.

認識結果判定スイッチ１３は、文字列変換部６での文字列の特定結果についてイヤホン１２から出力される音声に基づいて、ユーザがその良否を判断し、その音声出力を外部出力部１１へ出力するか又は当該文字列を廃棄して文字列変換部６での次の候補である新たな文字列の特定を指示するか等を選択する機能を備える。認識結果判定スイッチ１３からの指示が認識ＯＫの場合、その音声出力を外部出力部１１に切り替え出力する機能を備えている。また、認識結果判定スイッチ１３からの指示が認識ＮＧの場合、当該音声出力は廃棄される。 Based on the sound output from the earphone 12 regarding the character string specifying result in the character string conversion unit 6, the recognition result determination switch 13 determines whether the user is good or bad and outputs the sound output to the external output unit 11. Or a function of selecting whether to discard the character string and instruct the character string conversion unit 6 to specify a new character string that is the next candidate. When the instruction from the recognition result determination switch 13 is recognition OK, the voice output is switched to the external output unit 11 and provided. If the instruction from the recognition result determination switch 13 is recognition NG, the sound output is discarded.

図２は本実施の形態の画像撮影手段を示す図である。ユーザの口の部分に装着可能なマスク部Ａ１と、ユーザの耳に装着可能なイヤホン部Ａ２と、手に持ち操作可能なスイッチ部Ａ４と、外部機器等と接続可能な外部出力部Ａ３とから構成される。本例ではマスク部Ａ１には、図１に示す画像撮影部１のみ、又は画像撮影部１ないし音声合成出力部１０を設けることができ、また、スイッチ部Ａ４には図１に示す認識開始通知スイッチ９と認識結果判定スイッチ１３が設けられる。ユーザはマスク部Ａ１を口の部分に装着し、イヤホン部Ａ２を耳に装着し、音声を発することなく口を動かし、認識された文字列の音声をイヤホン部Ａ２により確認してスイッチ部Ａ４を操作して口部による正しい音声の認識結果のみを音声合成出力から外部出力部１１に出力する。 FIG. 2 is a diagram showing the image photographing means of the present embodiment. From a mask part A1 that can be worn on the user's mouth, an earphone part A2 that can be worn on the user's ear, a switch part A4 that can be held and operated by the hand, and an external output part A3 that can be connected to an external device or the like. Composed. In this example, the mask unit A1 can be provided with only the image capturing unit 1 shown in FIG. 1, or the image capturing unit 1 through the speech synthesis output unit 10, and the switch unit A4 can be notified of the recognition start shown in FIG. A switch 9 and a recognition result determination switch 13 are provided. The user wears the mask part A1 on the mouth part, puts the earphone part A2 on the ear, moves the mouth without making a voice, confirms the voice of the recognized character string by the earphone part A2, and switches the switch part A4. Only the correct speech recognition result by the mouth is output from the speech synthesis output to the external output unit 11.

（動作の説明）
次に、図１、２、３、４、５及び６を参照して本実施の形態の動作について詳細に説明する。
図３は本実施の形態の動作フローチャートを示す図である。ユーザは声を出さずに任意の文字を発話するように口だけを動かす（ステップＢ１）。この間にタイマー部２は固定時間間隔（１秒）が経過する毎に画像撮影部１に撮影指示の信号を通知する。画像撮影部１はこの通知を受ける毎にユーザの口を撮影し（ステップＢ３）、撮影した画像を画像解析部４に送信する（ステップＢ４）。なお、画像の撮影間隔が通常の会話等の語間より長い１秒間隔に設定したことにより、当該間隔に応じてユーザは文字を区切って発話するように口を動かす必要がある。 (Description of operation)
Next, the operation of the present embodiment will be described in detail with reference to FIGS.
FIG. 3 is a diagram showing an operation flowchart of the present embodiment. The user moves his / her mouth only to speak any character without speaking (step B1). During this period, the timer unit 2 notifies the image photographing unit 1 of a photographing instruction signal every time a fixed time interval (1 second) elapses. The image capturing unit 1 captures the user's mouth every time this notification is received (step B3), and transmits the captured image to the image analysis unit 4 (step B4). In addition, since the shooting interval of images is set to a 1-second interval longer than that of words such as normal conversation, the user needs to move his / her mouth so as to divide characters and speak according to the interval.

図４は画像解析部４の動作フローチャートを示す図である。画像解析部４では画像撮影部１で撮影した画像を受信し（ステップＣ１）、基本口形画像データベース５から１つの画像データ（画像）を取り出し（ステップＣ２）、両者のマッチングの比較をして一致の度合いを数値化する（ステップＣ３）。基本口形画像データベース５内の画像の全点についてこれを繰り返し（ステップＣ４）、類似度合いの大きい口の形に対応する声を判定し、その文字データ（文字）を文字列変換部６に送信する（ステップＣ５）。 FIG. 4 is a flowchart illustrating the operation of the image analysis unit 4. The image analysis unit 4 receives the image captured by the image capturing unit 1 (step C1), takes out one image data (image) from the basic mouth image database 5 (step C2), and compares the matching between the two to match. The degree of is numerically expressed (step C3). This is repeated for all points in the image in the basic mouth image database 5 (step C4), the voice corresponding to the mouth shape having a high degree of similarity is determined, and the character data (characters) is transmitted to the character string converter 6. (Step C5).

図５は文字列変換部６の動作フローチャートを示す図であり、図６は語彙データベースのデータの構成を示す図である。文字列変換部６では、受信した１字１字の文字データをまとめて文字列データ（文字列）に変換する処理を行う。文字列変換部６では、ユーザが認識開始通知スイッチ９を押すまで、画像解析部４から受信した文字情報を受信した順に結合し文字列化する（ステップＤ１およびステップＤ２）。認識開始通知スイッチ９が押されたら、文字列変換部６は、語彙データベース７から１つのデータを読み出す（ステップＤ３）。このとき読み出されるデータは図６に示すように、口の形から発せられる母音や”ん”等により構成される音データと、音データと対になる出力単語データから構成される。文字列変換部６は、音データと画像解析部４から受信した文字により作られた文字列を比較し一致度合いを数値化し（ステップＤ４）、出力単語データと対にして保持する。これを語彙データベース７内の全てのデータに対して実施し（ステップＤ５）、完了時に一致度が高い順に出力単語データの値を並べ、一番一致度の高い出力単語データを音声合成出力部１０に出力する。 FIG. 5 is a diagram showing an operation flowchart of the character string converter 6, and FIG. 6 is a diagram showing a data structure of the vocabulary database. The character string conversion unit 6 performs a process of collectively converting the received character data of one character and one character into character string data (character string). The character string conversion unit 6 combines the character information received from the image analysis unit 4 in the order received until the user presses the recognition start notification switch 9 to form a character string (step D1 and step D2). When the recognition start notification switch 9 is pressed, the character string conversion unit 6 reads one data from the vocabulary database 7 (step D3). As shown in FIG. 6, the data read at this time is composed of sound data composed of vowels, “n” and the like emitted from the mouth shape, and output word data paired with the sound data. The character string conversion unit 6 compares the sound data with the character string created from the characters received from the image analysis unit 4, digitizes the degree of matching (step D4), and holds it as a pair with the output word data. This is performed for all the data in the vocabulary database 7 (step D5). When the completion is completed, the values of the output word data are arranged in descending order, and the output word data having the highest coincidence is output to the speech synthesis output unit 10. Output to.

音声合成出力部１０は受信した出力単語データをイヤホン１２に出力する。ユーザはその内容を聞き、ユーザの意図する結果であれば、ユーザにより認識結果判定スイッチ１３の認識ＯＫボタンが押下される。認識ＯＫボタンによる通知は音声合成出力部１０に通知され、音声合成出力部１０から当該出力単語データが外部出力部１１に出力される。また、ユーザが意図しない結果であれば、ユーザにより認識結果判定スイッチ１３の認識ＮＧボタンが押下される。認識ＮＧボタンによる通知は文字列変換部６に通知され、文字列変換部６はこの通知により次に一致度の高い出力単語データを音声合成出力部１０に出力し、音声合成出力部１０は再び当該出力単語データをイヤホン１２に出力し、ユーザによる判定が行われ、以上と同様な処理が行われる。以上の処理によりユーザが発声することなく、発話したい内容を正しく出力することが可能となる。 The speech synthesis output unit 10 outputs the received output word data to the earphone 12. The user listens to the content, and if the result is intended by the user, the user presses the recognition OK button of the recognition result determination switch 13. The notification by the recognition OK button is notified to the speech synthesis output unit 10, and the output word data is output from the speech synthesis output unit 10 to the external output unit 11. If the result is not intended by the user, the user presses the recognition NG button of the recognition result determination switch 13. The notification by the recognition NG button is notified to the character string conversion unit 6, and the character string conversion unit 6 outputs the output word data with the next highest matching degree to the voice synthesis output unit 10 by this notification, and the voice synthesis output unit 10 again The output word data is output to the earphone 12, the determination by the user is performed, and the same processing as described above is performed. With the above processing, it is possible to correctly output the content to be spoken without the user uttering.

（他の実施の形態）
次に、本発明の第２の実施の形態について図面を参照して詳細に説明する。
図７は、本発明の第２の実施の形態を示す図である。第２の実施の形態は、通常の会話における発語に近い口の動きによって画像認識及び音声合成を可能とし、文字変換のタイミングもユーザから通知する操作を不要としたものである。 (Other embodiments)
Next, a second embodiment of the present invention will be described in detail with reference to the drawings.
FIG. 7 is a diagram showing a second embodiment of the present invention. In the second embodiment, image recognition and speech synthesis are enabled by mouth movements close to speech in normal conversation, and an operation for notifying the timing of character conversion from the user is unnecessary.

本実施の形態では、図１に示すような認識開始通知スイッチ９を不要とし、また、連続的に画像を解析する画像解析部（以下、「連続画像解析部」という。）１４を備える。タイマー部１５は１秒よりも短い時間間隔のｎ秒毎に画像撮影部１に信号を送り、画像撮影部１はｎ秒毎にユーザの口の形を撮影し、連続画像解析部１４に画像を連続的に送信する。連続画像解析部１４は文字列変換部６に変換開始をも通知するように構成し、図１に示すような認識開始通知スイッチ９を不要としている。 In the present embodiment, the recognition start notification switch 9 as shown in FIG. 1 is unnecessary, and an image analysis unit (hereinafter referred to as “continuous image analysis unit”) 14 that continuously analyzes images is provided. The timer unit 15 sends a signal to the image capturing unit 1 every n seconds with a time interval shorter than 1 second, and the image capturing unit 1 captures the shape of the user's mouth every n seconds, and the continuous image analyzing unit 14 Are sent continuously. The continuous image analysis unit 14 is configured to notify the character string conversion unit 6 of the start of conversion, and does not require the recognition start notification switch 9 as shown in FIG.

このため連続画像解析部１４は、画像撮影部１で撮影された画像を基本口形画像データベース５の画像と連続して比較する機能に加え、１秒より短い時間間隔で画像を撮影することにより同じ口の形を重複して撮影した場合、この重複する後続の画像を排除するため不要な画像情報を間引く、例えば３個以上、同じ文字が連続して認識される場合にユーザの発話の区切りと認識し、当該連続する同一文字から後続の文字、例えば最初の１文字以外の全てを間引いて、文字列変換部６に先頭の文字データを出力するとともに、文字列変換部６に対して認識開始を通知する機能を有する。 For this reason, the continuous image analysis unit 14 adds the function of continuously comparing the image captured by the image capturing unit 1 with the image of the basic mouth image database 5 by capturing images at a time interval shorter than 1 second. When shooting with overlapping mouth shapes, unnecessary image information is thinned out to eliminate this overlapping subsequent image. For example, when three or more identical characters are recognized in succession, Recognize, thin out all subsequent characters, for example, all but the first character from the same consecutive characters, and output the first character data to the character string converter 6 and start recognition for the character string converter 6 It has a function to notify.

図８は第２の実施の形態の動作フローチャートを示す図である。以下、図７および図８を参照して第２の実施の形態の動作について説明する。
第２の実施の形態は、第１の実施の形態と比較して、タイマー部２から通知される撮影間隔がｎ（＜１）秒であること、画像撮影部１で撮影された画像は連続画像解析部１４が入力して文字データを認識し文字列変換部６に通知するとともに、連続画像解析部１４が文字列変換部６に文字列データの特定等の認識開始を通知すること等以外は第１の実施の形態と同様である。 FIG. 8 is a diagram showing an operation flowchart of the second embodiment. The operation of the second embodiment will be described below with reference to FIGS.
Compared with the first embodiment, the second embodiment is that the shooting interval notified from the timer unit 2 is n (<1) seconds, and the images shot by the image shooting unit 1 are continuous. Other than the image analysis unit 14 inputting and recognizing the character data and notifying the character string conversion unit 6, and the continuous image analysis unit 14 notifying the character string conversion unit 6 of the start of recognition of the character string data etc. Is the same as in the first embodiment.

図８を参照すると、第２の実施の形態では、連続画像解析部１４の内部処理用に記憶された文字列を初期化し（ステップＥ１）、その後、画像撮影部１で撮影された画像は、連続画像解析部１４に送られる（ステップＥ２）。連続画像解析部１４では、第１の実施の形態の画像解析部４と同様に、基本口形画像データベース５の画像と受信した画像を１データ読み込み（ステップＥ３）、マッチングの比較を行い、一致の度合いを数値化する（ステップＥ４）。これを基本口形画像データベース５の全件に対して行い（ステップＥ５）、全件チェック完了時に、一致の度合いが最大となるものに対応する文字を既に格納されている内部処理用の文字列の後ろに結合する（ステップＥ６）とともに文字列変換部６に文字データを出力する。ここで、３個以上、同じ文字が連続して結合しないかぎりステップＥ２の処理から繰り返して内部処理用の文字列の結合を継続する（ステップＥ７、ＮＯ）。
また、ステップＥ７において、３個以上、同じ文字が連続して結合した場合（ステップＥ７、ＹＥＳ）、ユーザの発話の区切りと認識し、連続する同一文字から最初の１文字以外全て間引いて（ステップＥ８）、文字列変換部６に先頭の文字データを出力するとともに、文字列変換部６に対して認識開始を通知する（ステップＥ９）。
以上の連続画像解析部１４の動作による文字データ及び認識開始の通知に基づいて行われる文字変換部６の以降の動作は第１の実施の形態の動作と同様である。 Referring to FIG. 8, in the second embodiment, a character string stored for internal processing of the continuous image analysis unit 14 is initialized (step E1), and then an image photographed by the image photographing unit 1 is The image is sent to the continuous image analysis unit 14 (step E2). Similar to the image analysis unit 4 of the first embodiment, the continuous image analysis unit 14 reads one image of the image of the basic mouth image database 5 and the received image (step E3), performs matching comparison, and performs matching. The degree is digitized (step E4). This is performed for all cases in the basic mouth image database 5 (step E5), and when all cases are checked, the character string corresponding to the one having the highest degree of matching is already stored. The character data is output to the character string conversion unit 6 while being coupled backward (step E6). Here, as long as three or more identical characters are not continuously combined, the processing of step E2 is repeated and the combination of the character strings for internal processing is continued (step E7, NO).
In step E7, when three or more identical characters are continuously combined (step E7, YES), it is recognized as a break of the user's utterance, and all but the first one character is thinned out from the same consecutive characters (step E8) The head character data is output to the character string converter 6 and the start of recognition is notified to the character string converter 6 (step E9).
The subsequent operation of the character conversion unit 6 performed based on the character data by the operation of the continuous image analysis unit 14 and the notification of the start of recognition is the same as the operation of the first embodiment.

以上のように、第１の実施の形態では、ユーザが口を動かす動作を１秒間隔ごとに行わなくてはならないため発話に当たる口の動きを円滑に行うことができないが、本実施の形態では、タイマー部２の通知間隔のｎ秒の値をそれより小さくすることで、ユーザが口を動かす際に１字１字間隔をあけずに行うことができ、また認識開始の制御も口の動きを止めたり口を閉じる等の通常の会話の仕草等により特定文字データを連続出力させることで可能であり、所定の文字列毎にユーザがスイッチ操作で通知する必要がないためより円滑な無音声通話を行うことが可能である。 As described above, in the first embodiment, the user must move the mouth at intervals of one second, and thus the mouth movement corresponding to the speech cannot be smoothly performed. By making the n second value of the notification interval of the timer unit 2 smaller than that, the user can move the mouth without leaving a one-by-one interval, and also control the start of recognition. It is possible to output specific character data continuously by gestures of normal conversation such as stopping the speech or closing the mouth, etc., and there is no need for the user to notify by switch operation every predetermined character string, so smoother silence It is possible to make a call.

本発明の無音声通信システムの第１の実施の形態を示す図である。It is a figure which shows 1st Embodiment of the voiceless communication system of this invention. 第１の実施の形態の画像撮影手段を示す図である。It is a figure which shows the image imaging means of 1st Embodiment. 第１の実施の形態の動作フローチャートを示す図である。It is a figure which shows the operation | movement flowchart of 1st Embodiment. 画像解析部４の動作フローチャートを示す図である。It is a figure which shows the operation | movement flowchart of the image analysis part. 文字列変換部６の動作フローチャートを示す図である。It is a figure which shows the operation | movement flowchart of the character string conversion part. 語彙データベースのデータの構成を示す図である。It is a figure which shows the structure of the data of a vocabulary database. 本発明の第２の実施の形態を示す図である。It is a figure which shows the 2nd Embodiment of this invention. 第２の実施の形態の動作フローチャートを示す図である。It is a figure which shows the operation | movement flowchart of 2nd Embodiment. 従来の技術を示す図である。It is a figure which shows the prior art.

Explanation of symbols

１画像撮影部
２タイマー部
３照明部
４画像解析部
５基本口形画像データベース
６文字列変換部
７語彙データベース
９認識開始通知スイッチ
１０音声合成出力部
１１外部出力部
１２イヤホン
１３認識結果判定スイッチ
１４連続画像解析部 DESCRIPTION OF SYMBOLS 1 Image photographing part 2 Timer part 3 Illumination part 4 Image analysis part 5 Basic mouth shape image database 6 Character string conversion part 7 Vocabulary database 9 Recognition start notification switch 10 Speech synthesis output part 11 External output part 12 Earphone 13 Recognition result judgment switch 14 Continuous Image analysis unit

Claims

An image photographing unit for photographing the mouth, an image analyzing unit for generating character data from the photographed image data, a character string converting unit for generating character string data from the character data, and synthesizing voice data from the character string data A speech synthesis output unit, a speech output unit that converts the speech data into speech, and a recognition result determination switch that selects whether or not to output the character string data to an external output unit, the image analysis unit The character string conversion unit outputs a character string conversion unit from the image analysis unit by thinning out the subsequent character data of the continuous character data as a character string delimiter when a certain number of the same character data continues. A voiceless communication system , wherein character data up to the input time is converted into character string data and output .

A basic mouth image database having a relationship between a shape of a mouth portion of image data and character data as a database, wherein the image analysis unit refers to the basic mouth shape image database and calculates character data from the shape of the mouth portion of the photographed image; The voiceless communication system according to claim 1, wherein the voiceless communication system is generated.

2. A mask for covering the mouth portion and photographing the mouth portion, wherein at least an illumination portion for illuminating the mouth portion and the image photographing portion for photographing the mouth portion are arranged inside the mask. Or the voiceless communication system of 2.