JP4845183B2

JP4845183B2 - Remote dialogue method and apparatus

Info

Publication number: JP4845183B2
Application number: JP2005336002A
Authority: JP
Inventors: 淳善本
Original assignee: National Institute of Information and Communications Technology
Current assignee: National Institute of Information and Communications Technology
Priority date: 2005-11-21
Filing date: 2005-11-21
Publication date: 2011-12-28
Anticipated expiration: 2025-11-21
Also published as: JP2007142957A

Description

本発明は、ネットワークを介して接続された端末間で、音声及び映像を送受信して遠隔対話を行う方法と装置に関する。 The present invention relates to a method and apparatus for performing remote dialogue by transmitting and receiving audio and video between terminals connected via a network.

ＶＣ（ヴィデオチャット）やテレビ電話など、音声と共に映像を送受信して遠隔対話に用いるシステムが普及しつつある。
現状の一般的なＶＣ用機材一式によると、被写体の顔の位置の認識や、目が開いているか閉じているか程度の追跡は可能である。しかし、画素数や、精度、ＰＣの計算力などの点から、被写体の視線の方向を追跡できるようなレベルには実際には達していない。 Systems such as VC (video chat) and videophone are widely used for remote dialogue by transmitting and receiving video together with sound.
According to the current general VC equipment set, it is possible to recognize the position of the face of the subject and track whether the eyes are open or closed. However, in terms of the number of pixels, the accuracy, the computational power of the PC, etc., the level that can track the direction of the line of sight of the subject has not actually been reached.

ＶＣでは、対話中は基本的にバストアップの画像を双方で共有している。そのため、対話中に、相手の手がキーボードやマウスをどのように操作しているかはわからない。また、それら機材の操作時に下を向いてしまうことも少なくないので、対話の場が停滞し「しらけ」を招くことがある。
ＰＣの機能を呼び出すことや、また何を呼び出しているのかに関する情報を、双方で共有できれば、円滑で有効な対話に役立つ。しかし、これを積極的に可能せしめる従来技術はない。 In the VC, the bust-up image is basically shared by both parties during the dialogue. Therefore, it is not known how the other party's hand operates the keyboard or mouse during the conversation. In addition, it is not uncommon for people to face down when operating these equipment, which can lead to a stagnation of dialogue.
If information about calling a PC function and what is being called can be shared by both parties, it is useful for a smooth and effective dialogue. However, there is no prior art that makes this possible.

関連する従来技術に、下記のような文献がある。
特許文献１は、テレビ会議システムに関するものであり、受信したデータをＲＧＢ表示するディスプレイパネルと、受信した映像情報をディスプレイパネル上にスーパーインポーズ表示する回路とを備えることを特徴としている。
特開２００２−２７１７６３「テレビ会議システム」 Related arts include the following documents.
Patent Document 1 relates to a video conference system, and is characterized by including a display panel that displays received data in RGB and a circuit that displays received video information on the display panel.
Japanese Patent Application Laid-Open No. 2002-271863 “Video Conference System”

これは、プレゼンなどで使う装置に使用され、プレゼンテータの音声を常時音声認識させ、特定のキーワードをコマンドとして受理し、そのコマンドに応じて特定の動作を実行するという内容である。
例えば、プレゼンテータは、遷移させたい場面で「それでは」と発話し、予め決められたタイミング範囲で、カメラに向かって特定のゼスチャーを行う。これにより、「それでは」の音声がトリガとなり画像認識手段が起動し、予め決められたタイミング範囲で受信画像の中からスーパーインポーズ指示のゼスチャーコマンドが認識される。 This is used in a device used for presentations, etc., and is a content that always recognizes the presenter's voice, accepts a specific keyword as a command, and executes a specific operation in accordance with the command.
For example, the presenter utters “Now” in a scene to be changed, and performs a specific gesture toward the camera in a predetermined timing range. As a result, the image recognition means is activated with the voice “Now” as a trigger, and a gesture command for superimposing instructions is recognized from the received image within a predetermined timing range.

この方法によると、常時音声認識をしておく必要があるので、必然的に音声を解析する装置はプレゼンテータの発話全てを分析することになる。装置をプレゼンテータから離れた位置に設置するならば、その装置の大きさ、消費電力量、装置から発生する熱や騒音等はあまり問われないであろうが、一般的にはプレゼンテータのそばに設置し、小型で省電力型、発熱は少なく無騒音ならばさらに良いだろう。そのためには、装置が行う分析の総演算量の低減化、簡素化などの工夫が必要となる。
また、音声認識には、今、何という母音と子音が発話されたのかというような解析が必要である。音声認識のアルゴリズムを簡略にして軽負担化すると、肝心のコマンドを誤認してしまう可能性が高くなる。
また、音声認識に依存すると、意識しないで適当に話した言葉が、コマンドとして誤認される恐れがあるので、利用者は常に正しい発音を心がけねばならない。そのため、プレゼンなどフォーマルな発言等には向いているが、インフォーマルなチャット等には不向きである。
以上より、音声認識を常時行なうことも、音声をトリガとして用いることも、ＶＣには好適と言い難い。 According to this method, since it is necessary to always perform speech recognition, an apparatus for analyzing speech inevitably analyzes all utterances of the presenter. If the device is installed at a position away from the presenter, the size of the device, power consumption, heat and noise generated from the device will not be questioned so much, but generally it is installed near the presenter. However, it would be better if it was small, power-saving, low heat generation and no noise. For that purpose, it is necessary to devise such as reduction and simplification of the total calculation amount of analysis performed by the apparatus.
In speech recognition, it is necessary to analyze what vowels and consonants are uttered. If the speech recognition algorithm is simplified and the burden is reduced, there is a high possibility that an important command will be mistaken.
Also, depending on speech recognition, words spoken appropriately without being conscious may be misunderstood as commands, so the user must always try to pronounce correctly. Therefore, it is suitable for formal remarks such as presentations, but is not suitable for informal chats.
From the above, it is difficult to say that performing voice recognition constantly or using voice as a trigger is suitable for VC.

常時演算が必要な場合、アルゴリズム次第で極めて軽い負荷で実行することが可能である。幾つかの方法があるが、例えば単位時間当たりに処理する情報量を減らす、また、主な演算そのものを単純な加減算で構成させる方法などがある。
音声認識の場合、単位時間当たりに処理する情報量を減らすには、サンプリング周波数やそのビット量を減らす方法がある。この方法では、発話内容そのものがやや不鮮明になり我々人間でも聞き間違いが多くなる。「声量の変化」や「声の高さ（周波数）の変化」のみに着眼するならばまだ良いが、発話内容（単語など）を識別・抽出するのは難しくなる。また単語認識には単純な加減算だけでは難しいだろう。
動作認識の場合、単位時間当たりに処理する情報量を減らすには、１秒あたりのフレーム数やフレームのサイズを減らし、色情報を破棄して輝度情報のみに、またその輝度情報のビット量を減らす方法がある。この方法では細かい動作は破棄されてしまうが、大きな動作や予測が容易な動作ならば、かなりの量にまで絞り込むことができる。また、動作認識は一般的に、前後フレーム間の単純な加減算を基本とする場合が多い。
故に動作認識は、アルゴリズム次第で極めて軽い負荷で実行することが可能である。
本人が意図しない限り発生し難い特殊な動作をトリガとして用いれば、誤認も抑制できる利点がある。 If constant computation is required, it can be executed with a very light load depending on the algorithm. There are several methods, for example, a method of reducing the amount of information processed per unit time, and a method of configuring the main calculation itself by simple addition and subtraction.
In the case of speech recognition, there is a method of reducing the sampling frequency and its bit amount in order to reduce the amount of information processed per unit time. In this method, the content of the utterance itself becomes slightly unclear and we humans will make many mistakes. It is still better to focus only on “change in voice volume” and “change in voice pitch (frequency)”, but it becomes difficult to identify and extract utterance content (words, etc.). It will be difficult to recognize words by simple addition and subtraction.
In the case of motion recognition, in order to reduce the amount of information processed per unit time, the number of frames per second and the size of the frame are reduced, the color information is discarded, only the luminance information, and the bit amount of the luminance information is increased. There are ways to reduce it. In this method, fine operations are discarded, but if the operation is large or easy to predict, it can be reduced to a considerable amount. In general, motion recognition is generally based on simple addition and subtraction between preceding and following frames.
Therefore, motion recognition can be executed with a very light load depending on the algorithm.
If a special action that is unlikely to occur unless intended by the person is used as a trigger, there is an advantage that misidentification can be suppressed.

動作をトリガに利用する従来技術には、特許文献２がある。
特許３１６０１０８「運転支援システム」 Japanese Patent Application Laid-Open No. 2004-133867 discloses a conventional technique that uses an operation as a trigger.
Patent 3160108 "Driving support system"

これは、自動車等の運転を支援するシステムに関するものであり、種々の車外情報と、視線検出回路で検出した運転者の視線とから、運転者の注視物を認識し、所望の車外情報をＣＲＴに出力することを特徴としている。
運転者に車外の状況を提示して安全運転に寄与させるに当たって、運転者に対する情報提示のトリガに、運転者の特定な動作を利用している。
自動車の運転というのはある程度の動作拘束条件下にあり、視線情報を取得し続けるのには向いているが、これを直接ＶＣに応用することは困難である。また既存技術である視線抽出は原理的には可能だが未だ高価であり、同時に視線を厳密に追えば追うほど一般的に装置が利用者を侵襲する場合が多い。これは一般的なＶＣ利用者に受け入れられやすいと考えることは難しく、より一層の工夫が必要である。 This relates to a system that supports driving of an automobile or the like, and recognizes a driver's gaze from various out-of-vehicle information and the driver's line of sight detected by a line-of-sight detection circuit, and outputs desired out-of-vehicle information to a CRT. It is characterized by being output to.
When presenting the situation outside the vehicle to the driver and contributing to safe driving, a specific action of the driver is used as a trigger for information presentation to the driver.
The driving of a car is under some restraint conditions and is suitable for continuing to acquire line-of-sight information, but it is difficult to apply this directly to VC. The line-of-sight extraction, which is an existing technique, is possible in principle, but is still expensive, and the apparatus generally invades the user as the line of sight is strictly followed. It is difficult to think that this is easily accepted by general VC users, and further ingenuity is required.

VCで、通常とは異なる様態の音声をトリガに用いる従来技術に、非特許文献１がある。
Proceedings of the European Conference onSpeech Communication and Technology (Eurospeech 2003), 1201-1204, Sep. 2003 Non-Patent Document 1 is a prior art that uses a voice of a different mode from VC as a trigger in VC.
Proceedings of the European Conference on Speech Communication and Technology (Eurospeech 2003), 1201-1204, Sep. 2003

これは、声の高さをトリガとするものであり、普通に発声した発話はそのまま放置し、意図的に高く発声した発話をコマンドとすることを特徴としている。
音声認識装置をＰＣ上で動作させておき、通常の音声での発話内容は音声認識し、テキスト化する。そして、意図的に高い声での「保存」などの発話があれば、その「保存」をコマンドであると認識し、そのコマンドをしかるべきソフトに転送する。この場合、「保存」と発話されるまでのテキスト化されたデータは、ＰＣに「保存」される。 This is characterized by using the pitch of the voice as a trigger, and uttering a normal utterance as it is, leaving the utterance uttered intentionally as a command.
The speech recognition apparatus is operated on the PC, and the content of speech in normal speech is recognized and converted into text. Then, if there is an intentional utterance such as “save” in a high voice, the “save” is recognized as a command, and the command is transferred to appropriate software. In this case, the text data until “save” is spoken is “saved” on the PC.

しかし、VCなどのように２人以上での対話の場合は、意図的に声の高さなどの音質をコントロールすることは困難である。例えば、２者間の対話では、盛り上がって声が裏返ることがあったり、笑いなどの感情表出があるので、それがコマンドと誤認される恐れがある。
そのため、一人で静かな部屋で独白するような場合には有効であるが、対話には不向きである。
また、音声認識を常時行ない、音声をトリガとして用いている点でも、トリガ単語を避けるという心的な音声拘束をされているようなものなので、自由対話を主とするＶＣには好適と言い難い。 However, in the case of a dialogue between two or more people such as VC, it is difficult to intentionally control sound quality such as voice pitch. For example, in a dialogue between two parties, there is a possibility that the voice is turned upside down and emotions are expressed such as laughter, which may be mistaken as a command.
Therefore, it is effective when you are monologized in a quiet room alone, but is not suitable for dialogue.
Moreover, since voice recognition is always performed and voice is used as a trigger, it is like a voice constraint that avoids a trigger word, so it is difficult to say that it is suitable for a VC mainly using free dialogue. .

そこで、本発明は、装置に軽負荷でありながらも、コマンドを正確に認識できると共に、ＶＣに適するためにコマンドを対話相手と共有もでき、自然な対話を行える遠隔対話方法と、その方法を実施する装置を提供することを課題とする。 Therefore, the present invention provides a remote interaction method that can recognize a command accurately while sharing a light load on the apparatus, and can also share a command with a conversation partner in order to be suitable for a VC, and can perform a natural conversation. It is an object to provide an apparatus to be implemented.

上記課題を解決するために、本発明の遠隔対話方法は、少なくとも、音声及び画像の入出力手段及び認識手段、通信手段、コンピュータを有する端末と、その複数の端末間でデータ授受を介するネットワークとを備え、各端末間で音声及び映像を送受信して遠隔対話を行うシステムにおいて、画像入力手段によって撮像された被写体に関して、予め定められた被写体の特定動作が、画像認識手段により認識されたら、その特定動作認識をトリガとして、音声認識手段におけるコマンド音声認識手段を起動し、音声入力手段によって収音される音声から、コマンド音声認識手段によりコマンドを認識することを特徴とする。 In order to solve the above problems, a remote interaction method of the present invention includes at least a voice / image input / output unit and a recognition unit, a communication unit, a terminal having a computer, and a network that exchanges data among the plurality of terminals. In a system for performing remote dialogue by transmitting and receiving audio and video between each terminal, when a predetermined subject specific operation is recognized by the image recognition unit with respect to the subject imaged by the image input unit, The command voice recognition means in the voice recognition means is activated with specific action recognition as a trigger, and the command voice recognition means recognizes the command from the voice collected by the voice input means.

ここで、コマンド音声認識手段によりコマンドと認識する音声を、予め定められたキーワードとしてもよい。 Here, the voice recognized as the command by the command voice recognition means may be a predetermined keyword.

特定動作を時間継続する動作とし、その開始から終了までの間に収音される音声をコマンドとみなしてもよい。 The specific operation may be an operation that continues for a period of time, and a voice that is collected from the start to the end may be regarded as a command.

コマンドを、端末に備わるソフトウェアへ出力し、そのソフトウェアでコマンド内容を実行してもよい。 The command may be output to software provided in the terminal, and the command content may be executed by the software.

コマンドを、対話相手の端末側へ送信し、その相手端末に備わるソフトウェアでコマンド内容を実行して、情報の共有化に寄与させてもよい。 The command may be transmitted to the terminal of the conversation partner, and the command content may be executed by software provided in the partner terminal to contribute to information sharing.

特定動作を、被写体本人が意図しない限り発生し難い特殊な様態の動作として、誤認の抑制に寄与させてもよい。 The specific action may be made to contribute to suppression of misperception as a special mode action that hardly occurs unless the subject himself / herself intends.

特定動作の違いに応じて、同一音声コマンドに対するコマンド内容を変化させてもよい。 The command content for the same voice command may be changed according to the difference in the specific operation.

特定動作としては、ウインクが有用である。 A wink is useful as the specific operation.

右ウインクを、自分の端末で、後続のコマンドを実行することを指定する特定動作とすると共に、左ウインクを、対話相手の端末で、後続のコマンドを実行することを指定する特定動作としてもよい。 The right wink may be a specific operation that specifies execution of a subsequent command at its own terminal, and the left wink may be a specific operation that specifies execution of a subsequent command at the terminal of the conversation partner. .

特定動作としては、片方の目が画像入力手段によって撮像されない範囲まで頭部を回転させる動作も有用である。 As the specific operation, an operation of rotating the head to a range where one eye is not imaged by the image input means is also useful.

また、指を頭部の所定部位に当てる特定動作も有用である。 In addition, a specific operation of placing a finger on a predetermined part of the head is also useful.

更に、ウインクの代わりに手のひらで特定の目を覆う動作で代用することも有用である。他に、特定の肩を上げる、手の特定の爪を見せる、特定の鼻孔を覆う動作も有用である。より精密さを増すためには、光学的に識別しやすいように再帰性反射部材を利用したヘッドセット（頭部に装着して使用するマイク及びスピーカ）や、イヤホン、メガネ、指輪、腕輪、ネックレス、イヤリング、耳等のピアス、加工した爪等の装着品や装着品の一部を、見せたり、隠したり、動かしたりする動作などを特定動作とすることも有用である。
このように、特定動作としては、相手端末に現れていなかった身体の所定部位や装着品を見せたり、相手端末に現れていた身体の所定部位や装着品を隠す動作が利用できる。 Further, it is also useful to substitute a motion of covering a specific eye with a palm instead of a wink. In addition, the action of raising a specific shoulder, showing a specific nail of the hand, and covering a specific nostril is also useful. In order to increase the precision, headsets that use retroreflective members (microphones and speakers worn on the head), earphones, glasses, rings, bracelets, necklaces that are easy to identify optically It is also useful to set a specific operation such as an operation of showing, hiding, or moving a mounted product such as earrings, earrings such as ears, a processed nail or a part of the mounted product.
As described above, as the specific action, an operation of showing a predetermined part of the body or an attached product that did not appear on the counterpart terminal, or hiding the predetermined part of the body or the attached article that appeared on the counterpart terminal can be used.

本発明の遠隔対話装置は、少なくとも、音声及び画像の入出力手段及び認識手段、通信手段、コンピュータを有する端末と、その複数の端末間でデータ授受を介するネットワークとを備え、各端末間で音声及び映像を送受信して遠隔対話を行うシステムにおいて、画像入力手段によって撮像された被写体に関して、予め定められた被写体の特定動作が、画像認識手段により認識された時、その特定動作認識をトリガとして起動し、音声入力手段によって収音される音声からコマンドを認識するコマンド音声認識手段を備えることを特徴とする。 The remote interaction apparatus of the present invention comprises at least terminals having voice and image input / output means and recognition means, communication means, and a computer, and a network through which data is exchanged among the plurality of terminals. In addition, in a system for performing remote dialogue by transmitting and receiving video, when a specific action of a predetermined subject is recognized by the image recognition means with respect to the subject imaged by the image input means, the specific action recognition is activated as a trigger. And command voice recognition means for recognizing a command from the voice collected by the voice input means.

音声の入出力手段としては、頭部に装着して使用するマイク及びスピーカとし、その装着時における耳近傍位置に発光部材を付設してもよい。 As the voice input / output means, a microphone and a speaker that are used by being worn on the head may be used, and a light emitting member may be provided at a position near the ear at the time of wearing.

音声の入出力手段は、肩部に装着して使用するマイク及びスピーカとし、その装着時における両肩位置に発光部材が付設したものも有用である。 As the voice input / output means, a microphone and a speaker which are used by being mounted on the shoulder portion, and those provided with light emitting members at both shoulder positions at the time of the mounting are also useful.

本発明によると、特定の動作をコマンド入力用のトリガとするので、装置に軽負荷であり、コマンドの誤認を抑制できる。また、コマンドを自然な形態で対話相手と共有できるので、対話の円滑な進行にも寄与する。 According to the present invention, since a specific operation is used as a trigger for inputting a command, the apparatus is lightly loaded and command misunderstanding can be suppressed. In addition, since the command can be shared with the conversation partner in a natural form, it contributes to the smooth progress of the conversation.

以下に、図面を基に本発明の実施形態を説明する。
図１は、ＶＣのシステムの概要を示す説明図である。
ＰＣ本体には、音声認識や画像認識、通信などの各種アプリケーションが搭載され、マウスやキーボードなどの入力デバイスと、モニタなどの出力デバイスが接続され、マイクやスピーカ、ビデオカメラなどのＶＣ用機材が設けられている。このような各端末はインターネット等の通信回線を介して、他の端末と接続され、リアルタイムでのＶＣを可能にしている。 Embodiments of the present invention will be described below with reference to the drawings.
FIG. 1 is an explanatory diagram showing an overview of a VC system.
The PC itself is equipped with various applications such as voice recognition, image recognition, and communication. Input devices such as a mouse and keyboard and output devices such as a monitor are connected, and VC equipment such as a microphone, speaker, and video camera is connected. Is provided. Each of such terminals is connected to other terminals via a communication line such as the Internet, and enables real-time VC.

本発明では、通常では意図しない限り発生し難く予め定められた特殊な様態の動作を、コマンド入力用のトリガとすることを基本としている。これは、前記非特許文献１と比較するならば、声の高さの変化の代わりに特定動作によって、コマンド入力モードに移行することに対応する。
図２は、非特許文献１による従来のＶＣのシステムの要部を示す説明図である。
通信回線の他方には同様の装置があり２者間でVCを実行している。通信手段には、ネットーワークカードやネットワーク用ソフトウェア等が含まれる。その通信手段に対してVC用ソフトウェアが、相手及び自分の動画及び音声をやりとりしている。
なお、図中の他ソフトは、具体的にはブラウザでネットを介して検索している様子などを示すものである。
これに対し、図３は、本発明のＶＣのシステムの要部を示す説明図である。
本発明構成では、VC用ソフトウェアで利用している動画像音声を活用する形態になっている。また、通信手段や他ソフトウェアとも直接やりとりをしている。
このように、非特許文献１は、例えば高い声で「保存」と発話するように、声の高さが変化した発話で、高い声のコマンド入力モードと、「保存」というコマンドと、高い声が終わるコマンド入力モード終了とを同時に行っている。対して本発明では、動作（の維持）がコマンドモード状態のon/offを表し、発話がコマンドとなる点が異なっている。 The present invention is based on the fact that an operation in a special state that is not likely to occur unless intended normally is used as a command input trigger. If this is compared with the said nonpatent literature 1, it respond | corresponds to shifting to command input mode by specific operation instead of the change of a voice pitch.
FIG. 2 is an explanatory diagram showing a main part of a conventional VC system according to Non-Patent Document 1. As shown in FIG.
There is a similar device on the other side of the communication line, and VC is executed between the two parties. The communication means includes a network card, network software, and the like. The VC software exchanges the video and audio of the other party and oneself with the communication means.
Note that the other software in the figure specifically indicates that a browser is searching through the net.
On the other hand, FIG. 3 is an explanatory view showing the main part of the VC system of the present invention.
In the configuration of the present invention, the moving image audio used in the VC software is utilized. It also directly communicates with communication means and other software.
In this way, Non-Patent Document 1 describes a high voice command input mode, a command of “save”, a high voice, for example, an utterance in which the voice level is changed so that “save” is spoken with a high voice. The command input mode is terminated at the same time. On the other hand, the present invention is different in that the operation (maintaining) represents on / off of the command mode state, and the utterance is a command.

図４は、本発明においてコマンド音声を認識するためのシステム概要を示すフローチャートである。
カメラ等の画像入力装置と画像認識ソフトウェアが起動している状態で、カメラに対向した対話者である被写体が撮像される。予め定められた被写体の特定動作が、画像認識ソフトウェアによって認識された時、その特定動作認識をトリガとして、マイク等の音声入力装置と音声認識ソフトウェアが起動する。そして、音声入力装置によって収音される音声から、コマンド音声認識ソフトウェアでコマンド音声を認識する。そして、コマンド音声からコマンドを生成して出力する。 FIG. 4 is a flowchart showing an outline of a system for recognizing command voice in the present invention.
In a state where an image input device such as a camera and image recognition software are activated, a subject that is a conversation person facing the camera is imaged. When a predetermined specific action of the subject is recognized by the image recognition software, a voice input device such as a microphone and the voice recognition software are activated using the specific action recognition as a trigger. The command voice is recognized by the command voice recognition software from the voice collected by the voice input device. Then, a command is generated from the command voice and output.

同様に図５は、別実施例のシステム概要を示すフローチャートである。
図４における例では、特定動作認識をトリガとしてのみ利用しているが、本例では、特定動作認識をコマンド生成にも反映している。
すなわち、例えば右目ウインクと左目ウインクとの差異など、特定動作の内容差異と、特定音声とを組み合わせて、それぞれに応じて異なった内容のコマンドを生成する。 Similarly, FIG. 5 is a flowchart showing a system outline of another embodiment.
In the example in FIG. 4, the specific action recognition is used only as a trigger, but in this example, the specific action recognition is also reflected in command generation.
That is, for example, a specific operation content difference, such as a difference between a right eye wink and a left eye wink, and a specific sound are combined, and commands having different contents are generated according to each.

図６も、別実施例のシステム概要を示すフローチャートである。
本例では、相手も自分も両方認識させている。逐次割込式のプログラムにより相手の認識割込処理と自分の認識割込処理とを行ない、相手と自分の動作を並行して処理する。 FIG. 6 is also a flowchart showing a system outline of another embodiment.
In this example, both the other party and himself / herself are recognized. The other party's recognition interrupt process and one's own recognition interrupt process are performed by a sequential interrupt type program, and the other party's and one's own actions are processed in parallel.

図７は、従来技術のシステム概要を示すフローチャートである。
音声入力装置が当初から起動している点が、本発明とは異なる。
従来技術の多くが明確な音声待ちループであるのに対し、本発明は演算処理の軽減が図れる大雑把な動作待ちのループとなっている。 FIG. 7 is a flowchart showing a system overview of the prior art.
The voice input device is activated from the beginning, which is different from the present invention.
While many of the prior arts are clear voice waiting loops, the present invention is a rough operation waiting loop that can reduce arithmetic processing.

同様に図８も、従来技術のシステム概要を示すフローチャートである。
図７における例では、特定音声認識をトリガとして利用しているが、本例では、音声の特定の変化認識をトリガとしている。 Similarly, FIG. 8 is a flowchart showing an outline of the system of the prior art.
In the example in FIG. 7, specific voice recognition is used as a trigger, but in this example, specific change recognition of voice is used as a trigger.

いずれの場合でも、本発明では、通常では意図しない限り発生し難い特定動作と、予め定められた明瞭なキーワード等の特定の音声によるコマンドとの組み合わせによって、そのコマンドが出力されるステップへ移行するので、低負荷で正確にコマンドを中継することが可能である。
出力されたコマンドは、そのコマンド音声の入力された端末、または、通信回線を介した対話相手の端末に備わるソフトウェアに入力され、そのソフトウェアでコマンド内容が実行される。 In any case, in the present invention, the process proceeds to a step in which a command is output by a combination of a specific action that is unlikely to occur unless it is normally intended and a command by a specific voice such as a clear keyword defined in advance. Therefore, it is possible to relay commands accurately with low load.
The output command is input to software provided in the terminal to which the command voice is input or the terminal of the conversation partner via the communication line, and the command content is executed by the software.

特定動作の例としては、ウインクが挙げられる。
ウインクは、通常の対話では、本人が意図しない限り表出され難い動作である。ましてウインク状態を数秒間維持する動作は、非常に稀である。
このように時間継続する動作を特定動作に選定すると、その動作の開始から終了までの間に収音される音声をコマンドとみなすなど、コマンド音声の認識に寄与させることができる。 An example of the specific operation is wink.
Wink is an operation that is difficult to express in a normal dialogue unless intended by the person. Furthermore, the operation of maintaining the wink state for several seconds is very rare.
When an operation that continues for a time is selected as a specific operation in this way, it is possible to contribute to the recognition of the command voice, for example, the voice collected from the start to the end of the action is regarded as a command.

図９ないし１１は、ウインクの様態に応じたコマンド転送の様態を示す説明図であり、図９は、右目のウインクによって自分の端末へコマンドを出力し、図１０は、左目のウインクによって相手の端末へコマンドを出力し、図１１は、両目のウインクによって自分及び相手の両端末へコマンドを出力することを示している。
特定動作としてのウインクは、片目を閉じる動作とすることが好ましい。
すると、特定動作の違いに応じて、同一音声コマンドに対するコマンド内容を変化させることが可能になり、容易にコマンドに多様性を与えられる。
例えば、右ウインクであれば、自分の端末で、後続のコマンドを実行することを指定する特定動作を意味し、左ウインクであれば、対話相手の端末で、後続のコマンドを実行することを指定する特定動作を意味するなどして、左右の目のウインクに別の意味付けを付与できる。
この場合は、自分の装置では、自分の左ウインクは無視し、相手の左ウインクは有効で、その際のコマンドは相手の音声とする、という設定にすればよい。
また反対に、誤認識を低減させたい場合には、自分の装置でも相手の装置でも同様に解釈されたコマンドのみを、真正なコマンドとして認識する、という設定にすればよい。このように、認識処理を並列して多重化すると、認識精度を向上させられる。 9 to 11 are explanatory diagrams showing the mode of command transfer according to the state of the wink. FIG. 9 outputs a command to the terminal by the right eye wink, and FIG. 10 shows the other party by the left eye wink. The command is output to the terminal, and FIG. 11 shows that the command is output to both the terminal and the other terminal by winking both eyes.
The wink as the specific operation is preferably an operation of closing one eye.
Then, it becomes possible to change the command content for the same voice command according to the difference in specific operation, and it is easy to give diversity to the command.
For example, if it is the right wink, it means a specific operation that specifies that the subsequent command is executed on its own terminal, and if it is the left wink, it specifies that the subsequent command is executed on the terminal of the conversation partner. It is possible to give different meanings to the winks of the left and right eyes, for example, by meaning a specific action.
In this case, it is only necessary to set the apparatus so that the left wink of the other party is ignored, the left wink of the other party is valid, and the command at that time is the other party's voice.
On the other hand, if it is desired to reduce misrecognition, a setting may be made such that only commands that are interpreted in the same way by both the own device and the other device are recognized as authentic commands. As described above, when recognition processing is multiplexed in parallel, recognition accuracy can be improved.

実際にＶＣを使っていない状態でも、既存のＶＣ装置を個人単体で利用することもできる。例えば、各目のウインクに対する意味付けは、例えば、右ウインクが、「デスクトップの〜」を意味し、左ウインクが、「インターネットの〜」を意味するなど、適宜設定変更できる。
なお、ウインクとは、片目を時間ｔ以上閉じる動作とする。ｔは例えば０．５秒など適宜設定し、これより短い動作は無視する。瞬きは両目同時に短時間繰り返して行なうので、誤認される危惧はない。 Even when a VC is not actually used, an existing VC device can be used alone. For example, the meaning of each wink can be changed as appropriate, for example, the right wink means “desktop ~” and the left wink means “Internet ~”.
The wink is an operation of closing one eye for a time t or more. t is appropriately set, for example, 0.5 seconds, and operations shorter than this are ignored. Blinks are repeated for a short time at the same time for both eyes, so there is no risk of misunderstanding.

ウインクを特定動作として用いると、例えば次のように利用できる。
通常の動作時での発話は、音声認識装置では無視される。片目だけ閉じ続けると、音声認識装置が起動し、次に両目を開くまでに行われた発話が音声認識される。そして音声認識されたコマンドが、認識内容に従って所定のソフトウェアに送信される。
例えば両目を開いた通常対話の途中で、片目を閉じ続けて「検索
福沢諭吉」と言い、その後に目を開ける。この場合では、検索エンジンで「福沢諭吉」が検索され、その検索結果がモニタに表示される。 When wink is used as a specific operation, it can be used as follows, for example.
Speech during normal operation is ignored by the speech recognition device. If only one eye is kept closed, the speech recognition device is activated, and the speech that is made until the next time the eyes are opened is recognized. Then, the voice-recognized command is transmitted to predetermined software according to the recognized content.
For example, in the middle of a normal dialogue with both eyes open, keep one eye closed and say “Search Fukuzawa Yukichi”, then open your eyes. In this case, “Fukuzawa Yukichi” is searched by the search engine, and the search result is displayed on the monitor.

このシステムによると、対話相手も、こちらが何をしているのか認識できるので、状況を共有することができる利点がある。下を向いてキーボードで「福沢諭吉」と入力する従来の場合では、対話相手が何をしているのかわからない。すると、その数秒間にしらけが生じて退屈してしまう。また通例では「少し待って・・・（カタカタカタとキーボードを叩く音）」「今、何をしたの？」「福沢諭吉を検索してみた。１つ目が・・・」というような冗長な対話が必要となってしまう。
ウインクのような動作であると、手がマウスやキーボードで埋まっていても、コマンドの入力を妨げない利点がある。
また、対話相手が目を閉じたら、コマンド入力モードに入ったということを理解できるので、続いて発話されるコマンド内容を理解しやすくなり、臨場感も維持できる。
そして、対話相手側も同様のアプリケーションが搭載されているとすると、対話相手側でも「福沢諭吉」を検索しその検索結果を見ることができるので、それを資料として話を続けることにも寄与する。これは、ＰＣの電源は投入できるものの操作に不慣れなお年寄り等を相手に、そのＰＣの機能を遠隔地から解説することや、アプリケーションソフトと組み合わせた通信教育などに応用することも可能になる。 According to this system, the conversation partner can recognize what he / she is doing, so there is an advantage that the situation can be shared. In the traditional case of pointing down and typing “Fukuzawa Yukichi” on the keyboard, you don't know what the conversation partner is doing. Then, it becomes tedious for a few seconds and bored. Also, it is usually redundant, such as "wait a little ... (the sound of hitting the keyboard and the keyboard)", "what did you do now", "searched Yukichi Fukuzawa. The first one ..." Dialogue is required.
The wink-like operation has the advantage of not obstructing command input even if the hand is buried with a mouse or keyboard.
In addition, when the conversation partner closes his eyes, he can understand that he has entered the command input mode, so that he can easily understand the content of the command that is subsequently spoken and can maintain a sense of reality.
And if the conversation partner is also equipped with the same application, the conversation partner can also search for “Fukuzawa Yukichi” and see the search results. . This makes it possible to explain the functions of the PC from a remote location to an elderly person who is not familiar with the operation even though the PC can be turned on, and can be applied to correspondence education combined with application software.

特定動作を、片方の目がカメラに撮像されない範囲まで、頭部を回転させる動作と設定することも有効である。
図１２ないし１４は、別実施例の特定動作を示す説明図であり、図１２は、頭部の右回転によって自分の端末へコマンドを出力し、図１３は、左回転によって相手の端末へコマンドを出力し、図１４は、上方への回転によって自分及び相手の両端末へコマンドを出力することを示している。
このような大雑把な動きならば、現時点の廉価な装置で十分に認識可能である。また、モニタは通常ほぼ平面なので、片目でもモニタに映し出された絵や文字を、短時間なら負荷なく認識することが可能である。そのため、発話によるコマンドが正しく認識されているかどうかをリアルタイムで知ることができる。 It is also effective to set the specific action as an action of rotating the head until one eye is not captured by the camera.
FIGS. 12 to 14 are explanatory views showing specific operations of another embodiment. FIG. 12 shows a command output to its own terminal by rotating the head to the right, and FIG. FIG. 14 shows that a command is output to both the own terminal and the partner terminal by rotating upward.
Such a rough movement can be sufficiently recognized by an inexpensive device at present. In addition, since the monitor is generally almost flat, it is possible to recognize a picture and characters projected on the monitor with one eye without load for a short time. Therefore, it is possible to know in real time whether or not a command by utterance is correctly recognized.

特定動作を、特定の指を、頭部の特定の部位に当てる動作と設定することも有効である。
図１５ないし１７は、別実施例の特定動作を示す説明図であり、図１５は、右こめかみに指を当てる動作によって自分の端末へコマンドを出力し、図１６は、左こめかみに指を当てる動作によって相手の端末へコマンドを出力し、図１７は、額に指を当てる動作によって自分及び相手の両端末へコマンドを出力することを示している。 It is also effective to set the specific action as an action of applying a specific finger to a specific part of the head.
FIGS. 15 to 17 are explanatory views showing a specific operation of another embodiment. FIG. 15 outputs a command to its own terminal by an operation of placing a finger on the right temple, and FIG. 16 applies a finger to the left temple. The command is output to the partner terminal by the operation, and FIG. 17 shows that the command is output to both the terminal and the partner terminal by the operation of placing the finger on the forehead.

図１８ないし２０は、別実施例の特定動作を示す説明図であり、図１８は、右手の指を鼻に当てる動作によって自分の端末へコマンドを出力し、図１９は、左手の指を鼻に当てる動作によって相手の端末へコマンドを出力し、図２０は、複数本の指を鼻に当てる動作によって自分及び相手の両端末へコマンドを出力することを示している。
このように、指等の当てる部位や、当てられる部位は、多様に設定可能なので、それに応じて、特定動作の違いに伴うコマンド内容の変化も、多様かつ容易に設定変更可能である。 18 to 20 are explanatory views showing specific operations of another embodiment, in which FIG. 18 outputs a command to his / her terminal by the operation of placing the finger of the right hand on the nose, and FIG. 19 shows the finger of the left hand on the nose. A command is output to the partner's terminal by the operation of hitting the terminal, and FIG. 20 shows that the command is output to both the terminal of the partner and the partner by the action of hitting a plurality of fingers to the nose.
As described above, since the part to be touched by the finger or the like and the part to be touched can be set in various ways, the change in the command content due to the difference in the specific operation can be changed in various ways easily.

頭部等に光を発する部材を装着して、特定動作の認識に寄与させることもできる。
図２１は、耳近傍位置に発光部材を装着した状態を示す説明図であり、図２２は、その状態で頭部を回転させた状況を示す説明図である。
図示の例では、ＬＥＤの連設されたスピーカを左耳に装着している。
これによると、ＬＥＤの光によって、顔に照明が当たっていない状態でも動作を認識できる。頭部を左回転させると、自分の鼻や右頬でＬＥＤ光がカメラに届かなくなるので、一定時間以上ＬＥＤ光の入力がなければコマンド入力モードの開始などと設定できる。 A member that emits light may be attached to the head or the like to contribute to recognition of a specific action.
FIG. 21 is an explanatory view showing a state in which the light emitting member is mounted in the vicinity of the ear, and FIG. 22 is an explanatory view showing a state in which the head is rotated in that state.
In the example shown in the figure, a speaker connected with LEDs is attached to the left ear.
According to this, the operation can be recognized by the light of the LED even when the face is not illuminated. When the head is rotated counterclockwise, the LED light does not reach the camera with his nose or right cheek. Therefore, if no LED light is input for a certain period of time, the command input mode can be set to start.

図２３は、両肩位置に発光部材を装着した状態を示す説明図であり、図２４は、その状態で右肩を上昇させた状況を示す説明図である。
図示の例では、両肩位置にＬＥＤの付設されたマイク及びスピーカのセットを装着している。
通常は、２つのＬＥＤ光はほぼ水平にカメラに映っているとする。図示のように右肩を上げたり、上半身を左に傾けると、右肩部のＬＥＤ光の方が上昇する。このように複数のＬＥＤ光の位置関係によって、特定動作の認識を行うことができる。
なお、この肩に掛けるタイプの音声入出力装置を用いると、長時間使用しても従来のヘッドホンように耳部に痛みを生じることがない。電池内蔵型として音声をワイヤレスで送受信するようにすれば、そのまま席を離れることもでき便利である。 FIG. 23 is an explanatory view showing a state where the light emitting member is attached to both shoulder positions, and FIG. 24 is an explanatory view showing a state where the right shoulder is raised in that state.
In the example shown in the figure, a set of microphones and speakers provided with LEDs is mounted on both shoulder positions.
Usually, it is assumed that two LED lights are reflected on the camera almost horizontally. If the right shoulder is raised or the upper body is tilted to the left as shown, the LED light on the right shoulder rises. As described above, the specific operation can be recognized based on the positional relationship between the plurality of LED lights.
In addition, when this type of voice input / output device to be worn on the shoulder is used, there is no pain in the ear as in conventional headphones even when used for a long time. If the voice is sent and received wirelessly as a battery built-in type, you can leave the seat as it is, and it is convenient.

また、ＬＥＤの発光波長を複数用意し、それをＩＤとして用いれば、カメラの前に複数人いても各人の顔を認識することなく識別できる。
各人専用の機能の割り当て情報をサーバーに記憶させておけば、所定のネットワーク内のどのＰＣでも、同じ機能割り当てを使うことができる。 If a plurality of LED emission wavelengths are prepared and used as IDs, even if there are a plurality of people in front of the camera, they can be identified without recognizing each person's face.
If function assignment information dedicated to each person is stored in the server, the same function assignment can be used by any PC in a predetermined network.

本発明によると、既存の装置を組み合わせて軽負荷で構成できながらも、コマンドを正確に中継し、自然な形態で対話相手と共有もできるので、円滑に対話を進めることができる。
自然に使えるインターフェースであるので、高齢者などＰＣ操作に不慣れな者も容易に習得でき、アプリケーションを擬人化して生活空間に浸透させることも可能である。
また、通信教育等にも応用でき、用途が広く産業上非常に有用である。 According to the present invention, a command can be relayed accurately and can be shared with a conversation partner in a natural form while being able to be configured with a light load by combining existing devices.
Since the interface can be used naturally, even those who are not familiar with PC operations, such as elderly people, can easily learn, and it is possible to impersonate the application and penetrate it into the living space.
In addition, it can be applied to correspondence education and the like, and is widely useful in industrial applications.

ＶＣのシステムの概要を示す説明図Explanatory diagram showing an overview of the VC system 非特許文献１による従来のＶＣのシステムの要部を示す説明図Explanatory drawing which shows the principal part of the system of the conventional VC by a nonpatent literature 1. 本発明のＶＣのシステムの要部を示す説明図Explanatory drawing which shows the principal part of the system of VC of this invention コマンド音声を認識するためのシステム概要を示すフローチャートFlow chart showing the system outline for recognizing command voice 同、別実施例図Same example diagram 同、別実施例図Same example diagram 従来技術のシステム概要を示すフローチャートFlow chart showing system overview of the prior art 同、別例図Same example ウインクの様態に応じたコマンド転送の様態を示す説明図Explanatory diagram showing command transfer mode according to wink mode 同、別状態図Same state diagram 同、別状態図Same state diagram 頭部の回転を用いた別実施例の特定動作を示す説明図Explanatory drawing which shows specific operation | movement of another Example using rotation of a head. 同、別状態図Same state diagram 同、別状態図Same state diagram 指で指示する別実施例の特定動作を示す説明図Explanatory drawing which shows specific operation | movement of another Example instruct | indicated with a finger | toe 同、別状態図Same state diagram 同、別状態図Same state diagram 鼻を指示する別実施例の特定動作を示す説明図Explanatory drawing which shows the specific operation | movement of another Example which instruct | indicates a nose 同、別状態図Same state diagram 同、別状態図Same state diagram 耳近傍位置に発光部材を装着した状態を示す説明図Explanatory drawing which shows the state which mounted | wore the light emitting member in the ear vicinity position 同、別状態図Same state diagram 両肩位置に発光部材を装着した状態を示す説明図Explanatory drawing which shows the state which mounted | wore the light emitting member in both shoulder positions 同、別状態図Same state diagram

Claims

At least a voice and image input / output means and recognition means, a communication means, a terminal having a computer, and a network through data exchange between the plurality of terminals,
Audio and video obtained by the respective audio input unit and the image input means at each terminal, a method of performing remote dialogue by mutually transmitting and receiving each other between the respective terminals,
Regard the subject captured by the image input means, a specific operation of the subject to a predetermined, the step of recognizing the image recognition unit,
Remote Interactive method of as a trigger that particular operation recognition, the speech is picked up by the voice input means, characterized in that it comprises a step of recognizing the command by the speech recognition means.

The remote dialogue method according to claim 1, wherein the voice recognized as the command by the voice recognition means is a predetermined keyword.

The remote interaction method according to claim 1 , wherein the specific operation is a continuous operation, and a command is recognized from a voice collected from the start to the end of the specific operation.

The remote interaction method according to any one of claims 1 to 3, wherein a command recognized by the voice recognition means is output to software provided in the terminal, and the command content is executed by the software.

The remote dialogue method according to any one of claims 1 to 4, wherein a command recognized by the voice recognition means is transmitted to a terminal of a conversation partner, and the contents of the command are executed by software provided in the partner terminal.

The remote interaction method according to any one of claims 1 to 5, wherein command contents for the same voice command are changed according to a difference in specific operation.

The remote interaction method according to any one of claims 1 to 6, wherein the specific operation is winking.

The right wink is a specific action that specifies that subsequent commands are executed on your terminal,
The remote interaction method according to claim 6 or 7, wherein the left wink is a specific operation that specifies execution of a subsequent command at a terminal of an interaction partner.

The remote interaction method according to claim 1, wherein the specific operation is an operation of rotating the head to a range where one eye is not imaged by the image input means.

The remote interaction method according to claim 1, wherein the specific action is an action of placing a finger on a predetermined part of the head.

The remote dialogue method according to claim 1, wherein the specific action is an action of showing a predetermined part of the body that has not appeared.

The remote interaction method according to claim 1, wherein the specific action is an action of hiding a predetermined part of the body that has appeared.

The remote interaction method according to any one of claims 1 to 6, wherein the specific action is an action of showing a body wearing item that has not appeared.

The remote interaction method according to any one of claims 1 to 6, wherein the specific action is an action of hiding the worn body accessory.

At least a voice and image input / output means and recognition means, a communication means, a terminal having a computer, and a network through data exchange between the plurality of terminals,
Audio and video obtained by the respective audio input unit and the image input means at each terminal, in the apparatus for performing remote dialogue by mutually transmitting and receiving each other between the respective terminals,
Regard the subject captured by the image input means, an image recognition means for recognizing the specific operation of the subject to a predetermined,
As a trigger that particular motion recognition remote interactive device characterized by having a voice recognition means for recognizing a command from the speech that is picked up by the voice input means.

The voice input / output means is a microphone and a speaker that are used on the head,
The remote interaction device according to the near-ear position, in Claim 15 in which the light emitting member Ru is attached at the time when the subject wearing it is imaged by the image input means.

The voice input / output means is a microphone and a speaker that are used by attaching to the shoulder,
The remote interaction device according to the shoulders vicinity at the time when the subject wearing it is imaged by the image input means, to claim 15 in which the light emitting member Ru is attached.