JP7035686B2

JP7035686B2 - Remote calling devices, remote calling programs, and remote calling methods

Info

Publication number: JP7035686B2
Application number: JP2018056535A
Authority: JP
Inventors: 尚也川畑
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 2018-03-23
Filing date: 2018-03-23
Publication date: 2022-03-15
Anticipated expiration: 2038-03-23
Also published as: JP2019169856A

Description

本発明は、遠隔通話装置、遠隔通話プログラム、及び遠隔通話方法に関し、例えば、テレビ会議システムや電話会議システム等において用いられる、通話の開始処理に適用し得るものである。 The present invention relates to a remote call device, a remote call program, and a remote call method, and can be applied to, for example, a call start processing used in a video conference system, a conference call system, or the like.

近年、テレビ会議システムや電話会議システム等の遠隔通話システムを用いてテレビ会議やテレワークなどの遠隔地と通話やコミュニケーションを行う機会が増えている。 In recent years, there have been increasing opportunities to make calls and communicate with remote locations such as video conferences and telework using remote call systems such as video conference systems and telephone conference systems.

遠隔通話システムでは、遠隔地の通話相手と通話を行うために、遠隔通話システムに接続されている入力装置（例えば、マウス、キーボード、リモコンなど）で通話相手先の電話番号などの連絡先を入力、選択して接続する。または、近年ではモバイル端末（例えば、スマートフォンやタブレットパソコンなど）の普及により、遠隔通話システムがモバイル端末で動作し、モバイル端末の画面で連絡先を画面上の表示されているキーボードで入力し接続、モバイル端末のタッチパネルディスプレイに表示されている連絡先をタッチして選択し接続、画面上に表示されている通話相手の映像をタッチし接続するなどで遠隔地と接続することが多い。 In a remote call system, in order to make a call with a remote call partner, enter the contact information such as the call partner's phone number with an input device (for example, mouse, keyboard, remote control, etc.) connected to the remote call system. , Select and connect. Or, in recent years, with the spread of mobile terminals (for example, smartphones and tablet PCs), remote calling systems operate on mobile terminals, and contacts are entered on the screen of the mobile terminal using the keyboard displayed on the screen to connect. In many cases, you can connect to a remote location by touching the contact displayed on the touch panel display of the mobile terminal to select and connect, or by touching the image of the other party displayed on the screen to connect.

さらに、遠隔通話システムをロボットに組込み、近親者と単身の高齢者とのコミュニケーション支援するコミュニケーション支援ロボットシステムが特許文献１によって提案されている。 Further, Patent Document 1 proposes a communication support robot system that incorporates a remote communication system into a robot and supports communication between a close relative and a single elderly person.

特許文献１に記載のコミュニケーション支援ロボットシステムは、タッチパネルディスプレイに表示されている、近親者や高齢者の映像をタッチすることで、通話相手に接続され通話が開始する。 The communication support robot system described in Patent Document 1 is connected to a call partner and starts a call by touching an image of a close relative or an elderly person displayed on a touch panel display.

特開２０１５－１８４５９７号公報Japanese Unexamined Patent Publication No. 2015-184597

しかしながら、特許文献１に記載のコミュニケーション支援ロボットシステムでは、通話の開始や終了を従来の遠隔通話システムと同様に入力装置で通話相手の連絡先を入力し接続、タッチパネルディスプレイに表示されている連絡先をタッチすることで接続し通話を開始している。従来の接続方法で接続することは、実際の対面での通話と異なっているため臨場感（対面で会話しているような感覚）が非常に低い。 However, in the communication support robot system described in Patent Document 1, the contact of the other party is input and connected by the input device in the same manner as the conventional remote call system to start and end the call, and the contact displayed on the touch panel display. Touch to connect and start a call. Connecting by the conventional connection method is different from the actual face-to-face call, so the sense of presence (the feeling of having a face-to-face conversation) is very low.

また、コミュニケーション支援ロボッ卜に搭載されている音声認識システムを使用して、例えば、接続先の通話相手の名前や会話を開始するコマンド（例えば、「こんにちは」、「失礼します」など）などの呼びかる音声（以下、「呼びかけ音声」と呼ぶ）を発話し、その言葉を音声認識システムに入力し、音声認識の結果から接続先を判定して接続を開始できるようにしても、呼びかけ音声は、音声認識処理に入力されてから音声認識の結果から通話相手が決定し、通話相手に接続されるため、呼びかけ音声が通話相手に伝わらない。このため、通話相手に突然接続され、通話相手は違和感や不安感を感じ臨場感が向上しない。 Also, using the voice recognition system installed in the communication support robot, for example, the name of the other party to connect to or a command to start a conversation (for example, "Hello", "Excuse me", etc.), etc. Even if the calling voice (hereinafter referred to as "calling voice") is spoken, the word is input to the voice recognition system, the connection destination is determined from the voice recognition result, and the connection can be started, the calling voice is still used. , Since the call partner is determined from the result of voice recognition after being input to the voice recognition process and connected to the call partner, the call voice is not transmitted to the call partner. For this reason, the call partner is suddenly connected, and the call partner feels a sense of discomfort or anxiety, and the sense of presence does not improve.

さらに、遠隔通話システムの相手の映像が表示されるモニターが大画面になると、マイクは使用の邪魔にならないようにモニターの近くに設置することが多い。また、使用者は画面全体が見える位置で会話するため、使用者とマイクの距離が離れる。そのため、周りに人がいて話していたり、空気調整機やエアコンの音が大きかったりする場合は使用環境の雑音が大きく、使用者の音声が聞こえにくくなり会話できない。 Furthermore, when the monitor on which the image of the other party of the remote call system is displayed becomes a large screen, the microphone is often installed near the monitor so as not to interfere with the use. In addition, since the user talks in a position where the entire screen can be seen, the distance between the user and the microphone is increased. Therefore, when there are people around and talking, or when the sound of the air conditioner or air conditioner is loud, the noise of the usage environment is large, and it becomes difficult to hear the user's voice and it is not possible to talk.

そのため、使用環境の雑音が大きい場合でも呼びかけ音声を発話したり、通話相手と通話していても、実際の対面での通話する場合と同様に、呼びかけ音声で通話相手に接続して通話を開始し、通話相手側には、通話相手に呼びかけ音声を伝えてから通話が開始され、会話を終了するコマンド（例えば、「さようなら。」、「失礼しました。」など）などの切断する音声（以下、「切断音声」と呼ぶ）を発話し、会話が終了して通話が切断される遠隔通話装置、遠隔通話プログラム、及び遠隔通話方法が望まれている。 Therefore, even if the usage environment is noisy, you can make a call voice, or even if you are talking to the other party, you can connect to the other party with the call voice and start a call, just like when you make an actual face-to-face call. However, to the other party, the call is sent to the other party and then the call is started, and the voice to disconnect such as a command to end the conversation (for example, "Goodbye", "Excuse me", etc.) (hereinafter , Called "disconnect voice"), a remote call device, a remote call program, and a remote call method are desired, in which the call is terminated and the call is disconnected.

第１の本発明は、（１）音声を収音する複数のマイクの入力信号を使用して所定の信号処理を行う信号処理部と、（２）上記信号処理部から出力された処理信号を一定期間保持するオーディオバッファ部と、（３）上記処理信号に対して音声認識を行う音声認識部と、（４）上記音声認識部の結果を用いて、上記処理信号が、接続コマンド音声か否か判定するコマンド判定部と、（５）上記コマンド判定部により上記処理信号が上記接続コマンド音声と判定されたときに、接続先を判定する接続判定部と、（６）通常は上記信号処理部から出力された処理信号を遠隔の通話相手に送信するネットワーク通信部に出力し、上記コマンド判定部により上記処理信号が上記接続コマンド音声と判定された場合には、通話相手先端末に接続後、上記オーディオバッファ部に保持されている上記処理信号を上記ネットワーク通信部に出力し、上記オーディオバッファ部に保持されている上記処理信号を出力したら、上記信号処理部から出力された処理信号を再び上記ネットワーク通信部に出力するように切り替える出力切替え部とを有することを特徴とする。 The first aspect of the present invention is (1) a signal processing unit that performs predetermined signal processing using input signals of a plurality of microphones that collect audio, and (2) a processing signal output from the signal processing unit. Whether or not the processed signal is a connection command voice using the results of the audio buffer unit held for a certain period, (3) the voice recognition unit that performs voice recognition for the processed signal, and (4) the voice recognition unit. A command determination unit that determines whether or not the connection destination is determined when the processing signal is determined to be the connection command voice by the command determination unit, and (6) usually the signal processing unit . The processing signal output from is output to the network communication unit that transmits to the remote call partner, and if the process signal is determined to be the connection command voice by the command determination unit , after connecting to the call partner terminal, After the processing signal held in the audio buffer section is output to the network communication section and the processing signal held in the audio buffer section is output, the processing signal output from the signal processing section is output again . It is characterized by having an output switching unit that switches to output to the network communication unit .

第２の本発明の遠隔通話プログラムは、コンピュータを、（１）音声を収音する複数のマイクの入力信号を使用して所定の信号処理を行う信号処理部と、（２）上記信号処理部から出力された処理信号を一定期間保持するオーディオバッファ部と、（３）上記処理信号に対して音声認識を行う音声認識部と、（４）上記音声認識部の結果を用いて、上記処理信号が、接続コマンド音声か否か判定するコマンド判定部と、（５）上記コマンド判定部により上記処理信号が上記接続コマンド音声と判定されたときに、接続先を判定する接続判定部と、（６）通常は上記信号処理部から出力された処理信号を遠隔の通話相手に送信するネットワーク通信部に出力し、上記コマンド判定部により上記処理信号が上記接続コマンド音声と判定された場合には、通話相手先端末に接続後、上記オーディオバッファ部に保持されている上記処理信号を上記ネットワーク通信部に出力し、上記オーディオバッファ部に保持されている上記処理信号を出力したら、上記信号処理部から出力された処理信号を再び上記ネットワーク通信部に出力するように切り替える出力切替え部として機能させることを特徴とする。 The second remote call program of the present invention comprises a computer, (1) a signal processing unit that performs predetermined signal processing using input signals of a plurality of microphones that pick up voice, and (2) the above signal processing unit. Using the results of the audio buffer unit that holds the processed signal output from the above for a certain period of time, (3) the voice recognition unit that performs voice recognition for the processed signal, and (4) the voice recognition unit, the processing signal. However, a command determination unit that determines whether or not the connection command voice is used, and (5) a connection determination unit that determines the connection destination when the processing signal is determined to be the connection command voice by the command determination unit, and (6). ) Normally, the processing signal output from the signal processing unit is output to the network communication unit that transmits to the remote call partner, and when the command determination unit determines that the processing signal is the connection command voice, a call is made. After connecting to the destination terminal, the processing signal held in the audio buffer section is output to the network communication section, the processing signal held in the audio buffer section is output, and then output from the signal processing section . It is characterized in that it functions as an output switching unit that switches the processed signal to be output to the network communication unit again .

第３の本発明は、遠隔通話装置に使用する遠隔通話方法であって、信号処理部、オーディオバッファ部、音声認識部、コマンド判定部、接続判定部、及び出力切替え部を有し、（１）上記信号処理部は、音声を収音する複数のマイクの入力信号を使用して所定の信号処理を行い、（２）上記オーディオバッファ部は、上記信号処理部から出力された処理信号を一定期間保持し、（３）上記音声認識部は、上記処理信号に対して音声認識を行い、（４）上記コマンド判定部は、上記音声認識部の結果を用いて、上記処理信号が、接続コマンド音声か否か判定し、（５）上記接続判定部は、上記コマンド判定部により上記処理信号が上記接続コマンド音声と判定されたときに、接続先を判定し、（６）上記出力切替え部は、通常は上記信号処理部から出力された処理信号を遠隔の通話相手に送信するネットワーク通信部に出力し、上記コマンド判定部により上記処理信号が上記接続コマンド音声と判定された場合には、通話相手先端末に接続後、上記オーディオバッファ部に保持されている上記処理信号を上記ネットワーク通信部に出力し、上記オーディオバッファ部に保持されている上記処理信号を出力したら、上記信号処理部から出力された処理信号を再び上記ネットワーク通信部に出力するように切り替えることを特徴とする。 A third aspect of the present invention is a remote communication method used for a remote communication device, which includes a signal processing unit, an audio buffer unit, a voice recognition unit, a command determination unit, a connection determination unit, and an output switching unit (1). ) The signal processing unit performs predetermined signal processing using the input signals of a plurality of microphones that collect audio, and (2) the audio buffer unit constants the processing signal output from the signal processing unit. The period is held, (3) the voice recognition unit performs voice recognition for the processing signal, and (4) the command determination unit uses the result of the voice recognition unit, and the processing signal is a connection command. It is determined whether or not it is voice, (5) the connection determination unit determines the connection destination when the processing signal is determined to be the connection command voice by the command determination unit, and (6) the output switching unit determines. Normally, the processing signal output from the signal processing unit is output to the network communication unit that transmits the remote communication partner, and when the command determination unit determines that the processing signal is the connection command voice, a call is made. After connecting to the destination terminal, the processing signal held in the audio buffer section is output to the network communication section, the processing signal held in the audio buffer section is output, and then output from the signal processing section . It is characterized in that the processed signal is switched to be output to the network communication unit again .

本発明によれば、使用者がマイクから離れていても通話相手と接続するときに、実際の対面での通話するときと同じ、接続先の通話相手の名前等と会話が開始する言葉で接続を開始し、通話が終了する言葉で接続を終了することで、会話が開始する状態と終了する状態を再現し、双方が高い臨場感を感じることができる。 According to the present invention, when the user connects to the other party even if he / she is away from the microphone, he / she connects with the name of the other party to connect to and the words that start the conversation, which is the same as when making an actual face-to-face call. By starting and ending the connection with the words that end the call, the state where the conversation starts and the state where the conversation ends can be reproduced, and both sides can feel a high sense of presence.

第１の実施形態に係る遠隔通話装置の構成を示すブロック図である。It is a block diagram which shows the structure of the remote communication apparatus which concerns on 1st Embodiment. 第１の実施形態に係るコマンドリスト部の一例を示す説明図である。It is explanatory drawing which shows an example of the command list part which concerns on 1st Embodiment. 第１の実施形態に係る遠隔通話装置を使用する一方の拠点の部屋内の機器配置や使用者の位置の一例である。This is an example of the device arrangement and the user's position in the room of one of the bases where the remote communication device according to the first embodiment is used. 第２の実施形態に係る遠隔通話装置の構成を示すブロック図である。It is a block diagram which shows the structure of the remote communication apparatus which concerns on 2nd Embodiment. 第３の実施形態に係る遠隔通話装置の構成を示すブロック図である。It is a block diagram which shows the structure of the remote communication apparatus which concerns on 3rd Embodiment.

（Ａ）第１の実施形態
以下では、本発明の遠隔通話装置、遠隔通話プログラム、及び遠隔通話方法の実施形態を、図面を参照しながら詳細に説明する。 (A) First Embodiment In the following, an embodiment of the remote call device, the remote call program, and the remote call method of the present invention will be described in detail with reference to the drawings.

第１の実施形態は、例えば、テレビ会議システムや電話会議システム等のマイク入力部に上述した本発明の遠隔通話装置、遠隔通話プログラム、及び遠隔通話方法を適用した場合を例示したものである。 The first embodiment illustrates, for example, a case where the above-mentioned remote communication device, remote communication program, and remote communication method of the present invention are applied to a microphone input unit of a video conference system, a telephone conference system, or the like.

（Ａ－１）第１の実施形態の構成
図１は、第１の実施形態に係る遠隔通話装置１００の構成を示すブロック図である。 (A-1) Configuration of First Embodiment FIG. 1 is a block diagram showing a configuration of a remote communication device 100 according to the first embodiment.

本発明の第１の実施形態の遠隔通話装置１００は、例えば、専用ボードとして構築されるようにしても良いし、ＤＳＰ（デジタルシグナルプロセッサ）への遠隔通話プログラムの書き込みによって実現されたものであっても良く、ＣＰＵと、ＣＰＵが実行するソフトウェア（遠隔通話プログラム）によって実現されたものであっても良いが、機能的には、図１で表すことができる。 The remote communication device 100 of the first embodiment of the present invention may be constructed as a dedicated board, for example, or is realized by writing a remote communication program to a DSP (digital signal processor). It may be realized by a CPU and software (remote call program) executed by the CPU, but functionally, it can be represented by FIG.

図１において、本発明の第１の実施形態に係る遠隔通話装置１００は、マイクアレイ１０１、マイクアンプ１０２、ＡＤ変換器１０３、呼びかけ処理部１０４、ＮＷ通信部１０５、ネットワーク１０６、ＤＡ変換器１０７、スピーカアンプ１０８、スピーカ１０９、ビデオカメラ１２０、及びモニター１２１を有する。 In FIG. 1, the remote communication device 100 according to the first embodiment of the present invention includes a microphone array 101, a microphone amplifier 102, an AD converter 103, a call processing unit 104, a NW communication unit 105, a network 106, and a DA converter 107. It has a speaker amplifier 108, a speaker 109, a video camera 120, and a monitor 121.

マイクアレイ１０１は、人の音声や音を受音する複数本のマイクである。 The microphone array 101 is a plurality of microphones that receive human voices and sounds.

マイクアンプ１０２は、マイクアレイ１０１により受音された複数の入力信号を増幅するものである。 The microphone amplifier 102 amplifies a plurality of input signals received by the microphone array 101.

ＡＤ変換器１０３は、マイクアンプ１０２により増幅された複数の入力信号をアナログ信号からデジタル信号に変換するものである。以下、ＡＤ変換器１０３で変換された信号を「マイクアレイ入力信号」とする。 The AD converter 103 converts a plurality of input signals amplified by the microphone amplifier 102 from an analog signal to a digital signal. Hereinafter, the signal converted by the AD converter 103 will be referred to as a “microphone array input signal”.

呼びかけ処理部１０４は、入力された複数のマイク入力信号を信号処理し、処理した信号を出力端子に出力する。同時に、呼びかけ処理部１０４は信号処理した信号をオーディオバッファに保存する。さらに、呼びかけ処理部１０４は、信号処理した信号を音声認識し、音声認識結果がコマンドリスト部のコマンドの１つと一致した場合に、接続判定結果とオーディオバッファに保存されている音信号を一定時間出力し、一定時間出力が完了すると再び信号処理した信号を出力する。 The call processing unit 104 processes a plurality of input microphone input signals into signals, and outputs the processed signals to the output terminal. At the same time, the call processing unit 104 stores the signal processed signal in the audio buffer. Further, the call processing unit 104 recognizes the signal processed signal by voice, and when the voice recognition result matches one of the commands in the command list unit, the connection determination result and the sound signal stored in the audio buffer are displayed for a certain period of time. It outputs, and when the output is completed for a certain period of time, the signal processed signal is output again.

ＮＷ通信部１０５は、呼びかけ処理部１０４から出力された接続判定結果に基づき、ネットワーク１０６との接続処理を行う。接続後、遠隔通話装置１００はＮＷ通信部１０５を介して、ネットワーク１０６に接続している別のＮＷ通信部（別の遠隔通話装置）と音声のやりとりが行われる。 The NW communication unit 105 performs connection processing with the network 106 based on the connection determination result output from the call processing unit 104. After the connection, the remote communication device 100 exchanges voice with another NW communication unit (another remote communication device) connected to the network 106 via the NW communication unit 105.

ＤＡ変換器１０７は、ネットワーク１０６からの音声（ＮＷ通信部１０５を介して送信されてきた音信号）をデジタル信号からアナログ信号に変換するものである。 The DA converter 107 converts voice from the network 106 (sound signal transmitted via the NW communication unit 105) from a digital signal to an analog signal.

スピーカアンプ１０８は、アナログ信号を増幅するものである。 The speaker amplifier 108 amplifies the analog signal.

スピーカ１０９は、電気信号を空気の振動に変換して音として出力するスピーカである。 The speaker 109 is a speaker that converts an electric signal into vibration of air and outputs it as sound.

次に、呼びかけ処理部１０４の詳細な構成を説明する。 Next, the detailed configuration of the call processing unit 104 will be described.

呼びかけ処理部１０４は、入力端子１１０、信号処理部１１１、オーディオバッファ部１１２、音声認識部１１３、コマンドリスト部１１４、コマンド判定部１１５、出力切替え部１１６、出力端子１１７、接続判定部１１８、及び接続判定結果出力端子１１９を有する。 The call processing unit 104 includes an input terminal 110, a signal processing unit 111, an audio buffer unit 112, a voice recognition unit 113, a command list unit 114, a command determination unit 115, an output switching unit 116, an output terminal 117, a connection determination unit 118, and It has a connection determination result output terminal 119.

入力端子１１０は、マイクアレイ入力信号を呼びかけ処理部１０４に入力するインタフェースである。 The input terminal 110 is an interface for inputting a microphone array input signal to the calling processing unit 104.

信号処理部１１１は、マイクアレイ入力信号を信号処理して処理信号を出力する。 The signal processing unit 111 processes the microphone array input signal and outputs the processed signal.

オーディオバッファ部１１２は、処理信号を一定時間保持するバッファである。 The audio buffer unit 112 is a buffer that holds the processing signal for a certain period of time.

音声認識部１１３は、処理信号を音声認識し、音声認識の結果を出力する。 The voice recognition unit 113 recognizes the processed signal by voice and outputs the result of the voice recognition.

コマンドリスト部１１４は、コマンドが保持されているリストである。コマンドリスト部１１４は、例えば、図２のようにコマンドの一覧がテキストファイルで保持されている。なお、図２は、一例であって、保持するデータの内容及び形式は種々様々な値（形式）を適用することができる。 The command list unit 114 is a list in which commands are held. The command list unit 114 holds, for example, a list of commands in a text file as shown in FIG. Note that FIG. 2 is an example, and various values (formats) can be applied to the content and format of the data to be retained.

コマンド判定部１１５は、音声認識の結果がコマンドリスト部１１４のコマンドリストに存在するか否か判定し、判定結果を出力する。 The command determination unit 115 determines whether or not the voice recognition result exists in the command list of the command list unit 114, and outputs the determination result.

出力切替え部１１６は、コマンド判定結果から出力する音信号を決定し、音信号を出力する。 The output switching unit 116 determines the sound signal to be output from the command determination result, and outputs the sound signal.

出力端子１１７は、呼びかけ処理部１０４の音信号を出力するインタフェースである。 The output terminal 117 is an interface that outputs the sound signal of the call processing unit 104.

接続判定部１１８は、音声認識部１１３による音声認識結果及びコマンド判定部１１５に基づくコマンド判定結果に基づいて、ネットワーク１０６（例えば、相手側のテレビ電話）との接続先の判定を行う。 The connection determination unit 118 determines the connection destination with the network 106 (for example, the videophone of the other party) based on the voice recognition result by the voice recognition unit 113 and the command determination result based on the command determination unit 115.

接続判定結果出力端子１１９は、接続判定結果をＮＷ通信部１０５に出力する。 The connection determination result output terminal 119 outputs the connection determination result to the NW communication unit 105.

ビデオカメラ１２０は、自拠点（遠隔通話装置１００が設置される拠点）に設置される撮影デバイスである。ビデオカメラ１２０によって撮像された映像はＮＷ通信部１０５によってネットワーク１０６に送信される。 The video camera 120 is a photographing device installed at its own base (a base where the remote communication device 100 is installed). The image captured by the video camera 120 is transmitted to the network 106 by the NW communication unit 105.

モニター１２１は、映像出力デバイスである。モニター１２１が出力する映像は、例えば、相手側の拠点に設置されたビデオカメラによって撮影された映像であって、この映像（エンコードされたデータ）はネットワーク１０６を介してＮＷ通信部１０５で受信されデコード（復号化）した後、モニター１２１に入力される。 The monitor 121 is a video output device. The image output by the monitor 121 is, for example, an image taken by a video camera installed at the base of the other party, and this image (encoded data) is received by the NW communication unit 105 via the network 106. After decoding (decoding), it is input to the monitor 121.

（Ａ－２）第１の実施形態の動作
本発明の第１の実施形態に係る遠隔通話装置１００の動作を詳細に説明する。 (A-2) Operation of First Embodiment The operation of the remote communication device 100 according to the first embodiment of the present invention will be described in detail.

図３は、第１の実施形態に係る遠隔通話装置を使用する一方の拠点の部屋内の機器配置や使用者の位置の一例である。相手側の拠点にも同じ遠隔通話装置１００が設置されているものとする。部屋１５１は例えば会議室であり、高さはモニター１２１が簡単に設置でき、且つ十分に余裕がある高さ（例えば、モニター１２１の高さ＋数ｍ、または２ｍ以上）があれば良く、部屋１５１の大きさ（面積）は、モニター１２１やマイクアレイ１０１、スピーカ１０９などが簡単に設置でき、且つ十分に余裕がある広さ、または使用者１５２が会話するのに十分広さ（例えば、横縦数ｍ）があれば良い。 FIG. 3 is an example of the device arrangement and the user's position in the room of one of the bases using the remote communication device according to the first embodiment. It is assumed that the same remote communication device 100 is installed at the base of the other party. Room 151 is, for example, a conference room, and the height may be such that the monitor 121 can be easily installed and has a sufficient height (for example, the height of the monitor 121 + several m, or 2 m or more). The size (area) of 151 is such that a monitor 121, a microphone array 101, a speaker 109, etc. can be easily installed and has a sufficient margin, or a sufficient area for a user 152 to talk (for example, laterally). It suffices if there is a number of vertical meters).

まず、遠隔通話装置１００の動作が開始すると、モニター１２１は、相手側の拠点の遠隔通話装置１００のカメラ１２０で撮影している映像が相手側の拠点の遠隔通話装置１００のモニター１２１に表示される。相手側の拠点のモニター１２１には自拠点のビデオカメラ１２０で撮影している映像が相手の拠点のモニターに表示される。このとき、音声は接続されておらず、両拠点ともカメラ１２０で撮影している映像だけが表示されており、お互いの拠点の様子を確認できる。 First, when the operation of the remote communication device 100 starts, the monitor 121 displays the image captured by the camera 120 of the remote communication device 100 at the other party's base on the monitor 121 of the remote communication device 100 at the other party's base. To. On the monitor 121 of the other party's base, the image taken by the video camera 120 of the own base is displayed on the monitor of the other party's base. At this time, the audio is not connected, and only the image taken by the camera 120 is displayed at both bases, so that the state of each base can be confirmed.

使用者１５２がモニター１２１に表示されている相手側の拠点の映像を見るために、モニター１２１に近づき、相手側の拠点の全体映像に移っている人と通話を行う場合は、使用者１５２が呼びかけ音声を発話して会話を開始する。 When the user 152 approaches the monitor 121 and makes a call with a person who has moved to the entire image of the other party's base in order to see the image of the other party's base displayed on the monitor 121, the user 152 causes the user 152 to make a call. Speak a call and start a conversation.

使用者１５２が発した音声等の音信号や環境音が重畳したアナログ音信号が、マイクアレイ１０１の各マイクに入力される。 A sound signal such as a sound emitted by the user 152 or an analog sound signal superimposed with an environmental sound is input to each microphone of the microphone array 101.

マイクアレイ１０１に入力されたアナログの音信号は、マイクアンプ１０２で増幅され、ＡＤ変換器１０３でアナログ信号からデジタル信号に変換され、呼びかけ処理部１０４の入力端子１１０にマイク入力信号ｘ（ｍ，ｎ）として入力される。なお、上記マイク入力信号ｘ（ｍ，ｎ）において、ｍはマイクアレイ１０１内の各マイクを識別するパラメータであり、ｎは入力信号の時系列を示すパラメータである。 The analog sound signal input to the microphone array 101 is amplified by the microphone amplifier 102, converted from the analog signal to the digital signal by the AD converter 103, and the microphone input signal x (m,) is connected to the input terminal 110 of the call processing unit 104. It is input as n). In the microphone input signal x (m, n), m is a parameter for identifying each microphone in the microphone array 101, and n is a parameter indicating a time series of the input signal.

呼びかけ処理部１０４の入力端子１１０に信号が入力され始めると、まず、マイク入力信号ｘ（ｍ，ｎ）が信号処理部１１１に入力される。 When a signal starts to be input to the input terminal 110 of the call processing unit 104, first, the microphone input signal x (m, n) is input to the signal processing unit 111.

信号処理部１１１は、入力信号に対してマイクアレイ処理を行い、指向性処理や音源を分離する音源分離処理をする。指向性処理の手法は、例えば、従来のマイクアレイ処理である遅延和アレイ処理を以下の（１）、（２）式に従い、処理する手法がある。

The signal processing unit 111 performs microphone array processing on the input signal, and performs directivity processing and sound source separation processing for separating sound sources. As a method of directivity processing, for example, there is a method of processing a delay sum array processing which is a conventional microphone array processing according to the following equations (1) and (2).

上記（１）式のｘ’（ｎ）はマイクアレイ処理信号、Ｍはマイクの本数、τ_ｍは遅延量、（２）式のｄはマイク間隔、θは指向性の角度、ｃは音速である（例えば、マイクアレイの正面方向に指向性を形成する場合はθ＝０になり、τ_ｍ＝０となる）。 In the above equation (1), x'(n) is the microphone array processing signal, M is the number of microphones, τ _m is the delay amount, d in equation (2) is the microphone spacing, θ is the directivity angle, and c is the sound velocity. (For example, when directivity is formed in the front direction of the microphone array, θ = 0 and τ _m = 0).

なお、信号処理の算出手段は、種々の方法を広く適用することができ、例えば、遅延和アレイ処理以外の従来の別マイクアレイ処理や、マイクアレイを２組使用して、ある特定のエリアの収音できるマイクアレイ処理でも良い。信号処理部１１１は、算出したマイクアレイ処理信号ｘ’（ｎ）をオーディオバッファ部１１２と、音声認識部１１３と出力切替え部１１６に出力する。 As the signal processing calculation means, various methods can be widely applied. For example, a conventional microphone array process other than the delay sum array process or two sets of microphone arrays may be used in a specific area. Microphone array processing that can collect sound may also be used. The signal processing unit 111 outputs the calculated microphone array processing signal x'(n) to the audio buffer unit 112, the voice recognition unit 113, and the output switching unit 116.

出力切替え部１１６は、遠隔通話装置１００の動作時は、以下の（３）式に示すように、無音信号を出力信号ｙ（ｎ）として出力端子１１７に出力する。
ｙ（ｎ）＝０ …（３） When the remote communication device 100 is in operation, the output switching unit 116 outputs a silent signal as an output signal y (n) to the output terminal 117 as shown in the following equation (3).
y (n) = 0 ... (3)

また、呼びかけ処理部１０４は、同時にマイクアレイ処理信号ｘ’（ｎ）を、以下の（４）式に従い、オーディオバッファ部１１２のオーディオバッファｂｕｆｆｅｒ（ｎ）の書込み位置ｗｒｉｔｅ＿ｉｎｄｅｘの位置に保持する。保持した後、呼びかけ処理部１０４は、以下の（５）式に示すように、書込み位置ｗｒｉｔｅ＿ｉｎｄｅｘを進める（インクリメン卜する）。

Further, the call processing unit 104 simultaneously holds the microphone array processing signal x'(n) at the write position write_index of the audio buffer buffer (n) of the audio buffer unit 112 according to the following equation (4). After holding, the call processing unit 104 advances (increments) the writing position write_index as shown in the following equation (5).

上記（５）式のＢＵＦＦＥＲ＿ＳＩＺＥは、オーディオバッファ部１１２のオーディオバッファのバッファの長さである。 BUFFER_SIZE in the above equation (5) is the buffer length of the audio buffer of the audio buffer unit 112.

さらに、呼びかけ処理部１０４は、同時にマイクアレイ処理信号ｘ’（ｎ）を音声認識部１１３で音声認識を行い、音声認識結果をコマンド判定部１１５に出力する。 Further, the call processing unit 104 simultaneously performs voice recognition of the microphone array processing signal x'(n) by the voice recognition unit 113, and outputs the voice recognition result to the command determination unit 115.

コマンド判定部１１５は、音声認識の結果とコマンドリスト部１１４に保持されているコマンド一覧（例えば図２のコマンドリスト）を比較し、コマンドリストにある「人名」とコマンドリストにある「接続コマンド」が続けて音声認識されたか否かの判定を行う（例えば、「○○さんこんにちは」など）。そして、コマンド判定部１１５は、判定結果を出力切替え部１１６に、判定結果と音声認識結果を接続判定部１１８に出力する（例えば、「人名」と「接続コマンド」が続けて音声認識された場合は判定結果を１、それ以外は０など）。 The command determination unit 115 compares the result of voice recognition with the command list held in the command list unit 114 (for example, the command list in FIG. 2), and the "person name" in the command list and the "connection command" in the command list. Continues to determine whether or not voice recognition has been performed (for example, "Hello Mr. XX"). Then, the command determination unit 115 outputs the determination result to the output switching unit 116, and outputs the determination result and the voice recognition result to the connection determination unit 118 (for example, when the "person name" and the "connection command" are continuously voice-recognized. The judgment result is 1, otherwise it is 0, etc.).

接続判定部１１８は、音声認識部１１３による音声認識結果及びコマンド判定部１１５に基づくコマンド判定結果に基づいて、接続判定を行い、接続判定結果をＮＷ通信部１０５に出力する。例えば、判定結果が１で、コマンド判定部１１５から「○○さんこんにちは」という音声認識結果が出力された場合、接続判定部１１８は、相手側の拠点の遠隔通話装置１００が設置されている近くに○○さんがいる場合は、相手側の拠点の遠隔通話装置１００に接続する信号を接続判定結果出力端子１１９に出力する。拠点の遠隔通話装置１００が設置されている近くに○○さんが入るかどうかの判定は、例えば、事前に端末の近くにいる人を登録した情報を使用する。 The connection determination unit 118 makes a connection determination based on the voice recognition result by the voice recognition unit 113 and the command determination result based on the command determination unit 115, and outputs the connection determination result to the NW communication unit 105. For example, if the determination result is 1 and the command determination unit 115 outputs the voice recognition result "Hello Mr. XX", the connection determination unit 118 is near the remote communication device 100 of the other party's base. If Mr. XX is present, the signal connected to the remote communication device 100 at the other party's base is output to the connection determination result output terminal 119. For the determination of whether or not Mr. XX can enter near the remote communication device 100 of the base, for example, the information in which the person near the terminal is registered in advance is used.

ＮＷ通信部１０５は、接続判定結果出力端子１１９を介して出力された接続判定結果に基づき、ネットワーク１０６との接続処理を行い、接続後に遠隔通話装置はＮＷ通信部１０５を介して、音声のやりとりが行われる。 The NW communication unit 105 performs connection processing with the network 106 based on the connection determination result output via the connection determination result output terminal 119, and after the connection, the remote communication device exchanges voice via the NW communication unit 105. Is done.

一方、出力切替え部１１６は、コマンド判定部１１５で音声認識部１１３の音声認識の結果が「人名」と「接続コマンド」が続けて音声認識されない場合は、無音信号を出力端子１１７に出力し続け、コマンド判定部１１５で「人名」と「接続コマンド」が続けて音声認識された場合には、オーディオバッファ部１１２の読出し位置ｒｅａｄ＿ｉｎｄｅｘを、下記の（６）式に従い計算する。

On the other hand, the output switching unit 116 continuously outputs a silent signal to the output terminal 117 when the command determination unit 115 does not continuously recognize the voice recognition result of the voice recognition unit 113 by the "person name" and the "connection command". When the "person name" and the "connection command" are continuously voice-recognized by the command determination unit 115, the read position read_index of the audio buffer unit 112 is calculated according to the following equation (6).

上記（６）式のＬＥＮは、オーディオバッファ部１１２に保持されている処理信号を再生する長さである。なお、ＬＥＮの決定方法は、種々の方法を広く適用することができ、例えば、オーディオバッファ部１１２のバッファサイズと同じ長さ（ＬＥＮ＝ＢＵＦＦＥＲ＿ＳＩＺＥ）とするなどの定数とする方法が存在する。また、オーディオバッファ部１１２に保持されているマイク入力信号に音声区間処理を行い、バッファに保持されている音の長さを求めて、その長さをＬＥＮとする方法でも良い。 The LEN of the above equation (6) is a length for reproducing the processing signal held in the audio buffer unit 112. As the method for determining LEN, various methods can be widely applied, and for example, there is a method in which the length is the same as the buffer size of the audio buffer unit 112 (LEN = BUFFER_SIZE). Alternatively, a method may be used in which the microphone input signal held in the audio buffer unit 112 is subjected to audio section processing, the length of the sound held in the buffer is obtained, and the length is set to LEN.

そして、出力切替え部１１６は、以下の（７）式に示すようにオーディオバッファ部１１２に保持されている音信号を出力信号ｙ（ｎ）として出力端子１１７に一定時間（例えば、ＬＥＮの時間長分）出力し、以下の（８）式に示すように読出し位置ｒｅａｄ＿ｉｎｄｅｘを進める（インクリメン卜する）。

Then, as shown in the following equation (7), the output switching unit 116 uses the sound signal held in the audio buffer unit 112 as the output signal y (n) at the output terminal 117 for a fixed time (for example, the time length of LEN). Minutes) Output and advance the read position read_index as shown in the following equation (8) (increment).

ＮＷ通信部１０５は、出力端子１１７から介して出力された出力信号ｙ（ｎ）をネットワーク１０６で接続している相手のＮＷ通信部に送信する。 The NW communication unit 105 transmits the output signal y (n) output from the output terminal 117 to the NW communication unit of the other party connected by the network 106.

出力切替え部１１６は、オーディオバッファ部１１２に保持されている音信号を一定時間出力すると、以下の（９）式に示すように、マイクアレイ処理信号ｘ’（ｎ）を出力信号ｙ（ｎ）として出力端子１１７に出力する。
ｙ（ｎ）＝ｘ’（ｎ） …（９） When the output switching unit 116 outputs the sound signal held in the audio buffer unit 112 for a certain period of time, the output switching unit 116 outputs the microphone array processing signal x'(n) to the output signal y (n) as shown in the following equation (9). Is output to the output terminal 117.
y (n) = x'(n) ... (9)

ＮＷ通信部１０５は、出力端子１１７を介して出力された出力信号ｙ（ｎ）を引き続きネットワーク１０６で接続している相手のＮＷ通信部に送信する。 The NW communication unit 105 transmits the output signal y (n) output via the output terminal 117 to the NW communication unit of the other party continuously connected by the network 106.

一方、ネットワーク１０６から送信されてきた相手側の音声はＮＷ通信部１０５を介して、ＤＡ変換器１０７によりデジタル信号からアナログ信号に変換後、スピーカアンプ１０８で増幅され、スピーカ１０９により出力される。 On the other hand, the voice of the other party transmitted from the network 106 is converted from a digital signal to an analog signal by the DA converter 107 via the NW communication unit 105, amplified by the speaker amplifier 108, and output by the speaker 109.

しばらくして、通話を終了する場合は、使用者１５２が切断音声を発話して会話を終了する。 After a while, when the call is terminated, the user 152 utters a disconnect voice to end the conversation.

マイクアレイ１０１に入力されたアナログの音信号は、マイクアンプ１０２で増幅され、ＡＤ変換器１０３でアナログ信号からデジタル信号に変換され、呼びかけ処理部１０４の入力端子１１０にマイク入力信号ｘ（ｍ，ｎ）として入力され、信号処理部１１１に入力される。 The analog sound signal input to the microphone array 101 is amplified by the microphone amplifier 102, converted from an analog signal to a digital signal by the AD converter 103, and is converted into a digital signal by the AD converter 103, and the microphone input signal x (m, It is input as n) and is input to the signal processing unit 111.

信号処理部１１１は、入力信号に対してマイクアレイ処理を行い、指向性処理や音源を分離する音源分離処理を行い、算出したマイクアレイ処理信号ｘ’（ｎ）をオーディオバッファ部１１２と音声認識部１１３と出力切替え部１１６に出力する。 The signal processing unit 111 performs microphone array processing on the input signal, performs directional processing and sound source separation processing for separating sound sources, and recognizes the calculated microphone array processing signal x'(n) as the audio buffer unit 112 and voice recognition. Output to unit 113 and output switching unit 116.

出力切替え部１１６は、（９）式に示すように、マイクアレイ処理信号ｘ’（ｎ）を出力信号ｙ（ｎ）として出力端子１１７に出力する。 As shown in the equation (9), the output switching unit 116 outputs the microphone array processing signal x'(n) to the output terminal 117 as the output signal y (n).

また、呼びかけ処理部１０４は、同時にマイクアレイ処理信号ｘ’（ｎ）を、（４）式に従い、オーディオバッファ部１１２のオーディオバッファｂｕｆｆｅｒ（ｎ）の書込み位置ｗｒｉｔｅ＿ｉｎｄｅｘの位置に保持する。保持した後、呼びかけ処理部１０４は、（５）式に示すように、書込み位置ｗｒｉｔｅ＿ｉｎｄｅｘを進める（インクリメン卜する）。 Further, the call processing unit 104 simultaneously holds the microphone array processing signal x'(n) at the write position write_index of the audio buffer buffer (n) of the audio buffer unit 112 according to the equation (4). After holding, the call processing unit 104 advances (increments) the writing position write_index as shown in the equation (5).

コマンド判定部１１５は、音声認識の結果とコマンドリスト部１１４に保持されているコマンド一覧を比較し、音声認識の結果が「切断コマンド」の一覧に存在するか否かの判定を行う。そして、コマンド判定部１１５は、コマンドリストにある「切断コマンド」が音声認識された場合（例えば、「さようなら」など）のみ、判定結果を出力切替え部１１６、及び接続判定部１１８に出力する（例えば、通話中に「切断コマンド」が音声認識された場合は判定結果を２、それ以外は０など）。接続判定部１１８は、音声認識部１１３による音声認識結果及びコマンド判定部１１５に基づくコマンド判定結果に基づいて、切断判定を行い、ＮＷ通信部１０５に相手側のＮＷ通信部と切断する信号を接続判定結果出力端子１１９に出力する。 The command determination unit 115 compares the result of voice recognition with the command list held in the command list unit 114, and determines whether or not the result of voice recognition exists in the list of "disconnect commands". Then, the command determination unit 115 outputs the determination result to the output switching unit 116 and the connection determination unit 118 (for example, only when the "disconnect command" in the command list is voice-recognized (for example, "goodbye"). , If the "disconnect command" is voice-recognized during a call, the judgment result is 2, otherwise it is 0, etc.). The connection determination unit 118 makes a disconnection determination based on the voice recognition result by the voice recognition unit 113 and the command determination result based on the command determination unit 115, and connects the signal to disconnect from the NW communication unit on the other side to the NW communication unit 105. Output to the determination result output terminal 119.

ＮＷ通信部１０５は、接続判定結果出力端子１１９を介して出力された接続判定結果に基づき、相手のＮＷ通信部との切断処理を行う。 The NW communication unit 105 performs disconnection processing with the other party's NW communication unit based on the connection determination result output via the connection determination result output terminal 119.

一方、出力切替え部１１６は、相手と接続されてからは、コマンド判定部１１５で音声認識部１１３の音声認識の結果がコマンドリスト部１１４の切断コマンド一覧に存在しないと判定された場合には、マイクアレイ処理信号を出力端子１１７に出力し続け、コマンド判定部１１５で音声認識部１１３の音声認識の結果がコマンドリスト部１１４の切断コマンド一覧に存在すると判定された場合には、（３）式に示すように、無音信号を出力信号ｙ（ｎ）として出力端子１１７に出力する。 On the other hand, if the command determination unit 115 determines that the voice recognition result of the voice recognition unit 113 does not exist in the disconnect command list of the command list unit 114 after the output switching unit 116 is connected to the other party, When the microphone array processing signal is continuously output to the output terminal 117 and the command determination unit 115 determines that the result of the voice recognition of the voice recognition unit 113 exists in the disconnect command list of the command list unit 114, the equation (3) is used. As shown in the above, a silent signal is output to the output terminal 117 as an output signal y (n).

（Ａ－３）第１の実施形態の効果
以上のように、第１の実施形態によれば、遠隔通話装置１００は、複数のマイクを使用して音声を強調する信号処理を行い、信号処理した信号を一度オーディオバッファに保持し、同時に信号処理した信号に対して音声認識を行い、その音声認識の結果が呼びかけ音声か否かを判定し、呼びかけ音声の場合には、通話相手に接続してからバッファに保持している呼びかけ音声を出力することで、呼びかけ音声が相手に伝わってから会話を開始し、会話が開始しされてから、信号処理した信号に対して音声認識を行い、その音声認識の結果が切断音声か否かを判定し、切断音声の場合には、切断することにより対面での会話に近い状態を再現でき、高い臨場感で会話を開始することができる。 (A-3) Effect of First Embodiment As described above, according to the first embodiment, the remote communication device 100 performs signal processing for emphasizing voice by using a plurality of microphones, and signal processing. The generated signal is once held in the audio buffer, voice recognition is performed for the signal processed at the same time, it is determined whether the result of the voice recognition is the call voice, and if it is the call voice, it is connected to the other party. By outputting the call voice held in the buffer after that, the conversation is started after the call voice is transmitted to the other party, and after the conversation is started, voice recognition is performed for the signal processed signal, and the voice is recognized. It is determined whether or not the result of voice recognition is a disconnected voice, and in the case of a cut voice, a state close to a face-to-face conversation can be reproduced by disconnecting, and the conversation can be started with a high sense of presence.

また、第１の実施形態の遠隔通話装置１００は、使用環境の雑音が大きい環境においても、呼びかけ音声の収音は複数のマイク（マイクアレイ）を使用し、また音声を強調する信号処理を行っているため、音声認識部１１３において精度良い音声認識や通話を行うことができる。 Further, the remote communication device 100 of the first embodiment uses a plurality of microphones (microphone arrays) for picking up the call voice even in an environment where the usage environment is noisy, and performs signal processing for emphasizing the voice. Therefore, the voice recognition unit 113 can perform accurate voice recognition and a call.

（Ｂ）第２の実施形態
次に、本発明の遠隔通話装置、遠隔通話プログラム、及び遠隔通話方法の第２の実施形態を、図面を参照しながら詳細に説明する。 (B) Second Embodiment Next, a second embodiment of the remote call device, the remote call program, and the remote call method of the present invention will be described in detail with reference to the drawings.

第２の実施形態は、本発明の遠隔通話装置の音出力方法が、第１の実施形態と異なっている場合を例示する。 The second embodiment exemplifies the case where the sound output method of the remote communication device of the present invention is different from the first embodiment.

（Ｂ－１）第２の実施形態の構成
図４は、第２の実施形態に係る遠隔通話装置２００の構成を示すブロック図である。 (B-1) Configuration of the Second Embodiment FIG. 4 is a block diagram showing the configuration of the remote communication device 200 according to the second embodiment.

第２の実施形態の遠隔通話装置２００は、距離推定部２０２、アンプ制御部２０３、デジタルアンプ部２０４を構成要素とする点が第１の実施形態の遠隔通話装置１００と異なる。それ以外の構成要素は、第１の実施形態に係る図１の遠隔通話装置１００の構成要素と同一、又は対応するものである。なお、図４において、第１の実施形態に係る遠隔通話装置１００の構成要素と同一、又は対応するものについては同一の符号を付している。 The remote communication device 200 of the second embodiment is different from the remote communication device 100 of the first embodiment in that the distance estimation unit 202, the amplifier control unit 203, and the digital amplifier unit 204 are components. The other components are the same as or correspond to the components of the remote communication device 100 of FIG. 1 according to the first embodiment. In FIG. 4, the same or corresponding components of the remote communication device 100 according to the first embodiment are designated by the same reference numerals.

また、第１の実施形態と同一、又は対応する構成要素の詳細な説明は重複するため、ここでは省略する。 Further, since the detailed description of the same or corresponding component as that of the first embodiment is duplicated, it is omitted here.

図４において、本発明の第２の実施形態に係る遠隔通話装置２００は、マイクアレイ１０１、マイクアンプ１０２、ＡＤ変換器１０３、呼びかけ処理部２０１、ＮＷ通信部１０５、ネットワーク１０６、ＤＡ変換器１０７、スピーカアンプ１０８、スピーカ１０９、ビデオカメラ１２０、及びモニター１２１を有する。 In FIG. 4, the remote communication device 200 according to the second embodiment of the present invention includes a microphone array 101, a microphone amplifier 102, an AD converter 103, a call processing unit 201, a NW communication unit 105, a network 106, and a DA converter 107. It has a speaker amplifier 108, a speaker 109, a video camera 120, and a monitor 121.

また、呼びかけ処理部２０１は、入力端子１１０、距離推定部２０２、アンプ制御部２０３、デジタルアンプ部２０４、信号処理部１１１、オーディオバッファ部１１２、音声認識部１１３、コマンドリスト部１１４、コマンド判定部１１５、出力切替え部１１６、出力端子１１７、接続判定部１１８、及び接続判定結果出力端子１１９を有する。 Further, the call processing unit 201 includes an input terminal 110, a distance estimation unit 202, an amplifier control unit 203, a digital amplifier unit 204, a signal processing unit 111, an audio buffer unit 112, a voice recognition unit 113, a command list unit 114, and a command determination unit. It has 115, an output switching unit 116, an output terminal 117, a connection determination unit 118, and a connection determination result output terminal 119.

距離推定部２０２は、複数のマイク入力信号から使用者の方向や位置、距離などの位置情報を推定し、推定した位置情報を出力する。 The distance estimation unit 202 estimates position information such as the direction, position, and distance of the user from a plurality of microphone input signals, and outputs the estimated position information.

アンプ制御部２０３は、推定した位置情報から複数のマイク入力信号を増幅するためのアンプ値を算出し出力する。 The amplifier control unit 203 calculates and outputs an amplifier value for amplifying a plurality of microphone input signals from the estimated position information.

デジタルアンプ部２０４は、アンプ制御部２０３から出力されたアンプ値に基づいて複数のマイク入力信号を増幅する。 The digital amplifier unit 204 amplifies a plurality of microphone input signals based on the amplifier value output from the amplifier control unit 203.

（Ｂ－２）第２の実施形態の動作
第２の実施形態に係る遠隔通話装置２００における音声処理の基本的な動作は、第１の実施形態で説明した音声処理と同様である。 (B-2) Operation of the Second Embodiment The basic operation of the voice processing in the remote communication device 200 according to the second embodiment is the same as the voice processing described in the first embodiment.

以下では、第１の実施形態と異なる点である距離推定部２０２、アンプ制御部２０３、及びデジタルアンプ部２０４における処理動作を中心に詳細に説明する。 Hereinafter, the processing operations in the distance estimation unit 202, the amplifier control unit 203, and the digital amplifier unit 204, which are different from the first embodiment, will be described in detail.

距離推定部２０２は、複数のマイク入力信号から使用者の方向や位置などの位置情報を推定する。使用者の位置情報の推定の手法は、種々の方法を広く適用することができ、例えば、複数のマイク入力信号の内２つのマイク入力信号を使用して相互相関関数を算出し、相互相関関数から使用者の方向を推定しても良い。また、複数のマイク入力信号の内４つのマイク入力信号を抽出しその中から２つのマイク入力信号を使用して２つの相互相関関数を算出し、２つの相互相関関数から２つの方向と各マイクの位置から使用者の位置を算出しても良い。距離推定部２０２は、推定した位置情報をアンプ制御部２０３に出力する。 The distance estimation unit 202 estimates position information such as the direction and position of the user from a plurality of microphone input signals. Various methods can be widely applied to the method of estimating the user's position information. For example, a cross-correlation function is calculated using two microphone input signals out of a plurality of microphone input signals, and a cross-correlation function is used. The user's direction may be estimated from. In addition, four microphone input signals out of a plurality of microphone input signals are extracted, two microphone input signals are used from the two microphone input signals to calculate two cross-correlation functions, and two directions and each microphone are calculated from the two cross-correlation functions. The position of the user may be calculated from the position of. The distance estimation unit 202 outputs the estimated position information to the amplifier control unit 203.

アンプ制御部２０３は、距離推定部２０２から出力される位置推定情報からデジタルアンプの値を決定する。アンプ値の決定の手法は、種々の方法を広く適用することができ、例えば、以下の（１０）式のように距離に応じて決定する方法がある。

The amplifier control unit 203 determines the value of the digital amplifier from the position estimation information output from the distance estimation unit 202. Various methods can be widely applied to the method of determining the amplifier value, and for example, there is a method of determining according to the distance as in the following equation (10).

上記（１０）式のＡＭＰはアンプ値、ｋはマイクアレイからの距離（単位はメートル）である。また、マイクアレイからの距離によってＡＭＰ値が連続的に決定するようにしても良い。アンプ制御部２０３は、決定したアンプ値をデジタルアンプ部２０４に出力する。 The AMP of the above equation (10) is an amplifier value, and k is a distance (unit: meters) from the microphone array. Further, the AMP value may be continuously determined by the distance from the microphone array. The amplifier control unit 203 outputs the determined amplifier value to the digital amplifier unit 204.

デジタルアンプ部２０４は、以下の（１１）式に示すように、アンプ制御部２０３から出力されたアンプ値とマイク入力信号を乗算する。
ｘ’’（ｍ，ｎ）＝ＡＭＰ・ｘ（ｍ，ｎ） …（１１） As shown in the following equation (11), the digital amplifier unit 204 multiplies the amplifier value output from the amplifier control unit 203 with the microphone input signal.
x'' (m, n) = AMP · x (m, n) ... (11)

デジタルアンプ部２０４は、乗算して増幅されたマイク入力信号ｘ’’（ｍ，ｎ）を信号処理部１１１に出力する。 The digital amplifier unit 204 outputs the multiplied and amplified microphone input signal x ″ (m, n) to the signal processing unit 111.

信号処理部１１１は、増幅されたマイク入力信号ｘ’’（ｍ，ｎ）を用いて、先述のマイクアレイ処理を行い、指向性処理や音源を分離する音源分離処理をする。 The signal processing unit 111 performs the above-mentioned microphone array processing using the amplified microphone input signal x ″ (m, n), and performs directivity processing and sound source separation processing for separating sound sources.

（Ｂ－３）第２の実施形態の効果
以上のように、第２の実施形態によれば、遠隔通話装置２００は、使用者がモニター１２１から離れた箇所で呼びかけた場合でも、マイク入力信号から使用者の距離や位置、方向を推定してデジタルアンプを制御し、適切な音量にすることで適切な音量で音声を伝えることができる。 (B-3) Effect of Second Embodiment As described above, according to the second embodiment, the remote communication device 200 has a microphone input signal even when the user calls at a place away from the monitor 121. By estimating the distance, position, and direction of the user from the above, controlling the digital amplifier, and setting the volume to an appropriate level, the voice can be transmitted at an appropriate volume.

（Ｃ）第３の実施形態
次に、本発明の音声処理装置、音声処理プログラム、及び音声処理方法の第３の実施形態を、図面を参照しながら詳細に説明する。 (C) Third Embodiment Next, a third embodiment of the voice processing device, the voice processing program, and the voice processing method of the present invention will be described in detail with reference to the drawings.

第３の実施形態は、本発明の遠隔通話装置の距離推定方法が、第２の実施形態と異なっている場合を例示する。 The third embodiment exemplifies the case where the distance estimation method of the remote communication device of the present invention is different from the second embodiment.

（Ｃ－１）第３の実施形態の構成
図５は、第３の実施形態に係る遠隔通話装置の構成を示すブロック図である。 (C-1) Configuration of Third Embodiment FIG. 5 is a block diagram showing a configuration of a remote communication device according to a third embodiment.

第３の実施形態の遠隔通話装置３００は、第２の実施形態の遠隔通話装置２００構成の距離推定部２０２の入力が複数のマイク入力信号ではなく、距離センサー３０２とする点が第２の実施形態の遠隔通話装置２００と異なる。なお、図５において、第２の実施形態に係る遠隔通話装置２００の構成要素と同一、又は対応するものについては同一の符号を付している。 The second embodiment of the remote communication device 300 of the third embodiment is that the input of the distance estimation unit 202 of the remote communication device 200 configuration of the second embodiment is not a plurality of microphone input signals but a distance sensor 302. It is different from the remote communication device 200 of the form. In FIG. 5, the same or corresponding components of the remote communication device 200 according to the second embodiment are designated by the same reference numerals.

また、第２の実施形態と同一、又は対応する構成要素の詳細な説明は重複するため、ここでは省略する。 Further, since the detailed description of the components that are the same as or correspond to the second embodiment is duplicated, they will be omitted here.

図５において、本発明の第３の実施形態に係る遠隔通話装置３００は、マイクアレイ１０１、マイクアンプ１０２、ＡＤ変換器１０３、呼びかけ処理部３０１、ＮＷ通信部１０５、ネットワーク１０６、ＤＡ変換器１０７、スピーカアンプ１０８、スピーカ１０９、ビデオカメラ１２０、及びモニター１２１を有する。 In FIG. 5, the remote communication device 300 according to the third embodiment of the present invention includes a microphone array 101, a microphone amplifier 102, an AD converter 103, a call processing unit 301, a NW communication unit 105, a network 106, and a DA converter 107. It has a speaker amplifier 108, a speaker 109, a video camera 120, and a monitor 121.

また、呼びかけ処理部３０１は、入力端子１１０、距離センサー３０２、距離センサー入力端子３０３、距離推定部３０４、アンプ制御部２０３、デジタルアンプ部２０４、信号処理部１１１、オーディオバッファ部１１２、音声認識部１１３、コマンドリスト部１１４、コマンド判定部１１５、出力切替え部１１６、出力端子１１７、接続判定部１１８、及び接続判定結果出力端子１１９を有する。 Further, the call processing unit 301 includes an input terminal 110, a distance sensor 302, a distance sensor input terminal 303, a distance estimation unit 304, an amplifier control unit 203, a digital amplifier unit 204, a signal processing unit 111, an audio buffer unit 112, and a voice recognition unit. It has 113, a command list unit 114, a command determination unit 115, an output switching unit 116, an output terminal 117, a connection determination unit 118, and a connection determination result output terminal 119.

距離センサー３０２は、使用者１５２の距離や位置、方向を検出するセンサーである。 The distance sensor 302 is a sensor that detects the distance, position, and direction of the user 152.

距離センサー入力端子３０３は、距離センサー３０２のセンサー入力信号を呼びかけ処理部４０１に入力するインタフェースである。 The distance sensor input terminal 303 is an interface for calling the sensor input signal of the distance sensor 302 and inputting it to the processing unit 401.

距離推定部３０４は、距離センサー入力端子３０３を介して入力される距離センサーの入力信号から、使用者１５２の位置情報を推定し、推定した位置情報を出力する。 The distance estimation unit 304 estimates the position information of the user 152 from the input signal of the distance sensor input via the distance sensor input terminal 303, and outputs the estimated position information.

（Ｃ－２）第３の実施形態の動作
第３の実施形態に係る遠隔通話装置３００における音声処理の基本的な動作は、第１の実施形態、及び第２の実施形態で説明した呼びかけ処理と同様である。 (C-2) Operation of the Third Embodiment The basic operation of the voice processing in the remote communication device 300 according to the third embodiment is the call processing described in the first embodiment and the second embodiment. Is similar to.

以下では、第１の実施形態、及び第２の実施形態と異なる点である距離センサー３０２、距離センサー入力端子３０３、距離推定部３０４における処理動作を中心に詳細に説明する。 Hereinafter, the processing operations in the distance sensor 302, the distance sensor input terminal 303, and the distance estimation unit 304, which are different from the first embodiment and the second embodiment, will be described in detail.

距離センサー３０２は、センサーを使用して使用者の方向や位置などの位置情報を推定する。センサーは、種々のセンサーを広く適用することができ、例えば、赤外線センサーを使用しても良いし、超音波センサーを使用しても良い。距離推定部３０４は、センサー入力信号が距離センサー入力端子３０３に入力されると処理を開始する。 The distance sensor 302 uses the sensor to estimate position information such as the direction and position of the user. As the sensor, various sensors can be widely applied, and for example, an infrared sensor or an ultrasonic sensor may be used. The distance estimation unit 304 starts processing when the sensor input signal is input to the distance sensor input terminal 303.

距灘センサー入力端子４０３に信号が入力され始めると、距離推定部３０４は、距離センサーの入力信号に基づき使用者１５２からマイクアレイ１０１までの距離や方向、位置を推定する。 When a signal starts to be input to the distance sensor input terminal 403, the distance estimation unit 304 estimates the distance, direction, and position from the user 152 to the microphone array 101 based on the input signal of the distance sensor.

アンプ制御部２０３は、距離推定部２０２から出力される位置推定情報からデジタルアンプの値を決定する。アンプ値の決定の手法は、種々の方法を広く適用することができ、例えば、上記（１０）式のように距離に応じて決定する方法がある。また、マイクアレイからの距離によってＡＭＰ値が連続的に決定するようにしても良い。アンプ制御部２０３は、決定したアンプ値をデジタルアンプ部２０４に出力する。 The amplifier control unit 203 determines the value of the digital amplifier from the position estimation information output from the distance estimation unit 202. Various methods can be widely applied to the method of determining the amplifier value, and for example, there is a method of determining according to the distance as in the above equation (10). Further, the AMP value may be continuously determined by the distance from the microphone array. The amplifier control unit 203 outputs the determined amplifier value to the digital amplifier unit 204.

（Ｃ－３）第３の実施形態の効果
以上のように、第３の実施形態によれば、距離センサーを設けたことにより、使用者の距離を正確に計測して適切なアンプ値を増幅することで、使用者の通話する位置によってマイク入力信号が小さくならず、適切な音量にすることで適切な音量で音声を伝えることができる。 (C-3) Effect of Third Embodiment As described above, according to the third embodiment, by providing the distance sensor, the distance of the user is accurately measured and the appropriate amplifier value is amplified. By doing so, the microphone input signal does not become small depending on the position where the user talks, and the voice can be transmitted at an appropriate volume by setting the appropriate volume.

（Ｄ）他の実施形態
上述した各実施形態においても、種々の変形実施形態を説明したが、本発明は以下の変形実施形態についても適用することができる。 (D) Other Embodiments Although various modified embodiments have been described in each of the above-described embodiments, the present invention can also be applied to the following modified embodiments.

（Ｄ－１）上述した各実施形態で説明した遠隔通話装置は、例えば、電話会議で通話を開始するときに、音声の入力によるコマンドで通話を開始する装置に搭載されるようにしても良い。 (D-1) The remote communication device described in each of the above-described embodiments may be mounted on a device that starts a call by a command by voice input when starting a call in a conference call, for example. ..

（Ｄ－２）上述した各実施形態で説明した遠隔通話装置の、呼びかけ処理部やＮＷ通信部はネットワーク上の処理装置（例えば、サーバなど）で処理されるようにしても良い。 (D-2) The call processing unit and the NW communication unit of the remote communication device described in each of the above-described embodiments may be processed by a processing device (for example, a server) on the network.

（Ｄ－３）上述した各実施形態で説明した遠隔通話装置では、マイクアレイ１０１は、図２で示したようにモニター１２１の前方に配置される例を示したが、配置される例はこれに限らない。例えば、マイクアレイ１０１は、モニター１２１の上部又は側面に配置されても良い。また、遠隔通話装置がプロジェクターとスクリーンを備えている場合、プロジェクターからの投影映像を結像させるためのスクリーンをモニター１２１の代替えとしても良い。このスクリーンは普通のスクリーンでも良いし、音を透過するスクリーンでも良い。音を透過するスクリーンの場合、マイクアレイ１０１は、スクリーンの後方に配置しても良い。 (D-3) In the remote communication device described in each of the above-described embodiments, the microphone array 101 is arranged in front of the monitor 121 as shown in FIG. 2, but this is the example in which the microphone array 101 is arranged. Not limited to. For example, the microphone array 101 may be arranged on the upper part or the side surface of the monitor 121. Further, when the remote communication device includes a projector and a screen, a screen for forming an image of a projected image from the projector may be used as an alternative to the monitor 121. This screen may be an ordinary screen or a screen that transmits sound. In the case of a screen that transmits sound, the microphone array 101 may be arranged behind the screen.

１００…遠隔通話装置、１０１…マイクアレイ、１０２…マイクアンプ、１０３…ＡＤ変換器、１０４…呼びかけ処理部、１０５…ＮＷ通信部、１０６…ネットワーク、１０７…ＤＡ変換器、１０８…スピーカアンプ、１０９…スピーカ、１１０…入力端子、１１１…信号呼びかけ処理部、１１２…オーディオバッファ部、１１３…音声認識部、１１４…コマンドリスト部、１１５…コマンド判定部、１１６…出力切替え部、１１７…出力端子、１１８…接続判定部、１１９…接続判定結果出力端子、１２０…ビデオカメラ、１２１…モニター、１５１…部屋、１５２…使用者、２００…遠隔通話装置、２０１…呼びかけ処理部、２０２…距離推定部、２０３…アンプ制御部、２０４…デジタルアンプ部、３００…遠隔通話装置、３０１…呼びかけ処理部、３０２…距離センサー、３０３…距離センサー入力端子、３０４…距離推定部、４０１…呼びかけ処理部、４０２…距離推定部、４０３…距灘センサー入力端子。 100 ... remote communication device, 101 ... microphone array, 102 ... microphone amplifier, 103 ... AD converter, 104 ... call processing unit, 105 ... NW communication unit, 106 ... network, 107 ... DA converter, 108 ... speaker amplifier, 109 ... Speaker, 110 ... Input terminal, 111 ... Signal call processing unit, 112 ... Audio buffer unit, 113 ... Voice recognition unit, 114 ... Command list unit, 115 ... Command determination unit, 116 ... Output switching unit, 117 ... Output terminal, 118 ... Connection determination unit, 119 ... Connection determination result output terminal, 120 ... Video camera, 121 ... Monitor, 151 ... Room, 152 ... User, 200 ... Remote communication device, 201 ... Call processing unit, 202 ... Distance estimation unit, 203 ... Amplifier control unit, 204 ... Digital amplifier unit, 300 ... Remote communication device, 301 ... Call processing unit, 302 ... Distance sensor, 303 ... Distance sensor input terminal, 304 ... Distance estimation unit, 401 ... Call processing unit, 402 ... Distance estimation unit, 403 ... Distance sensor input terminal.

Claims

A signal processing unit that performs predetermined signal processing using the input signals of multiple microphones that collect audio, and
An audio buffer unit that holds the processed signal output from the signal processing unit for a certain period of time,
A voice recognition unit that recognizes voice for the above processed signal,
Using the result of the voice recognition unit, a command determination unit that determines whether or not the processing signal is a connection command voice, and
When the processing signal is determined to be the connection command voice by the command determination unit, the connection determination unit that determines the connection destination and
Normally, the processing signal output from the signal processing unit is output to the network communication unit that transmits it to the remote call partner, and when the command determination unit determines that the processing signal is the connection command voice, the call partner After connecting to the destination terminal, the processing signal held in the audio buffer section is output to the network communication section, the processing signal held in the audio buffer section is output, and then the signal processing section outputs the processing signal . A remote communication device including an output switching unit that switches the processed signal to be output to the network communication unit again .

A distance estimation unit that estimates the distance to the user who uses the remote communication device, and
A microphone amplifier control unit that determines the amplification value of the input signals of the plurality of microphones according to the distance estimated by the distance estimation unit, and
The remote communication device according to claim 1, further comprising a digital amplifier unit that amplifies the signal by multiplying the input signals of the plurality of microphones by the amplification value determined by the microphone amplifier control unit.

Computer,
A signal processing unit that performs predetermined signal processing using the input signals of multiple microphones that collect audio, and
An audio buffer unit that holds the processed signal output from the signal processing unit for a certain period of time,
A voice recognition unit that recognizes voice for the above processed signal,
Using the result of the voice recognition unit, a command determination unit that determines whether or not the processing signal is a connection command voice, and
When the processing signal is determined to be the connection command voice by the command determination unit, the connection determination unit that determines the connection destination and
Normally, the processing signal output from the signal processing unit is output to the network communication unit that transmits it to the remote call partner, and when the command determination unit determines that the processing signal is the connection command voice, the call partner After connecting to the destination terminal, the processing signal held in the audio buffer section is output to the network communication section, the processing signal held in the audio buffer section is output, and then the signal processing section outputs the processing signal . A remote call program characterized by functioning as an output switching unit that switches the processed signal to be output to the network communication unit again .

A remote call method used for remote call devices.
It has a signal processing unit, an audio buffer unit, a voice recognition unit, a command determination unit, a connection determination unit, and an output switching unit.
The signal processing unit performs predetermined signal processing using the input signals of a plurality of microphones that collect sound, and performs predetermined signal processing.
The audio buffer unit holds the processed signal output from the signal processing unit for a certain period of time.
The voice recognition unit performs voice recognition on the processed signal and performs voice recognition.
The command determination unit determines whether or not the processing signal is a connection command voice by using the result of the voice recognition unit.
The connection determination unit determines the connection destination when the processing signal is determined to be the connection command voice by the command determination unit.
The output switching unit normally outputs the processing signal output from the signal processing unit to the network communication unit that transmits the processing signal to the remote call partner, and the command determination unit determines that the processing signal is the connection command voice. In this case, after connecting to the other party's terminal, the processing signal held in the audio buffer section is output to the network communication section, and the processing signal held in the audio buffer section is output. A remote communication method characterized in that the processing signal output from the signal processing unit is switched to be output to the network communication unit again .