JP6464465B2

JP6464465B2 - Conversation support device, method for controlling conversation support device, and program for conversation support device

Info

Publication number: JP6464465B2
Application number: JP2017042240A
Authority: JP
Inventors: 一博中臺; 圭佑中村
Original assignee: Honda Motor Co Ltd
Current assignee: Honda Motor Co Ltd
Priority date: 2017-03-06
Filing date: 2017-03-06
Publication date: 2019-02-06
Anticipated expiration: 2033-11-29
Also published as: JP2017129873A

Description

本発明は、会話支援装置、会話支援装置の制御方法、及び会話支援装置のプログラムに関する。 The present invention relates to a conversation support apparatus, a method for controlling the conversation support apparatus, and a program for the conversation support apparatus.

特許文献１には、会話の内容を音声認識して文字を表示することで、視覚的に聴力を補助する聴力補助装置が開示されている。特許文献１に記載の聴力補助装置では、マイクロホンにより集音された音声を音声認識部が認識し、音声認識結果に基づいて認識した内容に対応する文字を表示手段に表示している。また、特許文献１に記載の聴力補助装置では、発話者が送信機を使用し、聴取者が受信機を使用する。そして、送信機は、マイクロホンと音声認識回路と送信部等を有し、音声認識結果に基づいて認識した内容に対応する文字情報を送信部が受信機に送信する。受信機は、受信部とＣＰＵ（中央演算装置）と表示器等を有し、送信機から文字情報を受信した場合に表示器に文字表示を行う。 Patent Document 1 discloses a hearing aid device that visually assists hearing by visually recognizing the content of a conversation and displaying characters. In the hearing aid device described in Patent Document 1, the voice recognition unit recognizes the voice collected by the microphone, and displays the character corresponding to the recognized content based on the voice recognition result on the display means. Moreover, in the hearing aid apparatus described in Patent Document 1, a speaker uses a transmitter and a listener uses a receiver. The transmitter includes a microphone, a voice recognition circuit, a transmission unit, and the like, and the transmission unit transmits character information corresponding to the content recognized based on the voice recognition result to the receiver. The receiver includes a receiver, a CPU (Central Processing Unit), a display, and the like, and displays character on the display when character information is received from the transmitter.

特開平９−２０６３２９号公報JP-A-9-206329

しかしながら、特許文献１に記載の技術では、発話者や聴取者がそれぞれ聴力補助装置を使用することを想定しているため、マイクロホンに複数の音声が混合されて入力されるような場合、それぞれの音声を認識することが困難であるという課題があった。 However, in the technique described in Patent Document 1, since it is assumed that a speaker or a listener uses a hearing aid device, when a plurality of sounds are mixed and input to a microphone, There was a problem that it was difficult to recognize speech.

本発明は、上記の問題点に鑑みてなされたものであって、発話者が複数であっても、それぞれの音声を認識して聴覚を支援することができる会話支援装置、会話支援装置の制御方法、及び会話支援装置のプログラムを提供することを目的とする。 The present invention has been made in view of the above-described problems, and even when there are a plurality of speakers, the conversation support device capable of recognizing each voice and supporting the hearing, and the control of the conversation support device It is an object to provide a method and a program for a conversation support apparatus.

（１）上記目的を達成するため、本発明の一態様に係る会話支援装置は、２以上の使用者の音声信号を入力する音声入力部と、前記音声入力部に入力された音声信号を認識する音声認識部と、前記音声入力部に入力された音声信号の音源方向を推定する音源推定部と、前記音声認識部によって認識された認識結果が表示される表示部と、前記使用者毎に対応する表示領域を前記表示部の画像表示領域に設定し、既に音源定位されている話者とは違う方向から音声を検出した場合には、新たな話者が会議に参加していると判断し、その発話内容を隣り合う既に認識されている話者同士のテキスト表示枠の間に表示させる画像処理部と、を備える。 (1) In order to achieve the above object, a conversation support apparatus according to an aspect of the present invention recognizes a voice input unit that inputs voice signals of two or more users, and a voice signal input to the voice input unit. A speech recognition unit, a sound source estimation unit for estimating a sound source direction of a speech signal input to the speech input unit, a display unit on which a recognition result recognized by the speech recognition unit is displayed, and for each user When the corresponding display area is set as the image display area of the display unit and voice is detected from a direction different from that of the speaker already localized, it is determined that a new speaker is participating in the conference. And an image processing unit that displays the utterance content between the text display frames of the already recognized speakers.

（２）また、本発明の一態様に係る会話支援装置において、前記画像処理部は、前記使用者毎に対応する表示領域の表示色、柄、当該表示領域に表示されるアイコン、当該表示領域に表示されるアバターのうち少なくとも１つを前記使用者毎に異なるように表示させるようにしてもよい。 (2) Moreover, in the conversation support apparatus according to an aspect of the present invention, the image processing unit includes a display color and a pattern of a display area corresponding to each user, an icon displayed in the display area, and the display area At least one of the avatars displayed on the screen may be displayed differently for each user.

（３）また、本発明の一態様に係る会話支援装置において、前記画像処理部は、前記音源推定部によって推定された前記音源方向に基づく画像を、前記表示部の前記使用者毎に対応する前記表示領域に表示させるようにしてもよい。 (3) In the conversation support device according to the aspect of the present invention, the image processing unit corresponds to the user of the display unit an image based on the sound source direction estimated by the sound source estimation unit. You may make it display on the said display area.

（４）また、本発明の一態様に係る会話支援装置は、前記音声入力部に入力された音声信号を前記使用者毎に分離する音源分離部を備え、前記画像処理部は、前記音源分離部によって分離された前記使用者毎の音声信号のうち、前記表示領域に対応する前記使用者以外の前記認識結果を前記表示部の前記使用者毎に対応する表示領域に表示させるようにしてもよい。 (4) Moreover, the conversation assistance apparatus which concerns on 1 aspect of this invention is equipped with the sound source separation part which isolate | separates the audio | voice signal input into the said audio | voice input part for every said user, The said image processing part is the said sound source separation. Of the audio signals for each user separated by the unit, the recognition result other than the user corresponding to the display region may be displayed in the display region corresponding to the user of the display unit. Good.

（５）また、本発明の一態様に係る会話支援装置は、前記使用者の位置を推定する位置推定部を備え、前記画像処理部は、前記位置推定部によって推定された前記使用者の位置に応じた位置に、前記使用者毎に対応する表示領域を前記表示部の画像表示領域に設定または再配置するようにしてもよい。 (5) Moreover, the conversation assistance apparatus which concerns on 1 aspect of this invention is provided with the position estimation part which estimates the said user's position, The said image processing part is the said user's position estimated by the said position estimation part The display area corresponding to each user may be set or rearranged in the image display area of the display unit at a position corresponding to the user.

（６）また、本発明の一態様に係る会話支援装置において、前記位置推定部は、前記音声入力部に入力された音声信号を用いて前記使用者の位置を推定するようにしてもよい。 (6) Moreover, the conversation assistance apparatus which concerns on 1 aspect of this invention WHEREIN: The said position estimation part may estimate the said user's position using the audio | voice signal input into the said audio | voice input part.

（７）また、本発明の一態様に係る会話支援装置は、前記音声認識部によって認識された認識結果を翻訳する翻訳部を備え、前記画像処理部は、前記翻訳部によって翻訳された翻訳結果を前記表示部の前記使用者毎に対応する前記表示領域に表示させるようにしてもよい。 (7) Moreover, the conversation assistance apparatus which concerns on 1 aspect of this invention is provided with the translation part which translates the recognition result recognized by the said speech recognition part, The said image process part is the translation result translated by the said translation part May be displayed in the display area corresponding to each user of the display unit.

（８）また、本発明の一態様に係る会話支援装置は、前記使用者が発話する言語を検出する言語情報検出部を備え、前記翻訳部は、前記表示領域に対応する前記使用者以外の前記認識結果を、前記言語情報検出部によって検出された言語に翻訳するようにしてもよい。 (8) Moreover, the conversation assistance apparatus which concerns on 1 aspect of this invention is provided with the language information detection part which detects the language which the said user utters, and the said translation part other than the said user corresponding to the said display area. The recognition result may be translated into a language detected by the language information detection unit.

（９）また、本発明の一態様に係る会話支援装置は、他の会話支援装置との通信を行う通信部を備え、前記音声入力部は、前記通信部が受信した前記他の会話支援装置から受信された音声信号を入力し、前記音声認識部は、前記音声入力部から入力された音声信号のうち、前記表示領域に対応する前記使用者以外の音声信号を認識するようにしてもよい。 (9) Moreover, the conversation assistance apparatus which concerns on 1 aspect of this invention is provided with the communication part which communicates with another conversation assistance apparatus, The said voice input part is said other conversation assistance apparatus which the said communication part received The voice signal received from the voice input unit may be input, and the voice recognition unit may recognize a voice signal other than the user corresponding to the display area among the voice signals input from the voice input unit. .

（１０）また、本発明の一態様に係る会話支援装置は、前記表示部に表示された画像の一部を選択する入力部を備え、前記画像処理部は、前記入力部によって選択された画像の一部が認識結果である場合、選択された前記認識に対応する他の認識候補を前記表示部に表示させ、前記認識候補のうち前記入力部によって選択された候補に前記認識結果を修正し、修正した前記認識結果を、前記通信部を介して前記他の会話支援装置に送信させるようにしてもよい。 (10) Moreover, the conversation assistance apparatus which concerns on 1 aspect of this invention is provided with the input part which selects a part of image displayed on the said display part, The said image process part is the image selected by the said input part. When a part of the recognition result is a recognition result, another recognition candidate corresponding to the selected recognition is displayed on the display unit, and the recognition result is corrected to the candidate selected by the input unit among the recognition candidates. The corrected recognition result may be transmitted to the other conversation support apparatus via the communication unit.

（１１）上記目的を達成するため、本発明の一態様に係る会話支援装置の制御方法は、音声入力部が、２以上の使用者の音声信号を入力する音声入力手順と、音声認識部が、前記音声入力手順によって入力された音声信号を認識する音声認識手順と、音源推定部が、前記音声入力手順によって入力された音声信号の音源方向を推定する音源推定手順と、画像処理部が、前記使用者毎に対応する表示領域を、前記音声認識手順によって認識された認識結果が表示される表示部の画像表示領域に設定し、既に音源定位されている話者とは違う方向から音声を検出した場合には、新たな話者が会議に参加していると判断し、その発話内容を隣り合う既に認識されている話者同士のテキスト表示枠の間に表示させる画像処理手順と、を含む。 (11) In order to achieve the above object, a method for controlling a conversation support apparatus according to an aspect of the present invention includes: a voice input unit that inputs a voice signal of two or more users; A voice recognition procedure for recognizing a voice signal input by the voice input procedure, a sound source estimation unit for estimating a sound source direction of the voice signal input by the voice input procedure, and an image processing unit, The display area corresponding to each user is set as the image display area of the display unit on which the recognition result recognized by the voice recognition procedure is displayed, and the voice is heard from a direction different from that of the speaker already localized. If detected, an image processing procedure for determining that a new speaker is participating in the conference and displaying the utterance content between the text display frames of the adjacent already recognized speakers, Including.

（１２）上記目的を達成するため、本発明の一態様に係る会話支援装置のプログラムは、会話支援装置のコンピュータに、２以上の使用者の音声信号を入力する音声入力手順と、
前記音声入力手順によって入力された音声信号を認識する音声認識手順と、前記音声入力手順によって入力された音声信号の音源方向を推定する音源推定手順と、前記使用者毎に対応する表示領域を、前記音声認識手順によって認識された認識結果が表示される表示部の画像表示領域に設定し、既に音源定位されている話者とは違う方向から音声を検出した場合には、新たな話者が会議に参加していると判断し、その発話内容を隣り合う既に認識されている話者同士のテキスト表示枠の間に表示させる画像処理手順と、を実行させる。 (12) In order to achieve the above object, a program for a conversation support apparatus according to an aspect of the present invention includes a voice input procedure for inputting two or more user voice signals to a computer of the conversation support apparatus;
A speech recognition procedure for recognizing a speech signal input by the speech input procedure, a sound source estimation procedure for estimating a sound source direction of the speech signal input by the speech input procedure, and a display area corresponding to each user, When the recognition result recognized by the voice recognition procedure is set in the image display area of the display unit where the recognition result is displayed and the voice is detected from a direction different from the speaker already localized, the new speaker It is determined that the user is participating in the conference, and an image processing procedure for displaying the utterance content between the adjacent text recognition frames of the already recognized speakers is executed.

上述した（１）、（１１）又は（１２）の構成によれば、発話者が複数であっても、それぞれの音声を認識して聴覚を支援することができる。また、上述した（１）、（１１）又は（１２）の構成によれば、使用者は、参加者が増えた場合に、いずれの話者間に新たな話者が加わったのかを視覚的に知ることができる。
本発明の態様（１）、（１１）又は（１２）の構成によれば、使用者が認識された結果が見やすくなるため、使用者の利便性を向上することができる。
本発明の態様（３）によれば、使用者は、自分の表示領域を判別しやすくなる。
本発明の態様（４）によれば、話者毎の方位の推定や話者毎の発話の分離を精度よく行うことができる。また、他の話者は、相手の発話を精度良く会話支援装置上で視覚的に確認することができるので、話者の聴覚を支援することができる。
本発明の態様（５）、（６）によれば、各話者の一番近い位置に表示位置が配置されるため、他の話者の発話内容が認識された文字データ（認識結果）が話者にとって見やすくなる。 According to the configuration of (1), (11), or (12) described above, even if there are a plurality of speakers, it is possible to recognize each voice and support hearing. In addition, according to the configuration of (1), (11), or (12) described above, the user can visually determine which speaker has a new speaker added when the number of participants increases. Can know.
According to the configuration of the aspect (1), (11), or (12) of the present invention, it is easy to see the result recognized by the user, so that the convenience for the user can be improved.
According to the aspect (3) of the present invention, the user can easily determine his / her display area.
According to the aspect (4) of the present invention, it is possible to accurately estimate the direction for each speaker and to separate the utterances for each speaker. In addition, other speakers can visually confirm the other party's utterance on the conversation support device with high accuracy, and thus can support the hearing of the speaker.
According to the aspects (5) and (6) of the present invention, since the display position is arranged at the closest position of each speaker, the character data (recognition result) in which the utterance contents of other speakers are recognized is obtained. It is easier for the speaker to see.

本発明の態様（７）、（８）によれば、翻訳部によって翻訳された翻訳結果を表示部（画像表示部１５）の使用者毎に対応する表示領域に表示されるので、他の話者は、相手の発話を会話支援装置上で視覚的に確認することができるので、話者の聴覚を支援することができる。
本発明の態様（９）によれば、複数台の会話支援装置を用いて音声認識を行うことができる。
本発明の態様（１０）によれば、使用者の発話内容を、他の使用者に正しく提示することができる。 According to aspects (7) and (8) of the present invention, the translation result translated by the translation unit is displayed in the display area corresponding to each user of the display unit (image display unit 15). Since the person can visually confirm the utterance of the other party on the conversation support device, it is possible to support the hearing of the speaker.
According to the aspect (9) of the present invention, voice recognition can be performed using a plurality of conversation support apparatuses.
According to the aspect (10) of the present invention, the user's utterance content can be correctly presented to other users.

第１実施形態に係る会話支援装置の構成を表すブロック図である。It is a block diagram showing the structure of the conversation assistance apparatus which concerns on 1st Embodiment. 第１実施形態に係るマイクロホンが本体に組み込まれている場合の例を説明する図である。It is a figure explaining the example in case the microphone which concerns on 1st Embodiment is integrated in the main body. 第１実施形態に係るマイクロホンがカバーに組み込まれている場合の例を説明する図である。It is a figure explaining the example in case the microphone which concerns on 1st Embodiment is integrated in the cover. 第１実施形態に係る話者が接話型マイクロホンを使用する場合の例を説明する図である。It is a figure explaining the example in case the speaker which concerns on 1st Embodiment uses a close-talking microphone. 第１実施形態に係るメニュー画像の一例を説明する図である。It is a figure explaining an example of the menu image which concerns on 1st Embodiment. 第１実施形態に係る話者が２人の場合に画像表示部上に表示される画面パターンの画像の例を説明する図である。It is a figure explaining the example of the image of the screen pattern displayed on an image display part, when there are two speakers concerning 1st Embodiment. 第１実施形態に係る話者が３人の場合に画像表示部上に表示される画面パターンの画像の例を説明する図である。It is a figure explaining the example of the image of the screen pattern displayed on an image display part, when there are three speakers which concern on 1st Embodiment. 第１実施形態に係る話者が４人の場合に画像表示部上に表示される画面パターンの画像の例を説明する図である。It is a figure explaining the example of the image of the screen pattern displayed on an image display part, when there are four speakers concerning 1st Embodiment. 第１実施形態に係る会話支援装置が行う処理手順のフローチャートである。It is a flowchart of the process sequence which the conversation assistance apparatus which concerns on 1st Embodiment performs. 実験環境を説明するための図である。It is a figure for demonstrating experimental environment. 会話を始める前の画像表示部上に表示される画像である。It is an image displayed on the image display part before starting a conversation. 第１話者が「こんばんは」と発話した後に画像表示部上に表示される画像である。This is an image displayed on the image display unit after the first speaker speaks “Good evening”. 図１２の後に第２話者が「こんばんは」と発話した後に画像表示部上に表示される画像である。FIG. 13 is an image displayed on the image display unit after the second speaker utters “Good evening” after FIG. 12. 第１話者が４回発話し、第２話者が３回発話した後に画像表示部上に表示される画像である。This is an image displayed on the image display unit after the first speaker speaks four times and the second speaker speaks three times. 話者が３人の場合に画像表示部上に表示される画像の例を説明する図である。It is a figure explaining the example of the image displayed on an image display part when there are three speakers. 第２実施形態に係る会話支援装置の構成を表すブロック図である。It is a block diagram showing the structure of the conversation assistance apparatus which concerns on 2nd Embodiment. 第２実施形態に係るマイクアレイに対応する各部の組み合わせの例を説明する図である。It is a figure explaining the example of the combination of each part corresponding to the microphone array which concerns on 2nd Embodiment. 第２実施形態に係る音源定位の一例を説明する図である。It is a figure explaining an example of the sound source localization which concerns on 2nd Embodiment. 第２実施形態に係る会話支援装置が行う処理手順のフローチャートである。It is a flowchart of the process sequence which the conversation assistance apparatus which concerns on 2nd Embodiment performs. 第２実施形態に係る話者の人数が変化した場合の処理を説明する図である。It is a figure explaining a process when the number of speakers concerning a 2nd embodiment changes. 第３実施形態に係る会話支援装置の構成を表すブロック図である。It is a block diagram showing the structure of the conversation assistance apparatus which concerns on 3rd Embodiment. 第３実施形態に係る複数の会話支援装置の配置の一例を説明する図である。It is a figure explaining an example of arrangement | positioning of the some conversation assistance apparatus which concerns on 3rd Embodiment. 第３実施形態に係る各会話支援装置の画像表示部に表示される画像の一例を説明する図である。It is a figure explaining an example of the image displayed on the image display part of each conversation assistance apparatus concerning 3rd Embodiment. 第３実施形態に係る会話支援装置の構成を表すブロック図である。It is a block diagram showing the structure of the conversation assistance apparatus which concerns on 3rd Embodiment. 第４実施形態に係る会話支援装置の構成を表すブロック図である。It is a block diagram showing the structure of the conversation assistance apparatus which concerns on 4th Embodiment. 第４実施形態に係る会話支援装置の画像表示部上に表示される画像の一例を説明する図である。It is a figure explaining an example of the image displayed on the image display part of the conversation assistance apparatus which concerns on 4th Embodiment.

まず、本発明の概要を説明する。
本発明では、マイクロホンで収音された音声信号に含まれる発話のうち、発話者が発した発話を示す情報を他者の表示領域に表示する。本発明では、発話者が複数の場合、表示部の表示領域を発話者の人数に応じた領域に分割し、分割した各領域に各発話者を対応付け、対応付けた各領域に発話を示す情報を表示する。 First, the outline of the present invention will be described.
In the present invention, of the utterances included in the audio signal collected by the microphone, information indicating the utterances uttered by the speaker is displayed in the display area of the other person. In the present invention, when there are a plurality of speakers, the display area of the display unit is divided into areas according to the number of speakers, each speaker is associated with each divided area, and the utterance is indicated in each associated area. Display information.

以下、図面を参照しながら本発明の実施形態について説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

［第１実施形態］
図１は、本実施形態に係る会話支援装置１の構成を表すブロック図である。図１に示すように、会話支援装置１は、収音部１１（音声入力部）、音響信号取得部１２（音声入力部）、音声認識部１３（音声認識部、位置推定部）、画像処理部１４、画像表示部１５（表示部）、及び入力部１６を備える。また、画像処理部１４は、画像パターン生成部１４１、表示画像生成部１４２、及び画像合成部１４３を備える。
会話支援装置１は、例えばタブレット型の端末、携帯電話、携帯ゲーム機、テーブルの表面に画像表示部を備える端末等である。以下の実施形態では、会話支援装置１がタブレット型の端末（以下、タブレット端末ともいう）について説明する。 [First Embodiment]
FIG. 1 is a block diagram showing the configuration of the conversation support apparatus 1 according to this embodiment. As shown in FIG. 1, the conversation support apparatus 1 includes a sound collection unit 11 (speech input unit), an acoustic signal acquisition unit 12 (speech input unit), a speech recognition unit 13 (speech recognition unit, position estimation unit), and image processing. Unit 14, image display unit 15 (display unit), and input unit 16. The image processing unit 14 includes an image pattern generation unit 141, a display image generation unit 142, and an image composition unit 143.
The conversation support device 1 is, for example, a tablet terminal, a mobile phone, a portable game machine, a terminal having an image display unit on the surface of a table, or the like. In the following embodiments, the conversation support device 1 will be described as a tablet type terminal (hereinafter also referred to as a tablet terminal).

入力部１６は、画像表示部１５上に設けられたタッチパネルセンサーであり、使用者によってタッチされた画面上の座標情報を、画像処理部１４に出力する。なお、入力部１６は、有線または無線接続による外付けの入力装置であってもよい。外付けの入力装置は、例えば、キーボード、マウス等である。 The input unit 16 is a touch panel sensor provided on the image display unit 15, and outputs coordinate information on the screen touched by the user to the image processing unit 14. Note that the input unit 16 may be an external input device by wired or wireless connection. Examples of the external input device include a keyboard and a mouse.

収音部１１は、Ｎ個（Ｎは１よりも大きい整数、例えば８個）のチャネルの音響信号を収録し、収録したＮチャネルの音響信号を音響信号取得部１２に送信する。収音部１１は、例えば周波数帯域（例えば２００Ｈｚ〜４ｋＨｚ）の成分を有する音波を受信するＮ個のマイクロホン１０１−１〜１０１−Ｎを備えている。収音部１１は、収録したＮチャネルの音響信号を無線で送信してもよいし、有線で送信してもよい。Ｎが１よりも大きい場合には、送信の際にチャネル間で音響信号が同期していればよい。なお、以下の説明において、マイクロホン１０１−１〜１０１−Ｎのうち特定しない場合は、単にマイクロホン１０１という。また、収音部１１のマイクロホン１０１は、後述するように、会話支援装置１に組み込まれていてもよく、または会話支援装置１に取り付けられていてもよく、あるいは、話者が使う接話マイクロホンであってもよい。 The sound collection unit 11 records acoustic signals of N channels (N is an integer greater than 1, for example, 8), and transmits the recorded N channel acoustic signals to the acoustic signal acquisition unit 12. The sound collection unit 11 includes, for example, N microphones 101-1 to 101-N that receive sound waves having components in a frequency band (for example, 200 Hz to 4 kHz). The sound collection unit 11 may transmit the recorded N-channel acoustic signals wirelessly or by wire. When N is larger than 1, it is only necessary that the acoustic signals are synchronized between the channels at the time of transmission. In the following description, when the microphones 101-1 to 101-N are not specified, they are simply referred to as the microphone 101. The microphone 101 of the sound collection unit 11 may be incorporated in the conversation support device 1 or attached to the conversation support device 1 as described later, or a close-talking microphone used by a speaker. It may be.

音響信号取得部１２は、収音部１１のＮ個のマイクロホン１０１によって収録されたＮ個の音響信号を取得する。音響信号取得部１２は、時間領域において、取得したＮ個の音響信号に対してフレーム毎にフーリエ変換を行うことで周波数領域の入力信号を生成する。音響信号取得部１２は、フーリエ変換したＮ個の音響信号を音声認識部１３に出力する。なお、Ｎ個の音響信号には、マイクロホン１０１−１〜１０１−Ｎを識別できる情報、またはマイクロホン１０１−１〜１０１−Ｎが取り付けられている向きを示す情報を含むようにしてもよい。なお、マイクロホン１０１−１〜１０１−Ｎの向きは、会話支援装置１が備える不図示の方位センサによって取得し、取得した方位に基づく情報と会話支援装置１に組み込まれているマイクロホン１０１の位置関係に基づいて、マイクロホン１０１−１〜１０１−Ｎが取り付けられている向きを示す情報を推定するようにしてもよい。 The acoustic signal acquisition unit 12 acquires N acoustic signals recorded by the N microphones 101 of the sound collection unit 11. The acoustic signal acquisition unit 12 generates an input signal in the frequency domain by performing Fourier transform on the acquired N acoustic signals for each frame in the time domain. The acoustic signal acquisition unit 12 outputs the N acoustic signals subjected to Fourier transform to the speech recognition unit 13. Note that the N acoustic signals may include information that can identify the microphones 101-1 to 101-N or information indicating the direction in which the microphones 101-1 to 101-N are attached. Note that the orientations of the microphones 101-1 to 101 -N are acquired by an orientation sensor (not shown) provided in the conversation support device 1, and the positional relationship between information based on the acquired orientation and the microphone 101 incorporated in the conversation support device 1. Based on the above, information indicating the direction in which the microphones 101-1 to 101-N are attached may be estimated.

音声認識部１３は、音響信号取得部１２から入力された音響信号に対して音声認識処理を行って発話内容（例えば、単語、文を示すテキスト）を認識する。なお、音声認識部１３は、複数の話者による音響信号の場合、話者を判別し、判別した話者毎に発話内容を認識する。また、音声認識部１３は、例えば、音響信号取得部１２から入力されたＮチャネルの音響信号のうち、最も信号レベルが大きな音響信号を取得したマイクロホン１０１の向きを話者の向きであると例えばＭＵＳＩＣ（ＭＵｌｔｉｐｌｅＳＩｇｎａｌＣｌａｓｓｉｆｉｃａｔｉｏｎ）法によって推定するようにしてもよい。そして、音声認識部１３は、話者を示す情報と話者の向きを示す情報と認識データとを、画像処理部１４に出力する。
音声認識部１３は、例えば、音響モデルである隠れマルコフモデル（ＨＭＭ：ＨｉｄｄｅｎＭａｒｋｏｖＭｏｄｅｌ）と単語辞書を備える。 The speech recognition unit 13 performs speech recognition processing on the acoustic signal input from the acoustic signal acquisition unit 12 to recognize the utterance content (for example, text indicating a word or sentence). In the case of an acoustic signal from a plurality of speakers, the voice recognition unit 13 determines a speaker and recognizes the utterance content for each determined speaker. For example, when the voice recognition unit 13 has the direction of the microphone 101 that has acquired the acoustic signal having the highest signal level among the N-channel acoustic signals input from the acoustic signal acquisition unit 12 as the speaker direction, for example, You may make it estimate by the MUSIC (Multiple Signal Classification) method. Then, the speech recognition unit 13 outputs information indicating the speaker, information indicating the direction of the speaker, and recognition data to the image processing unit 14.
The speech recognition unit 13 includes, for example, a hidden Markov model (HMM: Hidden Markov Model) that is an acoustic model and a word dictionary.

音声認識部１３は、例えば、音響信号について予め定めた時間間隔（例えば、１０ｍｓ）毎に音響特徴量を算出する。音響特徴量は、例えば、３４次のメル周波数ケプストラム（ＭＦＣＣ；Ｍｅｌ−ＦｒｅｑｕｅｎｃｙＣｅｐｓｔｒｕｍＣｏｅｆｆｉｃｉｅｎｔｓ）、静的メル尺度対数スペクトル（ｓｔａｔｉｃＭＳＬＳ：Ｍｅｌ−ＳｃａｌｅＬｏｇＳｐｅｃｔｒｕｍ）、デルタＭＳＬＳ及び１個のデルタパワーの組である特性ベクトル（ｆｅａｔｕｒｅｖｅｃｔｏｒ）、静的メル尺度対数スペクトル（ＭＳＬＳ：Ｍｅｌ−ＳｃａｌｅＬｏｇＳｐｅｃｔｒｕｍ）、デルタＭＳＬＳ及び１個のデルタパワーの組等である。音声認識部１３は、算出した音響特徴量から音響モデルを用いて音韻を定め、定めた音韻で構成される音韻列から単語辞書を用いて単語を認識する。 For example, the voice recognition unit 13 calculates an acoustic feature amount at predetermined time intervals (for example, 10 ms) for the acoustic signal. The acoustic features are, for example, a set of 34th-order Mel-Frequency Cepstrum Coefficients (Static MSLS: Mel-Scale Log Spectrum), Delta MSLS, and one delta power. A feature vector, a static Mel-scale log spectrum (MSLS), a delta MSLS, a set of one delta power, and the like. The speech recognition unit 13 determines phonemes from the calculated acoustic feature quantities using an acoustic model, and recognizes words using a word dictionary from a phoneme string composed of the determined phonemes.

なお、本実施形態の会話支援装置１は、聴覚を支援する装置であるため、使用者である話者は、相手の発話が聞き取りづらい。このため、例えば、話者が２人の場合、第１話者Ｓｐ１が発話した場合、まず、第１話者Ｓｐ１の発話に基づく文字データが画像表示部１５上に表示される。次に、第２話者Ｓｐ２は、画像表示部１５上に表示された文字データに対する返答を発話する。その後、第１話者Ｓｐ１は、画像表示部１５上に表示された文字データに対する返答を発話する。このように、２人の話者が同時に発話を行わない場合、図１に示したように、音響信号取得部１２から入力された音響信号に対して音響定位処理や音響分離処理を行わなくても、音響信号の特徴量や最も信号レベルの大きなマイクロホン１０１の音響信号から、どの話者が発話しているのかを判別することができ、かつ発話内容を認識することができる。 In addition, since the conversation assistance apparatus 1 of this embodiment is an apparatus which assists auditory sense, it is difficult for the speaker who is a user to hear the other party's utterance. Therefore, for example, when there are two speakers and the first speaker Sp1 speaks, first, character data based on the speech of the first speaker Sp1 is displayed on the image display unit 15. Next, the second speaker Sp2 utters a response to the character data displayed on the image display unit 15. Thereafter, the first speaker Sp1 utters a response to the character data displayed on the image display unit 15. As described above, when two speakers do not speak at the same time, as shown in FIG. 1, the sound localization process and the sound separation process are not performed on the sound signal input from the sound signal acquisition unit 12. In addition, it is possible to determine which speaker is speaking from the feature value of the acoustic signal and the acoustic signal of the microphone 101 having the largest signal level, and to recognize the utterance content.

画像パターン生成部１４１は、入力部１６から入力された画面上の座標情報に基づいて、後述するメニュー画像を生成し、生成したメニュー画像を画像合成部１４３に出力する。画像パターン生成部１４１は、入力部１６から入力された画面上の座標情報に基づいて、メニュー画面において、使用者によって選択された内容に従って画面パターンの画像を生成し、生成した画面パターンの画像を画像合成部１４３に出力する。なお、画面パターンの画像とは、後述するように、話者の人数に応じた表示画像である。 The image pattern generation unit 141 generates a menu image to be described later based on the coordinate information on the screen input from the input unit 16, and outputs the generated menu image to the image composition unit 143. The image pattern generation unit 141 generates a screen pattern image according to the content selected by the user on the menu screen based on the coordinate information on the screen input from the input unit 16, and generates the generated screen pattern image. The image is output to the image composition unit 143. The screen pattern image is a display image corresponding to the number of speakers, as will be described later.

表示画像生成部１４２は、音声認識部１３から入力された話者毎の認識データに対応する文字データを生成し、生成した話者毎の文字データを画像合成部１４３に出力する。表示画像生成部１４２は、音声認識部１３から入力された話者の向きを示す情報に基づいて、話者毎の向きを話者毎の向きを示す画像を生成し、生成した話者毎の向きを示す画像を画像合成部１４３に出力する。 The display image generation unit 142 generates character data corresponding to the recognition data for each speaker input from the speech recognition unit 13, and outputs the generated character data for each speaker to the image composition unit 143. The display image generation unit 142 generates an image indicating the direction of each speaker and the direction of each speaker based on the information indicating the direction of the speaker input from the voice recognition unit 13, and generates the image for each generated speaker. An image indicating the direction is output to the image composition unit 143.

画像合成部１４３は、画像パターン生成部１４１が生成したメニュー画像を画像表示部１５上に表示させる。画像合成部１４３は、画像パターン生成部１４１が生成した表示画像において、表示画像生成部１４２から入力された話者毎の文字データを発話した話者以外の表示領域に表示するように画像を合成する。また、画像合成部１４３は、表示画像生成部１４２から入力された話者毎の向きを示す画像を、各話者の表示領域に表示するように画像を合成する。画像合成部１４３は、合成した画像を画像表示部１５上に表示させる。
なお、画像合成部１４３は、表示画像において、話者毎の文字データを発話した話者の表示領域に表示するように画像を合成してもよい。 The image composition unit 143 displays the menu image generated by the image pattern generation unit 141 on the image display unit 15. In the display image generated by the image pattern generation unit 141, the image synthesis unit 143 synthesizes the image so that the character data for each speaker input from the display image generation unit 142 is displayed in a display area other than the speaker who spoke. To do. In addition, the image composition unit 143 synthesizes the images so as to display the image indicating the direction of each speaker input from the display image generation unit 142 in the display area of each speaker. The image composition unit 143 displays the synthesized image on the image display unit 15.
Note that the image synthesis unit 143 may synthesize an image so that character data for each speaker is displayed in the display area of the speaker who has spoken in the display image.

画像表示部１５には、画像合成部１４３が出力した画像が表示される。画像表示部１５は、例えば液晶表示装置、有機ＥＬ（エレクトロルミネッセンス）表示装置、電子インク表示装置等である。 The image display unit 15 displays the image output by the image composition unit 143. The image display unit 15 is, for example, a liquid crystal display device, an organic EL (electroluminescence) display device, an electronic ink display device, or the like.

次に、収音部１１のマイクロホン１０１について図２〜図４を用いて説明する。図２は、本実施形態に係るマイクロホン１０１が本体２０１に組み込まれている場合の例を説明する図である。図３は、本実施形態に係るマイクロホン１０１がカバー２０２に組み込まれている場合の例を説明する図である。図４は、本実施形態に係る話者が接話型マイクロホンを使用する場合の例を説明する図である。なお、図２〜図４に示したマイクロホン１０１の個数、配置は一例であり、これに限られない。また、図２〜図４において、符号２０１が示す画像は、会話支援装置１の本体を示す画像である。本体２０１の形状は、縦長の長方形に限られず、正方形、横長の長方形、円形、楕円形、多角形であってもよい。 Next, the microphone 101 of the sound collection unit 11 will be described with reference to FIGS. FIG. 2 is a diagram illustrating an example in which the microphone 101 according to the present embodiment is incorporated in the main body 201. FIG. 3 is a diagram illustrating an example in which the microphone 101 according to the present embodiment is incorporated in the cover 202. FIG. 4 is a diagram for explaining an example when a speaker according to the present embodiment uses a close-talking microphone. The number and arrangement of the microphones 101 shown in FIGS. 2 to 4 are examples, and the present invention is not limited to this. 2 to 4, an image denoted by reference numeral 201 is an image showing the main body of the conversation support apparatus 1. The shape of the main body 201 is not limited to a vertically long rectangle, and may be a square, a horizontally long rectangle, a circle, an ellipse, or a polygon.

図２に示す例では、本体２０１の周辺部にマイクロホン１０１が８個組み込まれている。８個のマイクロホン１０１のうちマイクロホン１０１−１〜１０１−４が、紙面に向かった右の長辺に沿って取り付けられ、残りのマイクロホン１０１−５〜１０１−８が、紙面に向かった左の長辺に沿って取り付けられている。 In the example shown in FIG. 2, eight microphones 101 are incorporated in the periphery of the main body 201. Among the eight microphones 101, the microphones 101-1 to 101-4 are attached along the right long side toward the paper surface, and the remaining microphones 101-5 to 101-8 are the left long toward the paper surface. It is attached along the side.

図３に示す例では、本体２０１に取り外し可能なカバー２０２の周辺部にマイクロホン１０１が８個組み込まれている。各マイクロホン１０１と本体２０１との接続は、有線または無線で接続されるようにしてもよい。図２と同様に、マイクロホン１０１−１〜１０１−４が、紙面に向かった右の長辺に沿って取り付けられ、残りのマイクロホン１０１−５〜１０１−８が、紙面に向かった左の長辺に沿って取り付けられている。なお、カバー２０２にマイクロホン１０１を組み込む例を接続したが、他の例として本体２０１を保護するバンパー等にマイクロホン１０１を組み込んでもよい。 In the example shown in FIG. 3, eight microphones 101 are incorporated in the periphery of a cover 202 that can be removed from the main body 201. The connection between each microphone 101 and the main body 201 may be wired or wireless. As in FIG. 2, microphones 101-1 to 101-4 are mounted along the right long side facing the paper surface, and the remaining microphones 101-5 to 101-8 are the left long side facing the paper surface. Attached along. Although the example of incorporating the microphone 101 into the cover 202 is connected, as another example, the microphone 101 may be incorporated into a bumper or the like that protects the main body 201.

図４に示す例では、４人の話者がそれぞれ接話型のマイクロホン１０１−１〜１０１−４を使用する例である。各マイクロホン１０１と本体２０１との接続は、有線または無線で接続されるようにしてもよい。また、マイクロホン１０１の位置は、話者に応じた位置であり、話者が２人の場合、話者の配置は、紙面に向かって本体２０１の左右または上下になることが望ましい。また、話者が４人の場合、話者の配置は、図４に示すように紙面に向かって本体２０１の右上、右下、左下、左上になることが望ましい。 In the example illustrated in FIG. 4, four speakers use close-talking microphones 101-1 to 101-4, respectively. The connection between each microphone 101 and the main body 201 may be wired or wireless. Further, the position of the microphone 101 is a position corresponding to the speaker, and when there are two speakers, it is desirable that the speakers be arranged on the left and right or top and bottom of the main body 201 toward the page. In addition, when there are four speakers, it is desirable that the speakers be arranged in the upper right, lower right, lower left, and upper left of the main body 201 as shown in FIG.

次に、入力部１６によって選択されるメニュー画面について説明する。図５は、本実施形態に係るメニュー画像３０１の一例を説明する図である。
図５において、メニュー画像３０１は、使用者が表示画面の選択や切り替えを行うときに画像表示部１５に表示される。メニュー画像３０１には、話者の選択メニュー領域３１１、話者１人目（第１話者）〜話者４人目（第４話者）がそれぞれ発話する言語の選択メニュー領域３１２〜３１５、及び画面の回転選択メニュー領域３１６を含む。なお、図５に示したメニュー画像３０１は一例であり、図５のように全てのメニュー画像を１つのメニューとして表示してもよく、または、項目毎に複数のメニュー画像に分けて表示するようにしてもよい。 Next, a menu screen selected by the input unit 16 will be described. FIG. 5 is a diagram illustrating an example of the menu image 301 according to the present embodiment.
In FIG. 5, the menu image 301 is displayed on the image display unit 15 when the user selects or switches the display screen. The menu image 301 includes a selection menu area 311 for a speaker, selection menu areas 312 to 315 for languages in which the first speaker (first speaker) to the fourth speaker (fourth speaker) speak, and a screen. Rotation selection menu area 316. Note that the menu image 301 shown in FIG. 5 is an example, and all the menu images may be displayed as one menu as shown in FIG. 5, or the items may be divided into a plurality of menu images for each item. It may be.

話者の選択メニュー領域３１１は、会話支援装置１を利用して会話を行う話者の人数に応じて、話者のうちいずれかの人が入力部１６を操作して選択する。話者の選択メニュー領域３１１において、例えば、「話者２人（緑色）」は、話者が２人であり、２人目の話者に対応する表示色が緑色であることを表している。なお、図５に示した例では、話者１人目〜話者４人目に対応する表示色が固定されている例を示したが、表示色は、複数の話者が同じ色合いにならず、視覚的に判別可能な範囲で選択できるようにしてもよい。この場合、例えば、１人目の話者は、「話者１人（赤色）」を入力部１６によって操作して、表示色を変更するようにしてもよい。例えば、１人目の話者によって赤色が選択された後、画像パターン生成部１４１は、この赤色と隣接しても視覚的に識別可能な色を、他の話者の対応する話者の選択メニュー領域３１１に表示するようにしてもよい。 The speaker selection menu area 311 is selected by any one of the speakers by operating the input unit 16 in accordance with the number of speakers having a conversation using the conversation support device 1. In the speaker selection menu area 311, for example, “Two speakers (green)” indicates that there are two speakers and the display color corresponding to the second speaker is green. In the example shown in FIG. 5, an example is shown in which the display color corresponding to the first speaker to the fourth speaker is fixed, but the display color is not the same for a plurality of speakers. You may make it selectable in the visually distinguishable range. In this case, for example, the first speaker may change the display color by operating “one speaker (red)” with the input unit 16. For example, after red is selected by the first speaker, the image pattern generation unit 141 selects a color that is visually identifiable even when adjacent to the red, and a selection menu for speakers corresponding to other speakers. You may make it display on the area | region 311. FIG.

話者１人目が発話する言語の選択メニュー領域３１２には、話者１人目が使用する言語を、入力部１６を操作して選択するためのメニュー画像である。同様に、話者２人目〜第４人目が発話する言語の選択メニュー領域３１３〜３１５には、話者２人目〜話者４人目が使用する言語を、入力部１６を操作して選択するためのメニュー画像である。なお、図５に示した例では、話者が使用する言語の例として、日本語、英語、フランス語、中国語の４カ国語から選択する例を示したが、言語数はこれに限られない。また、図５に示した例では、話者の選択メニュー領域３１１、話者１人目（第１話者）〜話者４人目（第４話者）がそれぞれ発話する言語の選択メニュー領域３１２〜３１５を全て日本語で表示する例を示したが、これに限られない。例えば、話者１人目が発話する言語の選択メニュー領域３１２のうち「話者１人目：言語日本語」を日本語で表示し、「話者１人目：言語
英語」を英語で表示し、「話者１人目：言語フランス語」をフランス語で表示し、「話者１人目：言語中国語」を中国語で表示するようにしてもよい。 The language selection menu area 312 for the first speaker speaks a menu image for selecting the language used by the first speaker by operating the input unit 16. Similarly, in the language selection menu areas 313 to 315 for the second to fourth speakers, the language used by the second to fourth speakers is selected by operating the input unit 16. It is a menu image. In the example shown in FIG. 5, as an example of the language used by the speaker, an example of selecting from four languages of Japanese, English, French, and Chinese is shown, but the number of languages is not limited to this. . Further, in the example shown in FIG. 5, a selection menu area 311 for a speaker, a selection menu area 312 for languages in which the first speaker (first speaker) to the fourth speaker (fourth speaker) speak, respectively. Although an example in which all 315 is displayed in Japanese is shown, the present invention is not limited to this. For example, “speaker first: language Japanese” is displayed in Japanese in the language selection menu area 312 spoken by the first speaker, and “speaker first: language Japanese” is displayed.
“English” may be displayed in English, “Speaker First: Language French” may be displayed in French, and “Speaker First: Language Chinese” may be displayed in Chinese.

画面の回転選択メニュー領域３１６は、画面に表示される画像が回転しないように固定させる指示、画面の向きの９０度回転させる指示、画面の向きの反転（１８０度回転）させる指示が含まれる。画面に表示される画像が回転しないように固定させる指示とは、例えば会話支援装置１が備える不図示の本体の回転を検出するセンサの検出出力に応じて、画面の向きを回転させる機能を有している場合であっても、表示画面が回転しないように固定する指示である。また、画面の向きの９０度回転させる指示、画面の向きの反転（１８０度回転）させる指示は、話者の人数や配置に応じて、話者にとって画像表示部１５上に表示される画像が最も見やすいように画像の表示向きを回転させたり反転させたりする指示である。
例えば、会話支援装置１に不図示のプロジェクタが接続された場合、画像表示部１５上に表示される画像がプロジェクタを介してスクリーンに表示される。この場合、会話支援装置１を回転させたりすると、このスクリーンに表示される画像も回転してしまうため、発話者にとって、自分が見るべき表示領域がわかりづらくなる場合もある。これを防止するため、画面に表示される画像が回転しないように固定させる。 The screen rotation selection menu area 316 includes an instruction to fix the image displayed on the screen so as not to rotate, an instruction to rotate the screen orientation by 90 degrees, and an instruction to reverse the screen orientation (rotate 180 degrees). The instruction to fix the image displayed on the screen so as not to rotate has, for example, a function of rotating the screen direction according to a detection output of a sensor (not shown) included in the conversation support device 1 that detects rotation of the main body. This is an instruction to fix the display screen so that it does not rotate even if it is. Further, an instruction to rotate the screen orientation by 90 degrees and an instruction to reverse the screen orientation (rotate by 180 degrees) indicate that the image displayed on the image display unit 15 by the speaker depends on the number and arrangement of the speakers. This is an instruction to rotate or reverse the display direction of the image so that it can be seen most easily.
For example, when a projector (not shown) is connected to the conversation support apparatus 1, an image displayed on the image display unit 15 is displayed on the screen via the projector. In this case, if the conversation support apparatus 1 is rotated, the image displayed on the screen is also rotated, which may make it difficult for the speaker to understand the display area to be viewed by the speaker. In order to prevent this, the image displayed on the screen is fixed so as not to rotate.

次に、図６〜図８を用いて、画像パターン生成部１４１が生成する画面パターンの画像の例を説明する。図６は、本実施形態に係る話者が２人の場合に画像表示部１５上に表示される画面パターンの画像の例を説明する図である。図７は、本実施形態に係る話者が３人の場合に画像表示部１５上に表示される画面パターンの画像の例を説明する図である。
図８は、本実施形態に係る話者が４人の場合に画像表示部１５上に表示される画面パターンの画像の例を説明する図である。なお、図６〜図８に示した配置及び配色等は一例であり、これに限られない。また、図６〜図８において、画像３０１Ａ〜画像３０１Ｃは、画像表示部１５上に表示される画面パターンの画像である。また、各話者に対応する色は、図５に示したメニュー画像のように、話者１人目が赤色、話者２人目が緑色、話者３人目が青色、話者４人目が黄色である。
なお、以下に示した例では、各話者に対応する表示領域を異なる色で区分けした例を示したが、これに限られない。各話者に対応する表示領域は、例えば異なる柄や、話者毎に対応するアイコン、話者毎に対応する擬人化した画像であるアバター等によって、見分けられるようにしてもよい。この場合、領域等を色で識別しなくても話者に対応する表示領域を話者が識別することができるので、画像表示部１５に白黒の画像表示装置や電子インク表示装置を用いることができ、消費電力を低減することができる。 Next, an example of a screen pattern image generated by the image pattern generation unit 141 will be described with reference to FIGS. FIG. 6 is a diagram illustrating an example of an image of a screen pattern displayed on the image display unit 15 when there are two speakers according to the present embodiment. FIG. 7 is a diagram illustrating an example of an image of a screen pattern displayed on the image display unit 15 when there are three speakers according to the present embodiment.
FIG. 8 is a diagram for explaining an example of an image of a screen pattern displayed on the image display unit 15 when there are four speakers according to the present embodiment. The arrangements and color schemes shown in FIGS. 6 to 8 are examples, and the present invention is not limited to these. 6 to 8, images 301 A to 301 C are images of screen patterns displayed on the image display unit 15. The color corresponding to each speaker is red for the first speaker, green for the second speaker, blue for the third speaker, and yellow for the fourth speaker as in the menu image shown in FIG. is there.
In addition, although the example shown below showed the example which divided the display area corresponding to each speaker with a different color, it is not restricted to this. The display area corresponding to each speaker may be distinguished by, for example, a different pattern, an icon corresponding to each speaker, an avatar that is an anthropomorphic image corresponding to each speaker, or the like. In this case, since the speaker can identify the display region corresponding to the speaker without identifying the region or the like by color, a monochrome image display device or an electronic ink display device may be used for the image display unit 15. And power consumption can be reduced.

図６に示すように、話者が２人の場合、画像３０１Ａは、紙面に向かって上下に分割され、例えば上側の領域に話者１人目に提示する情報の表示領域が割り当てられ、下側の領域に話者２人目に提示する情報の表示領域が割り当てられる。
話者１人目に提示される情報の表示領域の第１提示画像３２１Ａには、後述するように第２話者が発話した発話内容のテキストが表示される文字表示領域の第１文字提示画像３２２Ａを備えている。第１文字提示画像３２２Ａの色は、例えば白色である。また、第１文字提示画像３２２Ａには、後述するように第１話者の向きを示す方位画像３２３Ａが含まれる。図６に示した例は、第１話者は、紙面に対して、上側の正面にいる例である。
話者２人目に提示される情報の表示領域の第２提示画像３３１Ａには、第１話者が発話した発話内容のテキストが表示される文字表示領域の第２文字提示画像３３２Ａ、及び第２話者の向きを示す方位画像３３３Ａが含まれる。図６に示した例は、第２話者は、紙面に対して、右下にいる例である。第２文字提示画像３３２Ａの色は、例えば白色である。 As shown in FIG. 6, when there are two speakers, the image 301 A is divided up and down toward the page, and for example, a display area for information to be presented to the first speaker is assigned to the upper area. A display area for information to be presented to the second speaker is assigned to this area.
In the first display image 321A of the information display area presented to the first speaker, as described later, the first character presentation image 322A of the character display area in which the text of the utterance content spoken by the second speaker is displayed. It has. The color of the first character presentation image 322A is, for example, white. Further, the first character presentation image 322A includes an orientation image 323A indicating the direction of the first speaker as will be described later. The example shown in FIG. 6 is an example in which the first speaker is in front of the upper side with respect to the page.
In the second presentation image 331A in the display area of information presented to the second speaker, the second character presentation image 332A in the character display area in which the text of the utterance content spoken by the first speaker is displayed, and the second An orientation image 333A indicating the direction of the speaker is included. The example shown in FIG. 6 is an example in which the second speaker is at the lower right with respect to the page. The color of the second character presentation image 332A is, for example, white.

図７に示すように、話者が３人の場合、画像３０１Ｂは３分割され、例えば左上側の領域に話者１人目に提示する情報の表示領域が割り当てられ、左下側の領域に話者２人目に提示する情報の表示領域が割り当てられ、右側の領域に話者３人目に提示する情報の表示領域が割り当てられる。
また、第１話者〜第３話者に対応する表示領域の第１提示画像３２１Ｂ〜第３提示画像３４１Ｂそれぞれには、自分を含まない他の発話した発話内容のテキストが表示される文字表示領域の第１文字提示画像３２２Ｂ〜第３文字提示画像３４２Ｂ、及び自分の向きを示す方位画像３２３Ｂ〜３２４Ｂが含まれる。第１文字提示画像３２２Ｂ〜第３文字提示画像３４２Ｂの色は、例えば白色である。
一例として、話者３人目に提示される情報の表示領域の第３提示画像３４１Ｂの第３文字提示画像３４２Ｂには、第１話者及び第２話者が発話した発話内容のテキストが表示される。また、方位画像３４３Ｂは、第３話者の向きを表す。 As shown in FIG. 7, when there are three speakers, the image 301B is divided into three parts. For example, a display area for information presented to the first speaker is assigned to the upper left area, and the speaker is assigned to the lower left area. A display area for information presented to the second person is assigned, and a display area for information presented to the third person is assigned to the right area.
In addition, in each of the first presentation image 321B to the third presentation image 341B in the display area corresponding to the first speaker to the third speaker, a character display in which texts of other uttered utterance contents not including itself are displayed. The first character presentation image 322B to the third character presentation image 342B of the area and the orientation images 323B to 324B indicating the direction of the user are included. The color of the first character presentation image 322B to the third character presentation image 342B is, for example, white.
As an example, in the third character presentation image 342B of the third presentation image 341B in the display area of the information presented to the third speaker, the text of the utterance content uttered by the first speaker and the second speaker is displayed. The The orientation image 343B represents the direction of the third speaker.

図８に示すように、話者が４人の場合、画像３０１Ｃは４分割され、例えば左上側の領域に話者１人目に提示する情報の表示領域が割り当てられ、左下側の領域に話者２人目に提示する情報の表示領域が割り当てられる。また、右下側の領域に話者３人目に提示する情報の表示領域が割り当てられ、右上側の領域に話者４人目に提示する情報の表示領域が割り当てられる。
また、第１話者〜第４話者に対応する表示領域の第１提示画像３２１Ｃ〜第４提示画像３５１Ｃそれぞれには、自分を含まない他の発話した発話内容のテキストが表示される文字表示領域の第１文字提示画像３２２Ｃ〜第４文字提示画像３５２Ｃ、及び自分の向きを示す方位画像３２３Ｃ〜３５３Ｃが含まれる。第１文字提示画像３２２Ｃ〜第４文字提示画像３５２Ｃの色は、例えば白色である。 As shown in FIG. 8, when there are four speakers, the image 301C is divided into four parts. For example, a display area for information presented to the first speaker is assigned to the upper left area, and the speaker is assigned to the lower left area. A display area for information to be presented to the second person is allocated. In addition, a display area for information presented to the third speaker is assigned to the lower right area, and a display area for information presented to the fourth speaker is assigned to the upper right area.
In addition, in each of the first presentation image 321C to the fourth presentation image 351C in the display area corresponding to the first speaker to the fourth speaker, a character display in which text of other uttered utterance contents not including itself is displayed. The first character presentation image 322C to the fourth character presentation image 352C of the region and the orientation images 323C to 353C indicating their orientation are included. The color of the first character presentation image 322C to the fourth character presentation image 352C is, for example, white.

一例として、話者４人目に提示される情報の表示領域の第４提示画像３５１Ｃの第４文字提示画像３５２Ｃには、第１話者〜第３話者が発話した発話内容のテキストが表示される。また、方位画像３５３Ｃは、第４話者の向きを表す。
なお、各話者は、例えば図８において、自分の方向を画像表示部１５に設けられているタッチパネル式の入力部１６を操作することで初期の話者方向を入力するようにしてもよい。この場合、会話支援装置１は、入力された初期の話者方向に応じた報告に発話内容のテキストを表示させ続けるようにしてもよい。 As an example, in the fourth character presentation image 352C of the fourth presentation image 351C in the display area of information presented to the fourth speaker, the text of the utterance content uttered by the first to third speakers is displayed. The The orientation image 353C represents the direction of the fourth speaker.
For example, in FIG. 8, each speaker may input the initial speaker direction by operating the touch panel type input unit 16 provided in the image display unit 15. In this case, the conversation support apparatus 1 may continue to display the text of the utterance content on the input report corresponding to the initial speaker direction.

なお、図６〜図８に示した例のように、画像パターン生成部１４１は、例えば、自分を含まない他の発話した発話内容のテキストが表示される文字表示領域の画像の大きさが均等になるように、各表示領域を分割する。または、画像パターン生成部１４１は、会話支援装置１が備える不図示の傾き検出センサの検出結果に基づいて、装置がテーブルなどの上に傾けて置かれていることを検出し、検出された傾きの角度に応じて、文字表示領域の画像の大きさの比を演算し、演算した比に基づく文字表示領域の画像の大きさに応じて分割する各領域の大きさを決定するようにしてもよい。
なお、２〜４分割された領域のうち、どの領域がどの話者に対応するかは、予め画像パターン生成部１４１に記憶されている。画像パターン生成部１４１は、入力部１６から入力された指示に応じて、どの領域がどの話者に対応するかを切り替えるようにしてもよい。例えば、図８において、第２話者と第４話者の位置が入れ替わった場合、例えば第２話者は、画像表示部１５がタッチパネル式の入力部１６において、第２提示画像３３１Ｃを第４提示画像３５１Ｃの領域に移動させるように操作することで、第２提示画像３３１Ｃと第４提示画像３５１Ｃとを入れ替えるようにしてもよい。これにより、話者の位置が途中で入れ替わった場合であっても、本実施形態によれば、それまでの会話内容の表示を維持したままの画面を見ることができるので、話者に対する利便性が向上する。 6 to 8, for example, the image pattern generation unit 141 has the same size of the image in the character display area in which texts of other uttered utterances not including itself are displayed. Each display area is divided so that Alternatively, the image pattern generation unit 141 detects that the device is tilted on a table or the like based on the detection result of a tilt detection sensor (not shown) provided in the conversation support device 1 and detects the detected tilt. The ratio of the image size of the character display area is calculated according to the angle of the image, and the size of each area to be divided is determined according to the size of the image of the character display area based on the calculated ratio. Good.
Note that which region corresponds to which speaker among the regions divided into 2 to 4 is stored in the image pattern generation unit 141 in advance. The image pattern generation unit 141 may switch which region corresponds to which speaker in accordance with an instruction input from the input unit 16. For example, in FIG. 8, when the positions of the second speaker and the fourth speaker are switched, for example, the second speaker displays the second presentation image 331 C as the fourth display image in the touch panel type input unit 16. You may make it replace the 2nd presentation image 331C and the 4th presentation image 351C by operating so that it may move to the area | region of the presentation image 351C. Thereby, even if the position of the speaker is changed in the middle, according to the present embodiment, it is possible to view the screen while maintaining the display of the conversation content up to that point, so that it is convenient for the speaker. Will improve.

次に、会話支援装置１が行う処理手順を説明する。図９は、本実施形態に係る会話支援装置１が行う処理手順のフローチャートである。
（ステップＳ１）話者は、図５に示したメニュー画像３０１において、入力部１６を操作することで、使用者人数を選択する。次に、各話者は、図５に示したメニュー画像３０１において、入力部１６を操作することで、使用する言語を選択する。次に、入力部１６は、話者によって選択された画面上の座標情報を、画像処理部１４に出力する。 Next, a processing procedure performed by the conversation support apparatus 1 will be described. FIG. 9 is a flowchart of a processing procedure performed by the conversation support apparatus 1 according to the present embodiment.
(Step S1) The speaker selects the number of users by operating the input unit 16 in the menu image 301 shown in FIG. Next, each speaker selects the language to be used by operating the input unit 16 in the menu image 301 shown in FIG. Next, the input unit 16 outputs the coordinate information on the screen selected by the speaker to the image processing unit 14.

（ステップＳ２）画像パターン生成部１４１は、入力部１６から入力された画面上の座標情報に基づいて、メニュー画面において、使用者によって選択された内容に従って画面パターンの画像を生成し、生成した画面パターンの画像を画像合成部１４３に出力する。次に、画像合成部１４３は、画像パターン生成部１４１が生成したメニュー画像を画像表示部１５上に表示させる。
（ステップＳ３）音響信号取得部１２は、例えば入力部１６によって認識開始が指示されたことが検出された後、またはステップＳ１が行われたタイミング等で、収音部１１のＮ個のマイクロホン１０１によって収録されたＮ個の音響信号の取得を開始する。次に、音響信号取得部１２は、フーリエ変換したＮ個の音響信号を音声認識部１３に出力する。 (Step S2) The image pattern generation unit 141 generates an image of the screen pattern according to the content selected by the user on the menu screen based on the coordinate information on the screen input from the input unit 16, and the generated screen The pattern image is output to the image composition unit 143. Next, the image composition unit 143 displays the menu image generated by the image pattern generation unit 141 on the image display unit 15.
(Step S 3) The acoustic signal acquisition unit 12 detects, for example, that the start of recognition is instructed by the input unit 16, or at the timing when Step S 1 is performed, and the N microphones 101 of the sound collection unit 11. The acquisition of N acoustic signals recorded by is started. Next, the acoustic signal acquisition unit 12 outputs the N acoustic signals subjected to Fourier transform to the speech recognition unit 13.

（ステップＳ４）音声認識部１３は、音響信号取得部１２から入力された音響信号に対して、話者毎に音声認識処理を行って発話内容を認識する。次に、音声認識部１３は、話者毎に話者の向きを、例えば、話者の発話時に最も信号レベルが大きい音響信号を取得したマイクロホン１０１の向きに基づいて推定する。次に、音声認識部１３は、話者を示す情報と話者の向きを示す情報と認識データとを、画像処理部１４に出力する。 (Step S4) The speech recognition unit 13 performs speech recognition processing for each speaker on the acoustic signal input from the acoustic signal acquisition unit 12 to recognize the utterance content. Next, the voice recognizing unit 13 estimates the direction of the speaker for each speaker based on, for example, the direction of the microphone 101 that has acquired the acoustic signal having the highest signal level when the speaker speaks. Next, the voice recognition unit 13 outputs information indicating the speaker, information indicating the direction of the speaker, and recognition data to the image processing unit 14.

（ステップＳ５）表示画像生成部１４２は、音声認識部１３から入力された話者毎の認識データに対応する文字データを生成し、生成した話者毎の文字データを画像表示部１５に出力する。表示画像生成部１４２は、音声認識部１３から入力された話者の向きを示す情報に基づいて、話者毎の向きを話者毎の向きを示す情報の画像を生成し、生成した話者毎の向きを示す情報の画像を画像合成部１４３に出力する。 (Step S 5) The display image generation unit 142 generates character data corresponding to the recognition data for each speaker input from the speech recognition unit 13, and outputs the generated character data for each speaker to the image display unit 15. . The display image generation unit 142 generates an image of information indicating the direction of each speaker and the direction of each speaker based on the information indicating the direction of the speaker input from the voice recognition unit 13, and the generated speaker An image of information indicating each direction is output to the image composition unit 143.

（ステップＳ６）画像合成部１４３は、画像パターン生成部１４１が生成した表示画像において、表示画像生成部１４２から入力された話者毎の文字データを発話した話者以外の表示領域に表示するように画像を合成する。次に、画像合成部１４３は、表示画像生成部１４２から入力された話者毎の向きを示す情報の画像を、各話者の表示領域に表示するように画像を合成する。次に、画像合成部１４３は、合成した画像を画像表示部１５上に表示させる。
以上で、会話支援装置１が行う処理を終了する。 (Step S6) In the display image generated by the image pattern generation unit 141, the image composition unit 143 displays the character data for each speaker input from the display image generation unit 142 in a display area other than the speaker who has spoken. Composite the image with Next, the image synthesizing unit 143 synthesizes the images so that the image of the information indicating the direction of each speaker input from the display image generating unit 142 is displayed in the display area of each speaker. Next, the image composition unit 143 displays the synthesized image on the image display unit 15.
Above, the process which the conversation assistance apparatus 1 performs is complete | finished.

＜実験結果の説明＞
ここで、本実施形態に係る会話支援装置１を用いて行った実験結果の例を説明する。図１０は、実験環境を説明するための図である。
図１０に示すように、会話支援装置１は、テーブル４０１の上に傾けて置かれている。
また、会話支援装置１は、一方の長手方向がテーブル４０１に接するように置かれている。実験は、所定の広さを有する部屋で行った。また、話者は第１話者Ｓｐ１と第２話者Ｓｐ２の２人であり、第１話者Ｓｐ１と第２話者Ｓｐ２とは椅子４０２に着席した状態である。 <Explanation of experimental results>
Here, an example of an experimental result performed using the conversation support apparatus 1 according to the present embodiment will be described. FIG. 10 is a diagram for explaining the experimental environment.
As shown in FIG. 10, the conversation support apparatus 1 is placed on a table 401 at an angle.
In addition, the conversation support device 1 is placed so that one longitudinal direction is in contact with the table 401. The experiment was performed in a room having a predetermined area. Further, there are two speakers, a first speaker Sp1 and a second speaker Sp2, and the first speaker Sp1 and the second speaker Sp2 are seated on the chair 402.

図１１は、会話を始める前の画像表示部１５上に表示される画像５０１である。画像５０１において、紙面に向かって上方向の第１提示画像５２１が第１話者Ｓｐ１に提示される領域であり、紙面に向かって下方向の第２提示画像５３１が第２話者Ｓｐ２に提示される領域である。図１０における会話支援装置１の画像表示部１５上において、第１話者Ｓｐ１及び第２話者Ｓｐ２から見て左側に第１提示画像５２１が表示され、右側に第２提示画像５３１が表示される。図１１に示した図は、第１話者Ｓｐ１または第２話者Ｓｐ２によって、話者の人数として２人が選択された後に画像表示部１５に表示される画像である（ステップＳ２）。また、第１提示画像５２１は第１文字提示画像５２２を備え、第２提示画像５３１は第２文字提示画像５３２を備えている。 FIG. 11 shows an image 501 displayed on the image display unit 15 before the conversation is started. In the image 501, the first presentation image 521 in the upward direction toward the paper surface is a region presented to the first speaker Sp1, and the second presentation image 531 in the downward direction toward the paper surface is presented to the second speaker Sp2. It is an area to be done. On the image display unit 15 of the conversation support apparatus 1 in FIG. 10, the first presentation image 521 is displayed on the left side as viewed from the first speaker Sp1 and the second speaker Sp2, and the second presentation image 531 is displayed on the right side. The The figure shown in FIG. 11 is an image displayed on the image display unit 15 after two speakers are selected as the number of speakers by the first speaker Sp1 or the second speaker Sp2 (step S2). The first presentation image 521 includes a first character presentation image 522, and the second presentation image 531 includes a second character presentation image 532.

図１２は、第１話者Ｓｐ１が「こんばんは」と発話した後に画像表示部１５上に表示される画像である。図１２に示すように、第２文字提示画像５３２には、第１話者Ｓｐ１の発話を認識した文字を示す画像５３４Ａである「こんばんは」が画像処理部１４によって表示される。この時点で、第２話者Ｓｐ２は、まだ発話を行っていないため、第２話者Ｓｐ２の向きが不明である。このため、図１２に示すように、画像５３４Ａである「こんばんは」は、初期方向に向けて画像処理部１４によって表示される。また、第１文字提示画像５２２には、第１話者Ｓｐ１の向きを示す方位画像５２３が画像処理部１４によって表示される。なお、方位画像５２３において、矢印の矢の先の向きが第１話者Ｓｐ１の向きである。 FIG. 12 shows an image displayed on the image display unit 15 after the first speaker Sp1 utters “Good evening”. As shown in FIG. 12, in the second character presentation image 532, “Good evening”, which is an image 534 A indicating a character that recognizes the utterance of the first speaker Sp 1, is displayed by the image processing unit 14. At this time, since the second speaker Sp2 has not yet spoken, the direction of the second speaker Sp2 is unknown. Therefore, as shown in FIG. 12, “Good evening” which is the image 534A is displayed by the image processing unit 14 in the initial direction. In addition, in the first character presentation image 522, the orientation image 523 indicating the direction of the first speaker Sp1 is displayed by the image processing unit 14. In the orientation image 523, the direction of the arrowhead is the direction of the first speaker Sp1.

図１３は、図１２の後に第２話者Ｓｐ２が「こんばんは」と発話した後に画像表示部１５上に表示される画像である。この時点で、第２話者Ｓｐ２が発話を行ったため、音声認識部１３は、第２話者Ｓｐ２の向きを推定する。そして、第２文字提示画像５３２には、第２話者Ｓｐ２の向きを示す方位画像５３３が画像処理部１４によって表示される。この結果、第２文字提示画像５３２に表示される画像５３４Ａは、第２話者Ｓｐ２の向きに合わせて表示が表示画像生成部１４２によって回転されて表示される。
さらに、第１文字提示画像５２２には、第２話者Ｓｐ２の発話を認識した文字を示す画像５２４Ａである「こんばんは」が、第１話者Ｓｐ１の向きに応じた方向に画像処理部１４によって表示される。 FIG. 13 is an image displayed on the image display unit 15 after the second speaker Sp2 utters “Good evening” after FIG. At this time, since the second speaker Sp2 uttered, the voice recognition unit 13 estimates the direction of the second speaker Sp2. Then, in the second character presentation image 532, the azimuth image 533 indicating the direction of the second speaker Sp2 is displayed by the image processing unit 14. As a result, the image 534A displayed on the second character presentation image 532 is displayed by being rotated by the display image generation unit 142 according to the direction of the second speaker Sp2.
Furthermore, in the first character presentation image 522, “Good evening”, which is an image 524A indicating a character that recognizes the utterance of the second speaker Sp2, is displayed by the image processing unit 14 in a direction according to the direction of the first speaker Sp1. Is displayed.

図１４は、第１話者Ｓｐ１が４回発話し、第２話者Ｓｐ２が３回発話した後に画像表示部１５上に表示される画像である。
第１文字提示画像５２２には、第２話者Ｓｐ２の発話を認識した文字の画像５２４Ａ〜５２４Ｃが表示されている。そして、図１４に示すように、画像５２４Ａ〜５２４Ｃは、第１話者Ｓｐ１にとって画像表示部１５の奥から手前に向かって順次、表示される。また、第２文字提示画像５３２には、第１話者Ｓｐ１の発話を認識した文字の画像５３４Ａ〜５３４Ｄが表示されている。そして、図１４に示すように、画像５３４Ａ〜５３４Ｄは、第２話者Ｓｐ２にとって画像表示部１５の奥から手前に向かって順次、表示される。図１４において、発話順番は、例えば、画像５３４Ａ−＞画像５２４Ａ−＞画像５３４Ｂ−＞画像５２４Ｂ−＞画像５３４Ｃ−＞画像５２４Ｃ−＞画像５３４Ｄの順番である。 FIG. 14 is an image displayed on the image display unit 15 after the first speaker Sp1 speaks four times and the second speaker Sp2 speaks three times.
In the first character presentation image 522, character images 524A to 524C in which the utterance of the second speaker Sp2 is recognized are displayed. Then, as shown in FIG. 14, the images 524A to 524C are sequentially displayed for the first speaker Sp1 from the back of the image display unit 15 toward the front. In the second character presentation image 532, character images 534A to 534D that recognize the utterance of the first speaker Sp1 are displayed. As shown in FIG. 14, the images 534A to 534D are sequentially displayed for the second speaker Sp2 from the back of the image display unit 15 toward the front. In FIG. 14, the utterance order is, for example, the order of image 534A-> image 524A-> image 534B-> image 524B-> image 534C-> image 524C-> image 534D.

なお、表示画像生成部１４２は、例えば第１文字提示画像５２２が認識された文字に対応する画像で埋め尽くされたか否かを判別し、第１文字提示画像５２２が認識された文字に対応する画像で埋め尽くされたと判別した場合、古い発話に対応する画像から消去するようにしてもよく、または画像をスクロールするようにしてもよい。そして、第１話者Ｓｐ１は、消去された発話に対応する画像を見たいときに、画像表示部１５上の第１文字提示画像５２２の過去に表示された文字の画像を呼び出すように、画像表示部１５上に設けられているタッチパネルの入力部１６を操作することで、過去の発話も参照することができるように、画像処理部１４が処理するようにしてもよい。 For example, the display image generation unit 142 determines whether or not the first character presentation image 522 is filled with an image corresponding to the recognized character, and the first character presentation image 522 corresponds to the recognized character. When it is determined that the image is filled with an image, the image corresponding to the old speech may be deleted, or the image may be scrolled. Then, when the first speaker Sp1 wants to see an image corresponding to the erased utterance, the image of the character displayed in the past of the first character presentation image 522 on the image display unit 15 is called up. By operating the input unit 16 of the touch panel provided on the display unit 15, the image processing unit 14 may perform processing so that past utterances can also be referred to.

なお、図１２〜図１４において、第１文字提示画像５２２に表示される画像の大きさが全て等しい例を示したが、これに限られない。例えば、最新の発話を認識した結果の画像を、例えば第１文字提示画像５２２の中央に大きく表示させ、過去の発話に基づく画像を小さく表示するようにしてもよい。第２文字提示画像５３２に表示させる画像についても同様である。
また、図１２〜図１４では、各会話に対応する文字を１行に収まるように表示画像生成部１４２が、文字サイズを決定するようにしてもよい。または、表示画像生成部１４２は、所定の文字サイズで認識された文字に対応する画像を数行に渡って表示するようにしてもよい。この場合、音声認識部１３は、認識した認識データに文節を示す情報を含めて画像処理部１４に出力するようにしてもよい。これにより、表示画像生成部１４２は、認識された文が所定の文字サイズでは１行に収まらないと判別した場合、音声認識部１３から入力された文節を示す情報を用いて、文の中において文節の切れ目の位置で折り返すようにしてもよい。
また、例えば、図１４に示した例において、認識された文字に対応する画像５２４Ａ〜５２４Ｃは、第２話者Ｓｐ２に対応する色で表示するようにしてもよい。同様に、認識された文字に対応する画像５３４Ａ〜５３４Ｄは、第１話者Ｓｐ１に対応する色で表示するようにしてもよい。 In addition, in FIGS. 12-14, although the example with which the magnitude | size of the image displayed on the 1st character presentation image 522 is all equal was shown, it is not restricted to this. For example, an image obtained as a result of recognizing the latest utterance may be displayed large in the center of the first character presentation image 522, for example, and an image based on the past utterance may be displayed small. The same applies to the image displayed on the second character presentation image 532.
In FIGS. 12 to 14, the display image generation unit 142 may determine the character size so that the characters corresponding to each conversation fit in one line. Alternatively, the display image generation unit 142 may display an image corresponding to a character recognized with a predetermined character size over several lines. In this case, the voice recognition unit 13 may include the information indicating the phrase in the recognized recognition data and output it to the image processing unit 14. As a result, when the display image generation unit 142 determines that the recognized sentence does not fit in one line with a predetermined character size, the display image generation unit 142 uses the information indicating the phrase input from the speech recognition unit 13 in the sentence. You may make it wrap in the position of the break of a clause.
Further, for example, in the example shown in FIG. 14, the images 524A to 524C corresponding to the recognized characters may be displayed in a color corresponding to the second speaker Sp2. Similarly, the images 534A to 534D corresponding to the recognized characters may be displayed in a color corresponding to the first speaker Sp1.

次に、話者が３人の場合に画像表示部１５上に表示される画像の例を説明する。図１５は、話者が３人の場合に画像表示部１５上に表示される画像６０１の例を説明する図である。
画像６０１は、図７に示した画像３０１Ｂに対応し、各符号６２１、６２２、６２３、６３１、６３２、６３３、６４１、６４２、及び６４３それぞれは、図７の符号３２１Ｂ、３２２Ｂ、３２３Ｂ、３３１Ｂ、３３２Ｂ、３３３Ｂ、３４１Ｂ、３４２Ｂ、及び３４３Ｂに対応する。 Next, an example of an image displayed on the image display unit 15 when there are three speakers will be described. FIG. 15 is a diagram illustrating an example of an image 601 displayed on the image display unit 15 when there are three speakers.
The image 601 corresponds to the image 301B shown in FIG. 7, and the reference numerals 621, 622, 623, 631, 632, 633, 641, 642, and 643 are the reference numerals 321B, 322B, 323B, 331B, This corresponds to 332B, 333B, 341B, 342B, and 343B.

図１５に示した例では、まず第１話者が「こんにちは」と発話する。これにより、第２文字提示画像６３２に認識された文字に対応する画像６３４Ａが表示され、第３文字提示画像６４２に認識された文字に対応する画像６４４Ａが表示される。この画像６３４Ａと画像６４４Ａは、第１話者に対応した色、例えば赤色で表示されるようにしてもよい。または第１話者が発話したことを示す情報を、画像６３４Ａ及び画像６４４Ａに、表示画像生成部１４２が付加してもよい。第１話者が発話したことを示す情報とは、名前、第１話者に対応するアバター、第１話者に対応するアイコン、第１話者に対応した色のマーク（例えば赤丸）等である。これにより、どの発話者による発話結果が認識されているのかを、視覚的に認識することを支援できる。 In the example shown in FIG. 15, the first speaker utters "Hello". As a result, an image 634A corresponding to the character recognized in the second character presentation image 632 is displayed, and an image 644A corresponding to the character recognized in the third character presentation image 642 is displayed. The images 634A and 644A may be displayed in a color corresponding to the first speaker, for example, red. Alternatively, the display image generation unit 142 may add information indicating that the first speaker speaks to the image 634A and the image 644A. The information indicating that the first speaker speaks is a name, an avatar corresponding to the first speaker, an icon corresponding to the first speaker, a color mark corresponding to the first speaker (for example, a red circle), and the like. is there. Accordingly, it is possible to assist in visually recognizing which utterer has recognized the utterance result.

次に、第２話者が「やあ！」と発話する。これにより、第１文字提示画像６２２に認識された文字に対応する画像６２５Ｂが表示され、第３文字提示画像６４２に認識された文字に対応する画像６４５Ｂが表示される。この場合も、画像６２５Ｂと画像６４５Ｂは、第２話者に対応した色、例えば緑色で表示されるようにしてもよい。 Next, the second speaker speaks “Hey!”. Thereby, an image 625B corresponding to the recognized character in the first character presentation image 622 is displayed, and an image 645B corresponding to the recognized character in the third character presentation image 642 is displayed. Also in this case, the image 625B and the image 645B may be displayed in a color corresponding to the second speaker, for example, green.

以上のように、本実施形態に係る会話支援装置１では、２以上の使用者の音声信号を入力する音声入力部（収音部１１、音響信号取得部１２）と、音声入力部に入力された音声信号を認識する音声認識部１３と、音声認識部によって認識された認識結果が表示される表示部（画像表示部１５）と、使用者毎に対応する表示領域を表示部（画像表示部１５）の画像表示領域（３２１Ａ、３２２Ａ、３３１Ａ、３３２Ａ、３２１Ｂ、３２２Ｂ、３３１Ｂ、３３２Ｂ、３４１Ｂ、３４２Ｂ、３２１Ｃ、３２２Ｃ、３３１Ｃ、３３２Ｃ、３４１Ｃ、３４２Ｃ、３５１Ｃ、３５２Ｃ）に設定する画像処理部１４と、を備える。 As described above, in the conversation support device 1 according to the present embodiment, the voice input unit (sound collection unit 11 and acoustic signal acquisition unit 12) that inputs two or more user's voice signals and the voice input unit are input. A voice recognition unit 13 for recognizing a voice signal, a display unit (image display unit 15) for displaying a recognition result recognized by the voice recognition unit, and a display area (image display unit) corresponding to each user. 15) image processing unit 14 set in the image display area (321A, 322A, 331A, 332A, 321B, 322B, 331B, 332B, 341B, 342B, 321C, 322C, 331C, 332C, 341C, 342C, 351C, 352C). And comprising.

この構成により、本実施形態の会話支援装置１では、発話者が複数であっても、それぞれの音声を認識して聴覚を支援することができる。また、音響信号取得部１２が取得した音響信号に対して音響定位処理や音響分離処理を行わないため、装置の演算量を削減することができ、装置の機能部を削減することができる。 With this configuration, the conversation support apparatus 1 according to the present embodiment can support hearing by recognizing each voice even when there are a plurality of speakers. In addition, since the sound localization process and the sound separation process are not performed on the sound signal acquired by the sound signal acquisition unit 12, the amount of calculation of the device can be reduced, and the functional unit of the device can be reduced.

また、本実施形態に係る会話支援装置１では、使用者の音源方向を推定する音源推定部（音響信号取得部１２）を備え、画像処理部１４は、音源推定部によって推定された音源方向に基づく表示角度で、音声認識部によって認識された認識結果を画像表示部１５の使用者毎に対応する前記表示領域に表示させる。 In addition, the conversation support apparatus 1 according to the present embodiment includes a sound source estimation unit (acoustic signal acquisition unit 12) that estimates a user's sound source direction, and the image processing unit 14 has a sound source direction estimated by the sound source estimation unit. The recognition result recognized by the voice recognition unit is displayed in the display area corresponding to each user of the image display unit 15 at the display angle based on the display angle.

この構成により、本実施形態の会話支援装置１では、図１３〜図１５のように話者の向きに応じた角度で文字データを表示することができる。この結果、本実施形態の会話支援装置１では、使用者が認識された結果が見やすくなるため、使用者の利便性を向上することができる。 With this configuration, the conversation support device 1 according to the present embodiment can display character data at an angle corresponding to the direction of the speaker as shown in FIGS. As a result, in the conversation support device 1 of the present embodiment, the result of the user's recognition can be easily seen, so that the convenience for the user can be improved.

また、本実施形態に係る会話支援装置１では、画像処理部１４は、音源推定部（音響信号取得部１２）によって推定された音源方向に基づく画像を、画像表示部１５の使用者毎に対応する表示領域に表示させる。 Further, in the conversation support device 1 according to the present embodiment, the image processing unit 14 supports an image based on the sound source direction estimated by the sound source estimation unit (acoustic signal acquisition unit 12) for each user of the image display unit 15. Display in the display area.

この構成により、本実施形態の会話支援装置１では、図１３〜図１５のように話者の向きを示す画像５２３、５３３、６２３，６３３、６４３を画像表示部１５上に表示させることができるので、使用者は、自分の表示領域を判別しやすくなる。 With this configuration, the conversation support device 1 according to the present embodiment can display images 523, 533, 623, 633, and 643 indicating the direction of the speaker on the image display unit 15 as shown in FIGS. Therefore, the user can easily determine his / her display area.

なお、本実施形態では、例えば話者が４人の場合、メニューから選択された人数に応じた表示領域を図８のように表示させる例を説明したが、これに限られない。会話支援装置１は、例えば、会話開始前に４人の話者（第１話者Ｓｐ１〜第４話者Ｓｐ４）の音声を登録する。そして、４人の話者が各々所定の位置にいる場合、会話支援装置１の音声認識部１３は、４人の話者によって順次発話が行われたとき、発話された音声を用いて発話者の位置を推定する。そして、画像処理部１４は、音声認識部１３によって推定された各発話者の位置に基づいて、各画像表示位置を決定、または再配置するようにしてもよい。 In the present embodiment, for example, when there are four speakers, the display area corresponding to the number of persons selected from the menu is displayed as shown in FIG. 8, but the present invention is not limited to this. For example, the conversation support apparatus 1 registers the voices of four speakers (first speaker Sp1 to fourth speaker Sp4) before the conversation starts. When the four speakers are each in a predetermined position, the speech recognition unit 13 of the conversation support device 1 uses the spoken voice when the four speakers sequentially speak. Is estimated. Then, the image processing unit 14 may determine or rearrange each image display position based on the position of each speaker estimated by the voice recognition unit 13.

例えば、４人の話者（第１話者Ｓｐ１〜第４話者Ｓｐ４）が図４のような位置にいるとする。ここで、第１話者Ｓｐ１〜第４話者Ｓｐ４それぞれは、マイクロホン１０１−１〜１０１−４を使用しているとする。
音声認識部１３は、順次発話された各発話者の音声認識を行い、第１話者Ｓｐ１の表示領域を図８において第４提示画像３５１Ｃの右上に配置し、第２話者Ｓｐ２の表示領域を図８において第３提示画像３４１Ｃの右下に配置し、第３話者Ｓｐ３の表示領域を図８において第２提示画像３３１Ｃの左下に配置し、第４話者Ｓｐ４の表示領域を図８において第１提示画像３２１Ｃの左上に配置する。このように、複数の話者によって同時に発話が行われず、また会話支援装置１を使用する環境に雑音が少ない場合は、本実施形態の会話支援装置１のように、音源定位処理や音源分離処理を行わなくても、上述した処理を行うことができる。 For example, it is assumed that four speakers (first speaker Sp1 to fourth speaker Sp4) are at positions as shown in FIG. Here, it is assumed that each of the first speaker Sp1 to the fourth speaker Sp4 uses the microphones 101-1 to 101-4.
The voice recognition unit 13 performs voice recognition of each uttered speaker, arranges the display area of the first speaker Sp1 at the upper right of the fourth presentation image 351C in FIG. 8, and displays the display area of the second speaker Sp2. 8 is arranged at the lower right of the third presentation image 341C, the display area of the third speaker Sp3 is arranged at the lower left of the second presentation image 331C in FIG. 8, and the display area of the fourth speaker Sp4 is shown in FIG. Are arranged at the upper left of the first presentation image 321C. In this way, when a plurality of speakers do not speak at the same time and there is little noise in the environment where the conversation support apparatus 1 is used, sound source localization processing and sound source separation processing are performed as in the conversation support apparatus 1 of the present embodiment. Even if it does not perform, the process mentioned above can be performed.

以上のように、本実施形態に係る会話支援装置１では、使用者の位置を推定する位置推定部（音声認識部１３）を備え、画像処理部１４は、位置推定部によって推定された使用者の位置に応じた位置に、使用者毎に対応する表示領域を表示部の画像表示領域に設定または再配置する。
この構成によって、本実施形態の会話支援装置１では、各話者の一番近い位置に表示位置が配置されるため、他の話者の発話内容が認識された文字データ（認識結果）が話者にとって見やすくなる。 As described above, the conversation support apparatus 1 according to the present embodiment includes the position estimation unit (voice recognition unit 13) that estimates the position of the user, and the image processing unit 14 uses the user estimated by the position estimation unit. The display area corresponding to each user is set or rearranged in the image display area of the display unit at a position corresponding to the position.
With this configuration, in the conversation support apparatus 1 of the present embodiment, the display position is arranged at the position closest to each speaker, so that character data (recognition result) in which the utterance contents of other speakers are recognized is spoken. Easier to see.

［第２実施形態］
図１６は、本実施形態に係る会話支援装置１Ａの構成を表すブロック図である。図１６に示すように、会話支援装置１Ａは、収音部１１、音響信号取得部１２、音声認識部１３Ａ、画像処理部１４、画像表示部１５、入力部１６、音源定位部２１（音源推定部）、音源分離部２２、言語情報検出部２３、及び翻訳部２４を備える。また、画像処理部１４は、画像パターン生成部１４１、表示画像生成部１４２、及び画像合成部１４３を備える。
なお、図１で説明した会話支援装置１と同じ機能を有する機能部には同じ符号を用いて、説明を省略する。 [Second Embodiment]
FIG. 16 is a block diagram showing the configuration of the conversation support apparatus 1A according to the present embodiment. As shown in FIG. 16, the conversation support apparatus 1A includes a sound collection unit 11, an acoustic signal acquisition unit 12, a speech recognition unit 13A, an image processing unit 14, an image display unit 15, an input unit 16, a sound source localization unit 21 (sound source estimation A sound source separation unit 22, a language information detection unit 23, and a translation unit 24. The image processing unit 14 includes an image pattern generation unit 141, a display image generation unit 142, and an image composition unit 143.
In addition, the same code | symbol is used for the function part which has the same function as the conversation assistance apparatus 1 demonstrated in FIG. 1, and description is abbreviate | omitted.

音源定位部２１は、音響信号取得部１２から入力された入力信号に基づいて、音源の方位角を推定し、推定した方位角を示す方位角情報とＮチャネルの音響信号を音源分離部２２に出力する。音源定位部２１が推定する方位角は、例えば、収音部１１が備えるＮ個のマイクロホンの位置の重心点から、当該Ｎ個のマイクロホンのうち予め定めた１個のマイクロホンへの方向を基準とした、水平面内の方向である。例えば、音源定位部２１は、ＧＳＶＤ−ＭＵＳＩＣ（ＧｅｎｅｒａｌｉｚｅｄＳｉｎｇｕｌａｒＶａｌｕｅＤｅｃｏｍｐｏｓｉｔｉｏｎ−ＭｕｌｔｉｐｌｅＳｉｇｎａｌＣｌａｓｓｉｆｉｃａｔｉｏｎ；一般化特異値展開を用いたＭＵＳＩＣ）法を用いて方位角を推定する。
なお、方位角の推定には、ＷＤＳ−ＢＦ（ＷｅｉｇｈｔｅｄＤｅｌａｙａｎｄＳｕｍＢｅａｍＦｏｒｍｉｎｇ；重み付き遅延和ビームフォーミング）法、ＭＵＳＩＣ法等の他の音源方向推定方式を用いてもよい。 The sound source localization unit 21 estimates the azimuth angle of the sound source based on the input signal input from the acoustic signal acquisition unit 12, and sends the azimuth angle information indicating the estimated azimuth angle and the N-channel acoustic signal to the sound source separation unit 22. Output. The azimuth angle estimated by the sound source localization unit 21 is based on, for example, the direction from the center of gravity of the positions of the N microphones included in the sound collection unit 11 to one predetermined microphone among the N microphones. Direction in a horizontal plane. For example, the sound source localization unit 21 estimates an azimuth angle using a GSVD-MUSIC (Generalized Single Value Decomposition-Multiple Signal Classification; MUSIC using generalized singular value expansion) method.
For estimation of the azimuth angle, other sound source direction estimation methods such as a WDS-BF (Weighted Delay and Sum Beam Forming) method and a MUSIC method may be used.

音源分離部２２は、音源定位部２１が出力したＮチャネルの音響信号を取得し、取得したＮチャネルの音響信号を、例えばＧＨＤＳＳ（ＧｅｏｍｅｔｒｉｃＨｉｇｈ−ｏｒｄｅｒＤｅｃｏｒｒｅｌａｔｉｏｎ−ｂａｓｅｄＳｏｕｒｃｅＳｅｐａｒａｔｉｏｎ）法を用いて話者毎の音響信号に分離する。ＧＨＤＳＳ方については、後述する。または、音源分離部２２は、例えば独立成分分析（ＩｎｄｅｐｅｎｄｅｎｔＣｏｍｐｏｎｅｎｔＡｎａｌｙｓｉｓ；ＩＣＡ）法を用いて、音源分離処理を行ってもよい。音源分離部２２は、分離した話者毎の音響信号と音源定位部２１から入力された方位角情報とを、言語情報検出部２３に出力する。
なお、音源分離部２２は、例えば自部に記憶されている室内の伝達関数を用いて、雑音と話者の音響信号とを分離した後、話者毎の音響信号を分離するようにしてもよい。音源分離部２２は、例えばＮチャネルの音響信号毎に音響特徴量を算出し、算出した音響特徴量及び音源定位部２１から入力された方位角情報に基づいて、話者毎の音響信号に分離するようにしてもよい。 The sound source separation unit 22 acquires the N-channel acoustic signal output from the sound source localization unit 21, and uses the acquired N-channel acoustic signal as a speaker using, for example, a GHDSS (Geometric High-order Decorrelation-based Source Separation) method. Separate into each acoustic signal. The GHDSS method will be described later. Alternatively, the sound source separation unit 22 may perform sound source separation processing using, for example, an independent component analysis (ICA) method. The sound source separation unit 22 outputs the separated acoustic signal for each speaker and the azimuth angle information input from the sound source localization unit 21 to the language information detection unit 23.
The sound source separation unit 22 may separate the acoustic signal for each speaker after separating the noise and the acoustic signal of the speaker using, for example, an indoor transfer function stored in the own unit. Good. The sound source separation unit 22 calculates an acoustic feature amount for each N-channel acoustic signal, for example, and separates into an acoustic signal for each speaker based on the calculated acoustic feature amount and azimuth angle information input from the sound source localization unit 21. You may make it do.

言語情報検出部２３は、音源分離部２２から入力された話者毎の音響信号毎に、周知の手法によって話者毎の言語を検出する。言語情報検出部２３は、検出した話者毎の言語を示す情報、音源分離部２２から入力された話者毎の音響信号及び方位角情報を音声認識部１３Ａに出力する。言語情報検出部２３は、例えば言語データベースを参照して、参照した結果に基づいて話者毎の言語を検出する。言語データベースは、会話支援装置１Ａが備えていてもよく、有線または無線のネットワークを介して接続されていてもよい。 The language information detection unit 23 detects the language for each speaker by a well-known method for each acoustic signal for each speaker input from the sound source separation unit 22. The language information detection unit 23 outputs the information indicating the detected language for each speaker, the acoustic signal for each speaker input from the sound source separation unit 22, and the azimuth information to the speech recognition unit 13A. The language information detection unit 23 refers to a language database, for example, and detects a language for each speaker based on the result of the reference. The language database may be provided in the conversation support apparatus 1A, or may be connected via a wired or wireless network.

音声認識部１３Ａは、言語情報検出部２３から入力された話者毎の言語を示す情報、話者毎の音響信号及び方位角情報に基づいて、音響信号取得部１２から入力された音響信号に対して音声認識処理を行って発話内容（例えば、単語、文を示すテキスト）を認識する。音声認識部１３Ａは、発話内容、話者を示す情報、話者の向きを示す情報と認識データ、及び、話者毎の言語を示す情報を翻訳部２４に出力する。 The voice recognizing unit 13 A applies the acoustic signal input from the acoustic signal acquiring unit 12 based on the information indicating the language for each speaker input from the language information detecting unit 23, the acoustic signal for each speaker, and the azimuth angle information. Speech recognition processing is performed for the speech content (for example, text indicating a word or sentence). The voice recognition unit 13A outputs the utterance content, the information indicating the speaker, the information indicating the speaker direction and the recognition data, and the information indicating the language for each speaker to the translation unit 24.

翻訳部２４は、音声認識部１３Ａから入力された発話内容、話者を示す情報、話者毎の言語を示す情報に基づいて、必要に応じて発話内容を翻訳し、翻訳した発話内容を示す情報を音声認識部１３Ａから入力された情報に加えて、または置き換えて、画像処理部１４に出力する。具体的には、話者が第１話者Ｓｐ１と第２話者Ｓｐの２人であり、第１話者Ｓｐ１の使用言語が日本語、第２話者Ｓｐ２の使用言語が英語の場合を、図１４を用いて説明する。この場合、第２文字提示画像５３２に表示される画像５３４Ａ〜５３４Ｄが、第１話者Ｓｐ１が発話した日本語から第２話者Ｓｐ２の使用言語である英語に翻訳して表示されるように、翻訳部２４は発話内容を翻訳する。また、第１文字提示画像５２２に表示される画像５２４Ａ〜５２４Ｃが、第２話者Ｓｐ２が発話した英語から第１話者Ｓｐ１の使用言語である日本語に翻訳して表示されるように、翻訳部２４は発話内容を翻訳する。 The translation unit 24 translates the utterance content as necessary based on the utterance content input from the speech recognition unit 13A, information indicating the speaker, and information indicating the language of each speaker, and indicates the translated utterance content. The information is output to the image processing unit 14 in addition to or in place of the information input from the voice recognition unit 13A. Specifically, there are two speakers, the first speaker Sp1 and the second speaker Sp, the language used by the first speaker Sp1 is Japanese, and the language used by the second speaker Sp2 is English. This will be described with reference to FIG. In this case, the images 534A to 534D displayed on the second character presentation image 532 are displayed after being translated from Japanese spoken by the first speaker Sp1 into English which is the language used by the second speaker Sp2. The translation unit 24 translates the utterance content. Further, the images 524A to 524C displayed on the first character presentation image 522 are translated and displayed from English spoken by the second speaker Sp2 into Japanese which is the language used by the first speaker Sp1. The translation unit 24 translates the utterance content.

＜ＧＨＤＳＳ法＞
ここで、音源分離部２２で用いられるＧＨＤＳＳ法の概略について説明する。ＧＨＤＳＳ法は、ＧＣ（幾何拘束に基づく音源分離）法と、ＨＤＳＳ（Ｈｉｇｈ−ｏｒｄｅｒＤｉｃｏｒｒｅｌａｔｉｏｎ−ｂａｓｅｄＳｏｕｒｃｅＳｅｐａｒａｔｉｏｎ；高次元無相関化に基づく音源分離）法を統合した手法である。ＧＨＤＳＳ法は、１種のブラインド分離処理（ｂｌｉｎｄｄｅｃｏｎｖｏｌｕｔｉｏｎ）である。ＧＨＤＳＳ法は、分離行列（ｓｅｐａｒａｔｉｏｎｍａｔｒｉｘ）［Ｖ（ω）］を逐次に算出し、入力音声ベクトル［ｘ（ω）］に算出した分離行列［Ｖ（ω）］を乗算して音源ベクトル［ｕ（ω）］を推定することで、音源毎の音響信号に分離する手法である。分離行列［Ｖ（ω）］は、各音源から収音部１１が備える各マイクロホン１０１までに伝達関数を要素として有する伝達関数［Ｈ（ω）］の擬似逆行列（ｐｓｅｕｄｏ−ｉｎｖｅｒｓｅｍａｔｒｉｘ）である。入力音声ベクトル［ｘ（ω）］は、各チャネルの音響信号の周波数領域係数を要素として有するベクトルである。音源ベクトル［ｕ（ω）］は、各音源が発する音響信号の周波数領域係数を要素として有するベクトルである。 <GHDSS method>
Here, an outline of the GHDSS method used in the sound source separation unit 22 will be described. The GHDSS method is a technique in which a GC (sound source separation based on geometric constraint) method and a HDSS (High-order Dicorrelation-based Source Separation) method are integrated. The GHDSS method is a kind of blind deconvolution. In the GHDSS method, a separation matrix [V (ω)] is sequentially calculated, and an input speech vector [x (ω)] is multiplied by the calculated separation matrix [V (ω)] to obtain a sound source vector [u (Ω)] is a method for separating the sound signal into sound signals for each sound source. The separation matrix [V (ω)] is a pseudo-inverse matrix of a transfer function [H (ω)] having a transfer function as an element from each sound source to each microphone 101 included in the sound collection unit 11. . The input speech vector [x (ω)] is a vector having as an element the frequency domain coefficient of the acoustic signal of each channel. The sound source vector [u (ω)] is a vector having as an element the frequency domain coefficient of the acoustic signal emitted from each sound source.

ＧＨＤＳＳ法は、分離行列［Ｖ（ω）］を算出するとき、分離尖鋭度（ｓｅｐａｒａｔｉｏｎｓｈａｒｐｎｅｓｓ）ＪＳＳ、幾何制約度（ｇｅｏｍｅｔｒｉｘｃｏｎｓｔｒａｉｎｔｓ）ＪＧＣといった２つのコスト関数を、それぞれ最小化するように音源ベクトル［ｕ（ω）］を推定する。
ここで、分離尖鋭度ＪＳＳは、１つの音源が他の音源として誤って分離される度合いを表す指標値であり、例えば、次式（１）で表される。 In the GHDSS method, when calculating the separation matrix [V (ω)], the sound source vector [[ u (ω)] is estimated.
Here, the separation sharpness JSS is an index value representing the degree to which one sound source is erroneously separated as another sound source, and is represented by the following equation (1), for example.

式（２）において、｜｜…｜｜^２は、フロベニウスノルム（Ｆｒｏｂｅｎｉｕｓｎｏｒｍ）を示す。＊は、ベクトル又は行列の共役転置（ｃｏｎｊｕｇａｔｅｔｒａｎｓｐｏｓｅ）を示す。また、ｄｉａｇ（…）は、…の対角要素からなる対角行列（ｄｉａｇｏｎａｌｍａｔｒｉｘ）を示す。
幾何制約度ＪＧＣは、音源ベクトル［ｕ（ω）］の誤差の度合いを表す指標値であり、例えば、次式（２）で表される。 In Expression (2), || ... || ² represents a Frobenius norm. * Indicates a conjugate transpose of a vector or matrix. Further, diag (...) indicates a diagonal matrix composed of ... diagonal elements.
The geometric constraint degree JGC is an index value representing the degree of error of the sound source vector [u (ω)], and is represented by the following equation (2), for example.

なお、式（２）において、［Ｉ］は、単位行列を示す。 In equation (2), [I] represents a unit matrix.

なお、収音部１１が有するマイクロホン１０１−１〜１０１−Ｎによってマイクロホンアレイを構成した場合、話者は、自分が発話する際に、自分が発話することを示す情報を会話支援装置１Ａに入力または選択しなくてもよい。この場合、会話支援装置１Ａは、マイクロホンアレイを用いて、話者毎の発話に分離することができる。 When the microphone array is configured by the microphones 101-1 to 101-N included in the sound collection unit 11, the speaker inputs information indicating that he / she speaks to the conversation support apparatus 1A when he / she speaks. Or it may not be selected. In this case, the conversation support apparatus 1A can be separated into utterances for each speaker using a microphone array.

次に、図１６に示した各部を使用する組み合わせの例を説明する。図１７は、本実施形態に係るマイクアレイに対応する各部の組み合わせの例を説明する図である。
図１７において、マイクロホンアレイ１とは、図２に示したように会話支援装置１Ａにマイクロホン１０１のアレイが組み込まれているマイクロホンアレイである。マイクロホンアレイ２とは、図３に示したように、マイクロホン１０１が有線または無線で会話支援装置１Ａに接続されるマイクロホンアレイである。マイクロホンアレイ３とは、図４に示したように、各話者が例えば口元に接話型のマイクロホン１０１を使用し、マイクロホン１０１が有線または無線で会話支援装置１Ａに接続されるマイクロホンアレイである。 Next, an example of a combination using each unit illustrated in FIG. 16 will be described. FIG. 17 is a diagram for explaining an example of a combination of units corresponding to the microphone array according to the present embodiment.
In FIG. 17, the microphone array 1 is a microphone array in which an array of microphones 101 is incorporated in the conversation support apparatus 1A as shown in FIG. The microphone array 2 is a microphone array in which the microphone 101 is connected to the conversation support apparatus 1A by wire or wireless as shown in FIG. As shown in FIG. 4, the microphone array 3 is a microphone array in which each speaker uses, for example, a close-talking microphone 101 at the mouth, and the microphone 101 is connected to the conversation support apparatus 1A by wire or wirelessly. .

図１７の１行目に示すように、マイクロホンアレイ（単にアレイともいう）１〜３の場合、話者の位置等に応じて、音響信号の定位や分離状態が良い場合、会話支援装置１Ａは、音源定位部２１及び音源分離部２２を備えていなくてもよい。また、翻訳を行う必要がない場合や、話者が使用する言語が同一の場合等、会話支援装置１Ａは、言語情報検出部２３及び翻訳部２４を備えていなくてもよい。すなわち、言語情報検出部２３及び翻訳部２４は、オプションであってもよい。 As shown in the first line of FIG. 17, in the case of microphone arrays (also simply referred to as arrays) 1 to 3, when the localization and separation state of the acoustic signal is good depending on the position of the speaker, the conversation support apparatus 1 A is The sound source localization unit 21 and the sound source separation unit 22 may not be provided. In addition, the conversation support device 1A may not include the language information detection unit 23 and the translation unit 24 when it is not necessary to perform translation or when the language used by the speaker is the same. That is, the language information detection unit 23 and the translation unit 24 may be optional.

図１７の２行目に示すように、マイクロホンアレイ１及び２の場合、話者の位置等に応じて、音響信号の分離状態が良い場合、会話支援装置１Ａは、音源分離部２２を備えていなくてもよい。また、翻訳を行う必要がない場合や、話者が使用する言語が同一の場合等、会話支援装置１Ａは、言語情報検出部２３及び翻訳部２４を備えていなくてもよい。
図１７の３行目に示すように、マイクロホンアレイ１及び２の場合、話者の位置等に応じて、会話支援装置１Ａは、音源定位部２１及び音源分離部２２を備えていてもよい。また、翻訳を行う必要がない場合や、話者が使用する言語が同一の場合等、会話支援装置１Ａは、言語情報検出部２３及び翻訳部２４を備えていなくてもよい。 As shown in the second row of FIG. 17, in the case of the microphone arrays 1 and 2, the conversation support device 1 A includes the sound source separation unit 22 when the acoustic signal separation state is good according to the position of the speaker. It does not have to be. In addition, the conversation support device 1A may not include the language information detection unit 23 and the translation unit 24 when it is not necessary to perform translation or when the language used by the speaker is the same.
As shown in the third line of FIG. 17, in the case of the microphone arrays 1 and 2, the conversation support apparatus 1 A may include a sound source localization unit 21 and a sound source separation unit 22 according to the position of the speaker. In addition, the conversation support device 1A may not include the language information detection unit 23 and the translation unit 24 when it is not necessary to perform translation or when the language used by the speaker is the same.

図１８は、本実施形態に係る音源定位の一例を説明する図である。
図１８に示すように４人の話者Ｓｐ１〜Ｓｐ４が、会話支援装置１Ａを囲んでいる。そして、話者Ｓｐ１は、自分に最も近い第４提示画像３５１Ｃを予め選択し、話者Ｓｐ２は、自分に最も近い第１提示画像３２１Ｃを予め選択する。話者Ｓｐ３は、自分に最も近い第２提示画像３３１Ｃを予め選択し、話者Ｓｐ４は、自分に最も近い第３提示画像３４１Ｃを予め選択する。 FIG. 18 is a diagram illustrating an example of sound source localization according to the present embodiment.
As shown in FIG. 18, four speakers Sp1 to Sp4 surround the conversation support apparatus 1A. Then, the speaker Sp1 pre-selects the fourth presentation image 351C closest to him / her, and the speaker Sp2 pre-selects the first presentation image 321C closest to himself / herself. The speaker Sp3 pre-selects the second presentation image 331C closest to himself / herself, and the speaker Sp4 pre-selects the third presentation image 341C closest to himself / herself.

話者Ｓｐ１が発話を行っていない場合、会話支援装置１Ａは、話者Ｓｐ１がいる方位を推定することができない。このため、会話支援装置１Ａは、まず、会話支援装置１Ａの画像表示部１５の面に対して３６０度方向の取得された音響信号に対して音源定位処理を行う。そして、話者Ｓｐ１が発話した場合、この発話に基づいて音源定位を行う。この処理によって、話者Ｓｐ１の発話方向を推定できるので、会話支援装置１Ａは、以後、話者Ｓｐ１の音響信号の探索範囲を、話者Ｓｐ１に最も近い第４提示画像３５１Ｃ（表示領域）に基づいて、例えばθ_１の角度の範囲に変更するようにしてもよい。これにより、音源定位処理の演算量を削減することができ、さらに音源定位の精度を向上させることができる。同様に、会話支援装置１Ａは、話者Ｓｐ２が発話した後、話者Ｓｐ２の音響信号の探索範囲を、話者Ｓｐ２に最も近い第１提示画像３２１Ｃに基づいて、例えばθ_２の角度の範囲に変更するようにしてもよい。会話支援装置１Ａは、話者Ｓｐ３が発話した後、話者Ｓｐ３の音響信号の探索範囲を、話者Ｓｐ３に最も近い第２提示画像３３１Ｃに基づいて、例えばθ_３の角度の範囲に変更するようにしてもよく、話者Ｓｐ４が発話した後、話者Ｓｐ４の音響信号の探索範囲を、話者Ｓｐ４に最も近い第３提示画像３４１Ｃに基づいて、例えばθ_４の角度の範囲に変更するようにしてもよい。なお、θ_１〜θ_４それぞれの角度は、例えば９０度である。 When the speaker Sp1 is not speaking, the conversation support apparatus 1A cannot estimate the direction in which the speaker Sp1 is present. For this reason, the conversation support apparatus 1A first performs sound source localization processing on the acquired acoustic signal in the direction of 360 degrees with respect to the surface of the image display unit 15 of the conversation support apparatus 1A. And when speaker Sp1 utters, sound source localization is performed based on this utterance. Since the speech direction of the speaker Sp1 can be estimated by this processing, the conversation support apparatus 1A thereafter sets the search range of the acoustic signal of the speaker Sp1 to the fourth presentation image 351C (display area) closest to the speaker Sp1. based on, it may be changed in a range of, for example, theta ₁ angle. As a result, the calculation amount of the sound source localization process can be reduced, and the accuracy of the sound source localization can be further improved. Similarly, the conversation support apparatus 1A, after the speaker Sp2 uttered, the search range of the audio signal of the speaker Sp2, based on the closest first presentation image 321C speaker Sp2, e.g. theta ₂ of the angular range You may make it change to. Conversation assistance device 1A, after the speaker Sp3 uttered, the search range of the audio signal of the speaker Sp3, based on the nearest second presentation image 331C speaker Sp3, change the range of, for example, theta ₃ angles way may, after the speaker Sp4 uttered, the search range of the audio signal of the speaker Sp4, based on the nearest third presentation image 341C speaker Sp4, changing the angle in the range of e.g. theta ₄ You may do it. Note that each angle of θ _{1 to} θ ₄ is, for example, 90 degrees.

なお、上述した例では、図１８を用いて話者が４人の例を説明したが、これに限られない。話者が２人の場合、話者毎の音響信号の探索範囲を、図６のように話者毎の表示領域（３２１Ａ、３３１Ａ）に基づいて、例えば３６０度から１８０度の角度の範囲に変更するようにしてもよい。あるいは、話者が３人の場合、話者毎の音響信号の探索範囲を、例えば３６０度から１２０度の角度の範囲に変更するようにしてもよい。すなわち、会話支援装置１Ａは、各話者の探索範囲を、各話者の表示領域に基づいて変更するようにしてもよい。これにより、会話支援装置１Ａは、探索範囲を狭くすることができるので、方位の推定精度を向上することができ、かつ会話支援装置１Ａの演算量を削減することができる。 In the example described above, an example in which there are four speakers has been described with reference to FIG. 18, but the present invention is not limited to this. When there are two speakers, the search range of the acoustic signal for each speaker is set to an angle range of 360 degrees to 180 degrees, for example, based on the display area (321A, 331A) for each speaker as shown in FIG. It may be changed. Alternatively, when there are three speakers, the search range of the acoustic signal for each speaker may be changed from an angle range of 360 degrees to 120 degrees, for example. That is, the conversation support apparatus 1A may change the search range of each speaker based on the display area of each speaker. Thereby, since conversation support apparatus 1A can narrow a search range, it can improve the estimation accuracy of direction and can reduce the calculation amount of conversation support apparatus 1A.

また、例えば音声認識部１３Ａが話者同定を行うようにしてもよい。例えば、認識を開始させる前に、話者毎に音声を会話支援装置１Ａに予め登録させる。これにより、例えば話者Ｓｐ１が発話した場合、音声認識部１３Ａは、音源分離部２２によって分離された音響信号から話者Ｓｐ１の音響信号を同定するようにしてもよい。 Further, for example, the voice recognition unit 13A may perform speaker identification. For example, before starting recognition, the voice is registered in advance in the conversation support apparatus 1A for each speaker. Thereby, for example, when the speaker Sp1 speaks, the speech recognition unit 13A may identify the acoustic signal of the speaker Sp1 from the acoustic signal separated by the sound source separation unit 22.

また、各話者に提示される画像に表示される言語は、予めメニューから選択された言語に基づくものであってもよい。例えば話者Ｓｐ１が、使用言語として日本語をメニューから選択した場合、他の話者がフランス語で発話した結果を翻訳部２４が翻訳し、翻訳された結果を第１文字提示画像３２２Ｃに表示するようにしてもよい。このため、他の話者が、フランス語、英語、中国語で発話したとしても、会話支援装置１Ａは、図１８において、第４文字提示画像３５２Ｃに他の話者の発話を全て日本語で表示するようにしてもよい。 Further, the language displayed in the image presented to each speaker may be based on a language selected in advance from a menu. For example, when the speaker Sp1 selects Japanese as the language to be used from the menu, the translation unit 24 translates the result of another speaker speaking in French, and the translated result is displayed on the first character presentation image 322C. You may do it. For this reason, even if another speaker speaks in French, English, or Chinese, the conversation support apparatus 1A displays all the other speaker's utterances in Japanese on the fourth character presentation image 352C in FIG. You may make it do.

会話支援装置１Ａが行う処理手順を説明する。図１９は、本実施形態に係る会話支援装置１Ａが行う処理手順のフローチャートである。
（ステップＳ１０１〜Ｓ１０３）ステップＳ１０１〜Ｓ１０３は、ステップＳ１〜Ｓ３（図９参照）と同様に行う。なお、ステップＳ１０１において、各話者は、他の話者の発話を翻訳するか否かをメニュー画像３０１において、選択するようにしてもよい。 A processing procedure performed by the conversation support apparatus 1A will be described. FIG. 19 is a flowchart of a processing procedure performed by the conversation support apparatus 1A according to the present embodiment.
(Steps S101 to S103) Steps S101 to S103 are performed in the same manner as steps S1 to S3 (see FIG. 9). In step S101, each speaker may select whether or not to translate the speech of another speaker on the menu image 301.

（ステップＳ１０４）音源定位部２１は、音響信号取得部１２から入力された入力信号に基づいて、音源の方位角を推定し、推定した方位角を示す方位角情報とＮチャネルの音響信号を音源分離部２２に出力する。音源定位部２１は、ステップＳ１０４終了後、処理をステップＳ１０５に進める。
（ステップＳ１０５）音源分離部２２は、音源定位部２１が出力したＮチャネルの音響信号を取得し、取得したＮチャネルの音響信号を、例えばＧＨＤＳＳ法を用いて話者毎の音響信号に分離する。次に、音源分離部２２は、分離した話者毎の音響信号と音源定位部２１から入力された方位角情報とを、言語情報検出部２３に出力する。音源分離部２２は、ステップＳ１０５終了後、処理をステップＳ１０６に進める。 (Step S104) The sound source localization unit 21 estimates the azimuth angle of the sound source based on the input signal input from the acoustic signal acquisition unit 12, and uses the azimuth angle information indicating the estimated azimuth angle and the N-channel acoustic signal as the sound source. Output to the separation unit 22. The sound source localization unit 21 advances the process to step S105 after step S104 ends.
(Step S105) The sound source separation unit 22 acquires the N-channel acoustic signal output from the sound source localization unit 21, and separates the acquired N-channel acoustic signal into acoustic signals for each speaker using, for example, the GHDSS method. . Next, the sound source separation unit 22 outputs the separated acoustic signal for each speaker and the azimuth angle information input from the sound source localization unit 21 to the language information detection unit 23. The sound source separation unit 22 advances the process to step S106 after step S105 ends.

（ステップＳ１０６）言語情報検出部２３は、音源分離部２２から入力された話者毎の音響信号毎に、周知の手法によって話者毎の言語を検出する。言語情報検出部２３は、検出した話者毎の言語を示す情報、音源分離部２２から入力された話者毎の音響信号及び方位角情報を音声認識部１３Ａに出力する。言語情報検出部２３は、ステップＳ１０６終了後、処理をステップＳ１０７に進める。 (Step S106) The language information detection unit 23 detects the language for each speaker by a well-known method for each acoustic signal for each speaker input from the sound source separation unit 22. The language information detection unit 23 outputs the information indicating the detected language for each speaker, the acoustic signal for each speaker input from the sound source separation unit 22, and the azimuth information to the speech recognition unit 13A. The language information detection unit 23 advances the process to step S107 after step S106 ends.

（ステップＳ１０７）音声認識部１３Ａは、言語情報検出部２３から入力された話者毎の言語を示す情報、話者毎の音響信号及び方位角情報に基づいて、音響信号取得部１２から入力された音響信号に対して音声認識処理を行って発話内容を認識する。次に、音声認識部１３Ａは、発話内容、話者を示す情報、話者の向きを示す情報と認識データ、及び、話者毎の言語を示す情報を翻訳部２４に出力する。音声認識部１３Ａは、ステップＳ１０７終了後、処理をステップＳ１０８に進める。 (Step S 107) The speech recognition unit 13 A is input from the acoustic signal acquisition unit 12 based on the information indicating the language for each speaker input from the language information detection unit 23, the acoustic signal for each speaker, and the azimuth information. Speech recognition processing is performed on the received acoustic signal to recognize the utterance content. Next, the speech recognition unit 13A outputs to the translation unit 24 utterance contents, information indicating the speaker, information indicating the direction of the speaker and the recognition data, and information indicating the language of each speaker. The voice recognition unit 13A advances the process to step S108 after step S107 ends.

（ステップＳ１０８）翻訳部２４は、音声認識部１３Ａから入力された発話内容、話者を示す情報、話者毎の言語を示す情報に基づいて発話内容を翻訳し、翻訳した発話内容を示す情報を音声認識部１３Ａから入力された情報に加えて、または置き換えて、画像処理部１４に出力する。翻訳部２４は、ステップＳ１０８終了後、処理をステップＳ１０９に進める。
（ステップＳ１０９〜Ｓ１１０）ステップＳ１０９〜Ｓ１１０は、ステップＳ５〜Ｓ６（図９参照）と同様に行う。
以上で、会話支援装置１Ａが行う処理を終了する。 (Step S108) The translation unit 24 translates the utterance content based on the utterance content input from the speech recognition unit 13A, information indicating the speaker, and information indicating the language of each speaker, and information indicating the translated utterance content Is added to or replaced with the information input from the voice recognition unit 13A and output to the image processing unit 14. The translation part 24 advances a process to step S109 after completion | finish of step S108.
(Steps S109 to S110) Steps S109 to S110 are performed in the same manner as steps S5 to S6 (see FIG. 9).
Above, the process which conversation support apparatus 1A performs is complete | finished.

なお、図１９に示した例では、会話支援装置１Ａが図１６の全ての機能部を備え、全ての機能部を使用する例を説明したが、これに限られない。図１７に示したようにマイクロホンアレイに応じて、会話支援装置１Ａは、使用する機能部と処理とを選択するようにしてもよい。 In the example illustrated in FIG. 19, the conversation support apparatus 1 A includes all the functional units illustrated in FIG. 16 and uses all the functional units. However, the present invention is not limited thereto. As illustrated in FIG. 17, the conversation support apparatus 1 A may select a function unit and a process to be used according to the microphone array.

以上のように、本実施形態に係る会話支援装置１Ａでは、使用者の音源方向を推定する音源推定部（音源定位部２１）を備え、画像処理部１４は、音源推定部によって推定された前記音源方向に基づく表示角度で、音声認識部１３Ａによって認識された認識結果を表示部（画像表示部１５）の前記使用者毎に対応する表示領域に表示させる。
また、本実施形態に係る会話支援装置１Ａでは、音声入力部（収音部１１、音響信号取得部１２）に入力された音声信号を使用者毎に分離する音源分離部２２を備え、画像処理部１４は、音源分離部によって分離された使用者毎の音声信号のうち、表示領域に対応する使用者以外の認識結果を前記表示部の使用者毎に対応する表示領域に表示させる。 As described above, the conversation support apparatus 1A according to the present embodiment includes the sound source estimation unit (sound source localization unit 21) that estimates the sound source direction of the user, and the image processing unit 14 is estimated by the sound source estimation unit. The recognition result recognized by the voice recognition unit 13A is displayed in a display area corresponding to each user of the display unit (image display unit 15) at a display angle based on the sound source direction.
In addition, the conversation support apparatus 1A according to the present embodiment includes a sound source separation unit 22 that separates a voice signal input to the voice input unit (sound collection unit 11 and acoustic signal acquisition unit 12) for each user, and performs image processing. The unit 14 displays a recognition result other than the user corresponding to the display region in the display region corresponding to each user of the display unit among the audio signals for each user separated by the sound source separation unit.

この構成によって、本実施形態の会話支援装置１Ａによれば、音源の定位や分離状況が悪い場合であっても、音源定位部２１が音源定位処理を行い、音源分離部２２が音源分離処理を行うので、話者毎の方位の推定や話者毎の発話の分離を精度よく行うことができる。この結果、本実施形態の会話支援装置１Ａによれば、他の話者は、相手の発話を精度良く会話支援装置１Ａ上で視覚的に確認することができるので、話者の聴覚を支援することができる。 With this configuration, according to the conversation support apparatus 1A of the present embodiment, the sound source localization unit 21 performs the sound source localization process and the sound source separation unit 22 performs the sound source separation process even when the sound source localization or separation situation is bad. Therefore, it is possible to accurately estimate the direction for each speaker and to separate the utterances for each speaker. As a result, according to the conversation support apparatus 1A of the present embodiment, other speakers can visually confirm the other party's utterance on the conversation support apparatus 1A with high accuracy, thereby supporting the hearing of the speaker. be able to.

また、本実施形態に係る会話支援装置１Ａでは、音声認識部１３Ａによって認識された認識結果を翻訳する翻訳部２４を備え、画像処理部１４は、翻訳部によって翻訳された翻訳結果を表示部（画像表示部１５）の使用者毎に対応する表示領域に表示させる。
また、本実施形態に係る会話支援装置１Ａでは、使用者が発話する言語を検出する言語情報検出部２３を備え、翻訳部２４は、表示領域に対応する使用者以外の認識結果を、言語情報検出部によって検出された言語に翻訳する。 The conversation support apparatus 1A according to the present embodiment includes a translation unit 24 that translates the recognition result recognized by the voice recognition unit 13A, and the image processing unit 14 displays the translation result translated by the translation unit (display unit). The image is displayed in a display area corresponding to each user of the image display unit 15).
In addition, the conversation support apparatus 1A according to the present embodiment includes a language information detection unit 23 that detects a language spoken by the user, and the translation unit 24 displays the recognition result other than the user corresponding to the display area as the language information. Translate into the language detected by the detector.

この構成によって、本実施形態の会話支援装置１Ａによれば、言語情報検出部２３及び翻訳部２４を備えるようにしたので、話者毎に使用言語が異なる場合であっても、必要に応じて他の話者の発話を会話支援装置１Ａ上で視覚的に表示することができる。この結果、実施形態の会話支援装置１Ａによれば、他の話者は、相手の発話を会話支援装置１Ａ上で視覚的に確認することができるので、話者の聴覚を支援することができる。 With this configuration, according to the conversation support apparatus 1A of the present embodiment, since the language information detection unit 23 and the translation unit 24 are provided, even if the language used is different for each speaker, as necessary. The utterances of other speakers can be visually displayed on the conversation support apparatus 1A. As a result, according to the conversation support apparatus 1A of the embodiment, other speakers can visually confirm the other party's utterance on the conversation support apparatus 1A, so that the hearing of the speaker can be supported. .

なお、本実施形態では、複数の話者が会話支援装置１Ａを使用する例を説明したが、これに限られない。会話支援装置１Ａを１人の話者が使用するようにしてもよい。例えば、この話者が初期状態において使用言語を日本語として登録し、英語を発話したとき、会話支援装置１Ａは、この話者が発話した英語を登録された言語である日本語に翻訳して話者に対応した画像の提示領域に表示させるようにしてもよい。これにより、本実施形態の会話支援装置１Ａでは、外国語の学習支援を行う効果が得られる。 In the present embodiment, an example in which a plurality of speakers use the conversation support apparatus 1A has been described. However, the present invention is not limited to this. The conversation support device 1A may be used by one speaker. For example, when the speaker registers the language used as Japanese in the initial state and speaks English, the conversation support apparatus 1A translates the English spoken by the speaker into the registered language Japanese. You may make it display on the presentation area | region of the image corresponding to a speaker. Thereby, in the conversation support device 1A of the present embodiment, an effect of performing foreign language learning support is obtained.

また、本実施形態において、話者のうち１人が退席するような場合、退席する話者は、退席時に、退席を示す情報を会話支援装置１Ａに入力または選択するようにしてもよい。
会話支援装置１Ａは、例えば話者が４人から３人に減った場合、図８に示したレイアウトから図７に示したレイアウトに変更するようにしてもよい。
一方、話者が途中から参加する場合、途中参加する話者は参加を示す情報を会話支援装置１Ａに入力または選択するようにしてもよい。会話支援装置１Ａは、例えば話者が３人から４人に増えた場合、図７に示したレイアウトから図８に示したレイアウトに変更するようにしてもよい。 Further, in the present embodiment, when one of the speakers leaves, the speaker who leaves may input or select information indicating the departure to the conversation support apparatus 1A when leaving.
For example, when the number of speakers is reduced from four to three, the conversation support apparatus 1A may change the layout shown in FIG. 8 to the layout shown in FIG.
On the other hand, when a speaker participates from the middle, the speaker who participates in the middle may input or select information indicating participation in the conversation support apparatus 1A. For example, when the number of speakers increases from three to four, the conversation support apparatus 1A may change the layout shown in FIG. 7 to the layout shown in FIG.

図２０は、本実施形態に係る話者の人数が変化した場合の処理を説明する図である。図２０に示す例では、３人の話者Ｓｐ１〜Ｓｐ３が、会話支援装置１Ａを使用している例である。なお、図２０では、話者Ｓｐ１〜Ｓｐ３の発話方向が推定済みである。
例えば、３人の話者Ｓｐ１〜Ｓｐ３がいる位置が殆ど変化しない場合、音源定位部２１によって定位された音響信号に、話者Ｓｐ１〜Ｓｐ３とは異なる発話方向の音響信号があった場合、会話支援装置１Ａは、新たな話者Ｓｐ４が会話に参加したと判別するようにしてもよい。図２０に示した例では、話者Ｓｐ４が、紙面に向かって右斜め上方向から発話している。この場合、会話支援装置１Ａは、この新たな話者Ｓｐ４の発話方向を推定し、推定した結果に基づいて、図８に示したような４人の話者に対応した表示画面に切り替えるようにしてもよい。この場合、話者Ｓｐ４の位置が、話者Ｓｐ１とＳｐ３との間であるため、会話支援装置１Ａは、話者Ｓｐ４に対応する情報の表示領域を、第１提示画像６２１と第３提示画像６４１との間に挿入するように、各表示領域を再レイアウトするようにしてもよい。 FIG. 20 is a diagram for explaining processing when the number of speakers according to the present embodiment has changed. In the example illustrated in FIG. 20, three speakers Sp 1 to Sp 3 are using the conversation support device 1 A. In FIG. 20, the utterance directions of the speakers Sp1 to Sp3 have been estimated.
For example, when the positions where the three speakers Sp1 to Sp3 are hardly changed, the acoustic signal localized by the sound source localization unit 21 includes an acoustic signal having an utterance direction different from that of the speakers Sp1 to Sp3. The support apparatus 1A may determine that a new speaker Sp4 has participated in the conversation. In the example shown in FIG. 20, the speaker Sp4 speaks from the upper right direction toward the page. In this case, the conversation support apparatus 1A estimates the utterance direction of the new speaker Sp4, and switches to the display screen corresponding to the four speakers as shown in FIG. 8 based on the estimated result. May be. In this case, since the position of the speaker Sp4 is between the speakers Sp1 and Sp3, the conversation support apparatus 1A displays the information display area corresponding to the speaker Sp4 as the first presentation image 621 and the third presentation image. Each display area may be re-layed out so as to be inserted between the display areas 641 and 641.

［第３実施形態］
第１実施形態及び第２実施形態では、会話支援装置１または会話支援装置１Ａが１台の例を説明したが、本実施形態では、会話支援装置１または会話支援装置１Ａが複数台の例を説明する。複数台の会話支援装置１または会話支援装置１Ａは、例えば話者毎に使用されるようにしてもよい。 [Third Embodiment]
In the first embodiment and the second embodiment, an example in which the conversation support apparatus 1 or the conversation support apparatus 1A is one has been described. However, in the present embodiment, an example in which the conversation support apparatus 1 or the conversation support apparatus 1A is a plurality is used. explain. A plurality of conversation support apparatuses 1 or conversation support apparatuses 1A may be used for each speaker, for example.

図２１は、本実施形態に係る会話支援装置１Ｂの構成を表すブロック図である。図２１に示すように、会話支援装置１Ｂは、収音部１１、音響信号取得部１２Ｂ、音声認識部１３Ａ、画像処理部１４、画像表示部１５、入力部１６、音源定位部２１、音源分離部２２、言語情報検出部２３、翻訳部２４、及び通信部３１を備える。また、画像処理部１４は、画像パターン生成部１４１、表示画像生成部１４２、及び画像合成部１４３を備える。
なお、図１６で説明した会話支援装置１Ａと同じ機能を有する機能部には同じ符号を用いて、説明を省略する。なお、図２１では、会話支援装置１Ｂが図１６に示した会話支援装置１Ａを元にした構成の例を説明したが、会話支援装置１Ｂは、図１に示した会話支援装置１を元にした構成であってもよい。すなわち、用途に応じて、会話支援装置１Ｂは、音源定位部２１、音源分離部２２、言語情報検出部２３、及び翻訳部２４のうち、いくつかの機能部を備えていなくてもよい。 FIG. 21 is a block diagram illustrating the configuration of the conversation support apparatus 1B according to the present embodiment. As shown in FIG. 21, the conversation support apparatus 1B includes a sound collection unit 11, an acoustic signal acquisition unit 12B, a speech recognition unit 13A, an image processing unit 14, an image display unit 15, an input unit 16, a sound source localization unit 21, and a sound source separation. Unit 22, language information detection unit 23, translation unit 24, and communication unit 31. The image processing unit 14 includes an image pattern generation unit 141, a display image generation unit 142, and an image composition unit 143.
In addition, the same code | symbol is used for the function part which has the same function as 1 A of conversation assistance apparatuses demonstrated in FIG. 16, and description is abbreviate | omitted. FIG. 21 illustrates an example of a configuration in which the conversation support apparatus 1B is based on the conversation support apparatus 1A illustrated in FIG. 16, but the conversation support apparatus 1B is based on the conversation support apparatus 1 illustrated in FIG. It may be the configuration. That is, the conversation support apparatus 1B may not include some functional units among the sound source localization unit 21, the sound source separation unit 22, the language information detection unit 23, and the translation unit 24 depending on the application.

音響信号取得部１２Ｂは、収音部１１のＭ個（Ｍは、１以上の整数）のマイクロホン１０１によって収録されたＭ個の音響信号を取得する。例えば、Ｍが２の場合、２個のマイクロホン１０１によって収録された２個の音響信号を取得する。音響信号取得部１２Ｂは、取得したＭ個の音響信号を音源定位部２１及び通信部３１に出力する。また、音響信号取得部１２Ｂは、通信部３１から入力されたＬ個（Ｌは、１以上の整数）の音響信号を取得する。音響信号取得部１２Ｂは、取得したＬ個の音響信号を音源定位部２１に出力する。なお、音響信号取得部１２Ｂは、通信部３１から取得した音響信号に端末を識別する識別情報が含まれている場合、この識別情報も音源定位部２１に出力するようにしてもよい。 The acoustic signal acquisition unit 12B acquires M acoustic signals recorded by M (M is an integer of 1 or more) microphones 101 of the sound collection unit 11. For example, when M is 2, two acoustic signals recorded by the two microphones 101 are acquired. The acoustic signal acquisition unit 12B outputs the acquired M acoustic signals to the sound source localization unit 21 and the communication unit 31. The acoustic signal acquisition unit 12 B acquires L (L is an integer of 1 or more) acoustic signals input from the communication unit 31. The acoustic signal acquisition unit 12 B outputs the acquired L acoustic signals to the sound source localization unit 21. If the acoustic signal acquired from the communication unit 31 includes identification information for identifying the terminal, the acoustic signal acquisition unit 12B may output this identification information to the sound source localization unit 21.

通信部３１は、音響信号取得部１２Ｂから入力されたＭ個の音響信号を、他の会話支援装置１Ｂに送信する。また、通信部３１は、他の会話支援装置１Ｂから受信したＬ個の音響信号を音響信号取得部１２Ｂに出力する。例えば、通信部３１は、３台の会話支援装置１Ｂそれぞれから、２個ずつの音響信号を受信した場合、受信した６個（＝２個×３台）の音響信号を音響信号取得部１２Ｂに出力する。また、通信部３１は、端末を識別する識別情報を音響信号に含めて音響信号取得部１２Ｂに出力するようにしてもよい。 The communication unit 31 transmits the M acoustic signals input from the acoustic signal acquisition unit 12B to the other conversation support apparatus 1B. Further, the communication unit 31 outputs the L acoustic signals received from the other conversation support apparatus 1B to the acoustic signal acquisition unit 12B. For example, when the communication unit 31 receives two acoustic signals from each of the three conversation support apparatuses 1B, the received six acoustic signals (= 2 × 3 units) are sent to the acoustic signal acquisition unit 12B. Output. Further, the communication unit 31 may include identification information for identifying the terminal in the acoustic signal and output it to the acoustic signal acquisition unit 12B.

図２２は、本実施形態に係る複数の会話支援装置１Ｂの配置の一例を説明する図である。図２２に示す例では、４台の会話支援装置１Ｂ−１〜１Ｂ−４が、格子状に配置されている。会話支援装置１Ｂ−１〜１Ｂ−４は、マイクロホン１０１−１〜１０１−８のうち、それぞれ２つのマイクロホン１０１を備えている。例えば会話支援装置１Ｂ−１は、マイクロホン１０１−１及びマイクロホン１０１−２を備えている。
会話支援装置１Ｂ−１〜１Ｂ−４の構成は、図２１に示した構成である。会話支援装置１Ｂ−１〜１Ｂ−４それぞれは、各装置が備える通信部３１を介して互いに通信を行う。 FIG. 22 is a diagram illustrating an example of the arrangement of a plurality of conversation support devices 1B according to the present embodiment. In the example shown in FIG. 22, four conversation support apparatuses 1B-1 to 1B-4 are arranged in a lattice pattern. Each of the conversation support devices 1B-1 to 1B-4 includes two microphones 101 among the microphones 101-1 to 101-8. For example, the conversation support apparatus 1B-1 includes a microphone 101-1 and a microphone 101-2.
The configuration of the conversation support devices 1B-1 to 1B-4 is the configuration shown in FIG. Each of the conversation support devices 1B-1 to 1B-4 communicates with each other via a communication unit 31 included in each device.

また、図２２に示すように、会話支援装置１Ｂ−１〜１Ｂ−４のそれぞれの画像表示部１５には、提供される情報の表示領域は１つであり、文字提示画像７０１〜７０４も１つずつである。会話支援装置１Ｂ−１の文字提示画像７０１には、会話支援装置１Ｂ−２〜１Ｂ−４の話者が発話した発話内容が認識されたテキストが表示される。会話支援装置１Ｂ−２の文字提示画像７０２には、会話支援装置１Ｂ−１、１Ｂ−３、１Ｂ−４の話者が発話した発話内容が認識されたテキストが表示される。会話支援装置１Ｂ−３の文字提示画像７０３には、会話支援装置１Ｂ−１、１Ｂ−２、１Ｂ−４の話者が発話した発話内容が認識されたテキストが表示される。会話支援装置１Ｂ−４の文字提示画像７０４には、会話支援装置１Ｂ−１〜１Ｂ−３の話者が発話した発話内容が認識されたテキストが表示される。 Further, as shown in FIG. 22, each image display unit 15 of the conversation support devices 1 B- 1 to 1 B- 4 has one display area of information to be provided, and character presentation images 701 to 704 are also one. One by one. In the character presentation image 701 of the conversation support apparatus 1B-1, text in which the utterance content spoken by the speakers of the conversation support apparatuses 1B-2 to 1B-4 is recognized is displayed. In the character presentation image 702 of the conversation support device 1B-2, text in which the utterance content spoken by the speakers of the conversation support devices 1B-1, 1B-3, 1B-4 is recognized is displayed. In the character presentation image 703 of the conversation support device 1B-3, text in which the utterance content spoken by the speakers of the conversation support devices 1B-1, 1B-2, 1B-4 is recognized is displayed. In the character presentation image 704 of the conversation support apparatus 1B-4, text in which the utterance content spoken by the speakers of the conversation support apparatuses 1B-1 to 1B-3 is recognized is displayed.

すなわち、図２２に示したように４台の会話支援装置１Ｂ−１〜１Ｂ−４を使用する場合、会話支援装置１Ｂ−１が収音した音響信号を通信部３１と無線通信を介して、他の会話支援装置１Ｂ−２〜１Ｂ−３に送信する。一方、他の会話支援装置１Ｂ−２〜１Ｂ−３がそれぞれ収音した音響信号が、各装置の通信部３１と無線通信を介して、会話支援装置１Ｂ−１に送信される。この結果、会話支援装置１Ｂ−１は、会話支援装置１Ｂ−２〜１Ｂ−３から受信した各音響信号に対して音声認識を行って、音声認識した結果の文字を画像表示部１５上に表示する。なお、各会話支援装置１Ｂ−１〜１Ｂ−４は、他の会話支援装置１Ｂから受信した音響信号を直接音声認識処理してもよい。 That is, as shown in FIG. 22, when using four conversation support apparatuses 1B-1 to 1B-4, the acoustic signal collected by the conversation support apparatus 1B-1 is communicated with the communication unit 31 via wireless communication. It transmits to other conversation assistance apparatuses 1B-2 to 1B-3. On the other hand, the acoustic signals picked up by the other conversation support apparatuses 1B-2 to 1B-3 are transmitted to the conversation support apparatus 1B-1 via wireless communication with the communication unit 31 of each apparatus. As a result, the conversation support apparatus 1B-1 performs voice recognition on each acoustic signal received from the conversation support apparatuses 1B-2 to 1B-3, and displays characters as a result of the voice recognition on the image display unit 15. To do. Each of the conversation support apparatuses 1B-1 to 1B-4 may directly perform voice recognition processing on the acoustic signal received from the other conversation support apparatus 1B.

なお、図２２では、４台の会話支援装置１Ｂ−１〜１Ｂ−４を隣接させて設置させる例を説明したが、これに限られない。例えば、各会話支援装置１Ｂ−１〜１Ｂ−４は、それぞれ各話者の近傍に配置させるようにしてもよい。 In addition, in FIG. 22, although the example which installs four conversation assistance apparatuses 1B-1 to 1B-4 adjacently was demonstrated, it is not restricted to this. For example, each of the conversation support devices 1B-1 to 1B-4 may be arranged in the vicinity of each speaker.

図２３は、本実施形態に係る各会話支援装置１Ｂの画像表示部１５に表示される画像の一例を説明する図である。なお、図２３は、図２２に示したように、４台の会話支援装置１Ｂ−２〜１Ｂ−４のうち、会話支援装置１Ｂ−３の画像表示部１５上に表示される画像の一例である。
図２３において、符号７２０で示す領域の画像は、話者に対応する画像である。符号７２０で示す領域の画像には、会話支援装置１Ｂ−１に対応する話者を示す画像７２１、会話支援装置１Ｂ−２に対応する話者を示す画像７２２、会話支援装置１Ｂ−３に対応する話者を示す画像７２３、会話支援装置１Ｂ−４に対応する話者を示す画像７２４が含まれる。画像７２１は例えば赤色、画像７２２は例えば緑色、画像７２３は例えば青色、画像７２４は例えば黄色である。なお、各会話支援装置１Ｂ−１〜１Ｂ−４に対応する画像７２１〜７２４は色の画像に限られない。例えば、各会話支援装置１Ｂ−１〜１Ｂ−４に対応するアバター、アイコン、名前等であってもよい。
また、文字提示画像７０３に表示される画像は、会話支援装置１Ｂ−１に対応する話者の発話の認識データに基づく画像７３１、会話支援装置１Ｂ−２に対応する話者の発話の認識データに基づく画像７３２、及び会話支援装置１Ｂ−４に対応する話者の発話の認識データに基づく画像７３４である。これらの画像７３１〜７３４は、画像７２１〜７２４に対応する色で表示されてもよく、アバター、アイコン、名前等が付加されて表示されるようにしてもよい。アバター、アイコン、名前の場合は、例えば各画像７３１〜７３４の左に付加されて表示されるようにしてもよい。なお、これらの表示処理は、画像処理部１４が行う。 FIG. 23 is a diagram illustrating an example of an image displayed on the image display unit 15 of each conversation support apparatus 1B according to the present embodiment. FIG. 23 is an example of an image displayed on the image display unit 15 of the conversation support device 1B-3 among the four conversation support devices 1B-2 to 1B-4 as shown in FIG. is there.
In FIG. 23, the image of the area indicated by reference numeral 720 is an image corresponding to the speaker. The image of the area indicated by reference numeral 720 corresponds to an image 721 indicating a speaker corresponding to the conversation support apparatus 1B-1, an image 722 indicating a speaker corresponding to the conversation support apparatus 1B-2, and a conversation support apparatus 1B-3. And an image 724 showing a speaker corresponding to the conversation support apparatus 1B-4. The image 721 is, for example, red, the image 722 is, for example, green, the image 723 is, for example, blue, and the image 724 is, for example, yellow. The images 721 to 724 corresponding to the conversation support devices 1B-1 to 1B-4 are not limited to color images. For example, an avatar, an icon, a name, or the like corresponding to each of the conversation support devices 1B-1 to 1B-4 may be used.
The image displayed on the character presentation image 703 includes an image 731 based on the recognition data of the speaker corresponding to the conversation support device 1B-1, and the recognition data of the speaker corresponding to the conversation support device 1B-2. And an image 734 based on the recognition data of the speaker's utterance corresponding to the conversation support device 1B-4. These images 731 to 734 may be displayed in colors corresponding to the images 721 to 724, or may be displayed with an avatar, an icon, a name, or the like added thereto. In the case of an avatar, an icon, or a name, for example, the image may be added to the left of each image 731 to 734 and displayed. These display processes are performed by the image processing unit 14.

また、１台の会話支援装置１Ｃのみが全ての機能部を備えるようにしてもよい。そして、他の３台の会話支援装置は、収音部１１、音響信号取得部１２Ｂ、通信部３１、画像処理部１４、及び画像表示部１５を備えるようにしてもよい。この場合、全ての機能を備える会話支援装置１Ｃが、他の会話支援装置１Ｃからの音響信号を通信によって取得し、音源定位処理、音源分離処理、音声認識処理、画像生成処理等を行うようにしてもよい。そして生成した画像データを各会話支援装置１Ｃに送信するようにしてもよい。 Further, only one conversation support apparatus 1C may include all the functional units. The other three conversation support devices may include the sound collection unit 11, the acoustic signal acquisition unit 12B, the communication unit 31, the image processing unit 14, and the image display unit 15. In this case, the conversation support apparatus 1C having all functions acquires the acoustic signal from the other conversation support apparatus 1C by communication, and performs sound source localization processing, sound source separation processing, voice recognition processing, image generation processing, and the like. May be. Then, the generated image data may be transmitted to each conversation support apparatus 1C.

図２４は、本実施形態に係る会話支援装置１Ｃの構成を表すブロック図である。図２４に示すように、会話支援装置１Ｃは、収音部１１、音響信号取得部１２Ｃ、音声認識部１３Ａ、画像処理部１４Ｃ、画像表示部１５、入力部１６、音源定位部２１、音源分離部２２、言語情報検出部２３、翻訳部２４、及び通信部３１Ｃを備える。なお、図２１で説明した会話支援装置１Ｂと同じ機能を有する機能部には同じ符号を用いて、説明を省略する。なお、図２４では、会話支援装置１Ｃが図２１に示した会話支援装置１Ｂを元にした構成を備える例を説明したが、会話支援装置１Ｃは、図１に示した会話支援装置１を元にした構成であってもよい。すなわち、用途に応じて、会話支援装置１Ｃは、音源定位部２１、音源分離部２２、言語情報検出部２３、及び翻訳部２４のうち、いくつかの機能部を備えていなくてもよい。 FIG. 24 is a block diagram illustrating a configuration of a conversation support apparatus 1C according to the present embodiment. As shown in FIG. 24, the conversation support apparatus 1C includes a sound collection unit 11, an acoustic signal acquisition unit 12C, a speech recognition unit 13A, an image processing unit 14C, an image display unit 15, an input unit 16, a sound source localization unit 21, and a sound source separation. Unit 22, language information detection unit 23, translation unit 24, and communication unit 31C. In addition, the same code | symbol is used for the function part which has the same function as the conversation assistance apparatus 1B demonstrated in FIG. 21, and description is abbreviate | omitted. Note that although FIG. 24 illustrates an example in which the conversation support apparatus 1C has a configuration based on the conversation support apparatus 1B illustrated in FIG. 21, the conversation support apparatus 1C is based on the conversation support apparatus 1 illustrated in FIG. The structure made into may be sufficient. That is, the conversation support apparatus 1 C may not include some functional units among the sound source localization unit 21, the sound source separation unit 22, the language information detection unit 23, and the translation unit 24 depending on the application.

音響信号取得部１２Ｃは、収音部１１のＭ個（Ｍは、１以上の整数）のマイクロホン１０１によって収録されたＭ個の音響信号を取得する。音響信号取得部１２Ｃは、取得したＭ個の音響信号を音源定位部２１に出力する。また、音響信号取得部１２Ｃは、通信部３１Ｃから入力されたＬ個（Ｌは、１以上の整数）の音響信号を取得し、取得したＬ個の音響信号を音源定位部２１に出力する。なお、音響信号取得部１２Ｃは、通信部３１Ｃから取得した音響信号に端末を識別する識別情報が含まれている場合、この識別情報も音源定位部２１に出力するようにしてもよい。 The acoustic signal acquisition unit 12C acquires M acoustic signals recorded by M (M is an integer of 1 or more) microphones 101 of the sound collection unit 11. The acoustic signal acquisition unit 12 C outputs the acquired M acoustic signals to the sound source localization unit 21. The acoustic signal acquisition unit 12 C acquires L (L is an integer of 1 or more) acoustic signals input from the communication unit 31 C, and outputs the acquired L acoustic signals to the sound source localization unit 21. Note that the acoustic signal acquisition unit 12C may output the identification information to the sound source localization unit 21 when the acoustic signal acquired from the communication unit 31C includes identification information for identifying the terminal.

画像処理部１４Ｃは、翻訳部２４が出力した話者を示す情報と話者の向きを示す情報と認識データに基づいて、話者に対応した端末毎の文字データ及び話者の向きを示す画像を生成する。画像処理部１４Ｃは、生成した話者に対応した端末毎の文字データ及び話者の向きを示す画像を通信部３１Ｃに出力する。また、画像処理部１４Ｃは、自装置に対応する話者に対応した端末毎の文字データ及び話者の向きを示す画像を画像表示部１５上に表示させる。 The image processing unit 14C, based on the information indicating the speaker output from the translation unit 24, the information indicating the speaker direction, and the recognition data, character data for each terminal corresponding to the speaker and the image indicating the speaker direction. Is generated. The image processing unit 14C outputs character data for each terminal corresponding to the generated speaker and an image indicating the speaker direction to the communication unit 31C. In addition, the image processing unit 14 C causes the image display unit 15 to display character data for each terminal corresponding to the speaker corresponding to the own device and an image indicating the direction of the speaker.

通信部３１Ｃは、他の会話支援装置１Ｃから受信したＬ個の音響信号を音響信号取得部１２Ｃに出力する。通信部３１Ｃは、画像処理部１４Ｃから入力された話者に対応した端末毎の文字データ及び話者の向きを示す画像を、無線通信を介して、対応する他の会話支援装置１Ｃに送信する。 The communication unit 31C outputs the L acoustic signals received from the other conversation support device 1C to the acoustic signal acquisition unit 12C. The communication unit 31C transmits the character data for each terminal corresponding to the speaker and the image indicating the speaker direction input from the image processing unit 14C to the corresponding other conversation support apparatus 1C via wireless communication. .

例えば、図２２において、会話支援装置１Ｂ−１を全ての機能部を備える会話支援装置１Ｃ−１とした場合、会話支援装置１Ｂ−２〜１Ｂ−４それぞれを一部の機能部を備える会話支援装置１Ｃ−２〜１Ｃ−４とする。また、第１話者Ｓｐ１が会話支援装置１Ｃ−１を使用し、第２話者Ｓｐ２が会話支援装置１Ｃ−２を使用し、第３話者Ｓｐ３が会話支援装置１Ｃ−３を使用し、第４話者Ｓｐ４が会話支援装置１Ｃ−４を使用するとする。
この場合、会話支援装置１Ｃ−２〜１Ｃ−３は、収音したＭ個ずつの音響信号を、それぞれ通信部３１Ｃと無線通信を介して、会話支援装置１Ｃ−１に送信する。そして、会話支援装置１Ｃ−１は、自装置が収音した音響信号、受信した音響信号全てに対して音声認識を行う。 For example, in FIG. 22, when the conversation support apparatus 1B-1 is a conversation support apparatus 1C-1 including all functional units, each of the conversation support apparatuses 1B-2 to 1B-4 includes conversation support units. Let it be apparatuses 1C-2 to 1C-4. Further, the first speaker Sp1 uses the conversation support device 1C-1, the second speaker Sp2 uses the conversation support device 1C-2, the third speaker Sp3 uses the conversation support device 1C-3, Assume that the fourth speaker Sp4 uses the conversation support apparatus 1C-4.
In this case, the conversation support apparatuses 1C-2 to 1C-3 transmit the collected M acoustic signals to the conversation support apparatus 1C-1 via wireless communication with the communication unit 31C. Then, the conversation support apparatus 1C-1 performs voice recognition on all of the acoustic signals collected by the own apparatus and the received acoustic signals.

そして、画像処理部１４Ｃは、自装置の画像表示部１５上に第１話者Ｓｐ１の向きを示す画像と第２話者Ｓｐ２〜第４話者Ｓｐ４の発話内容を認識した文字データとを表示させる。
画像処理部１４Ｃは、第２話者Ｓｐ２の向きを示す画像と第１話者Ｓｐ１、第３話者Ｓｐ３、第４話者Ｓｐ４の発話内容を認識した文字データを生成する。そして、通信部３１Ｃは、生成された第２話者Ｓｐ２の向きを示す画像と第１話者Ｓｐ１、第３話者Ｓｐ３、第４話者Ｓｐ４の発話内容を認識した文字データを、無線通信を介して会話支援装置１Ｃ−２に送信する。 Then, the image processing unit 14C displays an image indicating the direction of the first speaker Sp1 and character data that recognizes the utterance contents of the second speaker Sp2 to the fourth speaker Sp4 on the image display unit 15 of the own device. Let
The image processing unit 14C generates an image indicating the direction of the second speaker Sp2, and character data that recognizes the utterance contents of the first speaker Sp1, the third speaker Sp3, and the fourth speaker Sp4. Then, the communication unit 31C wirelessly communicates the generated image indicating the direction of the second speaker Sp2 and character data in which the utterance contents of the first speaker Sp1, the third speaker Sp3, and the fourth speaker Sp4 are recognized. To the conversation support apparatus 1C-2.

同様に、画像処理部１４Ｃによって生成された第３話者Ｓｐ３の向きを示す画像と第１話者Ｓｐ１、第２話者Ｓｐ２、第４話者Ｓｐ４の発話内容を認識した文字データを、通信部３１Ｃは、無線通信を介して会話支援装置１Ｃ−３に送信する。
さらに、画像処理部１４Ｃによって生成された第４話者Ｓｐ４の向きを示す画像と第１話者Ｓｐ１〜第３話者Ｓｐ３の発話内容を認識した文字データを、通信部３１Ｃは、無線通信を介して会話支援装置１Ｃ−４に送信する。 Similarly, the image indicating the direction of the third speaker Sp3 generated by the image processing unit 14C and the character data in which the utterance contents of the first speaker Sp1, the second speaker Sp2, and the fourth speaker Sp4 are recognized are communicated. The unit 31C transmits to the conversation support apparatus 1C-3 via wireless communication.
Further, the communication unit 31C performs wireless communication with the image indicating the direction of the fourth speaker Sp4 generated by the image processing unit 14C and the character data in which the utterance contents of the first speaker Sp1 to the third speaker Sp3 are recognized. To the conversation support apparatus 1C-4.

以上のように、本実施形態に係る会話支援装置１Ｂ、１Ｃでは、他の会話支援装置との通信を行う通信部３１、３１Ｃを備え、音声入力部（収音部１１、音響信号取得部１２Ｂ、１２Ｃ）は、通信部が受信した他の会話支援装置から受信された音声信号を入力し、音声認識部１３Ａは、音声入力部から入力された音声信号のうち、表示領域に対応する前記使用者以外の音声信号を認識する。 As described above, the conversation support devices 1B and 1C according to the present embodiment include the communication units 31 and 31C that communicate with other conversation support devices, and include the voice input unit (sound collection unit 11 and acoustic signal acquisition unit 12B). 12C) inputs a voice signal received from another conversation support device received by the communication unit, and the voice recognition unit 13A uses the voice signal input from the voice input unit corresponding to the display area. Recognize audio signals other than the person.

この構成によって、本実施形態に係る会話支援装置１Ｂ、１Ｃでは、複数台の会話支援装置１Ｂを用いて音声認識を行うことができる。 With this configuration, the conversation support apparatuses 1B and 1C according to the present embodiment can perform voice recognition using a plurality of conversation support apparatuses 1B.

［第４実施形態］
第１〜第３実施形態では、各話者に対応する文字提示画像に他の話者の発話内容を認識した認識データに基づく画像の例を説明したが、これに限られない。本実施形態では、他の話者に限らず、自分の発話も含めて発話内容を認識した認識データに基づく画像が表示する例について説明する。 [Fourth Embodiment]
In the first to third embodiments, an example of an image based on recognition data obtained by recognizing the utterance content of another speaker in the character presentation image corresponding to each speaker has been described. However, the present invention is not limited to this. In the present embodiment, an example will be described in which an image based on recognition data in which utterance content is recognized including not only another speaker but also one's utterance is displayed.

図２５は、本実施形態に係る会話支援装置１Ｄの構成を表すブロック図である。図２５に示すように、会話支援装置１Ｄは、収音部１１、音響信号取得部１２Ｂ、音声認識部１３Ａ、画像処理部１４Ｄ、画像表示部１５、入力部１６、音源定位部２１、音源分離部２２、言語情報検出部２３、翻訳部２４、及び通信部３１Ｄを備える。なお、図２１で説明した会話支援装置１Ｂと同じ機能を有する機能部には同じ符号を用いて、説明を省略する。なお、図２５では、会話支援装置１Ｄが図２１に示した会話支援装置１Ｂを元にした構成を備える例を説明したが、会話支援装置１Ｄは、図１に示した会話支援装置１を元にした構成であってもよい。すなわち、用途に応じて、会話支援装置１Ｄは、音源定位部２１、音源分離部２２、言語情報検出部２３、及び翻訳部２４のうち、いくつかの機能部を備えていなくてもよい。 FIG. 25 is a block diagram illustrating a configuration of a conversation support apparatus 1D according to the present embodiment. As shown in FIG. 25, the conversation support apparatus 1D includes a sound collection unit 11, an acoustic signal acquisition unit 12B, a voice recognition unit 13A, an image processing unit 14D, an image display unit 15, an input unit 16, a sound source localization unit 21, and a sound source separation. Unit 22, language information detection unit 23, translation unit 24, and communication unit 31D. In addition, the same code | symbol is used for the function part which has the same function as the conversation assistance apparatus 1B demonstrated in FIG. 21, and description is abbreviate | omitted. In FIG. 25, an example in which the conversation support apparatus 1D has a configuration based on the conversation support apparatus 1B illustrated in FIG. 21 has been described. However, the conversation support apparatus 1D is based on the conversation support apparatus 1 illustrated in FIG. The structure made into may be sufficient. That is, the conversation support apparatus 1D may not include some functional units among the sound source localization unit 21, the sound source separation unit 22, the language information detection unit 23, and the translation unit 24 depending on the application.

画像処理部１４Ｄは、会話支援装置１Ｄを使用している話者の発話内容が認識された文字データを画像表示部１５上に表示させる。画像処理部１４Ｄは、会話支援装置１Ｄを使用している話者の発話内容が認識された文字データを通信部３１Ｄに出力する。画像処理部１４Ｄは、画像表示部１５上に設けられているタッチパネルである入力部１６の操作に基づいて、会話支援装置１Ｄを使用している話者の発話内容が認識された文字データを修正し、修正した文字データを通信部３１Ｄに出力する。 The image processing unit 14D displays on the image display unit 15 character data in which the utterance content of the speaker using the conversation support device 1D is recognized. The image processing unit 14D outputs the character data in which the utterance content of the speaker using the conversation support device 1D is recognized to the communication unit 31D. The image processing unit 14D corrects the character data in which the utterance content of the speaker using the conversation support device 1D is recognized based on the operation of the input unit 16 that is a touch panel provided on the image display unit 15. Then, the corrected character data is output to the communication unit 31D.

通信部３１Ｄは、画像処理部１４Ｄから入力された会話支援装置１Ｄを使用している話者の発話内容が認識された文字データを、無線通信を介して、他の会話支援装置１Ｄに送信する。また、通信部３１Ｄは、画像処理部１４Ｄから入力された修正された文字データを、無線通信を介して、他の会話支援装置１Ｄに送信する。なお、通信部３１Ｄは、会話支援装置１Ｄを使用している話者の発話内容が認識された文字データが入力された後、所定の時間、文字データの送信を待機させ、修正された文字データが入力されたか否かを判別するようにしてもよい。そして、通信部３１Ｄは、所定の時間内に修正された文字データが入力されなかった場合、待機させていた文字データを他の会話支援装置１Ｄに送信するようにしてもよい。一方、通信部３１Ｄは、所定の時間内に修正された文字データが入力された場合、待機されていた文字データは送信せず、入力された修正された文字データのみを他の会話支援装置１Ｄに送信するようにしてもよい。 The communication unit 31D transmits to the other conversation support apparatus 1D via wireless communication the character data in which the utterance content of the speaker using the conversation support apparatus 1D input from the image processing unit 14D is recognized. . In addition, the communication unit 31D transmits the corrected character data input from the image processing unit 14D to another conversation support apparatus 1D via wireless communication. The communication unit 31D waits for transmission of the character data for a predetermined time after the character data in which the utterance content of the speaker using the conversation support device 1D is recognized, and the corrected character data. It may be determined whether or not is input. And the communication part 31D may be made to transmit the character data made to wait to other conversation assistance apparatus 1D, when the character data corrected within the predetermined time are not input. On the other hand, when the character data corrected within a predetermined time is input, the communication unit 31D does not transmit the character data that has been waiting, and only the input corrected character data is transmitted to the other conversation support device 1D. You may make it transmit to.

図２６は、本実施形態に係る会話支援装置１Ｄの画像表示部１５上に表示される画像の一例を説明する図である。図２６に示した例は、４台の会話支援装置１Ｄが使用され、図２２における会話支援装置１Ｂ−３に対応する位置に配置される会話支援装置１Ｄの画像表示部１５上に表示される画像の例である。以下の例では、図２２において、会話支援装置１Ｂ−１が会話支援装置１Ｄ−１であるとし、会話支援装置１Ｄ−１は、第１話者Ｓｐ１が使用するとする。以下同様に、会話支援装置１Ｂ−２が会話支援装置１Ｄ−２であり、会話支援装置１Ｄ−２は、第１話者Ｓｐ２が使用するとする。会話支援装置１Ｂ−３が会話支援装置１Ｄ−３であるとし、会話支援装置１Ｄ−３は、第３話者Ｓｐ３が使用するとする。会話支援装置１Ｂ−４が会話支援装置１Ｄ−４であるとし、会話支援装置１Ｄ−４は、第４話者Ｓｐ４が使用するとする。 FIG. 26 is a diagram illustrating an example of an image displayed on the image display unit 15 of the conversation support apparatus 1D according to the present embodiment. In the example shown in FIG. 26, four conversation support apparatuses 1D are used and displayed on the image display unit 15 of the conversation support apparatus 1D arranged at a position corresponding to the conversation support apparatus 1B-3 in FIG. It is an example of an image. In the following example, in FIG. 22, it is assumed that the conversation support apparatus 1B-1 is the conversation support apparatus 1D-1, and the conversation support apparatus 1D-1 is used by the first speaker Sp1. Similarly, it is assumed that the conversation support apparatus 1B-2 is the conversation support apparatus 1D-2, and the conversation support apparatus 1D-2 is used by the first speaker Sp2. It is assumed that the conversation support apparatus 1B-3 is the conversation support apparatus 1D-3, and the conversation support apparatus 1D-3 is used by the third speaker Sp3. It is assumed that the conversation support apparatus 1B-4 is the conversation support apparatus 1D-4, and the conversation support apparatus 1D-4 is used by the fourth speaker Sp4.

図２６において、画像７５１は、第３話者Ｓｐ３が発話した発話内容が認識された文字データである。画像７５１は、第３話者Ｓｐ３が「きのうかわいさんにあいましたか？」と発話した音響信号を音声認識した結果である。しかしながら、第３話者Ｓｐ３は、「かわいさん」を「河合さん」として話しているつもりであるが、認識結果は「川井さん」と表示されている。この画像７５１が、会話支援装置１Ｄ−３の画像表示部１５に表示されず、他の会話支援装置１Ｄ−１、１Ｄ−２、１Ｄ−４の各画像表示部１５上のみに表示されても、第１話者Ｓｐ１、第２話者Ｓｐ２、第４話者Ｓｐ４は「川井さん」を知らないため、会話が成り立たないこともあり得る。 In FIG. 26, an image 751 is character data in which the utterance content uttered by the third speaker Sp3 is recognized. The image 751 is a result of voice recognition of an acoustic signal spoken by the third speaker Sp3 “Did you meet Mr. Kawai yesterday?”. However, the third speaker Sp3 intends to speak “Kawai” as “Mr. Kawai”, but the recognition result is displayed as “Mr. Kawai”. Even if this image 751 is not displayed on the image display unit 15 of the conversation support device 1D-3 but is displayed only on each image display unit 15 of the other conversation support devices 1D-1, 1D-2, 1D-4. Since the first speaker Sp1, the second speaker Sp2, and the fourth speaker Sp4 do not know “Mr. Kawai”, the conversation may not be established.

このため、本実施形態では、自分が発話した発話内容が認識された文字データも画像表示部１５上に表示される。
これにより、第３話者Ｓｐ３は、画像７５１を確認し、例えば認識が異なっている箇所の画像７５２を画像表示部１５が備えるタッチパネルの入力部１６を操作して選択する。
そして、画像処理部１４Ｄは、選択された画像７５２に対応する他の変換「河合さん」、「河井さん」等を含む画像７５３を、図２６のように例えば選択された画像７５２の近傍に表示する。これにより、第３話者Ｓｐ３は、画像７５３から所望の「河合さん」を選択する。入力部１６は、選択された「河合さん」を示す情報を画像処理部１４Ｄに出力するようにしてもよい。そして、会話支援装置１Ｄ−３の通信部３１Ｄは、画像処理部１４Ｄによって修正された文字データを、他の会話支援装置１Ｄに送信し直すようにしてもよい。 For this reason, in this embodiment, the character data in which the utterance content that the user uttered is recognized is also displayed on the image display unit 15.
As a result, the third speaker Sp3 confirms the image 751, and selects, for example, the image 752 at a location with different recognition by operating the input unit 16 of the touch panel provided in the image display unit 15.
Then, the image processing unit 14D displays an image 753 including other conversions “Mr. Kawai”, “Mr. Kawai”, and the like corresponding to the selected image 752 in the vicinity of the selected image 752, for example, as shown in FIG. To do. Thereby, the third speaker Sp3 selects a desired “Mr. Kawai” from the image 753. The input unit 16 may output information indicating the selected “Mr. Kawai” to the image processing unit 14D. Then, the communication unit 31D of the conversation support apparatus 1D-3 may retransmit the character data corrected by the image processing unit 14D to another conversation support apparatus 1D.

なお、上述した例では、図２２のように複数の会話支援装置１Ｄに適用する例を説明したが、これに限られない。第１、第２実施形態で説明したように１台の会話支援装置１、１Ａにも適用するようにしてもよい。
例えば、図８において、会話支援装置１または１Ａは、第１文字提示画像３２２Ａに、第１話者Ｓｐ１の発話内容を認識した文字データを表示するようにしてもよい。 In the example described above, an example in which the present invention is applied to a plurality of conversation support apparatuses 1D as illustrated in FIG. 22 has been described, but the present invention is not limited thereto. As described in the first and second embodiments, the present invention may also be applied to one conversation support device 1 or 1A.
For example, in FIG. 8, the conversation support apparatus 1 or 1A may display character data that recognizes the utterance content of the first speaker Sp1 on the first character presentation image 322A.

以上のように、本実施形態の会話支援装置１Ｄでは、表示部（画像表示部１５）に表示された画像の一部を選択する入力部１６を備え、画像処理部１４Ｄは、入力部によって選択された画像の一部が認識結果である場合、選択された認識に対応する他の認識候補を表示部に表示させ、認識候補のうち入力部によって選択された候補に認識結果を修正し、修正した認識結果を、通信部３１Ｄを介して他の会話支援装置に送信させる。
この構成によって、使用者の発話内容を、他の使用者に正しく提示することができる。 As described above, the conversation support apparatus 1D of the present embodiment includes the input unit 16 that selects a part of the image displayed on the display unit (image display unit 15), and the image processing unit 14D is selected by the input unit. If a part of the selected image is a recognition result, another recognition candidate corresponding to the selected recognition is displayed on the display unit, and the recognition result is corrected to the candidate selected by the input unit among the recognition candidates. The recognized result is transmitted to another conversation support apparatus via the communication unit 31D.
With this configuration, the user's utterance content can be correctly presented to other users.

なお、第１〜第４実施形態では、話者が２〜４人の例を説明したが、これに限られず５人以上であってもよい。この場合、画像パターン生成部１４１は、人数に合わせた表示パターンを生成するようにする。または、第３実施形態で説明したように、話者毎に会話支援装置を用いることで、５人以上にも対応することができる。
また、複数の会話支援装置（１、１Ａ、１Ｂ、１Ｃ、１Ｄ）を用いる場合、第３、第４実施形態では、１台の会話支援装置（１、１Ａ、１Ｂ、１Ｃ、１Ｄ）の画像表示部１５上に、１人分の表示を行う例を示したが、これに限られない。複数の会話支援装置（１、１Ａ、１Ｂ、１Ｃ、１Ｄ）を用いる場合、各会話支援装置（１、１Ａ、１Ｂ、１Ｃ、１Ｄ）に表示する表示パターンは、例えば図６〜図８に示したように、複数の話者に対応した画面であってもよい。例えば、第３、第４実施形態によれば、各会話支援装置（１、１Ａ、１Ｂ、１Ｃ、１Ｄ）の画像表示部１５上に図６に示した表示パターンを表示することで、会話支援装置（１、１Ａ、１Ｂ、１Ｃ、１Ｄ）を２台用いて４人の話者に対応することができる。 In the first to fourth embodiments, an example in which there are 2 to 4 speakers has been described. However, the number of speakers is not limited to this, and may be 5 or more. In this case, the image pattern generation unit 141 generates a display pattern according to the number of people. Alternatively, as described in the third embodiment, it is possible to cope with five or more people by using a conversation support device for each speaker.
When a plurality of conversation support devices (1, 1A, 1B, 1C, 1D) are used, the images of one conversation support device (1, 1A, 1B, 1C, 1D) are used in the third and fourth embodiments. Although the example which displays for one person on the display part 15 was shown, it is not restricted to this. When a plurality of conversation support devices (1, 1A, 1B, 1C, 1D) are used, display patterns displayed on each conversation support device (1, 1A, 1B, 1C, 1D) are shown in FIGS. As described above, a screen corresponding to a plurality of speakers may be used. For example, according to the third and fourth embodiments, conversation support is provided by displaying the display pattern shown in FIG. 6 on the image display unit 15 of each conversation support device (1, 1A, 1B, 1C, 1D). Two devices (1, 1A, 1B, 1C, 1D) can be used to support four speakers.

なお、第１〜第４実施形態では、会話支援装置１、１Ａ、１Ｂ、１Ｄの例としてタブレット端末等を例に説明したが、これに限られない。例えば、テーブル上に画像表示部１５を備える装置に会話支援装置１、１Ａ、１Ｂ、１Ｄを適用するようにしてもよい。または、会話支援装置１、１Ａ、１Ｂ、１Ｄを、電子黒板等に適用するようにしてもよい。また、第３実施形態で説明したように、複数の端末で会話支援装置を構成する場合、これらの各端末は、例えば同じ室内に配置されなくてもよい。例えば、複数の会話支援装置１、１Ａ、１Ｂ、１Ｄは、異なる部屋に配置されるようにしてもよい。さらに、複数の会話支援装置１、１Ａ、１Ｂ、１Ｄは、例えば複数の車両などに搭載されていてもよい。また、複数の端末がネットワークを介して接続されている場合、複数の会話支援装置１、１Ａ、１Ｂ、１Ｄは、例えば異なる国や地域に配置されていてもよい。これにより、離れた位置にいる複数の話者の聴覚の支援を行うことができる。 In the first to fourth embodiments, a tablet terminal or the like has been described as an example of the conversation support devices 1, 1A, 1B, and 1D. However, the present invention is not limited to this. For example, the conversation support devices 1, 1 A, 1 B, and 1 D may be applied to a device that includes the image display unit 15 on the table. Alternatively, the conversation support devices 1, 1A, 1B, and 1D may be applied to an electronic blackboard or the like. Further, as described in the third embodiment, when a conversation support device is configured with a plurality of terminals, these terminals do not have to be arranged in the same room, for example. For example, the plurality of conversation support devices 1, 1A, 1B, 1D may be arranged in different rooms. Furthermore, the plurality of conversation support apparatuses 1, 1A, 1B, and 1D may be mounted on, for example, a plurality of vehicles. In addition, when a plurality of terminals are connected via a network, the plurality of conversation support devices 1, 1A, 1B, and 1D may be arranged in different countries or regions, for example. As a result, it is possible to support the hearing of a plurality of speakers located at distant positions.

なお、第１〜第４実施形態において、会話支援装置１、１Ａ、１Ｂ、１Ｄは、画像表示領域に表示される話者の発話を認識した文字を示す画像を、例えば、最新の発話に対応する画像を濃く表示し、過去の発話に対応する画像を薄く表示するようにしてもよい。例えば、会話支援装置１、１Ａ、１Ｂ、１Ｄは、最新の発話に対応する画像を太字で表示し、過去の発話に対応する画像を細字で表示させてもよい。または、過去の発話に対応する画像に用いる文字の大きさを、最新の発話に対応する画像に用いる文字の大きさより大きくしてもよい。
また、第１〜第４実施形態において、話者の発話を認識した画像を表示する位置は、例えば図１４では、上から下に順番に表示する例を説明したが、これに限られない。図１４において、例えば、会話支援装置１、１Ａ、１Ｂ、１Ｄは、最新の発話に対応する画像を、例えば第１提示画像５２１の第１文字提示画像５２２の略中心に表示させ、１つ前の発話に対応する画像をその上部に表示させるようにしてもよい。 In the first to fourth embodiments, the conversation support devices 1, 1A, 1B, and 1D correspond to, for example, the latest utterance with an image showing characters recognized by the speaker displayed in the image display area. The image to be displayed may be displayed darkly, and the image corresponding to the past utterance may be displayed lightly. For example, the conversation support devices 1, 1 A, 1 B, and 1 D may display an image corresponding to the latest utterance in bold and display an image corresponding to the past utterance in thin characters. Or you may make the size of the character used for the image corresponding to the past utterance larger than the size of the character used for the image corresponding to the latest utterance.
Further, in the first to fourth embodiments, for example, FIG. 14 illustrates an example in which the position where an image in which a speaker's utterance is recognized is displayed in order from top to bottom is not limited to this. In FIG. 14, for example, the conversation support apparatuses 1, 1 A, 1 B, and 1 D display an image corresponding to the latest utterance at, for example, the approximate center of the first character presentation image 522 of the first presentation image 521. An image corresponding to the utterance may be displayed at the top.

また、第１〜第４実施形態において、会話支援装置１、１Ａ、１Ｂ、１Ｄは、話者毎の発話量に応じて、例えば話者毎に提示される情報の表示領域（例えば第１提示画像３２１Ａ）、またはその表示領域内の文字提示画像（例えば第１文字提示画像３２２Ａ）の明るさを変化させて表示させてもよい。会話支援装置１、１Ａ、１Ｂ、１Ｄは、例えば、発話回数または発話時間を話者毎に検出し、検出した発話回数または発話時間が他の話者より少ない話者の表示領域または文字提示画像を初期状態の輝度より低くするように制御してもよい。または、会話支援装置１、１Ａ、１Ｂ、１Ｄは、検出した発話回数または発話時間が他の話者より多い話者の表示領域または文字提示画像を初期状態の輝度より高くするように制御してもよい。これにより、使用者は、自分の発話時間または発話回数を認識することができる。または、司会者は、この表示を見て、発話回数の少ない、または発話時間が短い話者に、発話を促すことで会議などの進行において有効に活用することもできる。 In the first to fourth embodiments, the conversation support devices 1, 1 A, 1 B, and 1 D display, for example, a display area (for example, the first presentation) of information presented for each speaker according to the utterance amount for each speaker. The brightness of the image 321A) or the character presentation image (for example, the first character presentation image 322A) in the display area may be changed and displayed. The conversation support apparatuses 1, 1A, 1B, and 1D detect, for example, the number of utterances or utterance time for each speaker, and the display area or character presentation image of the speaker with the detected number of utterances or utterance time being smaller than that of other speakers May be controlled to be lower than the luminance in the initial state. Alternatively, the conversation support devices 1, 1A, 1B, and 1D control the speaker display area or the character presentation image so that the detected number of utterances or utterance time is higher than that of other speakers so as to be higher than the initial brightness. Also good. Thus, the user can recognize his / her utterance time or the number of utterances. Alternatively, the moderator can effectively use the display in the progress of the conference or the like by urging the speaker with a small number of utterances or a short utterance time to see the display.

なお、第１〜第４実施形態では、会話支援装置１、１Ａ、１Ｂ、１Ｄが、音声認識部１３または１３Ａを備える例を説明したが、音声認識部１３または１３Ａは、例えばネットワーク経由で提供されるようにしてもよい。 In the first to fourth embodiments, an example in which the conversation support devices 1, 1A, 1B, and 1D include the voice recognition unit 13 or 13A has been described. However, the voice recognition unit 13 or 13A is provided via a network, for example. You may be made to do.

なお、本発明における会話支援装置１、１Ａ、１Ｂ、１Ｄの機能を実現するためのプログラムをコンピュータ読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピュータシステムに読み込ませ、実行することにより上述した各種の処理を行ってもよい。なお、ここでいう「コンピュータシステム」とは、ＯＳや周辺機器等のハードウェアを含むものとする。また、「コンピュータシステム」は、ホームページ提供環境（あるいは表示環境）を備えたＷＷＷシステムも含むものとする。また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ−ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。さらに「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムが送信された場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリ（ＲＡＭ）のように、一定時間プログラムを保持しているものも含むものとする。 Note that a program for realizing the functions of the conversation support apparatuses 1, 1A, 1B, and 1D according to the present invention is recorded on a computer-readable recording medium, and the program recorded on the recording medium is read into a computer system. Various processes described above may be performed by executing them. Here, the “computer system” includes an OS and hardware such as peripheral devices. The “computer system” includes a WWW system having a homepage providing environment (or display environment). The “computer-readable recording medium” refers to a storage device such as a flexible medium, a magneto-optical disk, a portable medium such as a ROM and a CD-ROM, and a hard disk incorporated in a computer system. Further, the “computer-readable recording medium” refers to a volatile memory (RAM) in a computer system that becomes a server or a client when a program is transmitted via a network such as the Internet or a communication line such as a telephone line. In addition, those holding programs for a certain period of time are also included.

また、上記プログラムは、このプログラムを記憶装置等に格納したコンピュータシステムから、伝送媒体を介して、あるいは、伝送媒体中の伝送波により他のコンピュータシステムに伝送されてもよい。ここで、プログラムを伝送する「伝送媒体」は、インターネット等のネットワーク（通信網）や電話回線等の通信回線（通信線）のように情報を伝送する機能を有する媒体のことをいう。また、上記プログラムは、前述した機能の一部を実現するためのものであってもよい。さらに、前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるもの、いわゆる差分ファイル（差分プログラム）であってもよい。 The program may be transmitted from a computer system storing the program in a storage device or the like to another computer system via a transmission medium or by a transmission wave in the transmission medium. Here, the “transmission medium” for transmitting the program refers to a medium having a function of transmitting information, such as a network (communication network) such as the Internet or a communication line (communication line) such as a telephone line. The program may be for realizing a part of the functions described above. Furthermore, what can implement | achieve the function mentioned above in combination with the program already recorded on the computer system, what is called a difference file (difference program) may be sufficient.

１、１Ａ、１Ｂ、１Ｃ、１Ｄ…会話支援装置、１１…収音部、１２、１２Ｂ、１２Ｃ…音
響信号取得部、１３、１３Ａ…音声認識部、１４、１４Ｃ、１４Ｄ…画像処理部、１５…
画像表示部、１６…入力部、２１…音源定位部、２２…音源分離部、２３…言語情報検出
部、２４…翻訳部、３１、３１Ｃ、３１Ｄ…通信部、１０１−１〜１０１−Ｎ、１０１−
Ｍ…マイクロホン、１４１…画像パターン生成部、１４２…表示画像生成部、１４３…画
像合成部、３２１Ａ、３２１Ｂ、３２１Ｃ、５２１、６２１…第１提示画像、３２２Ａ、
３２２Ｂ、３２２Ｃ、５２２、６２２…第１文字提示画像、３２３Ａ、３３３Ａ、３２３
Ｂ〜３２４Ｂ、３２３Ｃ〜３５３Ｃ、５２３、５３３…方位画像、３３１Ａ、３３１Ｂ、
３３１Ｃ、５３１、６３１…第２提示画像、３３２Ａ、３３２Ｂ、３３２Ｃ、５３２、６
３２…第２文字提示画像、３４１Ｂ、３４１Ｃ、６４１…第３提示画像、３４２Ｂ、３４
２Ｃ、６４２…第３文字提示画像、３５１Ｃ…第４提示画像、３５２Ｃ…第４文字提示画
像、５２４Ａ〜５２４Ｃ、５３４Ａ〜５３４Ｄ、６３４Ａ、６４４Ａ、６２５Ｂ、６４５
Ｂ…発話が認識された文字の画像 DESCRIPTION OF SYMBOLS 1, 1A, 1B, 1C, 1D ... Conversation support apparatus, 11 ... Sound collection part, 12, 12B, 12C ... Acoustic signal acquisition part, 13, 13A ... Voice recognition part, 14, 14C, 14D ... Image processing part, 15 ...
Image display unit, 16 ... input unit, 21 ... sound source localization unit, 22 ... sound source separation unit, 23 ... language information detection unit, 24 ... translation unit, 31, 31C, 31D ... communication unit, 101-1 to 101-N, 101-
M ... Microphone, 141 ... Image pattern generation unit, 142 ... Display image generation unit, 143 ... Image composition unit, 321A, 321B, 321C, 521, 621 ... First presentation image, 322A,
322B, 322C, 522, 622 ... first character presentation image, 323A, 333A, 323
B to 324B, 323C to 353C, 523, 533 ... orientation images, 331A, 331B,
331C, 531, 631 ... second presentation image, 332A, 332B, 332C, 532, 6
32 ... 2nd character presentation image, 341B, 341C, 641 ... 3rd presentation image, 342B, 34
2C, 642 ... third character presentation image, 351C ... fourth presentation image, 352C ... fourth character presentation image, 524A-524C, 534A-534D, 634A, 644A, 625B, 645
B ... Character image with utterance recognized

Claims

An audio input unit for inputting audio signals of two or more users;
A voice recognition unit for recognizing a voice signal input to the voice input unit;
A sound source estimation unit that estimates a sound source direction of a sound signal input to the sound input unit;
A display unit for displaying a recognition result recognized by the voice recognition unit;
If the display area corresponding to each user is set as the image display area of the display unit and a voice is detected from a direction different from that of the speaker already localized, a new speaker joins the conference. An image processing unit that determines that the utterance content is displayed between the text display frames of the adjacent already recognized speakers,
A conversation support device.

The image processing unit
The display color of the display area corresponding to each user, the pattern, the icon displayed in the display area, and the avatar displayed in the display area are displayed differently for each user. Item 4. The conversation support device according to Item 1.

The image processing unit
The conversation support according to claim 1 or 2, wherein an image based on the sound source direction estimated by the sound source estimation unit is displayed in the display area corresponding to each user of the display unit. apparatus.

A sound source separation unit for separating the sound signal input to the sound input unit for each user;
The image processing unit
Of the audio signals for each user separated by the sound source separation unit, the recognition result other than the user corresponding to the display region is displayed in a display region corresponding to the user of the display unit. The conversation support apparatus according to any one of claims 1 to 3, wherein

A position estimation unit for estimating the position of the user;
The image processing unit
The display area corresponding to each user is set or rearranged in the image display area of the display section at a position corresponding to the position of the user estimated by the position estimation section. The conversation support device according to claim 1.

The position estimation unit
The conversation support apparatus according to claim 5, wherein the position of the user is estimated using a voice signal input to the voice input unit.

A translation unit that translates the recognition result recognized by the voice recognition unit;
The image processing unit
The conversation support apparatus according to any one of claims 1 to 6, wherein a translation result translated by the translation unit is displayed in the display area corresponding to each user of the display unit. .

A language information detection unit for detecting a language spoken by the user;
The translation unit
The conversation support apparatus according to claim 7, wherein the recognition result other than the user corresponding to the display area is translated into a language detected by the language information detection unit.

Provided with a communication unit that communicates with other conversation support devices,
The voice input unit
Input the voice signal received from the other conversation support device received by the communication unit,
The voice recognition unit
The conversation according to any one of claims 1 to 7, wherein a speech signal other than the user corresponding to the display area is recognized among the speech signals input from the speech input unit. Support device.

An input unit for selecting a part of the image displayed on the display unit;
The image processing unit
When a part of the image selected by the input unit is a recognition result, another recognition candidate corresponding to the selected recognition is displayed on the display unit, and the recognition unit selected by the input unit among the recognition candidates The conversation support apparatus according to claim 9, wherein the recognition result is corrected by a candidate, and the corrected recognition result is transmitted to the other conversation support apparatus via the communication unit.

A voice input procedure in which a voice input unit inputs voice signals of two or more users;
A speech recognition unit for recognizing a speech signal input by the speech input procedure;
A sound source estimation unit for estimating a sound source direction of the audio signal input by the audio input procedure;
The image processing unit sets a display region corresponding to each user as an image display region of a display unit on which a recognition result recognized by the voice recognition procedure is displayed, and a speaker that has already been sound source localized When speech is detected from a different direction, it is determined that a new speaker is participating in the conference, and the content of the utterance is displayed between the text display frames of adjacent already recognized speakers. Processing procedure and
A method for controlling a conversation support apparatus, comprising:

In the computer of the conversation support device,
A voice input procedure for inputting voice signals of two or more users;
A voice recognition procedure for recognizing a voice signal input by the voice input procedure;
A sound source estimation procedure for estimating a sound source direction of the audio signal input by the audio input procedure;
The display area corresponding to each user is set as the image display area of the display unit on which the recognition result recognized by the voice recognition procedure is displayed, and the voice is heard from a direction different from that of the speaker already localized. If detected, an image processing procedure for determining that a new speaker is participating in the conference and displaying the utterance content between the text display frames of the adjacent already recognized speakers;
A program for a conversation support apparatus, characterized in that