JP7429107B2

JP7429107B2 - Speech translation device, speech translation method and its program

Info

Publication number: JP7429107B2
Application number: JP2019196078A
Authority: JP
Inventors: 博基古川; 敦坂口; 剛樹西川
Original assignee: Panasonic Corp; Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Corp; Panasonic Holdings Corp
Priority date: 2019-03-25
Filing date: 2019-10-29
Publication date: 2024-02-07
Anticipated expiration: 2039-10-29
Also published as: CN111739511A; JP2020160429A

Description

本開示は、音声翻訳装置、音声翻訳方法及び音声翻訳方法を用いたプログラムに関する。 The present disclosure relates to a speech translation device, a speech translation method, and a program using the speech translation method.

例えば特許文献１には、第１言語話者及び第１言語話者の会話相手である第２言語話者が発する音声を音声データに変換して出力する音声入力部と、第１言語話者が音声を発している間に入力される入力スイッチであって、第１言語話者が音声を発していない間も入力される入力スイッチと、入力された音声データを翻訳した翻訳結果を音声に変換して出力する音声出力部とを備える通訳システムが開示されている。 For example, Patent Document 1 discloses a voice input unit that converts the voice uttered by a first language speaker and a second language speaker who is a conversation partner of the first language speaker into voice data and outputs the voice data; An input switch that is entered while the first language speaker is making a sound, and an input switch that is entered even when the first language speaker is not making a sound, and an input switch that translates the input speech data into speech. An interpretation system is disclosed that includes an audio output unit that converts and outputs the audio.

特許第３８９１０２３号公報Patent No. 3891023

しかしながら、特許文献１に開示される技術では、第１話者及び第２話者が会話する際に、第１話者及び第２話者のそれぞれの発話に際し、発話の度に入力スイッチを操作する必要があり、操作が煩わしくなる。第１話者及び第２話者が会話する際に、度々、入力スイッチを操作することとなるため、通訳システムの使用頻度及び使用期間が増大してしまう。 However, in the technology disclosed in Patent Document 1, when the first speaker and the second speaker have a conversation, the input switch is operated each time the first speaker and the second speaker speak. operation becomes cumbersome. When the first speaker and the second speaker have a conversation, the input switch must be operated frequently, which increases the frequency and period of use of the interpretation system.

また、第１話者及び第２話者が互いに通訳システムを操作する場合、通訳システムの非所有者は、通常、通訳システムの操作方法を理解していない。このため、通訳システムの操作に手間取ることとなるため、通訳システムの使用期間の増大に拍車がかかる。その結果、従来の通訳システムでは、使用期間の増大によるエネルギーを費やすこととなってしまうという課題がある。 Furthermore, when the first speaker and the second speaker mutually operate the interpretation system, non-owners of the interpretation system usually do not understand how to operate the interpretation system. Therefore, it takes time to operate the interpretation system, which accelerates the length of time the interpretation system is used. As a result, conventional interpretation systems have the problem of consuming energy due to an increase in the period of use.

そこで、本開示は、操作を簡易にすることで、音声翻訳装置のエネルギー消費の増大を抑制することができる音声翻訳装置、音声翻訳方法及びそのプログラムを提供することを目的とする。 Therefore, an object of the present disclosure is to provide a speech translation device, a speech translation method, and a program therefor, which can suppress an increase in energy consumption of the speech translation device by simplifying the operation.

本開示の一態様に係る音声翻訳装置は、第１言語で発話する第１話者と、前記第１話者の会話相手であり、前記第１言語と異なる第２言語で発話する第２話者とが会話を行うための音声翻訳装置であって、音声入力部に入力される音から、前記第１話者及び前記第２話者が発話した音声区間を検出する音声検出部と、前記音声検出部が検出した音声区間の音声が音声認識されることで、当該音声が示す前記第１言語から前記第２言語に翻訳した翻訳結果を表示し、かつ、前記第２言語から前記第１言語に翻訳した翻訳結果を表示する表示部と、前記第１話者の発話後に前記第２話者に発話を促す内容を、前記表示部を介して、翻訳結果を表示した後、又は同時に、前記第２言語により出力し、かつ、前記第２話者の発話後に前記第１話者に発話を促す内容を、前記表示部を介して、翻訳結果を表示した後、又は同時に、前記第１言語により出力する発話指示部とを備える。 A speech translation device according to an aspect of the present disclosure includes a first speaker who speaks in a first language, and a second speaker who is a conversation partner of the first speaker and who speaks in a second language different from the first language. A voice translation device for having a conversation with a person, the voice detection unit detecting a voice section uttered by the first speaker and the second speaker from the sounds input to the voice input unit; By performing voice recognition on the voice in the voice section detected by the voice detection unit, displaying a translation result translated from the first language indicated by the voice into the second language, and displaying the result of translation from the second language to the first language. A display unit that displays the translation result translated into a language, and content that prompts the second speaker to speak after the first speaker speaks, after displaying the translation result, or at the same time, through the display unit, After displaying the translation result via the display unit, or at the same time, the first speaker outputs content in the second language and prompts the first speaker to speak after the second speaker speaks. and a speech instruction unit that outputs in language.

なお、これらのうちの一部の具体的な態様は、システム、方法、集積回路、コンピュータプログラム又はコンピュータで読み取り可能なＣＤ－ＲＯＭ等の記録媒体を用いて実現されてもよく、システム、方法、集積回路、コンピュータプログラム及び記録媒体の任意な組み合わせを用いて実現されてもよい。 Note that some specific aspects of these may be realized using a system, a method, an integrated circuit, a computer program, or a computer-readable recording medium such as a CD-ROM. It may be implemented using any combination of integrated circuits, computer programs, and storage media.

本開示の音声翻訳装置等によれば、操作を簡易にすることで、音声翻訳装置のエネルギー消費の増大を抑制することができる。 According to the speech translation device and the like of the present disclosure, by simplifying the operation, it is possible to suppress an increase in energy consumption of the speech translation device.

図１Ａは、実施の形態１における音声翻訳装置の外観と、第１話者が発話したときの第１話者と第２話者との音声翻訳装置の使用場面の一例を示す図である。FIG. 1A is a diagram illustrating the appearance of the speech translation device according to the first embodiment and an example of a usage scene of the speech translation device between the first speaker and the second speaker when the first speaker speaks. 図１Ｂは、実施の形態１における音声翻訳装置の外観と、第２話者が発話したときの第１話者と第２話者との音声翻訳装置の使用場面の一例を示す図である。FIG. 1B is a diagram illustrating an example of the external appearance of the speech translation device according to the first embodiment and a usage scene of the speech translation device between the first speaker and the second speaker when the second speaker speaks. 図１Ｃは、第１話者と第２話者とが会話をするときの音声翻訳装置の使用場面の別の一例を示す図である。FIG. 1C is a diagram showing another example of a usage scene of the speech translation device when a first speaker and a second speaker have a conversation. 図２は、実施の形態１における音声翻訳装置を示すブロック図である。FIG. 2 is a block diagram showing the speech translation device in the first embodiment. 図３は、実施の形態１における音声翻訳装置の動作を示すフローチャートである。FIG. 3 is a flowchart showing the operation of the speech translation device in the first embodiment. 図４は、実施の形態２における音声翻訳装置を示すブロック図である。FIG. 4 is a block diagram showing a speech translation device according to the second embodiment. 図５は、実施の形態２における音声翻訳装置の動作を示すフローチャートである。FIG. 5 is a flowchart showing the operation of the speech translation device in the second embodiment. 図６は、実施の形態２の変形例における音声翻訳装置の動作を示すフローチャートである。FIG. 6 is a flowchart showing the operation of the speech translation device in a modification of the second embodiment. 図７は、実施の形態３における音声翻訳装置を示すブロック図である。FIG. 7 is a block diagram showing a speech translation device according to the third embodiment. 図８は、実施の形態３における音声翻訳装置の動作を示すフローチャートである。FIG. 8 is a flowchart showing the operation of the speech translation device in the third embodiment. 図９は、実施の形態３の変形例における音声翻訳装置を示すブロック図である。FIG. 9 is a block diagram showing a speech translation device according to a modification of the third embodiment. 図１０は、実施の形態４における音声翻訳装置を示すブロック図である。FIG. 10 is a block diagram showing a speech translation device according to the fourth embodiment. 図１１は、実施の形態４における音声翻訳装置の動作を示すフローチャートである。FIG. 11 is a flowchart showing the operation of the speech translation device according to the fourth embodiment.

本開示の一態様に係る音声翻訳装置は、第１言語で発話する第１話者と、前記第１話者の会話相手であり、前記第１言語と異なる第２言語で発話する第２話者とが会話を行うための音声翻訳装置であって、音声入力部に入力される音から、前記第１話者及び前記第２話者が発話した音声区間を検出する音声検出部と、前記音声検出部が検出した音声区間の音声が音声認識されることで、当該音声が示す前記第１言語から前記第２言語に翻訳した翻訳結果を表示し、かつ、前記第２言語から前記第１言語に翻訳した翻訳結果を表示する表示部と、前記第１話者の発話後に前記第２話者に発話を促す内容を、前記表示部を介して前記第２言語により出力し、かつ、前記第２話者の発話後に前記第１話者に発話を促す内容を、前記表示部を介して前記第１言語により出力する発話指示部とを備える。 A speech translation device according to an aspect of the present disclosure includes a first speaker who speaks in a first language, and a second speaker who is a conversation partner of the first speaker and who speaks in a second language different from the first language. A voice translation device for having a conversation with a person, the voice detection unit detecting a voice section uttered by the first speaker and the second speaker from the sounds input to the voice input unit; By performing voice recognition on the voice in the voice section detected by the voice detection unit, displaying a translation result translated from the first language indicated by the voice into the second language, and displaying the result of translation from the second language to the first language. a display unit that displays a translation result translated into a language; and a display unit that outputs, in the second language, content that prompts the second speaker to speak after the first speaker speaks, through the display unit, and and a speech instruction section that outputs, in the first language, via the display section, content that prompts the first speaker to speak after the second speaker has uttered the speech.

これによれば、第１話者と第２話者との会話から、それぞれの音声区間を検出することで、検出した音声を第１言語から第２言語に翻訳した翻訳結果を取得したり、検出した音声を第２言語から前記第１言語に翻訳した翻訳結果を取得したりすることができる。つまり、この音声翻訳装置では、翻訳をするための入力操作をしなくても、第１話者と第２話者とのそれぞれの発話ごとに、自動的に検出した音声の言語を別の言語に翻訳することができる。 According to this, by detecting each speech interval from a conversation between a first speaker and a second speaker, a translation result of the detected speech from the first language to the second language can be obtained, It is also possible to obtain a translation result obtained by translating the detected voice from the second language to the first language. In other words, this speech translation device automatically converts the language of the detected speech into a different language for each utterance by the first speaker and the second speaker, without any input operations for translation. can be translated into

また、音声翻訳装置は、第１話者が発話した後に第２話者に発話を促す内容を出力したり、第２話者が発話した後に第１話者に発話を促す内容を出力したりすることができる。これにより、この音声翻訳装置では、第１話者と第２話者とのそれぞれの発話ごとに、発話開始の入力操作をしなくても、第１話者と第２話者とが発話をするタイミングを認識することができる。 In addition, the speech translation device outputs content that prompts the second speaker to speak after the first speaker speaks, or outputs content that prompts the first speaker to speak after the second speaker speaks. can do. As a result, with this speech translation device, the first speaker and the second speaker can communicate without having to perform an input operation to start each utterance. be able to recognize when to do so.

これらのように、音声翻訳装置では、発話を開始するための入力操作、言語切替をするための入力操作等をしなくてもよく、操作性に優れている。つまりこの音声翻訳装置の操作に手間取り難いため、使用期間の増大を抑制することができる。 As described above, the speech translation device does not require input operations to start speaking, input operations to switch languages, etc., and is excellent in operability. In other words, since the operation of this speech translation device does not take much time, it is possible to suppress an increase in the period of use.

したがって、音声翻訳装置では、操作を簡易にすることで、音声翻訳装置のエネルギー消費の増大を抑制することができる。 Therefore, by simplifying the operation of the speech translation device, it is possible to suppress an increase in energy consumption of the speech translation device.

特に、この音声翻訳装置では、操作を簡易にすることができるため、誤操作を抑制することもできる。 In particular, since this speech translation device can be operated easily, it is also possible to suppress erroneous operations.

本開示の他の態様に係る音声翻訳方法は、第１言語で発話する第１話者と、前記第１話者の会話相手であり、前記第１言語と異なる第２言語で発話する第２話者とが会話を行うための音声翻訳方法であって、音声入力部に入力される音から、前記第１話者及び前記第２話者が発話した音声区間を検出することと、検出した音声区間の音声を音声認識することで、当該音声が示す前記第１言語から前記第２言語に翻訳した翻訳結果を表示し、かつ、前記第２言語から前記第１言語に翻訳した翻訳結果を表示する表示部が表示することと、前記第１話者の発話後に前記第２話者に発話を促す内容を、前記表示部を介して前記第２言語により出力し、かつ、前記第２話者の発話後に前記第１話者に発話を促す内容を、前記表示部を介して前記第１言語により出力することとを含む。 A speech translation method according to another aspect of the present disclosure includes a first speaker who speaks in a first language, and a second speaker who is a conversation partner of the first speaker and who speaks in a second language different from the first language. A voice translation method for having a conversation with a speaker, the method comprising: detecting a voice section uttered by the first speaker and the second speaker from sounds input to a voice input section; By performing voice recognition on the voice in the voice section, displaying the translation result translated from the first language indicated by the voice into the second language, and displaying the translation result translated from the second language into the first language. outputting, in the second language via the display unit, content that the display unit displays and prompts the second speaker to speak after the first speaker speaks; and outputting content for prompting the first speaker to speak in the first language via the display unit after the speaker speaks.

この音声翻訳方法においても、上述の音声翻訳装置と同様の作用効果を奏する。 This speech translation method also provides the same effects as the above-mentioned speech translation device.

また、本開示の他の態様に係るプログラムは、音声翻訳方法をコンピュータに実行させるためのプログラムである。 Further, a program according to another aspect of the present disclosure is a program for causing a computer to execute a speech translation method.

このプログラムにおいても、上述の音声翻訳装置と同様の作用効果を奏する。 This program also has the same effects as the above-mentioned speech translation device.

本開示の他の態様に係る音声翻訳装置は、さらに、前記第１話者又は前記第２話者が発話して音声認識された場合、再度、当該発話した前記第１話者又は前記第２話者の発話を、優先して音声認識する優先発話入力部を備える。 The speech translation device according to another aspect of the present disclosure further provides that, when the first speaker or the second speaker utters and the speech is recognized, the first speaker or the second speaker who uttered the utterance again A priority utterance input unit is provided that prioritizes and recognizes the utterances of the speaker.

これによれば、例えば第１話者及び第２話者である話者が言い間違えた場合、言い淀んだ音声が途中で翻訳された場合等、優先発話入力部を操作することで、発話した話者が優先されるため、発話した当該話者は、再度、発話をする機会を得ることができる（言い直すことができる）。このため、優先発話入力部は、第１話者及び第２話者の一方の話者が発話した音声を音声認識し終えた後、他方の話者の音声を音声認識するための処理に移行しても、一方の話者が発話する音声を音声認識する処理に戻すことができる。これにより、音声翻訳装置は、第１話者及び第２話者の音声を確実に取得することができるため、当該音声に基づいて翻訳された翻訳結果を出力することができる。 According to this, for example, when the first and second speakers make a mistake in saying something, or when a voice that they hesitate to say is translated midway through, by operating the priority speech input section, the utterance can be corrected. Since priority is given to the speaker, the speaker who has uttered the utterance can have an opportunity to utter the utterance again (can rephrase the utterance). For this reason, after the priority speech input unit finishes recognizing the voice uttered by one of the first speaker and the second speaker, it shifts to the process for recognizing the voice of the other speaker. Even if one speaker speaks, the voice uttered by one speaker can be returned to the voice recognition process. Thereby, the speech translation device can reliably acquire the voices of the first speaker and the second speaker, and therefore can output a translation result translated based on the voices.

本開示の他の態様に係る音声翻訳装置は、さらに、前記第１話者と前記第２話者とが会話する音声が入力される音声入力部と、前記音声検出部が検出した音声区間の音声を音声認識することで、テキスト文に変換する音声認識部と、前記音声認識部が変換した前記テキスト文を前記第１言語から前記第２言語に翻訳し、かつ、前記第２言語から前記第１言語に翻訳する翻訳部と、前記翻訳部が翻訳した結果を音声によって出力する音声出力部とを備える。 The speech translation device according to another aspect of the present disclosure further includes a speech input section into which the speech of conversation between the first speaker and the second speaker is input, and a speech section detected by the speech detection section. a speech recognition unit that converts speech into a text sentence by speech recognition; a speech recognition unit that converts the text sentence converted by the speech recognition unit from the first language to the second language; It includes a translation unit that translates into a first language, and a voice output unit that outputs the result translated by the translation unit as a voice.

これによれば、入力される音声を音声認識してから、当該音声の言語を別の言語に翻訳することができる。つまり、音声翻訳装置は、第１話者と第２話者とが会話する音声の取得から、音声を翻訳した結果を出力するまでの処理を行うことができる。このため、音声翻訳装置は、外部サーバと通信しなくても、第１話者と第２話者とが会話するそれぞれの音声を相互に翻訳することができる。音声翻訳装置が外部サーバと通信し難い環境下においても適用することができる。 According to this, it is possible to perform speech recognition on input speech and then translate the language of the speech into another language. In other words, the speech translation device can perform processing from obtaining the speech of a conversation between the first speaker and the second speaker to outputting the result of translating the speech. Therefore, the speech translation device can mutually translate the voices of the first speaker and the second speaker, without communicating with an external server. It can be applied even in environments where it is difficult for the speech translation device to communicate with an external server.

本開示の他の態様に係る音声翻訳装置において、前記音声入力部は、複数設けられ、さらに、複数の前記音声入力部のうちの少なくとも一部の音声入力部に入力される音声を信号処理することにより、前記第１話者による音声の音源方向に収音の指向性を制御する第１ビームフォーマ部と、複数の前記音声入力部のうちの少なくとも一部の音声入力部に入力される音声を信号処理することにより、前記第２話者による音声の音源方向に収音の指向性を制御する第２ビームフォーマ部と、取得する信号を、前記第１ビームフォーマ部の出力信号、又は、前記第２ビームフォーマ部の出力信号に切換える入力切換部と、複数の前記音声入力部に入力される音声を信号処理することにより、音源方向を推定する音源方向推定部とを備え、前記発話指示部は、前記入力切換部に、前記第１ビームフォーマ部の出力信号を取得するか、前記第２ビームフォーマ部の出力信号を取得するかを切換えさせる。 In the speech translation device according to another aspect of the present disclosure, a plurality of the speech input sections are provided, and further performs signal processing on speech input to at least some of the plurality of speech input sections. The first beam former unit controls the directivity of sound collection in the direction of the sound source of the voice of the first speaker, and the voice input to at least some of the voice input units of the plurality of voice input units. a second beamformer section that controls the directivity of sound collection in the direction of the sound source of the voice of the second speaker by signal processing; and the output signal of the first beamformer section, or an input switching unit that switches to an output signal of the second beamformer unit; and a sound source direction estimation unit that estimates a sound source direction by signal processing the audio input to the plurality of audio input units; The unit causes the input switching unit to switch between acquiring the output signal of the first beamformer unit and acquiring the output signal of the second beamformer unit.

これによれば、音源方向推定部によって、音声翻訳装置に対する相対的な話者の方向を推定することができる。このため、入力切換部は、話者の方向に適した第１ビームフォーマ部の出力信号及び第２ビームフォーマ部の出力信号のいずれかに切換えることができる。つまり、音源方向にビームフォーマ部の収音の指向性を向けることができるため、音声翻訳装置では、第１話者及び第２話者の音声について、周囲ノイズを低減して収音することができる。 According to this, the direction of the speaker relative to the speech translation device can be estimated by the sound source direction estimation section. Therefore, the input switching section can switch to either the output signal of the first beamformer section or the output signal of the second beamformer section suitable for the direction of the speaker. In other words, since the directionality of the sound collection of the beamformer unit can be directed toward the sound source, the speech translation device can collect the sounds of the first speaker and the second speaker while reducing ambient noise. can.

本開示の他の態様に係る音声翻訳装置において、前記音声入力部は、複数設けられ、さらに、複数の前記音声入力部に入力される音声を信号処理することにより、音源方向を推定する音源方向推定部と、当該音声翻訳装置に対する前記第１話者の位置に対応する前記表示部の表示領域に前記第１言語を表示させ、当該音声翻訳装置に対する前記第２話者の位置に対応する前記表示部の表示領域に前記第２言語を表示させる制御部とを備え、前記制御部は、当該音声翻訳装置の表示部から前記第１話者又は前記第２話者に向かう表示方向であって、前記表示部のいずれかの表示領域に表示する側の表示方向と、前記音源方向推定部が推定した音源方向とを比較し、前記表示方向と推定した音源方向とが実質的に一致する場合、前記音声認識部及び前記翻訳部を実行させ、前記表示方向と推定した音源方向とが異なる場合、前記音声認識部及び前記翻訳部を停止させる。 In the speech translation device according to another aspect of the present disclosure, a plurality of the speech input sections are provided, and the sound source direction is further configured to estimate a sound source direction by signal processing the speech inputted to the plurality of speech input sections. an estimation unit; displaying the first language in a display area of the display unit corresponding to the position of the first speaker with respect to the speech translation device; and displaying the first language in a display area of the display unit corresponding to the position of the second speaker with respect to the speech translation device; a control unit that displays the second language in a display area of a display unit, the control unit configured to display the second language in a display direction from the display unit of the speech translation device toward the first speaker or the second speaker; , when the display direction displayed in any display area of the display section and the sound source direction estimated by the sound source direction estimation section are compared, and the display direction and the estimated sound source direction substantially match; , the speech recognition section and the translation section are executed, and when the display direction and the estimated sound source direction are different, the speech recognition section and the translation section are stopped.

これによれば、表示部の表示領域に表示された言語の表示方向と、話者の発話による音声の音源方向とが実質的に一致する場合、話者が第１言語で発話する第１話者か第２言語で発話する第２話者かを特定することができる。この場合、第１話者の音声を第１言語で音声認識することができ、第２話者の音声を第２言語で音声認識することができる。また、表示方向と音源方向とが異なる場合、入力された音声の翻訳を停止することで、入力された音声が翻訳されない又は誤翻訳されてしまうことを抑制することができる。 According to this, when the display direction of the language displayed in the display area of the display unit and the sound source direction of the sound uttered by the speaker substantially match, the first episode uttered by the speaker in the first language It is possible to identify whether the user is speaking in the second language or the second speaker speaking in the second language. In this case, the first speaker's voice can be recognized in the first language, and the second speaker's voice can be recognized in the second language. Further, when the display direction and the sound source direction are different, by stopping the translation of the input voice, it is possible to prevent the input voice from not being translated or being mistranslated.

これにより、音声翻訳装置は、第１言語の音声及び第２言語の音声を確実に音声認識することができるため、確実に音声を翻訳することができる。その結果、この音声翻訳装置では、誤翻訳等を抑制することで音声翻訳装置の処理量の増大を抑制することができる。 Thereby, the voice translation device can reliably recognize the first language voice and the second language voice, and therefore can reliably translate the voice. As a result, this speech translation device can suppress an increase in the processing amount of the speech translation device by suppressing mistranslations and the like.

本開示の他の態様に係る音声翻訳装置において、前記制御部が前記音声認識部及び前記翻訳部を停止させる場合、前記発話指示部は、再度、指示した言語による発話を促す内容を出力する。 In the speech translation device according to another aspect of the present disclosure, when the control section stops the speech recognition section and the translation section, the speech instruction section outputs content encouraging speech in the instructed language again.

これによれば、表示方向と音源方向とが異なる場合でも、発話指示部が再度、発話を促す内容を出力することで、対象となる話者が発話する。このため、音声翻訳装置は、対象となる話者の音声を確実に取得することができるため、より確実に音声を翻訳することができる。 According to this, even if the display direction and the sound source direction are different, the speech instruction section outputs the content encouraging speech again, so that the target speaker speaks. Therefore, the speech translation device can reliably acquire the speech of the target speaker, and therefore can translate the speech more reliably.

本開示の他の態様に係る音声翻訳装置において、前記表示方向と推定した音源方向とが異なる場合、前記発話指示部は、前記制御部が比較をしてから規定期間が経過した後に、再度、指示した言語による発話を促す内容を出力する。 In the speech translation device according to another aspect of the present disclosure, when the display direction and the estimated sound source direction are different, the speech instruction section again performs the speech instruction after a predetermined period has elapsed since the control section made the comparison. Outputs content that encourages speaking in the specified language.

これによれば、表示方向と音源方向との比較をしてから規定期間を空けることで、第１話者と第２話者との音声が混在して入力されることを抑制することができる。これにより、規定期間経過後、再度、発話を促す内容を出力することで、対象となる話者が発話する。このため、音声翻訳装置は、対象となる話者の音声をより確実に取得することができるため、より確実に音声を翻訳することができる。 According to this, by leaving a specified period after comparing the display direction and the sound source direction, it is possible to suppress the voices of the first speaker and the second speaker from being input together. . As a result, after the predetermined period of time has elapsed, the target speaker speaks by outputting the content encouraging him to speak again. Therefore, the speech translation device can more reliably acquire the speech of the target speaker, and therefore can translate the speech more reliably.

本開示の他の態様に係る音声翻訳装置において、前記音声入力部は、複数設けられ、さらに、複数の前記音声入力部のうちの少なくとも一部の音声入力部に入力される音声を信号処理することにより、前記第１話者による音声の音源方向に収音の指向性を制御する第１ビームフォーマ部と、複数の前記音声入力部のうちの少なくとも一部の音声入力部に入力される音声を信号処理することにより、前記第２話者による音声の音源方向に収音の指向性を制御する第２ビームフォーマ部と、前記第１ビームフォーマ部の出力信号、及び、前記第２ビームフォーマ部の出力信号を信号処理することにより、音源方向を推定する音源方向推定部とを備える。 In the speech translation device according to another aspect of the present disclosure, a plurality of the speech input sections are provided, and further performs signal processing on speech input to at least some of the plurality of speech input sections. The first beam former unit controls the directivity of sound collection in the direction of the sound source of the voice of the first speaker, and the voice input to at least some of the voice input units of the plurality of voice input units. a second beamformer unit that controls the directivity of sound collection in the direction of the sound source of the voice by the second speaker by signal processing the output signal of the first beamformer unit; and a sound source direction estimation section that estimates the direction of the sound source by signal processing the output signal of the section.

これによれば、音源方向推定部によって、音声翻訳装置に対する相対的な話者の方向を推定することができる。このため、音源方向推定部は、話者の方向に適した第１ビームフォーマ部の出力信号及び第２ビームフォーマ部の出力信号を信号処理するため、信号処理による演算コストを低下させることができる。 According to this, the direction of the speaker relative to the speech translation device can be estimated by the sound source direction estimation section. Therefore, the sound source direction estimation section processes the output signal of the first beamformer section and the output signal of the second beamformer section suitable for the direction of the speaker, so that the calculation cost due to signal processing can be reduced. .

本開示の他の態様に係る音声翻訳装置において、前記発話指示部は、当該音声翻訳装置の起動時に、前記第１話者に発話を促す内容を、前記表示部を介して前記第１言語により出力し、前記第１話者の発話による音声が前記第１言語から前記第２言語に翻訳されて、前記表示部に翻訳結果が表示された後に、前記第２話者に発話を促す内容を、前記表示部を介して前記第２言語により出力する。 In the speech translation device according to another aspect of the present disclosure, the speech instruction section may display content prompting the first speaker to speak in the first language via the display section when the speech translation device is activated. output, and after the voice uttered by the first speaker is translated from the first language to the second language and the translation result is displayed on the display unit, content that prompts the second speaker to speak. , output in the second language via the display section.

これによれば、第１言語で第１話者が発話した後に、第２言語で第２話者が発話することを予め登録しておけば、音声翻訳装置の起動時に、第１話者に発話を促す内容を第１言語により出力すれば、第１話者は、発話を開始することができる。このため、音声翻訳装置の起動時に、第２言語で第２話者が発話することによる誤翻訳を抑制することができる。 According to this, if you register in advance that the second speaker will speak in the second language after the first speaker speaks in the first language, when the speech translation device starts up, the first speaker will If the content prompting speech is output in the first language, the first speaker can start speaking. Therefore, it is possible to suppress mistranslation caused by the second speaker speaking in the second language when the speech translation device is activated.

本開示の他の態様に係る音声翻訳装置において、前記発話指示部は、翻訳開始後、発話を促すための音声を規定回数、前記音声出力部に出力させ、前記規定回数の発話を促すための音声を出力した後に、発話を促すためのメッセージを前記表示部に出力させる。 In the speech translation device according to another aspect of the present disclosure, the speech instruction section causes the speech output section to output a voice for encouraging speech a predetermined number of times after starting translation, After outputting the voice, the display section is caused to output a message to encourage speaking.

これによれば、発話を促すための音声を規定回数で留めることによって、音声翻訳装置のエネルギー消費の増大を抑制することができる。 According to this, by limiting the voice for prompting speech to a specified number of times, it is possible to suppress an increase in energy consumption of the speech translation device.

本開示の他の態様に係る音声翻訳装置において、前記音声認識部は、音声を音声認識した結果、及び、当該結果の信頼性スコアを出力し、前記発話指示部は、前記音声認識部から取得した前記信頼性スコアが閾値以下の場合、前記信頼性スコアが閾値以下の音声の翻訳を行わずに、発話を促す内容を、前記表示部及び前記音声出力部の少なくともいずれかを介して出力する。 In the speech translation device according to another aspect of the present disclosure, the speech recognition unit outputs a result of speech recognition of the speech and a reliability score of the result, and the speech instruction unit acquires the result from the speech recognition unit. If the reliability score obtained is less than or equal to a threshold value, content that prompts the user to speak is outputted via at least one of the display unit and the audio output unit without translating the voice whose reliability score is less than or equal to the threshold value. .

これによれば、音声認識の精度を示す信頼性スコアが閾値以下であれば、発話指示部が再度、発話を促す内容を出力することで、対象となる話者が再度、発話する。このため、音声翻訳装置は、対象となる話者の音声を確実に音声認識することができるようになるため、より確実に音声を翻訳することができる。 According to this, if the reliability score indicating the accuracy of speech recognition is equal to or less than the threshold value, the speech instruction section outputs the content encouraging speech again, so that the target speaker speaks again. Therefore, the speech translation device can reliably recognize the speech of the target speaker, and therefore can translate the speech more reliably.

特に、音声出力部が発話を促す内容を音声により出力すれば、話者は、正しく音声認識されていないと気付き易くなる。 Particularly, if the voice output unit outputs the content that prompts the speaker to speak, the speaker will be more likely to notice that the voice is not being recognized correctly.

なお、これらのうちの一部の具体的な態様は、システム、方法、集積回路、コンピュータプログラム又はコンピュータで読み取り可能なＣＤ－ＲＯＭ等の記録媒体を用いて実現されてもよく、システム、方法、集積回路、コンピュータプログラム又は記録媒体の任意な組み合わせを用いて実現されてもよい。 Note that some specific aspects of these may be realized using a system, a method, an integrated circuit, a computer program, or a computer-readable recording medium such as a CD-ROM. It may be implemented using any combination of integrated circuits, computer programs, or recording media.

以下で説明する実施の形態は、いずれも本開示の一具体例を示すものである。以下の実施の形態で示される数値、形状、材料、構成要素、構成要素の配置位置などは、一例であり、本開示を限定する主旨ではない。また、以下の実施の形態における構成要素のうち、独立請求項に記載されていない構成要素については、任意の構成要素として説明される。また全ての実施の形態において、各々の内容を組み合わせることもできる。 The embodiments described below are all specific examples of the present disclosure. The numerical values, shapes, materials, components, arrangement positions of the components, etc. shown in the following embodiments are merely examples, and do not limit the present disclosure. Further, among the constituent elements in the following embodiments, constituent elements that are not described in the independent claims will be described as arbitrary constituent elements. Moreover, in all embodiments, the contents of each can be combined.

以下、本開示の一態様に係る音声翻訳装置、音声翻訳方法及びそのプログラムについて、図面を参照しながら具体的に説明する。 Hereinafter, a speech translation device, a speech translation method, and a program thereof according to one aspect of the present disclosure will be specifically described with reference to the drawings.

（実施の形態１）
＜構成：音声翻訳装置１＞
図１Ａは、実施の形態１における音声翻訳装置１の外観と、第１話者が発話したときの第１話者と第２話者との音声翻訳装置１の使用場面の一例を示す図である。図１Ｂは、実施の形態１における音声翻訳装置１の外観と、第２話者が発話したときの第１話者と第２話者との音声翻訳装置１の使用場面の一例を示す図である。 (Embodiment 1)
<Configuration: Voice translation device 1>
FIG. 1A is a diagram showing the appearance of the speech translation device 1 according to the first embodiment and an example of a usage scene of the speech translation device 1 between the first speaker and the second speaker when the first speaker speaks. be. FIG. 1B is a diagram showing the appearance of the speech translation device 1 according to the first embodiment and an example of a usage scene of the speech translation device 1 between the first speaker and the second speaker when the second speaker speaks. be.

図１Ａ及び図１Ｂに示すように、音声翻訳装置１は、第１言語で発話する第１話者と、第１話者の会話相手であり、第１言語と異なる第２言語で発話する第２話者とが会話を行うために、第１話者と第２話者との間の会話を双方向に翻訳する装置である。つまり、音声翻訳装置１は、第１話者と第２話者との異なる２つの言語間において、第１話者と第２話者とが発話（発声）したそれぞれの言語を認識し、発話内容を互いの相手の言語に翻訳する装置である。例えば、音声翻訳装置１は、第１話者が発話する第１言語を第２言語に翻訳して出力し、第２話者が発話する第２言語を第１言語に翻訳して出力する。また、第１言語及び第２言語は、例えば、日本語、英語、フランス語、ドイツ語、中国語等である。 As shown in FIGS. 1A and 1B, the speech translation device 1 includes a first speaker who speaks in a first language, and a conversation partner of the first speaker who speaks in a second language different from the first language. This is a device that bidirectionally translates a conversation between a first speaker and a second speaker so that the two speakers can have a conversation. In other words, the speech translation device 1 recognizes the respective languages uttered (uttered) by the first speaker and the second speaker between the two different languages of the first speaker and the second speaker, and It is a device that translates content into each other's language. For example, the speech translation device 1 translates a first language uttered by a first speaker into a second language and outputs the result, and translates a second language uttered by a second speaker into the first language and outputs the result. Further, the first language and the second language are, for example, Japanese, English, French, German, Chinese, etc.

本実施の形態の図１Ａ及び図１Ｂでは、１名の第１話者と、１名の第２話者とが対面しながら会話する様子を例示する。なお、複数の第１話者と複数の第２話者とが会話する際に用いてもよい。 FIGS. 1A and 1B of this embodiment illustrate a situation in which one first speaker and one second speaker have a conversation while facing each other. Note that it may be used when a plurality of first speakers and a plurality of second speakers have a conversation.

なお、第１話者及び第２話者は、音声翻訳装置１を用いて対面しながら会話したり、図１Ｃに示すように、左右に並んで会話したりしてもよい。図１Ｃは、第１話者と第２話者とが会話をするときの音声翻訳装置１の使用場面の別の一例を示す図である。この場合、音声翻訳装置１は、表示態様を変更してもよい。このような音声翻訳装置１は、図１Ａ、図１Ｂ及び図１Ｃに示すように、縦向き又は横向きにされた状態で用いられる。 Note that the first speaker and the second speaker may have a conversation while facing each other using the speech translation device 1, or may have a conversation side by side side by side as shown in FIG. 1C. FIG. 1C is a diagram showing another example of a usage scene of the speech translation device 1 when a first speaker and a second speaker have a conversation. In this case, the speech translation device 1 may change the display mode. Such a speech translation device 1 is used in a vertical or horizontal orientation, as shown in FIGS. 1A, 1B, and 1C.

音声翻訳装置１は、スマートホン及びタブレット端末等の、第１話者が携帯可能な携帯端末である。 The speech translation device 1 is a mobile terminal that can be carried by the first speaker, such as a smartphone or a tablet terminal.

図２は、実施の形態１における音声翻訳装置１を示すブロック図である。 FIG. 2 is a block diagram showing the speech translation device 1 in the first embodiment.

図２に示すように、音声翻訳装置１は、音声入力部２１と、音声検出部２２と、優先発話入力部２４と、発話指示部２５と、音声認識部２３と、翻訳部２６と、表示部２７と、音声出力部２８と、電源部２９とを備える。 As shown in FIG. 2, the speech translation device 1 includes a speech input section 21, a speech detection section 22, a priority speech input section 24, a speech instruction section 25, a speech recognition section 23, a translation section 26, a display section 27, an audio output section 28, and a power supply section 29.

［音声入力部２１］
音声入力部２１は、第１話者と第２話者とが会話する際の音声が入力されるマイクロフォンであり、音声検出部２２と通信可能に接続される。つまり、音声入力部２１は、音を取得（収音）し、取得した音から電気信号に変換し、変換した電気信号である音響信号を音声検出部２２に出力する。なお、音声入力部２１が取得した音響信号を記憶部等に記憶してもよい。 [Voice input section 21]
The voice input section 21 is a microphone into which the voice of a first speaker and a second speaker having a conversation is input, and is communicably connected to the voice detection section 22 . That is, the audio input section 21 acquires (collects) sound, converts the acquired sound into an electrical signal, and outputs the acoustic signal, which is the converted electrical signal, to the audio detection section 22 . Note that the acoustic signal acquired by the audio input section 21 may be stored in a storage section or the like.

なお、音声入力部２１は、アダプタとして構成されてもよい。この場合、音声入力部２１は、音声翻訳装置１にマイクロフォンが装着されることで機能し、マイクロフォンが取得する音響信号を取得する。 Note that the audio input section 21 may be configured as an adapter. In this case, the voice input unit 21 functions when a microphone is attached to the voice translation device 1, and acquires the acoustic signal acquired by the microphone.

［音声検出部２２］
音声検出部２２は、音声入力部２１に入力される音から、第１話者及び第２話者が発話した音声区間を検出する装置であり、音声入力部２１及び音声認識部２３と通信可能に接続される。具体的には、音声検出部２２は、音声入力部２１から取得した音響信号に示される音量から、音量が大きくなった瞬間と、音量が小さくなった瞬間とを音声の区切り目とみなし、音響信号における音声区間の開始時点及び終了時点を検出（終話検出）する。ここで、音声区間は、話者の発話による一話ごとの音声を示すが、一話の音声における開始地点から終了地点までの期間を含んでいてもよい。 [Sound detection unit 22]
The voice detection unit 22 is a device that detects voice sections uttered by the first speaker and the second speaker from the sound input to the voice input unit 21, and is capable of communicating with the voice input unit 21 and the voice recognition unit 23. connected to. Specifically, the audio detection unit 22 considers the moment when the volume increases and the moment when the volume decreases from the volume indicated in the audio signal obtained from the audio input unit 21 as a break in the audio, and detects the sound. The start and end points of the voice section in the signal are detected (end of speech detection). Here, the audio section indicates the audio of each episode uttered by the speaker, but may also include the period from the start point to the end point of the audio of one episode.

音声検出部２２は、音響信号から検出した音声区間、つまり、音響信号から第１話者と第２話者との会話のそれぞれの音声を検出し、検出した音声を示す音声情報を音声認識部２３に出力する。 The voice detection section 22 detects the voice section detected from the acoustic signal, that is, the voice of each conversation between the first speaker and the second speaker from the acoustic signal, and transmits voice information indicating the detected voice to the voice recognition section. Output to 23.

［発話指示部２５］
発話指示部２５は、第１話者の発話後に第２話者に発話を促す内容を、表示部２７を介して第２言語により出力し、かつ、第２話者の発話後に第１話者に発話を促す内容を第１言語により出力する装置である。つまり、発話指示部２５は、第１話者と第２話者とが会話できるように、それぞれのタイミングで第１話者又は第２話者に発話を促す内容である発話指示テキスト情報を表示部２７に出力する。また、発話指示部２５は、第１話者又は第２話者に発話を促す内容である発話指示音声情報を音声出力部２８に出力する。この場合、発話指示部２５は、表示部２７に出力する発話指示テキスト情報に示される内容と同様の内容である発話指示音声情報を音声出力部２８に出力する。なお、発話指示部２５は、発話指示音声情報を音声出力部２８に出力しなくてもよく、音声による発話を促す内容を出力することは必須ではない。 [Speech instruction section 25]
The utterance instruction unit 25 outputs content that prompts the second speaker to speak after the first speaker has uttered the content in the second language via the display unit 27, and This is a device that outputs content in the first language that prompts the user to speak. In other words, the speech instruction section 25 displays speech instruction text information that prompts the first speaker or the second speaker to speak at respective timings so that the first speaker and the second speaker can have a conversation. output to section 27. Furthermore, the speech instruction section 25 outputs speech instruction audio information, which is content that prompts the first speaker or the second speaker to speak, to the audio output section 28 . In this case, the speech instruction section 25 outputs speech instruction audio information having the same content as the content shown in the speech instruction text information output to the display section 27 to the audio output section 28 . Note that the speech instruction section 25 does not need to output the speech instruction audio information to the audio output section 28, and it is not essential to output content encouraging speech by voice.

ここで、発話指示テキスト情報は、第１話者又は第２話者に発話を促す内容を示すテキスト文である。また、発話指示音声情報は、第１話者又は第２話者に発話を促す内容を示す音声である。 Here, the utterance instruction text information is a text sentence indicating content that prompts the first speaker or the second speaker to speak. Furthermore, the speech instruction audio information is audio indicating content that prompts the first speaker or the second speaker to speak.

また、発話指示部２５は、翻訳部２６が第１言語を第２言語に翻訳、又は、翻訳部２６が第２言語を第１言語に翻訳するための指示コマンドを出力する。例えば第１話者の発話後に第２話者が発話するため、発話指示部２５は、第２話者が発話した音声を第２言語で音声認識するための指示コマンドを音声認識部２３に出力し、音声認識された音声を第２言語から第１言語に翻訳するための指示コマンドを翻訳部２６に出力する。また、第１話者が発話した場合も同様である。 Furthermore, the speech instruction section 25 outputs an instruction command for the translation section 26 to translate the first language into the second language, or for the translation section 26 to translate the second language into the first language. For example, since the second speaker speaks after the first speaker speaks, the speech instruction section 25 outputs an instruction command to the speech recognition section 23 to recognize the speech uttered by the second speaker in the second language. Then, an instruction command for translating the recognized speech from the second language to the first language is output to the translation unit 26. The same applies when the first speaker speaks.

また、発話指示部２５は、第１話者及び第２話者のうちの一方の話者が発話後に、他方の話者に発話を促す内容である発話指示テキスト情報を表示部２７に出力する。一方の話者が発話した音声を、翻訳部２６が翻訳した翻訳結果を出力する時点又は出力した後に、発話指示部２５は、発話指示テキスト情報を表示部２７に出力し、発話指示音声情報を音声出力部２８に出力する。 Furthermore, after one of the first speaker and the second speaker speaks, the speech instruction section 25 outputs speech instruction text information to the display section 27, which is the content of urging the other speaker to speak. . At the time or after the translation unit 26 outputs the translation result of the voice uttered by one of the speakers, the utterance instruction unit 25 outputs the utterance instruction text information to the display unit 27, and outputs the utterance instruction voice information. It is output to the audio output section 28.

また、発話指示部２５は、後述する優先発話入力部２４から指示コマンドを取得すると、直近に発話した話者に対して、再度、発話を促す内容である発話指示テキスト情報を表示部２７に出力し、発話指示音声情報を音声出力部２８に出力する。 Further, when the speech instruction section 25 acquires an instruction command from the priority speech input section 24 to be described later, it outputs speech instruction text information to the display section 27, which is content that urges the speaker who has most recently spoken to speak again. Then, the speech instruction audio information is output to the audio output section 28.

また、発話指示部２５は、当該音声翻訳装置１の起動時に、第１話者に発話を促す内容を、表示部２７を介して第１言語により出力する。つまり、第１話者が音声翻訳の所有者である場合、発話指示部２５は、第１話者から発話を開始するように促す。また、発話指示部２５は、第１話者の発話による音声が第１言語から第２言語に翻訳されて、表示部２７に翻訳結果が表示された後に、第２話者に発話を促す内容を、表示部２７を介して第２言語により出力する。第１言語の第１話者の発話が第２言語に翻訳された後に、第２話者が第２言語で発話し、発話した第２言語が第１言語に翻訳される。これを繰り返し行うことで、第１話者と第２話者との会話が弾む。 Furthermore, when the speech translation device 1 is activated, the speech instruction section 25 outputs content that prompts the first speaker to speak in the first language via the display section 27 . That is, when the first speaker is the owner of the voice translation, the speech instruction unit 25 prompts the first speaker to start speaking. The utterance instruction unit 25 also provides content that prompts the second speaker to speak after the voice uttered by the first speaker is translated from the first language to the second language and the translation result is displayed on the display unit 27. is output in the second language via the display unit 27. After the first speaker's utterances of the first language are translated into the second language, the second speaker speaks in the second language, and the second language utterances are translated into the first language. By repeating this, the conversation between the first speaker and the second speaker becomes more lively.

また、発話指示部２５は、翻訳開始後、発話を促すための音声を規定回数、音声出力部２８に出力させる。つまり、第２話者が直ぐに発話をしない、又は、聞き取れない場合等があるため、発話指示部２５は、発話を促すための音声を規定回数出力する。発話指示部２５は、規定回数の発話を促すための音声を出力した後に、発話を促すためのメッセージを表示部２７に出力させる。つまり、発話を促すための音声を規定回数出力しても、効果がない場合、電力の消費を抑制するために、発話を促すためのメッセージを表示部２７に表示させる。 Furthermore, after the start of translation, the speech instruction section 25 causes the speech output section 28 to output a voice for encouraging speech a specified number of times. In other words, since there are cases where the second speaker does not speak immediately or cannot be heard, the speech instruction unit 25 outputs the voice to encourage speech a specified number of times. After outputting a sound for encouraging speech a predetermined number of times, speech instruction section 25 causes display section 27 to output a message for encouraging speech. That is, if there is no effect even after outputting the voice to encourage speech a specified number of times, a message to encourage speech is displayed on the display unit 27 in order to suppress power consumption.

発話指示部２５は、音声認識部２３、優先発話入力部２４、翻訳部２６、表示部２７及び音声出力部２８と通信可能に接続される。 The speech instruction section 25 is communicably connected to the speech recognition section 23, the priority speech input section 24, the translation section 26, the display section 27, and the speech output section 28.

［優先発話入力部２４］
優先発話入力部２４は、第１話者又は第２話者が発話して音声認識された場合、再度、当該発話した第１話者又は第２話者の発話を優先して（又は連続して）音声認識部２３に音声認識させることができる装置である。つまり、優先発話入力部２４は、直近に発話した話者であって発話した音声が音声認識された話者に対して、再度、発話した第１話者又は第２話者に発話を行う機会を与えることができる。言い換えれば、優先発話入力部２４は、第１話者及び第２話者の一方の話者が発話した音声を音声認識し終えて、他方の話者の音声を音声認識するための処理に移行しても、一方の話者が発話する音声を音声認識する処理に戻すことができる。 [Priority speech input section 24]
When the first speaker or the second speaker makes an utterance and the speech is recognized, the priority utterance input unit 24 again gives priority to the utterance of the first speaker or the second speaker who made the utterance (or consecutively). ) This is a device that allows the voice recognition unit 23 to perform voice recognition. In other words, the priority speech input unit 24 provides an opportunity for the speaker who has most recently spoken and whose voice has been recognized to speak again to the first or second speaker. can be given. In other words, after the priority speech input unit 24 finishes recognizing the voice uttered by one of the first speaker and the second speaker, the priority speech input unit 24 shifts to the process of recognizing the voice of the other speaker. Even if one speaker speaks, the voice uttered by one speaker can be returned to the voice recognition process.

優先発話入力部２４は、音声翻訳装置１の操作者から入力を受付ける操作入力部である。例えば、発話した話者が言い間違えた場合、言い淀んだ音声が途中で翻訳された場合、音声検出部２２が音声を検出しない区間が規定区間以上となると、音声翻訳装置１が発話を終了したと認識する恐れがある場合等のように、直近に発話した話者が続けて発話したいときがある。このため、優先発話入力部２４は、直近に発話した話者が発話する音声を優先して音声認識部２３に音声認識させ、かつ、翻訳部２６に翻訳させる。これにより、優先発話入力部２４は、発話指示部２５に再度、発話を促す内容である発話指示テキスト情報及び発話指示音声情報を発話指示部２５に出力させるための指示コマンドを、発話指示部２５に出力する。操作者は、第１話者及び第２話者の少なくとも一方であるが、本実施の形態では、主に第１話者である。 The priority speech input unit 24 is an operation input unit that receives input from the operator of the speech translation device 1. For example, if the speaker makes a mistake in speaking, if the voice that he/she hesitates to say is translated midway through, or if the interval in which the voice detection unit 22 does not detect voice exceeds a specified interval, the voice translation device 1 will terminate the utterance. There are times when the speaker who spoke most recently wants to continue speaking, such as when there is a risk that the speaker may recognize that the person has spoken. For this reason, the priority speech input unit 24 gives priority to the voice uttered by the most recent speaker, causes the voice recognition unit 23 to recognize the voice, and causes the translation unit 26 to translate the voice. As a result, the priority speech input section 24 sends an instruction command to the speech instruction section 25 to cause the speech instruction section 25 to output speech instruction text information and speech instruction voice information that are content to prompt the speech instruction section 25 again. Output to. The operator is at least one of the first speaker and the second speaker, but in this embodiment, the operator is mainly the first speaker.

本実施の形態では、優先発話入力部２４は、音声翻訳装置１の表示部２７と一体的に設けられるタッチセンサである。この場合、音声翻訳装置１の表示部２７には、優先発話入力部２４としての、一方の話者による操作を受付ける操作ボタンが表示されていてもよい。 In this embodiment, the priority speech input section 24 is a touch sensor that is provided integrally with the display section 27 of the speech translation device 1. In this case, the display unit 27 of the speech translation device 1 may display an operation button as the priority speech input unit 24 that accepts an operation by one of the speakers.

本実施の形態では、音声認識部２３が音声認識を第１言語から第２言語に切換えたときに、切換え前の第１言語を優先して音声認識して翻訳させるために、第１言語の優先ボタンである優先発話入力部２４を表示部２７に表示する。また、音声認識部２３が音声認識を第２言語から第１言語に切換えたときに、切換え前の第２言語を優先して音声認識して翻訳させるために、第２言語の優先ボタンである優先発話入力部２４を表示部２７に表示する。このような、優先ボタンは、少なくとも翻訳後に、表示部２７に表示される。 In this embodiment, when the speech recognition unit 23 switches the speech recognition from the first language to the second language, the first language is changed so that the first language before the switching is given priority for speech recognition and translation. A priority speech input section 24, which is a priority button, is displayed on the display section 27. In addition, when the speech recognition unit 23 switches the speech recognition from the second language to the first language, a second language priority button is provided in order to prioritize the speech recognition and translation of the second language before switching. The priority speech input section 24 is displayed on the display section 27. Such a priority button is displayed on the display unit 27 at least after translation.

［音声認識部２３］
音声認識部２３は、音声検出部２２が検出した音声区間の音声を音声認識することで、テキスト文に変換する。具体的には、音声認識部２３は、音声検出部２２が検出した音声情報を取得すると、音声情報に示される音声を音声認識する。例えば、音声情報に示される音声が第１言語である場合、当該音声を第１言語で音声認識し、音声情報に示される音声が第２言語である場合、当該音声を第２言語で音声認識する。音声認識部２３は、第１言語で音声を音声認識した場合、音声認識した音声の内容を示す第１テキスト文を生成し、生成した第１テキスト文を翻訳部２６に出力する。また、音声認識部２３は、第２言語で音声を音声認識した場合、音声認識した音声の内容を示す第２テキスト文を生成し、生成した第２テキスト文を翻訳部２６に出力する。 [Voice recognition unit 23]
The speech recognition section 23 performs speech recognition on the speech in the speech section detected by the speech detection section 22 and converts it into a text sentence. Specifically, when the voice recognition unit 23 acquires the voice information detected by the voice detection unit 22, the voice recognition unit 23 performs voice recognition on the voice indicated by the voice information. For example, if the voice shown in the voice information is in the first language, the voice is recognized in the first language, and if the voice shown in the voice information is in the second language, the voice is recognized in the second language. do. When the speech recognition unit 23 performs speech recognition on speech in the first language, the speech recognition unit 23 generates a first text sentence indicating the content of the recognized speech, and outputs the generated first text sentence to the translation unit 26 . Furthermore, when the speech recognition unit 23 performs speech recognition on speech in the second language, the speech recognition unit 23 generates a second text sentence indicating the content of the recognized speech, and outputs the generated second text sentence to the translation unit 26 .

［翻訳部２６］
翻訳部２６は、音声認識部２３が変換したテキスト文を第１言語から第２言語に翻訳し、かつ、第２言語から第１言語に翻訳する翻訳装置である。具体的には、翻訳部２６は、音声認識部２３からテキスト文である第１テキスト文を取得すると、第１言語から第２言語に翻訳する。つまり、翻訳部２６は、第１テキスト文を第２言語に翻訳した第２翻訳テキスト文を生成する。また、翻訳部２６は、音声認識部２３からテキスト文である第２テキスト文を取得すると、第２言語から第１言語に翻訳する。つまり、翻訳部２６は、第２テキスト文を第１言語に翻訳した第１翻訳テキスト文を生成する。 [Translation Department 26]
The translation unit 26 is a translation device that translates the text sentence converted by the speech recognition unit 23 from the first language to the second language, and from the second language to the first language. Specifically, upon acquiring a first text sentence from the speech recognition unit 23, the translation unit 26 translates it from the first language into the second language. That is, the translation unit 26 generates a second translated text sentence by translating the first text sentence into the second language. Furthermore, upon acquiring the second text sentence from the speech recognition unit 23, the translation unit 26 translates it from the second language into the first language. That is, the translation unit 26 generates a first translated text sentence by translating the second text sentence into the first language.

ここで、第１言語で示された第１テキスト文の内容は、第２言語で示された第２翻訳テキスト文の内容と一致する。また、第２言語で示された第２テキスト文の内容は、第１言語で示された第１翻訳テキスト文の内容と一致する。 Here, the content of the first text sentence shown in the first language matches the content of the second translated text sentence shown in the second language. Also, the content of the second text sentence shown in the second language matches the content of the first translated text sentence shown in the first language.

翻訳部２６は、第２翻訳テキスト文を生成すると、第２翻訳テキスト文の内容を認識し、認識した第２翻訳テキスト文の内容を示す第２言語の翻訳音声を生成する。また、翻訳部２６は、第１翻訳テキスト文を生成すると、第１翻訳テキスト文の内容を認識し、認識した第１翻訳テキスト文の内容を示す第１言語の翻訳音声を生成する。なお、第１翻訳テキスト文及び第２翻訳テキスト文に基づく翻訳音声の生成は、音声出力部２８が行ってもよい。 After generating the second translated text sentence, the translation unit 26 recognizes the contents of the second translated text sentence and generates translated speech in the second language indicating the recognized contents of the second translated text sentence. Moreover, when the translation unit 26 generates the first translated text sentence, it recognizes the contents of the first translated text sentence and generates translated speech in the first language indicating the recognized contents of the first translated text sentence. Note that the voice output unit 28 may generate the translated voice based on the first translated text sentence and the second translated text sentence.

翻訳部２６は、第２翻訳テキスト文又は第１翻訳テキスト文を生成すると、生成した第２翻訳テキスト文又は第１翻訳テキスト文を表示部２７に出力する。また、翻訳部２６は、第２言語の翻訳音声を生成又は第１言語の翻訳音声を生成すると、生成した第２言語の翻訳音声を生成又は第１言語の翻訳音声を音声出力部２８に出力する。 After generating the second translated text sentence or the first translated text sentence, the translation unit 26 outputs the generated second translated text sentence or the first translated text sentence to the display unit 27 . Further, when the translation unit 26 generates the translated voice in the second language or the translated voice in the first language, the translation unit 26 generates the translated voice in the second language or outputs the generated translated voice in the first language to the voice output unit 28. do.

翻訳部２６は、発話指示部２５、音声認識部２３、表示部２７及び音声出力部２８と通信可能に接続される。 The translation section 26 is communicably connected to the speech instruction section 25, the speech recognition section 23, the display section 27, and the speech output section 28.

［表示部２７］
表示部２７は、例えば、液晶パネル、又は、有機ＥＬパネル等のモニタであり、発話指示部２５及び翻訳部２６と通信可能に接続される。具体的には、表示部２７は、音声検出部２２が検出した音声区間の音声が音声認識されることで、当該音声が示す第１言語から第２言語に翻訳した翻訳結果を表示し、かつ、第２言語から第１言語に翻訳した翻訳結果を表示するモニタである。表示部２７は、翻訳部２６から取得した第１テキスト文、第２テキスト文、第１翻訳テキスト文及び第２翻訳テキスト文を表示する。また、表示部２７は、これらのテキスト文を表示した後又は同時に、第１話者又は第２話者に発話を促す内容である発話指示テキスト情報を表示する。 [Display section 27]
The display section 27 is, for example, a monitor such as a liquid crystal panel or an organic EL panel, and is communicably connected to the speech instruction section 25 and the translation section 26. Specifically, the display unit 27 displays the translation result translated from the first language indicated by the voice into the second language by voice recognition of the voice in the voice section detected by the voice detection unit 22, and , a monitor that displays translation results from a second language to a first language. The display unit 27 displays the first text sentence, the second text sentence, the first translated text sentence, and the second translated text sentence acquired from the translation unit 26. In addition, after displaying these text sentences or at the same time, the display unit 27 displays speech instruction text information that prompts the first speaker or the second speaker to speak.

なお、表示部２７は、音声翻訳装置１に対する第１話者と第２話者との位置関係に応じて、テキスト文を表示する画面レイアウトを変更する。例えば、図１Ａ及び図１Ｂに示すように、表示部２７は、第１話者が発話すると、第１話者側に位置する表示部２７の表示領域に音声認識された第１テキスト文を表示し、第２話者側に位置する表示部２７の表示領域に翻訳された第２翻訳テキスト文を表示する。また、表示部２７は、第２話者が発話すると、第２話者側に位置する表示部２７の表示領域に音声認識された第２テキスト文を表示し、第１話者側に位置する表示部２７の表示領域に翻訳された第１翻訳テキスト文を表示する。これらの場合、表示部２７は、第１テキスト文と第２翻訳テキスト文との文字の向き、及び、第１翻訳テキスト文と第２テキスト文との文字の向きが逆さまとなって表示する。なお、図１Ｃに示すように、表示部２７は、第１話者と第２話者とが左右に並んで会話する場合、第１テキスト文と第２テキスト文との文字の向きが同一となるように表示する。 Note that the display unit 27 changes the screen layout for displaying the text sentence according to the positional relationship between the first speaker and the second speaker with respect to the speech translation device 1. For example, as shown in FIGS. 1A and 1B, when the first speaker speaks, the display unit 27 displays the first text sentence whose voice has been recognized in the display area of the display unit 27 located on the first speaker's side. Then, the translated second translated text sentence is displayed in the display area of the display unit 27 located on the second speaker side. Furthermore, when the second speaker speaks, the display unit 27 displays the second text sentence that has been voice-recognized in the display area of the display unit 27 located on the second speaker's side, and The translated first translated text sentence is displayed in the display area of the display unit 27. In these cases, the display unit 27 displays the first text sentence and the second translated text sentence in an upside-down manner, and the first translated text sentence and the second translated text sentence in an upside-down manner. Note that, as shown in FIG. 1C, when the first speaker and the second speaker converse side by side, the display unit 27 shows that the orientation of the characters in the first text sentence and the second text sentence are the same. Display as shown.

［音声出力部２８］
音声出力部２８は、翻訳部２６が翻訳した結果である翻訳音声を翻訳部２６から取得し、取得した翻訳音声を出力するスピーカであり、翻訳部２６及び発話指示部２５と通信可能に接続される。つまり、音声出力部２８は、第１話者が発話した場合、表示部２７に表示される第２翻訳テキスト文と同様の内容の翻訳音声を再生して出力する。また、音声出力部２８は、第２話者が発話した場合、表示部２７に表示される第１翻訳テキスト文と同様の内容の翻訳音声を再生して出力する。 [Audio output section 28]
The audio output unit 28 is a speaker that acquires translated audio as a result of translation by the translation unit 26 from the translation unit 26 and outputs the acquired translated audio, and is communicably connected to the translation unit 26 and the speech instruction unit 25. Ru. That is, when the first speaker speaks, the audio output unit 28 reproduces and outputs translated audio having the same content as the second translated text displayed on the display unit 27. Further, when the second speaker speaks, the audio output unit 28 reproduces and outputs translated audio having the same content as the first translated text sentence displayed on the display unit 27.

また、音声出力部２８は、発話指示音声情報を取得すると、第１話者又は第２話者に、発話指示音声情報に示される発話を促す内容である音声を再生して出力する。音声出力部２８は、第１翻訳テキスト文又は第２翻訳テキスト文の翻訳音声を出力した後に、発話指示音声情報に示される音声を再生して出力する。 Furthermore, upon acquiring the speech instruction audio information, the audio output unit 28 reproduces and outputs the audio that urges the first speaker or the second speaker to speak as indicated in the speech instruction audio information. After outputting the translated voice of the first translated text sentence or the second translated text sentence, the voice output unit 28 reproduces and outputs the voice indicated by the speech instruction voice information.

［電源部２９］
電源部２９は、例えば一次電池又は二次電池等であり、配線を介して音声入力部２１、音声検出部２２、優先発話入力部２４、発話指示部２５、音声認識部２３、翻訳部２６、表示部２７及び音声出力部２８等と電気的に接続される。電源部２９は、音声検出部２２、優先発話入力部２４、発話指示部２５、音声認識部２３、翻訳部２６、表示部２７及び音声出力部２８等に電力を供給する。 [Power supply section 29]
The power supply section 29 is, for example, a primary battery or a secondary battery, and is connected via wiring to the voice input section 21, voice detection section 22, priority speech input section 24, speech instruction section 25, speech recognition section 23, translation section 26, It is electrically connected to the display section 27, audio output section 28, and the like. The power supply unit 29 supplies power to the voice detection unit 22, priority speech input unit 24, speech instruction unit 25, voice recognition unit 23, translation unit 26, display unit 27, voice output unit 28, and the like.

＜動作＞
以上のように構成される音声翻訳装置１が行う動作について、図３を用いて説明する。 <Operation>
The operation performed by the speech translation device 1 configured as described above will be explained using FIG. 3.

図３は、実施の形態１における音声翻訳装置１の動作を示すフローチャートである。 FIG. 3 is a flowchart showing the operation of the speech translation device 1 in the first embodiment.

音声翻訳装置１には、第１話者が第１言語による発話を行うことを予め設定し、第２話者が第２言語による発話を行うことを予め設定する。ここでは、第１話者及び第２話者のうちの一方の話者が発話を開始した場合を想定する。第１話者は、音声翻訳装置１を起動させることで、音声翻訳装置１は、第１話者及び第２話者の会話の翻訳を開始する。 The speech translation device 1 is set in advance so that the first speaker speaks in the first language, and it is set in advance that the second speaker speaks in the second language. Here, it is assumed that one of the first speaker and the second speaker starts speaking. The first speaker activates the speech translation device 1, and the speech translation device 1 starts translating the conversation between the first speaker and the second speaker.

まず、図３に示すように、第１話者と第２話者とが会話を行う際、音声を発する前に音声翻訳装置１を起動する。音声翻訳装置１は、音を取得し（Ｓ１１）、取得した音を示す音響信号を生成する。本実施の形態では、一方の話者が発話を開始すると、音声翻訳装置１は、一方の話者が発話した音声を取得する。図１Ａに示すように、一方の話者が第１話者である場合、「何をお探しですか？」と発話すると、音声入力部２１は、この発話した音声を取得する。音声入力部２１は、音を取得し、取得した音を電気信号に変換し、変換した電気信号である音響信号を音声検出部２２に出力する。 First, as shown in FIG. 3, when a first speaker and a second speaker have a conversation, the speech translation device 1 is activated before uttering a voice. The speech translation device 1 acquires a sound (S11) and generates an acoustic signal indicating the acquired sound. In this embodiment, when one speaker starts speaking, the speech translation device 1 acquires the voice uttered by the other speaker. As shown in FIG. 1A, when one speaker is the first speaker and utters "What are you looking for?", the voice input unit 21 acquires the uttered voice. The audio input unit 21 acquires sound, converts the acquired sound into an electrical signal, and outputs an acoustic signal, which is the converted electrical signal, to the audio detection unit 22 .

次に、音声検出部２２は、音声入力部２１から音響信号を取得すると、音響信号に示される音から一方の話者の音声区間を検出することで（Ｓ１２）、検出した音声を一方の話者の音声として抽出する。一例を挙げると、図１Ａに示すように、音声入力部２１に入力される音から、第１話者の「何をお探しですか？」という音声区間を検出し、検出した音声を抽出する。音声検出部２２は、抽出した一方の話者の音声を示す音声情報を音声認識部２３に出力する。 Next, when the audio detection unit 22 acquires the audio signal from the audio input unit 21, the audio detection unit 22 detects the audio section of one speaker from the sound indicated in the audio signal (S12), thereby converting the detected audio into one of the speakers. Extract the voice as the person's voice. For example, as shown in FIG. 1A, the first speaker's voice section “What are you looking for?” is detected from the sound input to the voice input unit 21, and the detected voice is extracted. . The voice detection unit 22 outputs voice information indicating the extracted voice of one speaker to the voice recognition unit 23.

発話指示部２５は、一方の話者が発話した言語で音声認識するための指示コマンドを音声認識部２３に出力し、音声認識された音声を一方の言語から他方の言語に翻訳するための指示コマンドを翻訳部２６に出力する。つまり、発話指示部２５は、音声認識部２３が一方の話者が発話する言語を認識できるように、音声認識部２３の認識言語を切換えるための指示コマンドを出力する。また、発話指示部２５は、翻訳部２６が音声認識部２３で音声認識された言語に基づいて所望の言語で翻訳できるように、翻訳言語を切換えるための指示コマンドを出力する。 The speech instruction unit 25 outputs instruction commands for recognizing speech in the language spoken by one speaker to the speech recognition unit 23, and outputs instructions for translating the recognized speech from one language to another language. The command is output to the translation section 26. That is, the speech instruction section 25 outputs an instruction command for switching the language recognized by the speech recognition section 23 so that the speech recognition section 23 can recognize the language spoken by one of the speakers. Furthermore, the speech instruction section 25 outputs an instruction command for switching the translation language so that the translation section 26 can translate in a desired language based on the language voice recognized by the speech recognition section 23 .

例えば、音声認識部２３は、指示コマンドを取得すると、認識言語を第２言語から第１言語に、又は、認識言語を第１言語から第２言語に切換える。また、翻訳部２６は、指示コマンドを取得すると、翻訳言語を第２言語から第１言語に、又は、第１言語から第２言語に切換える。 For example, upon acquiring the instruction command, the speech recognition unit 23 switches the recognition language from the second language to the first language, or from the first language to the second language. Moreover, upon acquiring the instruction command, the translation unit 26 switches the translation language from the second language to the first language or from the first language to the second language.

次に、音声認識部２３は、指示コマンドと音声情報とを取得すると、音声情報に示される音声を音声認識する（Ｓ１３）。例えば、一方の話者の言語が第１言語であれば、音声認識部２３は、認識言語を第１言語に選択し、選択した第１言語で音声情報に示される音声を音声認識する。つまり、音声認識部２３は、音声情報に示される音声を、第１言語のテキスト文に変換し、変換した第１テキスト文を翻訳部２６に出力する。また、一方の話者の言語が第２言語であれば、音声認識部２３は、認識言語を第２言語に選択し、選択した第２言語で音声情報に示される音声を音声認識する。つまり、音声認識部２３は、音声情報に示される音声を、第２言語のテキスト文に変換し、変換した第２テキスト文を翻訳部２６に出力する。 Next, upon acquiring the instruction command and the voice information, the voice recognition unit 23 recognizes the voice indicated by the voice information (S13). For example, if the language of one speaker is the first language, the speech recognition unit 23 selects the first language as the recognition language and recognizes the speech indicated in the speech information in the selected first language. That is, the speech recognition section 23 converts the speech indicated by the speech information into a text sentence in the first language, and outputs the converted first text sentence to the translation section 26 . Further, if the language of one speaker is the second language, the speech recognition unit 23 selects the second language as the recognition language, and performs speech recognition of the speech indicated in the speech information in the selected second language. That is, the speech recognition section 23 converts the speech indicated by the speech information into a text sentence in the second language, and outputs the converted second text sentence to the translation section 26 .

一例を挙げると、図１Ａに示すように、音声認識部２３は、音声情報に示される音声「何をお探しですか？」を、第１テキスト文「何をお探しですか？」に変換する。 For example, as shown in FIG. 1A, the voice recognition unit 23 converts the voice "What are you looking for?" shown in the voice information into the first text sentence "What are you looking for?" do.

次に、翻訳部２６は、音声認識部２３からテキスト文を取得すると、第１言語及び第２言語のうちの一方の言語から他方の言語に翻訳する（Ｓ１４）。つまり、翻訳部２６は、テキスト文が第１言語の第１テキスト文であれば第２言語に翻訳し、翻訳した結果である第２翻訳テキスト文を生成する。また、翻訳部２６は、テキスト文が第２言語の第２テキスト文であれば第１言語に翻訳し、翻訳した結果である第１翻訳テキスト文を生成する。一例を挙げると、図１Ａに示すように、翻訳部２６は、第１言語の第１テキスト文「何をお探しですか？」を、第２言語に翻訳して、第２翻訳テキスト文「What are you looking for?」を生成する。 Next, upon acquiring the text sentence from the speech recognition unit 23, the translation unit 26 translates it from one of the first language and the second language into the other language (S14). That is, if the text sentence is a first text sentence in the first language, the translation unit 26 translates it into the second language, and generates a second translated text sentence as a result of the translation. Furthermore, if the text sentence is a second text sentence in the second language, the translation unit 26 translates it into the first language, and generates a first translated text sentence as a result of the translation. For example, as shown in FIG. 1A, the translation unit 26 translates the first text sentence "What are you looking for?" in the first language into the second language, and translates the first text sentence "What are you looking for?" in the first language into the second translated text sentence "What are you looking for?" "What are you looking for?"

次に、翻訳部２６は、生成した第２言語の第２翻訳テキスト文又は第１言語の第１翻訳テキスト文を表示部２７に出力する。表示部２７は、第２翻訳テキスト文又は第１翻訳テキスト文を表示する（Ｓ１５）。一例を挙げると、図１Ａに示すように、表示部２７は、第２翻訳テキスト文「What are you looking for?」を表示する。 Next, the translation unit 26 outputs the generated second translated text sentence in the second language or the first translated text sentence in the first language to the display unit 27. The display unit 27 displays the second translated text sentence or the first translated text sentence (S15). For example, as shown in FIG. 1A, the display unit 27 displays the second translated text sentence "What are you looking for?".

また、翻訳部２６は、第２翻訳テキスト文を生成すると、当該第２翻訳テキスト文を音声に変換した第２言語の翻訳音声を生成する。また、翻訳部２６は、第１翻訳テキスト文を生成すると、当該第１翻訳テキスト文を音声に変換した第１言語の翻訳音声を生成する。翻訳部２６は、生成した第２言語の翻訳音声又は第１言語の翻訳音声を音声出力部２８に出力する。音声出力部２８は、第２言語の翻訳音声又は第１言語の翻訳音声を出力する（Ｓ１６）。一例を挙げると、図１Ａに示すように、音声出力部２８は、第２翻訳テキスト文「What are you looking for?」を音声により出力する。なお、ステップＳ１５及びＳ１６の処理は、同一のタイミングであってもよく、処理が反対であってもよい。 Moreover, when the translation unit 26 generates the second translated text sentence, it generates translated speech in the second language by converting the second translated text sentence into speech. Moreover, when the translation unit 26 generates the first translated text sentence, it generates translated audio in the first language by converting the first translated text sentence into audio. The translation section 26 outputs the generated second language translated speech or first language translated speech to the speech output section 28 . The audio output unit 28 outputs the translated audio in the second language or the translated audio in the first language (S16). For example, as shown in FIG. 1A, the voice output unit 28 outputs the second translated text sentence "What are you looking for?" in voice. Note that the processing in steps S15 and S16 may be performed at the same timing, or may be performed at opposite times.

次に、発話指示部２５は、優先発話入力部２４から指示コマンドを取得したかどうかを判定する（Ｓ１７）。例えば、一方の話者が再度発話したい場合、音声翻訳装置１の操作者は、優先発話入力部２４を操作する。これにより、優先発話入力部２４は、操作を受付けると、指示コマンドを発話指示部２５に出力する。 Next, the speech instruction section 25 determines whether an instruction command has been obtained from the priority speech input section 24 (S17). For example, if one of the speakers wants to speak again, the operator of the speech translation device 1 operates the priority speech input section 24. Thereby, the priority speech input section 24 outputs an instruction command to the speech instruction section 25 upon receiving the operation.

発話指示部２５が優先発話入力部２４から指示コマンドを取得した場合（Ｓ１７でＹＥＳ）、音声認識部２３及び翻訳部２６は、一方の話者の音声認識及び翻訳の処理を終了並びに中断、又は、他方の話者の音声を音声認識するための処理に移行していても、一方の話者が発話する音声を音声認識及び翻訳する処理に戻す。発話指示部２５は、直近に発話した音声が音声認識された一方の話者に対して、当該一方の話者が発話する音声を優先して音声認識するために、再度、一方の話者に発話を促す内容である発話指示テキスト情報を表示部２７に出力する。表示部２７は、発話指示部２５から取得した発話指示テキスト情報を表示する（Ｓ１８）。一例を挙げると、表示部２７は、発話指示テキスト情報「もう一度発話して下さい」を表示する。 When the speech instruction section 25 acquires the instruction command from the priority speech input section 24 (YES at S17), the speech recognition section 23 and the translation section 26 terminate and interrupt the speech recognition and translation processing of one speaker, or , even if the process has shifted to the process of recognizing the voice of the other speaker, the process returns to the process of recognizing and translating the voice uttered by one speaker. The utterance instruction unit 25 sends a message to one of the speakers again, in order to give priority to the voice uttered by the one speaker whose voice has been voice-recognized most recently. Speech instruction text information, which is content to encourage speech, is output to the display unit 27. The display section 27 displays the speech instruction text information acquired from the speech instruction section 25 (S18). For example, the display unit 27 displays speech instruction text information "Please speak again."

また、発話指示部２５は、優先発話入力部２４から指示コマンドを取得した場合、一方の話者に発話を促す内容である発話指示音声情報を音声出力部２８に出力する。音声出力部２８は、発話指示部２５から取得した発話指示音声情報を音声により出力する（Ｓ１９）。一例を挙げると、音声出力部２８は、発話指示音声情報「もう一度発話して下さい」を音声により出力する。 Further, when the speech instruction section 25 acquires an instruction command from the priority speech input section 24, it outputs speech instruction voice information that prompts one speaker to speak to the voice output section 28. The audio output unit 28 outputs the speech instruction audio information acquired from the speech instruction unit 25 as a sound (S19). For example, the audio output unit 28 outputs the utterance instruction audio information "Please speak again" as a voice.

この場合、音声翻訳装置１は、他方の話者に対して、「Thank you for your patience.」等を表示したり、音声で出力したりしてもよく、何も出力しなくてもよい。なお、ステップＳ１８、Ｓ１９の処理は同時に行ってもよく、処理が逆転してもよい。 In this case, the speech translation device 1 may display or output a message such as "Thank you for your patience" to the other speaker, or may output nothing. Note that the processes in steps S18 and S19 may be performed simultaneously, or the processes may be reversed.

また、発話指示部２５は、発話指示音声情報を規定回数、音声出力部２８に出力させてもよい。発話指示部２５は、規定回数の発話指示音声情報を出力した後に、発話指示音声情報のメッセージを表示部２７に出力させてもよい。 Further, the speech instruction section 25 may cause the speech output section 28 to output the speech instruction voice information a prescribed number of times. After the speech instruction section 25 outputs the speech instruction voice information a predetermined number of times, the speech instruction section 25 may cause the display section 27 to output the message of the speech instruction voice information.

そして、音声翻訳装置１は、処理を終了する。これにより、一方の話者は再度、発話を行うことで、音声翻訳装置１は、ステップＳ１１から処理を開始する。 Then, the speech translation device 1 ends the process. As a result, one of the speakers speaks again, and the speech translation device 1 starts processing from step S11.

一方、発話指示部２５は、優先発話入力部２４から指示コマンドを取得できない場合（Ｓ１７でＮＯ）、他方の話者に発話を促す内容である発話指示テキスト情報を表示部２７に出力する。例えば、この場合、一方の話者が再度発話する必要がなく、音声が正しく認識された場合である。表示部２７は、発話指示部２５から取得した発話指示テキスト情報を表示する（Ｓ２１）。一例を挙げると、図１Ａに示すように、表示部２７は、発話指示テキスト情報「Your Turn!」を表示する。 On the other hand, if the utterance instruction unit 25 cannot obtain the instruction command from the priority utterance input unit 24 (NO in S17), the utterance instruction unit 25 outputs utterance instruction text information to the display unit 27, which is the content that prompts the other speaker to speak. For example, in this case, one speaker does not need to speak again and the speech is correctly recognized. The display section 27 displays the speech instruction text information acquired from the speech instruction section 25 (S21). For example, as shown in FIG. 1A, the display unit 27 displays speech instruction text information "Your Turn!".

また、発話指示部２５は、優先発話入力部２４から指示コマンドを取得できない場合、他方の話者に発話を促す内容である発話指示音声情報を音声出力部２８に出力する。音声出力部２８は、発話指示部２５から取得した発話指示音声情報を音声により出力する（Ｓ２２）。一例を挙げると、音声出力部２８は、発話指示音声情報「Your Turn!」を音声により出力する。なお、ステップＳ２１、Ｓ２２の処理は同時に行ってもよく、処理が逆転してもよい。 Furthermore, when the instruction command cannot be obtained from the priority speech input section 24, the speech instruction section 25 outputs speech instruction voice information that urges the other speaker to speak to the voice output section 28. The audio output unit 28 outputs the speech instruction audio information acquired from the speech instruction unit 25 as a sound (S22). For example, the audio output unit 28 outputs the utterance instruction audio information "Your Turn!" as a voice. Note that the processes in steps S21 and S22 may be performed simultaneously, or the processes may be reversed.

また、発話指示部２５は、発話を促すための音声を規定回数、音声出力部２８に出力させてもよい。発話指示部２５は、規定回数の発話を促すための音声を出力した後に、発話を促すためのメッセージを表示部２７に出力させてもよい。 Furthermore, the speech instruction section 25 may cause the speech output section 28 to output a voice for encouraging speech a specified number of times. The utterance instruction unit 25 may output a voice to encourage the user to speak a specified number of times, and then cause the display unit 27 to output a message to encourage the user to speak.

このように、第１話者が音声翻訳装置１を最初に操作するだけで、音声翻訳装置１は、第１話者と第２話者との会話を翻訳することができる。 In this way, the speech translation device 1 can translate the conversation between the first speaker and the second speaker just by the first speaker operating the speech translation device 1 for the first time.

なお、一方の話者の発話に対して他方の話者が発話する場合も同様の処理であるため、その説明を省略する。 Note that the same process is performed when one speaker speaks in response to the other speaker's utterance, so the explanation thereof will be omitted.

＜作用効果＞
次に、本実施の形態における音声翻訳装置１の作用効果について説明する。 <Effect>
Next, the effects of the speech translation device 1 in this embodiment will be explained.

以上のように、本実施の形態における音声翻訳装置１は、第１言語で発話する第１話者と、第１話者の会話相手であり、第１言語と異なる第２言語で発話する第２話者とが会話を行うための音声翻訳装置１であって、音声入力部２１に入力される音から、第１話者及び第２話者が発話した音声区間を検出する音声検出部２２と、音声検出部２２が検出した音声区間の音声が音声認識されることで、当該音声が示す第１言語から第２言語に翻訳した翻訳結果を表示し、かつ、第２言語から第１言語に翻訳した翻訳結果を表示する表示部２７と、第１話者の発話後に第２話者に発話を促す内容を、表示部２７を介して第２言語により出力し、かつ、第２話者の発話後に第１話者に発話を促す内容を、表示部２７を介して第１言語により出力する発話指示部２５とを備える。 As described above, the speech translation device 1 in this embodiment is a first speaker who speaks in a first language, and a conversation partner of the first speaker who speaks in a second language different from the first language. The speech translation device 1 is for a conversation between two speakers, and includes a speech detection section 22 that detects the speech sections uttered by the first speaker and the second speaker from the sounds input to the speech input section 21. Then, the voice in the voice section detected by the voice detection unit 22 is voice recognized, and the result of translation from the first language indicated by the voice to the second language is displayed, and the translation result from the second language to the first language is displayed. a display section 27 that displays the translation result translated into a second language; The speech instructing section 25 outputs, via the display section 27, the content for prompting the first speaker to speak in the first language after the speech.

これによれば、第１話者と第２話者との会話から、それぞれの音声区間を検出することで、検出した音声を第１言語から第２言語に翻訳した翻訳結果を取得したり、検出した音声を第２言語から第１言語に翻訳した翻訳結果を取得したりすることができる。つまり、この音声翻訳装置１では、翻訳をするための入力操作をしなくても、第１話者と第２話者とのそれぞれの発話ごとに、自動的に検出した音声の言語を別の言語に翻訳することができる。 According to this, by detecting each speech interval from a conversation between a first speaker and a second speaker, a translation result of the detected speech from the first language to the second language can be obtained, It is possible to obtain a translation result obtained by translating the detected voice from the second language to the first language. In other words, this speech translation device 1 automatically converts the language of the detected speech into a different language for each utterance by the first speaker and the second speaker, without any input operation for translation. Can be translated into languages.

また、音声翻訳装置１は、第１話者が発話した後に第２話者に発話を促す内容を出力したり、第２話者が発話した後に第１話者に発話を促す内容を出力したりすることができる。これにより、この音声翻訳装置１では、第１話者と第２話者とのそれぞれの発話ごとに、発話開始の入力操作をしなくても、第１話者と第２話者とが発話をするタイミングを認識することができる。 The speech translation device 1 also outputs content that prompts the second speaker to speak after the first speaker speaks, or outputs content that prompts the first speaker to speak after the second speaker speaks. You can As a result, in this speech translation device 1, the first speaker and the second speaker do not have to perform an input operation to start the utterance for each utterance by the first speaker and the second speaker. be able to recognize when to do something.

これらのように、音声翻訳装置１では、発話を開始するための入力操作、言語切替をするための入力操作等をしなくてもよく、操作性に優れている。つまりこの音声翻訳装置１の操作に手間取り難いため、使用期間の増大を抑制することができる。 As described above, the speech translation device 1 does not require input operations to start speaking, input operations to switch languages, etc., and is excellent in operability. In other words, since the operation of this speech translation device 1 does not take much time, it is possible to suppress an increase in the period of use.

したがって、音声翻訳装置１では、操作を簡易にすることで、音声翻訳装置１のエネルギー消費の増大を抑制することができる。特に、この音声翻訳装置１では、操作を簡易にすることができるため、誤操作を抑制することもできる。 Therefore, the speech translation device 1 can suppress an increase in energy consumption of the speech translation device 1 by simplifying the operation. In particular, with this speech translation device 1, since the operation can be simplified, it is also possible to suppress erroneous operations.

また、本実施の形態における音声翻訳方法は、第１言語で発話する第１話者と、第１話者の会話相手であり、第１言語と異なる第２言語で発話する第２話者とが会話を行うための音声翻訳方法であって、音声入力部２１に入力される音から、第１話者及び第２話者が発話した音声区間を検出することと、検出した音声区間の音声を音声認識することで、当該音声が示す第１言語から第２言語に翻訳した翻訳結果を表示し、かつ、第２言語から第１言語に翻訳した翻訳結果を表示する表示部２７が表示することと、第１話者の発話後に第２話者に発話を促す内容を、表示部２７を介して第２言語により出力し、かつ、第２話者の発話後に第１話者に発話を促す内容を、表示部２７を介して第１言語により出力することとを含む。 In addition, the speech translation method in the present embodiment includes a first speaker who speaks in a first language, and a second speaker who is a conversation partner of the first speaker and who speaks in a second language different from the first language. is a voice translation method for carrying out a conversation, which includes detecting a voice section uttered by a first speaker and a second speaker from sounds input to a voice input section 21, and translating the voice of the detected voice section. By recognizing the voice, the display unit 27 displays the translation result translated from the first language indicated by the voice to the second language, and displays the translation result translated from the second language to the first language. In addition, content that prompts the second speaker to speak after the first speaker speaks is outputted in the second language via the display unit 27, and the content prompting the first speaker to speak after the second speaker speaks This includes outputting the prompting content in the first language via the display unit 27.

この音声翻訳方法においても、上述の音声翻訳装置１と同様の作用効果を奏する。 This voice translation method also provides the same effects as the voice translation device 1 described above.

また、本実施の形態におけるプログラムは、音声翻訳方法をコンピュータに実行させるためのプログラムである。 Further, the program in this embodiment is a program for causing a computer to execute a speech translation method.

このプログラムにおいても、上述の音声翻訳装置１と同様の作用効果を奏する。 This program also has the same effects as the above-mentioned speech translation device 1.

本実施の形態における音声翻訳装置１は、さらに、第１話者又は第２話者が発話して音声認識された場合、再度、当該発話した第１話者又は第２話者の発話を優先して音声認識する優先発話入力部２４を備える。 Furthermore, when the first speaker or the second speaker makes an utterance and the speech is recognized, the speech translation device 1 according to the present embodiment gives priority to the utterance of the first speaker or the second speaker who made the utterance. A priority speech input unit 24 is provided for speech recognition.

これによれば、例えば第１話者及び第２話者である話者が言い間違えた場合、言い淀んだ音声が途中で翻訳された場合等、優先発話入力部２４を操作することで、発話した話者が優先されるため、発話した当該話者は、再度、発話をする機会を得ることができる（言い直すことができる）。このため、優先発話入力部２４は、第１話者及び第２話者の一方の話者が発話した音声を音声認識し終えた後、他方の話者の音声を音声認識するための処理に移行しても、一方の話者が発話する音声を音声認識する処理に戻すことができる。これにより、音声翻訳装置１は、第１話者及び第２話者の音声を確実に取得することができるため、当該音声に基づいて翻訳された翻訳結果を出力することができる。 According to this, for example, when the speakers who are the first speaker and the second speaker make a mistake in saying something, or when a voice that they hesitate to say is translated halfway, etc., by operating the priority speech input section 24, the utterance can be made. Since priority is given to the speaker who has spoken, the speaker who has spoken can have an opportunity to speak again (can rephrase). For this reason, after the priority speech input unit 24 has finished recognizing the voice uttered by one of the first speaker and the second speaker, the priority speech input unit 24 performs processing for voice recognition of the voice of the other speaker. Even after the transition, the voice uttered by one speaker can be returned to the voice recognition process. Thereby, the speech translation device 1 can reliably acquire the voices of the first speaker and the second speaker, and therefore can output a translation result translated based on the voices.

本実施の形態における音声翻訳装置１は、さらに、第１話者と第２話者とが会話する音声が入力される音声入力部２１と、音声検出部２２が検出した音声区間の音声を音声認識することで、テキスト文に変換する音声認識部２３と、音声認識部２３が変換したテキスト文を第１言語から第２言語に翻訳し、かつ、第２言語から第１言語に翻訳する翻訳部２６と、翻訳部２６が翻訳した結果を音声によって出力する音声出力部２８とを備える。 The speech translation device 1 according to the present embodiment further includes a speech input section 21 into which the speech of the conversation between the first speaker and the second speaker is input, and a speech detection section 22 that converts the speech of the speech section detected by the speech detection section 22 into speech. A speech recognition unit 23 that converts the text into a text sentence by recognition, and a translation that translates the text text converted by the speech recognition unit 23 from a first language to a second language, and from the second language to the first language. unit 26, and an audio output unit 28 that outputs the result translated by the translation unit 26 as a voice.

これによれば、入力される音声を音声認識してから、当該音声の言語を別の言語に翻訳することができる。つまり、音声翻訳装置１は、第１話者と第２話者とが会話する音声の取得から、音声を翻訳した結果を出力するまでの処理を行うことができる。このため、音声翻訳装置１は、外部サーバと通信しなくても、第１話者と第２話者とが会話するそれぞれの音声を相互に翻訳することができる。音声翻訳装置１が外部サーバと通信し難い環境下においても適用することができる。 According to this, it is possible to perform speech recognition on input speech and then translate the language of the speech into another language. In other words, the speech translation device 1 can perform processing from obtaining the speech of a conversation between the first speaker and the second speaker to outputting the result of translating the speech. Therefore, the speech translation device 1 can mutually translate the voices of the first speaker and the second speaker in conversation without communicating with an external server. The present invention can be applied even in environments where it is difficult for the speech translation device 1 to communicate with an external server.

本実施の形態における音声翻訳装置１において、発話指示部２５は、当該音声翻訳装置１の起動時に、第１話者に発話を促す内容を、表示部２７を介して第１言語により出力し、第１話者の発話による音声が第１言語から第２言語に翻訳されて、表示部２７に翻訳結果が表示された後に、第２話者に発話を促す内容を、表示部２７を介して第２言語により出力する。 In the speech translation device 1 according to the present embodiment, the speech instruction section 25 outputs content that prompts the first speaker to speak in the first language via the display section 27 when the speech translation device 1 is started, After the voice uttered by the first speaker is translated from the first language to the second language and the translation result is displayed on the display unit 27, content that prompts the second speaker to speak is displayed on the display unit 27. Output in the second language.

これによれば、第１言語で第１話者が発話した後に、第２言語で第２話者が発話することを予め登録しておけば、音声翻訳装置１の起動時に、第１話者に発話を促す内容を第１言語により出力すれば、第１話者は、発話を開始することができる。このため、音声翻訳装置１の起動時に、第２言語で第２話者が発話することによる誤翻訳を抑制することができる。 According to this, if it is registered in advance that the second speaker speaks in the second language after the first speaker speaks in the first language, when the speech translation device 1 is started, the first speaker The first speaker can start speaking by outputting content in the first language that prompts the first speaker to speak. Therefore, when the speech translation device 1 is activated, mistranslation caused by the second speaker speaking in the second language can be suppressed.

本実施の形態における音声翻訳装置１において、発話指示部２５は、翻訳開始後、発話を促すための音声を規定回数、音声出力部２８に出力させ、規定回数の発話を促すための音声を出力した後に、発話を促すためのメッセージを表示部２７に出力させる。 In the speech translation device 1 according to the present embodiment, after the start of translation, the speech instruction section 25 causes the speech output section 28 to output a voice for encouraging speech a specified number of times, and outputs a voice for encouraging speech a specified number of times. After that, a message to prompt the user to speak is outputted on the display unit 27.

これによれば、発話を促すための音声を規定回数で留めることによって、音声翻訳装置１のエネルギー消費の増大を抑制することができる。 According to this, an increase in energy consumption of the speech translation device 1 can be suppressed by limiting the number of times the voice for prompting the user to speak is limited to the specified number of times.

（実施の形態２）
＜構成＞
本実施の形態の音声翻訳装置１ａの構成を、図４を用いて説明する。 (Embodiment 2)
<Configuration>
The configuration of the speech translation device 1a of this embodiment will be explained using FIG. 4.

図４は、実施の形態２における音声翻訳装置１ａを示すブロック図である。 FIG. 4 is a block diagram showing a speech translation device 1a in the second embodiment.

本実施の形態では、音源方向を推定する点で、実施の形態１と相違する。 This embodiment differs from Embodiment 1 in that the direction of the sound source is estimated.

本実施の形態における他の構成は、特に明記しない場合は、実施の形態１と同様であり、同一の構成については同一の符号を付して構成に関する詳細な説明を省略する。 Other configurations in this embodiment are the same as those in Embodiment 1 unless otherwise specified, and the same configurations are denoted by the same reference numerals and detailed explanations regarding the configurations will be omitted.

図４に示すように、音声翻訳装置１ａは、音声検出部２２、優先発話入力部２４、発話指示部２５、音声認識部２３、翻訳部２６、表示部２７、音声出力部２８及び電源部２９の他に、複数の音声入力部２１と、音源方向推定部３１とを備える。 As shown in FIG. 4, the speech translation device 1a includes a speech detection section 22, a priority speech input section 24, a speech instruction section 25, a speech recognition section 23, a translation section 26, a display section 27, a speech output section 28, and a power supply section 29. In addition, it includes a plurality of audio input units 21 and a sound source direction estimation unit 31.

［複数の音声入力部２１］
複数の音声入力部２１は、マイクロフォンアレイを構成する。具体的には、マイクロフォンアレイは、互いに離間して配置された２以上のマイクロフォンユニットからなり、音声を取得し、取得した音声から電気信号に変換した音響信号を取得する。 [Multiple audio input units 21]
The plurality of audio input units 21 constitute a microphone array. Specifically, the microphone array is made up of two or more microphone units arranged apart from each other, and acquires audio, and acquires an acoustic signal converted from the acquired audio into an electrical signal.

複数の音声入力部２１は、取得した音響信号を音源方向推定部３１に出力する。また、複数の音声入力部２１のうちの少なくとも一つは、音声検出部２２に音響信号を出力する。本実施の形態では、一つの音声入力部２１が音声検出部２２と通信可能に接続され、音声検出部２２に音響信号を出力する。 The plurality of audio input units 21 output the acquired acoustic signals to the sound source direction estimation unit 31. Furthermore, at least one of the plurality of audio input units 21 outputs an acoustic signal to the audio detection unit 22. In this embodiment, one audio input section 21 is communicably connected to the audio detection section 22 and outputs an acoustic signal to the audio detection section 22 .

本実施の形態では、２つの音声入力部２１が音声翻訳装置１ａに設けられる、一方の音声入力部２１は、他方の音声入力部２１と音声の１／２波長以下となる距離だけ離間した状態で配置される。 In this embodiment, two voice input units 21 are provided in the voice translation device 1a, and one voice input unit 21 is spaced apart from the other voice input unit 21 by a distance that is equal to or less than 1/2 wavelength of the voice. It will be placed in

［音源方向推定部３１］
音源方向推定部３１は、複数の音声入力部２１に入力される音声を信号処理することにより、音源方向を推定する。具体的には、音源方向推定部３１は、音声検出部２２からの音声情報と、複数の音声入力部２１からの音響信号とを取得すると、マイクロフォンアレイを構成する複数の音声入力部２１のそれぞれに到達した音声の時間差（位相差）を算出し、例えば遅延時間推定法等により音源方向を推定する。つまり、音声検出部２２が音声区間を検出できれば、第１話者又は第２話者の音声が音声入力部２１に入力されたことを意味するため、音源方向推定部３１は、音声情報の取得をトリガとして、音源方向の推定を開始する。 [Sound source direction estimation unit 31]
The sound source direction estimating unit 31 estimates the sound source direction by signal processing the sounds input to the plurality of audio input units 21 . Specifically, when the sound source direction estimation unit 31 acquires the audio information from the audio detection unit 22 and the acoustic signals from the multiple audio input units 21, the sound source direction estimation unit 31 detects each of the multiple audio input units 21 constituting the microphone array. The time difference (phase difference) between the voices reaching the destination is calculated, and the direction of the sound source is estimated using, for example, a delay time estimation method. In other words, if the voice detection section 22 is able to detect a voice section, it means that the voice of the first speaker or the second speaker has been input to the voice input section 21. Therefore, the sound source direction estimation section 31 acquires voice information. is used as a trigger to start estimating the direction of the sound source.

音源方向推定部３１は、推定した結果である音源方向を示す音源方向情報を発話指示部２５に出力する。 The sound source direction estimation unit 31 outputs sound source direction information indicating the sound source direction, which is the result of the estimation, to the speech instruction unit 25.

［発話指示部２５］
発話指示部２５は、表示部２７に表示させる態様を制御する制御部３１ａを有する。具体的には、制御部３１ａは、音声翻訳装置１ａに対する第１話者の位置に対応する表示部２７の表示領域に第１言語を表示させ、音声翻訳装置１ａに対する第２話者の位置に対応する表示部２７の表示領域に第２言語を表示させる。例えば、図１Ａに示すように、第１話者の位置に対応する表示部２７の表示領域は、日本語で表示されている第１話者側の表示部２７の表示領域である。また、第２話者の位置に対応する表示部２７の表示領域は、英語で表示されている第２話者側の表示部２７の表示領域である。 [Speech instruction section 25]
The speech instruction section 25 includes a control section 31a that controls the display mode on the display section 27. Specifically, the control unit 31a displays the first language in the display area of the display unit 27 corresponding to the position of the first speaker with respect to the voice translation device 1a, and displays the first language in the display area of the display unit 27 corresponding to the position of the first speaker with respect to the voice translation device 1a. The second language is displayed in the display area of the corresponding display section 27. For example, as shown in FIG. 1A, the display area of the display unit 27 corresponding to the position of the first speaker is the display area of the display unit 27 on the first speaker's side where Japanese is displayed. Further, the display area of the display unit 27 corresponding to the position of the second speaker is the display area of the display unit 27 on the second speaker side where English is displayed.

制御部３１ａは、当該音声翻訳装置１ａの表示部２７から第１話者又は第２話者に向かう表示方向であって、表示部２７のいずれかの表示領域に表示する側の表示方向と、音源方向推定部３１が推定した音源方向とを比較する。制御部３１ａは、表示方向と音源方向とが実質的に一致する場合、音声認識部２３及び翻訳部２６を実行させる。例えば、図１Ａに示すように、第１話者が発話すると、音声翻訳装置１ａに入力された第１話者の音声の内容を示す第１テキスト文が第１話者側（又は第１話者に面する側）の表示領域に表示される。この場合、表示方向は表示部２７から第１話者に向く方向であり、音源方向推定部３１が推定した音源方向も表示部２７から第１話者に向く方向である。 The control unit 31a is a display direction from the display unit 27 of the speech translation device 1a toward the first speaker or the second speaker, and a display direction that is displayed in any display area of the display unit 27; The sound source direction estimated by the sound source direction estimation unit 31 is compared. When the display direction and the sound source direction substantially match, the control section 31a causes the speech recognition section 23 and the translation section 26 to execute. For example, as shown in FIG. 1A, when the first speaker speaks, a first text sentence indicating the content of the first speaker's voice input into the speech translation device 1a is transmitted to the first speaker (or the first speaker). displayed in the display area (on the side facing the person). In this case, the display direction is a direction from the display section 27 toward the first speaker, and the sound source direction estimated by the sound source direction estimation section 31 is also a direction from the display section 27 toward the first speaker.

一方、制御部３１ａは、表示方向と音源方向とが異なる場合、音声認識部２３及び翻訳部２６を停止させる。第１話者が発話すると、第１話者の音声の内容を示す第１テキスト文が第１話者側の表示領域に表示されても、音源方向推定部３１が推定した音源方向が表示部２７から第２話者に向く方向である場合、表示方向と推定した音源方向とが一致しない。例えば、第１話者が発話した後に、第１話者が優先発話入力部２４を操作せずに、続けて発話する場合、会話に関係の無い周囲の音を音声入力部２１が収音した場合等である。 On the other hand, if the display direction and the sound source direction are different, the control section 31a stops the speech recognition section 23 and the translation section 26. When the first speaker speaks, even if the first text sentence indicating the content of the first speaker's voice is displayed in the first speaker's display area, the sound source direction estimated by the sound source direction estimation unit 31 is displayed on the display area. 27 toward the second speaker, the display direction and the estimated sound source direction do not match. For example, if the first speaker continues speaking without operating the priority speech input section 24 after the first speaker has spoken, the voice input section 21 may have picked up surrounding sounds unrelated to the conversation. Cases etc.

また、制御部３１ａが音声認識部２３及び翻訳部２６を停止させる場合、発話指示部２５は、再度、指示した言語による発話を促す内容を出力する。例えば、表示方向と推定した音源方向とが一致しないため、どちらの話者が発話したか判らないため、音声認識部２３は、音声を第１言語で音声認識してよいのか、第２言語で音声認識してよいのか判らない。また、第１話者が発話してもその音声を音声認識することができなかった場合、翻訳を行うこともできない。このため、制御部３１ａは、音声認識部２３及び翻訳部２６を停止させる。 Further, when the control section 31a stops the speech recognition section 23 and the translation section 26, the speech instruction section 25 outputs the content encouraging speech in the instructed language again. For example, since the display direction and the estimated sound source direction do not match, it is not possible to determine which speaker made the utterance. I don't know if it's okay to use voice recognition. Further, even if the first speaker speaks, if the voice cannot be recognized, translation cannot be performed. Therefore, the control section 31a stops the speech recognition section 23 and the translation section 26.

＜動作＞
以上のように構成される音声翻訳装置１ａが行う動作について、図５を用いて説明する。 <Operation>
The operation performed by the speech translation device 1a configured as described above will be explained using FIG. 5.

図５は、実施の形態２における音声翻訳装置１ａの動作を示すフローチャートである。 FIG. 5 is a flowchart showing the operation of the speech translation device 1a in the second embodiment.

図５と同様の処理については、同一の符号を付し、説明を適宜省略する。 Processes similar to those in FIG. 5 are denoted by the same reference numerals, and description thereof will be omitted as appropriate.

音声翻訳装置１ａは、音を取得し（Ｓ１１）、取得した音を示す音響信号を生成する。 The speech translation device 1a acquires a sound (S11) and generates an acoustic signal indicating the acquired sound.

次に、音源方向推定部３１は、音声検出部２２から音声情報を取得したかどうかを判定する（Ｓ１２ａ）。 Next, the sound source direction estimation unit 31 determines whether or not audio information has been acquired from the audio detection unit 22 (S12a).

音源方向推定部３１が音声検出部２２から音声情報を取得しない場合は（Ｓ１２ａでＮＯ）、音声検出部２２が音響信号から音声を検出できない場合であるため、音源方向推定部３１は、音声情報を取得できない。つまり、第１話者及び第２話者が会話していない場合である。この場合、ステップＳ１２ａの処理を繰り返す。 If the sound source direction estimation unit 31 does not acquire audio information from the audio detection unit 22 (NO in S12a), this means that the audio detection unit 22 cannot detect audio from the acoustic signal, so the sound source direction estimation unit 31 acquires the audio information. cannot be obtained. That is, this is a case where the first speaker and the second speaker are not having a conversation. In this case, the process of step S12a is repeated.

音源方向推定部３１が音声検出部２２から音声情報を取得した場合（Ｓ１２ａでＹＥＳ）、第１話者及び第２話者の少なくとも一方が発話した場合である。この場合、音源方向推定部３１は、複数の音声入力部２１のそれぞれから取得した音響信号に含まれる音声の時間差（位相差）を算出し、音源方向を推定する（Ｓ３１）。音源方向推定部３１は、推定した結果である音源方向を示す音源方向情報を発話指示部２５に出力する。 When the sound source direction estimation unit 31 acquires audio information from the audio detection unit 22 (YES in S12a), this is a case where at least one of the first speaker and the second speaker speaks. In this case, the sound source direction estimating unit 31 calculates the time difference (phase difference) of sounds included in the acoustic signals acquired from each of the plurality of audio input units 21, and estimates the sound source direction (S31). The sound source direction estimation unit 31 outputs sound source direction information indicating the sound source direction, which is the result of the estimation, to the speech instruction unit 25.

次に、音源方向推定部３１の制御部３１ａは、表示方向と、推定した音源方向とが実質的に一致しているかどうかを判定する（Ｓ３２）。 Next, the control unit 31a of the sound source direction estimation unit 31 determines whether the display direction and the estimated sound source direction substantially match (S32).

制御部３１ａは、表示方向と音源方向とが異なる場合（Ｓ３２でＮＯ）、音声認識部２３及び翻訳部２６を停止させる。制御部３１ａが音声認識部２３及び翻訳部２６を停止させる場合、発話指示部２５は、再度、指示した言語による発話を促す内容を出力する。 If the display direction and the sound source direction are different (NO in S32), the control unit 31a stops the speech recognition unit 23 and the translation unit 26. When the control unit 31a stops the speech recognition unit 23 and the translation unit 26, the speech instruction unit 25 again outputs content encouraging speech in the instructed language.

具体的には、発話指示部２５は、一方の話者に発話を促す内容である発話指示テキスト情報を表示部２７に出力する。表示部２７は、発話指示部２５から取得した発話指示テキスト情報を表示する（Ｓ３３）。 Specifically, the speech instruction section 25 outputs speech instruction text information that prompts one speaker to speak to the display section 27. The display section 27 displays the speech instruction text information acquired from the speech instruction section 25 (S33).

また、発話指示部２５は、一方の話者に発話を促す内容である発話指示音声情報を音声出力部２８に出力する。音声出力部２８は、発話指示部２５から取得した発話指示音声情報を音声により出力する（Ｓ３４）。 Furthermore, the speech instruction section 25 outputs speech instruction voice information that prompts one speaker to speak to the voice output section 28 . The audio output unit 28 outputs the speech instruction audio information acquired from the speech instruction unit 25 as a sound (S34).

そして、音声翻訳装置１ａは、処理を終了する。これにより、一方の話者は再度、発話を行うことで、音声翻訳装置１ａは、ステップＳ１１から処理を開始する。 Then, the speech translation device 1a ends the process. As a result, one of the speakers speaks again, and the speech translation device 1a starts processing from step S11.

制御部３１ａは、表示方向と音源方向とが実質的に一致する場合（Ｓ３２でＹＥＳ）、音声認識部２３及び翻訳部２６を実行させる。そして、音声翻訳装置１ａは、ステップＳ１３に進み、図３と同様の処理を行う。 When the display direction and the sound source direction substantially match (YES in S32), the control unit 31a causes the speech recognition unit 23 and the translation unit 26 to execute. The speech translation device 1a then proceeds to step S13 and performs the same process as in FIG. 3.

＜作用効果＞
次に、本実施の形態における音声翻訳装置１ａの作用効果について説明する。 <Effect>
Next, the effects of the speech translation device 1a in this embodiment will be explained.

以上のように、本実施の形態における音声翻訳装置１ａにおいて、音声入力部２１は、複数設けられる。また、音声翻訳装置１ａは、さらに、複数の音声入力部２１に入力される音声を信号処理することにより、音源方向を推定する音源方向推定部３１と、当該音声翻訳装置１ａに対する第１話者の位置に対応する表示部２７の表示領域に第１言語を表示させ、当該音声翻訳装置１ａに対する第２話者の位置に対応する表示部２７の表示領域に第２言語を表示させる制御部３１ａとを備える。そして、制御部３１ａは、当該音声翻訳装置１ａの表示部２７から第１話者又は第２話者に向かう表示方向であって、表示部２７のいずれかの表示領域に表示する側の表示方向と、音源方向推定部３１が推定した音源方向とを比較し、表示方向と音源方向とが実質的に一致する場合、音声認識部２３及び翻訳部２６を実行させ、表示方向と音源方向とが異なる場合、音声認識部２３及び翻訳部２６を停止させる。 As described above, in the speech translation device 1a according to the present embodiment, a plurality of speech input units 21 are provided. The speech translation device 1a further includes a sound source direction estimation unit 31 that estimates a sound source direction by signal processing the speech input to the plurality of speech input units 21, and a first speaker for the speech translation device 1a. a control unit 31a that displays a first language in a display area of the display unit 27 corresponding to the position of the speech translation device 1a, and displays a second language in a display area of the display unit 27 that corresponds to the position of the second speaker with respect to the speech translation device 1a; Equipped with. The control unit 31a controls the display direction from the display unit 27 of the speech translation device 1a toward the first speaker or the second speaker, and the display direction in which the display is displayed in any display area of the display unit 27. and the sound source direction estimated by the sound source direction estimating unit 31, and if the display direction and the sound source direction substantially match, the speech recognition unit 23 and the translation unit 26 are executed, and the display direction and the sound source direction are If different, the speech recognition unit 23 and translation unit 26 are stopped.

これによれば、表示部２７の表示領域に表示された言語の表示方向と、話者の発話による音声の音源方向とが実質的に一致する場合、話者が第１言語で発話する第１話者か第２言語で発話する第２話者かを特定することができる。この場合、第１話者の音声を第１言語で音声認識することができ、第２話者の音声を第２言語で音声認識することができる。また、表示方向と音源方向とが異なる場合、入力された音声の翻訳を停止することで、入力された音声が翻訳されない又は誤翻訳されてしまうことを抑制することができる。 According to this, when the display direction of the language displayed in the display area of the display unit 27 and the sound source direction of the sound uttered by the speaker substantially match, the first language uttered by the speaker in the first language is It is possible to identify whether the speaker is the speaker or the second speaker speaking in the second language. In this case, the first speaker's voice can be recognized in the first language, and the second speaker's voice can be recognized in the second language. Further, when the display direction and the sound source direction are different, by stopping the translation of the input voice, it is possible to prevent the input voice from not being translated or being mistranslated.

これにより、音声翻訳装置１ａは、第１言語の音声及び第２言語の音声を確実に音声認識することができるため、確実に音声を翻訳することができる。その結果、この音声翻訳装置１ａでは、誤翻訳等を抑制することで音声翻訳装置１ａの処理量の増大を抑制することができる。 Thereby, the speech translation device 1a can reliably recognize the first language speech and the second language speech, and therefore can reliably translate the speech. As a result, this speech translation device 1a can suppress an increase in the processing amount of the speech translation device 1a by suppressing mistranslations and the like.

本実施の形態における音声翻訳装置１ａにおいて、制御部３１ａが音声認識部２３及び翻訳部２６を停止させる場合、発話指示部２５は、再度、指示した言語による発話を促す内容を出力する。 In the speech translation device 1a according to the present embodiment, when the control section 31a stops the speech recognition section 23 and the translation section 26, the speech instruction section 25 again outputs content encouraging speech in the instructed language.

これによれば、表示方向と音源方向とが異なる場合でも、発話指示部２５が再度、発話を促す内容を出力することで、対象となる話者が発話する。このため、音声翻訳装置１ａは、対象となる話者の音声を確実に取得することができるため、より確実に音声を翻訳することができる。 According to this, even if the display direction and the sound source direction are different, the speech instruction unit 25 outputs the content encouraging speech again, so that the target speaker speaks. Therefore, the speech translation device 1a can reliably acquire the speech of the target speaker, and therefore can translate the speech more reliably.

本実施の形態における音声翻訳装置１ａにおいても、実施の形態１等と同様の作用効果を奏する。 The speech translation device 1a in this embodiment also has the same effects as in the first embodiment.

（実施の形態２の変形例）
本変形例における他の構成は、特に明記しない場合は、実施の形態１と同様であり、同一の構成については同一の符号を付して構成に関する詳細な説明を省略する。 (Modification of Embodiment 2)
Unless otherwise specified, other configurations in this modification are the same as those in Embodiment 1, and the same configurations are denoted by the same reference numerals and detailed explanations regarding the configurations will be omitted.

このように構成される音声翻訳装置１ａが行う動作について、図６を用いて説明する。 The operation performed by the speech translation device 1a configured in this way will be explained using FIG. 6.

図６は、実施の形態２の変形例における音声翻訳装置１ａの動作を示すフローチャートである。 FIG. 6 is a flowchart showing the operation of the speech translation device 1a in a modification of the second embodiment.

音声翻訳装置１ａの処理において、ステップＳ１１～Ｓ３１の処理を経たのち、ステップＳ３２でＮＯの場合、制御部３１ａは、表示方向と音源方向との比較をしてから規定期間が経過したかどうかを判定する（Ｓ３２ａ）。 In the processing of the speech translation device 1a, after going through the processing of steps S11 to S31, if NO in step S32, the control unit 31a compares the display direction and the sound source direction and determines whether a specified period has elapsed. Determination is made (S32a).

制御部３１ａは、表示方向と音源方向との比較をしてから規定期間が経過していない場合（Ｓ３２ａでＮＯ）、処理をステップＳ３２ａに戻す。 If the specified period has not elapsed since the comparison between the display direction and the sound source direction (NO in S32a), the control unit 31a returns the process to step S32a.

制御部３１ａは、表示方向と音源方向との比較をしてから規定期間が経過している場合（Ｓ３２ａでＹＥＳ）、処理をステップＳ３３に進め、図５と同様の処理を行う。 If the specified period has elapsed since the comparison between the display direction and the sound source direction (YES in S32a), the control unit 31a advances the process to step S33 and performs the same process as in FIG. 5.

このように、本変形例における音声翻訳装置１ａにおいて、表示方向と音源方向とが異なる場合、発話指示部２５は、制御部３１ａが比較をしてから規定期間が経過した後に、再度、指示した言語による発話を促す内容を出力する。 In this way, in the speech translation device 1a according to the present modification, when the display direction and the sound source direction are different, the speech instruction section 25 issues an instruction again after a specified period has elapsed since the control section 31a made the comparison. Output content that encourages verbal utterances.

これによれば、表示方向と音源方向との比較をしてから規定期間を空けることで、第１話者と第２話者との音声が混在して入力されることを抑制することができる。これにより、規定期間経過後、再度、発話を促す内容を出力することで、対象となる話者が発話する。このため、音声翻訳装置１ａは、対象となる話者の音声をより確実に取得することができるため、より確実に音声を翻訳することができる。 According to this, by leaving a specified period after comparing the display direction and the sound source direction, it is possible to suppress the voices of the first speaker and the second speaker from being input together. . As a result, after the predetermined period of time has elapsed, the target speaker speaks by outputting the content encouraging him to speak again. Therefore, the speech translation device 1a can more reliably acquire the speech of the target speaker, and therefore can translate the speech more reliably.

本変形例における音声翻訳装置１ａにおいても、実施の形態２と同様の作用効果を奏する。 The speech translation device 1a in this modification also provides the same effects as in the second embodiment.

（実施の形態３）
＜構成＞
本実施の形態の音声翻訳装置１ｂの構成を、図７を用いて説明する。 (Embodiment 3)
<Configuration>
The configuration of the speech translation device 1b of this embodiment will be explained using FIG. 7.

図７は、実施の形態３における音声翻訳装置１ｂを示すブロック図である。 FIG. 7 is a block diagram showing a speech translation device 1b in the third embodiment.

本実施の形態では、音源方向を推定する点で、実施の形態１等と相違する。 This embodiment differs from Embodiment 1 etc. in that the direction of the sound source is estimated.

本実施の形態における他の構成は、特に明記しない場合は、実施の形態１等と同様であり、同一の構成については同一の符号を付して構成に関する詳細な説明を省略する。 Unless otherwise specified, other configurations in this embodiment are the same as those in Embodiment 1, etc., and the same configurations are denoted by the same reference numerals and detailed explanations regarding the configurations will be omitted.

音声翻訳装置１ｂは、音声検出部２２、優先発話入力部２４、発話指示部２５、音声認識部２３、翻訳部２６、表示部２７、音声出力部２８、電源部２９及び音源方向推定部３１の他に、複数の音声入力部２１と、第１ビームフォーマ部４１と、第２ビームフォーマ部４２と、入力切換部３２とを備える。 The speech translation device 1b includes a speech detection section 22, a priority speech input section 24, a speech instruction section 25, a speech recognition section 23, a translation section 26, a display section 27, a speech output section 28, a power supply section 29, and a sound source direction estimation section 31. In addition, it includes a plurality of audio input sections 21, a first beam former section 41, a second beam former section 42, and an input switching section 32.

［複数の音声入力部２１］
複数の音声入力部２１は、マイクロフォンアレイを構成する。複数の音声入力部２１のそれぞれは、取得した音響信号を第１ビームフォーマ部４１及び第２ビームフォーマ部４２に出力する。本実施の形態では、２つの音声入力部２１を用いている例を示す。 [Multiple audio input units 21]
The plurality of audio input units 21 constitute a microphone array. Each of the plurality of audio input units 21 outputs the acquired acoustic signal to the first beam former unit 41 and the second beam former unit 42 . In this embodiment, an example is shown in which two audio input units 21 are used.

［第１ビームフォーマ部４１及び第２ビームフォーマ部４２］
第１ビームフォーマ部４１は、複数の音声入力部２１のうちの少なくとも一部の音声入力部２１に入力される音声の音響信号を信号処理することにより、第１話者による音声の音源方向に収音の指向性を制御する。また、第２ビームフォーマ部４２は、複数の音声入力部２１のうちの少なくとも一部の音声入力部２１に入力される音声の音響信号を信号処理することにより、第２話者による音声の音源方向に収音の指向性を制御する。本実施の形態では、第１ビームフォーマ部４１及び第２ビームフォーマ部４２は、複数の音声入力部２１のそれぞれから取得した音響信号を信号処理する。 [First beamformer section 41 and second beamformer section 42]
The first beamformer unit 41 performs signal processing on the acoustic signal of the voice input to at least some of the voice input units 21 of the plurality of voice input units 21, thereby directing the voice toward the source of the voice of the first speaker. Controls the directionality of sound collection. Furthermore, the second beam former unit 42 processes the acoustic signals of the voices input to at least some of the voice input units 21 of the plurality of voice input units 21 to generate a sound source of the voice by the second speaker. Controls the directivity of sound collection. In this embodiment, the first beamformer section 41 and the second beamformer section 42 perform signal processing on the acoustic signals obtained from each of the plurality of audio input sections 21.

これにより、第１ビームフォーマ部４１及び第２ビームフォーマ部４２は、所定方向に収音の指向性を制御することで、所定方向以外の音の入力を抑制する。所定方向は、例えば、第１話者及び第２話者がそれぞれ発話する音声のそれぞれの音源方向である。 Thereby, the first beamformer section 41 and the second beamformer section 42 suppress input of sound in directions other than the predetermined direction by controlling the directivity of sound collection in the predetermined direction. The predetermined directions are, for example, respective sound source directions of voices uttered by the first speaker and the second speaker.

本実施の形態では、第１ビームフォーマ部４１は、第１話者側に配置され、複数の音声入力部２１のそれぞれと通信可能に接続され、第２ビームフォーマ部４２は、第２話者側に配置され、複数の音声入力部２１のそれぞれと通信可能に接続される。第１ビームフォーマ部４１及び第２ビームフォーマ部４２のそれぞれは、複数の音声入力部２１のそれぞれから取得した音響信号を信号処理した結果である音響処理信号を、入力切換部３２に出力する。 In this embodiment, the first beamformer section 41 is arranged on the first speaker side and is communicably connected to each of the plurality of audio input sections 21, and the second beamformer section 42 is arranged on the side of the second speaker. It is disposed on the side and is communicably connected to each of the plurality of audio input units 21 . Each of the first beamformer section 41 and the second beamformer section 42 outputs, to the input switching section 32, an acoustic processing signal that is a result of signal processing of the acoustic signal acquired from each of the plurality of audio input sections 21.

［発話指示部２５］
発話指示部２５は、入力切換部３２に、第１ビームフォーマ部４１の出力信号を取得するか、第２ビームフォーマ部４２の出力信号を取得するかを切換えさせる。具体的には、発話指示部２５は、音源方向推定部３１から推定した結果である音源方向を示す音源方向情報を取得すると、音源方向情報に示される音源方向と、ビームフォーマ部の収音の指向性である所定方向とを比較する。発話指示部２５は、音源方向と所定方向とが実質的に一致する又は近しい方向のビームフォーマ部を選択する。 [Speech instruction section 25]
The speech instruction section 25 causes the input switching section 32 to switch between acquiring the output signal of the first beamformer section 41 and the output signal of the second beamformer section 42 . Specifically, when the speech instruction unit 25 acquires the sound source direction information indicating the sound source direction that is the result of estimation from the sound source direction estimation unit 31, the speech instruction unit 25 calculates the direction of the sound source indicated by the sound source direction information and the direction of the sound collected by the beamformer unit. The direction is compared with a predetermined direction. The speech instruction unit 25 selects a beamformer unit in which the sound source direction and the predetermined direction substantially match or are close to each other.

発話指示部２５は、第１ビームフォーマ部４１及び第２ビームフォーマ部４２から選択したビームフォーマ部の出力信号を出力させるように、入力切換部３２に切換コマンドを出力する。 The speech instruction section 25 outputs a switching command to the input switching section 32 so as to output the output signal of the beamformer section selected from the first beamformer section 41 and the second beamformer section 42 .

［入力切換部３２］
入力切換部３２は、第１ビームフォーマ部４１の出力信号、及び、第２ビームフォーマ部４２の出力信号を取得し、音声検出部２２に出力する出力信号を切換える装置である。入力切換部３２は、取得する信号を、第１ビームフォーマ部４１の出力信号、又は、第２ビームフォーマ部４２の出力信号に切換える。具体的には、入力切換部３２は、発話指示部２５からの切換コマンドを取得することで、第１ビームフォーマ部４１の出力信号から第２ビームフォーマ部４２の出力信号、又は、第２ビームフォーマ部４２の出力信号から第１ビームフォーマ部４１の出力信号に切換える。入力切換部３２は、切換コマンドによって、第１ビームフォーマ部４１の出力信号を音声検出部２２に出力したり、第２ビームフォーマ部４２の出力信号を音声検出部２２に出力したりする。 [Input switching section 32]
The input switching unit 32 is a device that acquires the output signal of the first beamformer unit 41 and the output signal of the second beamformer unit 42 and switches the output signal to be output to the audio detection unit 22. The input switching unit 32 switches the signal to be acquired to the output signal of the first beamformer unit 41 or the output signal of the second beamformer unit 42. Specifically, by acquiring a switching command from the speech instruction section 25, the input switching section 32 changes the output signal of the first beamformer section 41 from the output signal of the second beamformer section 42 or the second beam The output signal of the former section 42 is switched to the output signal of the first beam former section 41. The input switching unit 32 outputs the output signal of the first beamformer unit 41 to the audio detection unit 22 or outputs the output signal of the second beamformer unit 42 to the audio detection unit 22 according to a switching command.

入力切換部３２は、第１ビームフォーマ部４１、第２ビームフォーマ部４２、音声検出部２２及び発話指示部２５と通信可能に接続される。 The input switching section 32 is communicably connected to the first beamformer section 41, the second beamformer section 42, the voice detection section 22, and the speech instruction section 25.

＜動作＞
以上のように構成される音声翻訳装置１ｂが行う動作について説明する。 <Operation>
The operation performed by the speech translation device 1b configured as described above will be explained.

図８は、実施の形態３における音声翻訳装置１ｂの動作を示すフローチャートである。 FIG. 8 is a flowchart showing the operation of the speech translation device 1b in the third embodiment.

図５等と同様の処理については、同一の符号を付し、説明を適宜省略する。 Processes similar to those in FIG. 5 and the like are denoted by the same reference numerals, and description thereof will be omitted as appropriate.

図８に示すように、音声翻訳装置１ｂの処理において、ステップＳ１１、Ｓ１２ａ、Ｓ３１及びＳ３２の処理を経たのち、制御部３１ａが表示方向と音源方向とが実質的に一致すると判定した場合（Ｓ３２でＹＥＳ）、発話指示部２５は、入力切換部３２に切換コマンドを出力する（Ｓ５１）。 As shown in FIG. 8, in the processing of the speech translation device 1b, after the processing of steps S11, S12a, S31, and S32, if the control unit 31a determines that the display direction and the sound source direction substantially match (S32 (YES), the speech instruction section 25 outputs a switching command to the input switching section 32 (S51).

具体的には、第１話者と第２話者とが発話するうえで、２つの音声入力部２１において、第１ビームフォーマ部４１は、第２話者の発話よりも第１話者の発話に対して高い感度を有し、第２ビームフォーマ部４２は、第１話者の発話よりも第２話者の発話に対して高い感度を有する。 Specifically, when the first speaker and the second speaker speak, the first beamformer unit 41 in the two audio input units 21 is configured to listen to the first speaker's utterances more than the second speaker's utterances. The second beamformer unit 42 has high sensitivity to speech, and the second beam former unit 42 has higher sensitivity to speech by the second speaker than to speech by the first speaker.

このため、表示方向が第１話者側の表示部２７の表示領域であれば、第１ビームフォーマ部４１の方が第１話者の発話に対して高い感度を有するため、発話指示部２５は、第１ビームフォーマ部４１の出力信号を出力させるように、入力切換部３２に切換コマンドを出力する。この場合、入力切換部３２は、切換コマンドを取得すると、第１ビームフォーマ部４１の出力信号を出力する。 Therefore, if the display direction is the display area of the display section 27 on the side of the first speaker, the first beamformer section 41 has higher sensitivity to the first speaker's utterance, so the utterance instruction section 25 outputs a switching command to the input switching unit 32 so as to output the output signal of the first beamformer unit 41. In this case, the input switching section 32 outputs the output signal of the first beamformer section 41 upon acquiring the switching command.

また、表示方向が第２話者側の表示部２７の表示領域であれば、第２ビームフォーマ部４２の方が第２話者の発話に対して高い感度を有するため、発話指示部２５は、第２ビームフォーマ部４２の出力信号を出力させるように、入力切換部３２に切換コマンドを出力する。この場合、入力切換部３２は、切換コマンドを取得すると、第２ビームフォーマ部４２の出力信号を出力する。 Furthermore, if the display direction is the display area of the display section 27 on the second speaker's side, the second beamformer section 42 has higher sensitivity to the second speaker's speech, so the speech instruction section 25 , outputs a switching command to the input switching unit 32 so as to output the output signal of the second beamformer unit 42. In this case, the input switching section 32 outputs the output signal of the second beamformer section 42 upon acquiring the switching command.

そして、音声翻訳装置１ｂは、ステップＳ１２に進み、図５と同様の処理を行う。 The speech translation device 1b then proceeds to step S12 and performs the same process as in FIG. 5.

＜作用効果＞
次に、本実施の形態における音声翻訳装置１ｂの作用効果について説明する。 <Effect>
Next, the effects of the speech translation device 1b in this embodiment will be explained.

以上のように、本実施の形態における音声翻訳装置１ｂにおいて、音声入力部２１は、複数設けられる。また、音声翻訳装置１ｂは、さらに、複数の音声入力部２１のうちの少なくとも一部の音声入力部２１に入力される音声を信号処理することにより、第１話者による音声の音源方向に収音の指向性を制御する第１ビームフォーマ部４１と、複数の音声入力部２１のうちの少なくとも一部の音声入力部２１に入力される音声を信号処理することにより、第２話者による音声の音源方向に収音の指向性を制御する第２ビームフォーマ部４２と、取得する信号を、第１ビームフォーマ部４１の出力信号、又は、第２ビームフォーマ部４２の出力信号に切換える入力切換部３２と、複数の音声入力部２１に入力される音声を信号処理することにより、音源方向を推定する音源方向推定部３１とを備える。そして、発話指示部２５は、入力切換部３２に、第１ビームフォーマ部４１の出力信号を取得するか、第２ビームフォーマ部４２の出力信号を取得するかを切換えさせる。 As described above, in the speech translation device 1b according to the present embodiment, a plurality of speech input units 21 are provided. Further, the speech translation device 1b further performs signal processing on the speech input to at least some of the speech input sections 21 of the plurality of speech input sections 21, thereby converging the speech from the first speaker in the direction of the sound source. The first beamformer section 41 that controls the directivity of sound and the sound input to at least some of the plurality of sound input sections 21 are subjected to signal processing to generate the sound of the second speaker. a second beamformer section 42 that controls the directivity of sound collection in the direction of the sound source; and an input switch that switches the acquired signal to the output signal of the first beamformer section 41 or the output signal of the second beamformer section 42. section 32, and a sound source direction estimation section 31 that estimates the direction of the sound source by signal processing the sounds input to the plurality of sound input sections 21. Then, the speech instruction section 25 causes the input switching section 32 to switch between acquiring the output signal of the first beamformer section 41 and the output signal of the second beamformer section 42 .

これによれば、音源方向推定部３１によって、音声翻訳装置１ｂに対する相対的な話者の方向を推定することができる。このため、入力切換部３２は、話者の方向に適した第１ビームフォーマ部４１の出力信号及び第２ビームフォーマ部４２の出力信号のいずれかに切換えることができる。つまり、音源方向にビームフォーマ部の収音の指向性を向けることができるため、音声翻訳装置１ｂでは、第１話者及び第２話者の音声について、周囲ノイズを低減して収音することができる。 According to this, the direction of the speaker relative to the speech translation device 1b can be estimated by the sound source direction estimation unit 31. Therefore, the input switching section 32 can switch to either the output signal of the first beamformer section 41 or the output signal of the second beamformer section 42 suitable for the direction of the speaker. In other words, since the directionality of the sound collection of the beamformer unit can be directed toward the direction of the sound source, the speech translation device 1b can collect the sounds of the first speaker and the second speaker while reducing ambient noise. Can be done.

本実施の形態における音声翻訳装置１ｂにおいても、実施の形態１等と同様の作用効果を奏する。 The speech translation device 1b in this embodiment also has the same effects as in the first embodiment.

（実施の形態３の変形例）
本変形例の音声翻訳装置１ｃを、図９を用いて説明する。 (Modification of Embodiment 3)
The speech translation device 1c of this modification will be explained using FIG. 9.

図９は、実施の形態３の変形例における音声翻訳装置１ｃを示すブロック図である。 FIG. 9 is a block diagram showing a speech translation device 1c in a modification of the third embodiment.

本変形例における他の構成は、特に明記しない場合は、実施の形態１等と同様であり、同一の構成については同一の符号を付して構成に関する詳細な説明を省略する。 Unless otherwise specified, other configurations in this modification are the same as those in Embodiment 1, etc., and the same configurations are denoted by the same reference numerals and detailed explanations regarding the configurations will be omitted.

図９に示すように、第１ビームフォーマ部４１及び第２ビームフォーマ部４２は、複数の音声入力部２１のそれぞれと通信可能に接続され、かつ、音源方向推定部３１及び入力切換部３２と通信可能に接続される。 As shown in FIG. 9, the first beamformer section 41 and the second beamformer section 42 are communicably connected to each of the plurality of audio input sections 21, and are connected to the sound source direction estimation section 31 and the input switching section 32. Connected for communication.

第１ビームフォーマ部４１及び第２ビームフォーマ部４２には、複数の音声入力部２１のそれぞれからの音響信号が入力される。第１ビームフォーマ部４１及び第２ビームフォーマ部４２は、入力されたそれぞれの音響信号を信号処理することにより、信号処理した結果であるそれぞれの音響処理信号を、音源方向推定部３１及び入力切換部３２のそれぞれに出力する。 Acoustic signals from each of the plurality of audio input sections 21 are input to the first beam former section 41 and the second beam former section 42 . The first beamformer section 41 and the second beamformer section 42 perform signal processing on each of the input acoustic signals, and transfer the respective acoustic processed signals that are the result of the signal processing to the sound source direction estimation section 31 and the input switching section. The output signal is output to each of the sections 32.

つまり、本変形例では、複数の音声入力部２１のそれぞれは、第１ビームフォーマ部４１及び第２ビームフォーマ部４２と通信可能に接続され、音源方向推定部３１とは通信可能に接続されていない。 That is, in this modification, each of the plurality of audio input units 21 is communicably connected to the first beamformer unit 41 and the second beamformer unit 42, and is communicably connected to the sound source direction estimation unit 31. do not have.

このように、音源方向推定部３１には、第１ビームフォーマ部４１及び第２ビームフォーマ部４２によって、話者による音声の音源方向に収音の指向性を高めた音響信号が入力される。 In this way, the sound source direction estimating unit 31 receives an acoustic signal with enhanced directivity of sound collection in the direction of the sound source of the speaker's voice by the first beam former unit 41 and the second beam former unit 42 .

このような、本変形例における音声翻訳装置１ｃにおいて、音声入力部２１は、複数設けられる。また、音声翻訳装置１ｃは、さらに、複数の音声入力部２１のうちの少なくとも一部の音声入力部２１に入力される音声を信号処理することにより、第１話者による音声の音源方向に収音の指向性を制御する第１ビームフォーマ部４１と、複数の音声入力部２１のうちの少なくとも一部の音声入力部２１に入力される音声を信号処理することにより、第２話者による音声の音源方向に収音の指向性を制御する第２ビームフォーマ部４２と、第１ビームフォーマ部４１の出力信号、及び、第２ビームフォーマ部４２の出力信号を信号処理することにより、音源方向を推定する音源方向推定部３１とを備える。 In such a speech translation device 1c in this modification, a plurality of speech input sections 21 are provided. Further, the speech translation device 1c further performs signal processing on the speech input to at least some of the plurality of speech input sections 21, so that the speech translated by the first speaker is focused in the direction of the sound source. The first beamformer section 41 that controls the directivity of sound and the sound input to at least some of the plurality of sound input sections 21 are subjected to signal processing to generate the sound of the second speaker. By signal processing the output signals of the second beamformer section 42, which controls the directivity of sound collection in the direction of the sound source, the output signals of the first beamformer section 41, and the output signals of the second beamformer section 42, and a sound source direction estimation section 31 that estimates the direction of the sound source.

これによれば、音源方向推定部３１によって、音声翻訳装置１ｃに対する相対的な話者の方向を推定することができる。このため、音源方向推定部３１は、話者の方向に適した第１ビームフォーマ部４１の出力信号及び第２ビームフォーマ部４２の出力信号を信号処理するため、信号処理による演算コストを低下させることができる。 According to this, the sound source direction estimation unit 31 can estimate the direction of the speaker relative to the speech translation device 1c. For this reason, the sound source direction estimation unit 31 performs signal processing on the output signal of the first beamformer unit 41 and the output signal of the second beamformer unit 42 that are suitable for the direction of the speaker, thereby reducing the calculation cost due to signal processing. be able to.

本変形例における音声翻訳装置１ｃにおいても、上述の実施の形態１等と同様の作用効果を奏する。 The speech translation device 1c in this modification also provides the same effects as in the first embodiment described above.

（実施の形態４）
＜構成＞
本実施の形態の音声翻訳装置１ｄの構成を、図１０を用いて説明する。 (Embodiment 4)
<Configuration>
The configuration of the speech translation device 1d of this embodiment will be explained using FIG. 10.

図１０は、実施の形態４における音声翻訳装置１ｄを示すブロック図である。 FIG. 10 is a block diagram showing a speech translation device 1d in the fourth embodiment.

本実施の形態では、音声翻訳装置１ｄがスコア算出部４３を有する点で、実施の形態１等と相違する。 This embodiment differs from Embodiment 1 in that the speech translation device 1d includes a score calculation unit 43.

本実施の形態における構成は、特に明記しない場合は、実施の形態１等と同様であり、同一の構成については同一の符号を付して構成に関する詳細な説明を省略する。 Unless otherwise specified, the configuration in this embodiment is the same as that in Embodiment 1, etc., and the same components are denoted by the same reference numerals, and detailed explanations regarding the configuration will be omitted.

図１０に示すように、音声翻訳装置１ｄの音声認識部２３は、スコア算出部４３を備える。 As shown in FIG. 10, the speech recognition section 23 of the speech translation device 1d includes a score calculation section 43.

［スコア算出部４３］
スコア算出部４３は、音声を音声認識した結果、及び、当該結果の信頼性スコアを算出し、算出した信頼性スコアを、発話指示部２５に出力する。信頼性スコアは、音声検出部２２から取得した音声情報に示される音声を音声認識したときの、音声認識の精度（類似度）を示す。例えば、スコア算出部４３は、音声情報に示される音声を変換したテキスト文と、音声情報に示される音声とを比較し、テキスト文と当該音声との類似度を表す信頼性スコアを算出する。 [Score calculation unit 43]
The score calculation unit 43 calculates the result of voice recognition of the voice and the reliability score of the result, and outputs the calculated reliability score to the speech instruction unit 25. The reliability score indicates the accuracy (similarity) of speech recognition when the speech indicated by the speech information acquired from the speech detection unit 22 is speech recognized. For example, the score calculation unit 43 compares a text sentence obtained by converting the voice shown in the voice information with the voice shown in the voice information, and calculates a reliability score representing the degree of similarity between the text sentence and the voice.

なお、スコア算出部４３は、音声認識部２３に備えられていなくてもよく、音声認識部２３と独立した別の装置であってもよい。 Note that the score calculation section 43 does not need to be included in the speech recognition section 23, and may be a separate device independent from the speech recognition section 23.

［発話指示部２５］
発話指示部２５は、音声認識部２３のスコア算出部４３から取得した信頼性スコアを評価することで、音声認識の精度を判定する。具体的には、発話指示部２５は、音声認識部２３のスコア算出部４３から取得した信頼性スコアが閾値以下であるかどうかを判定する。発話指示部２５は、信頼性スコアが閾値以下の場合、信頼性スコアが閾値以下の音声の翻訳を行わずに、発話を促す内容を、表示部２７及び音声出力部２８の少なくともいずれかを介して出力する。発話指示部２５は、信頼性スコアが閾値よりも高い場合、音声の翻訳を行う。 [Speech instruction section 25]
The speech instruction unit 25 determines the accuracy of speech recognition by evaluating the reliability score obtained from the score calculation unit 43 of the speech recognition unit 23. Specifically, the speech instruction unit 25 determines whether the reliability score obtained from the score calculation unit 43 of the speech recognition unit 23 is less than or equal to a threshold value. When the reliability score is less than or equal to the threshold, the utterance instruction unit 25 displays content that prompts utterance through at least one of the display unit 27 and the audio output unit 28 without translating the voice whose reliability score is less than or equal to the threshold. and output it. The speech instruction unit 25 translates the speech when the reliability score is higher than the threshold.

＜動作＞
以上のように構成される音声翻訳装置１ｄが行う動作について説明する。 <Operation>
The operation performed by the speech translation device 1d configured as described above will be explained.

図１１は、実施の形態４における音声翻訳装置１ｄの動作を示すフローチャートである。 FIG. 11 is a flowchart showing the operation of the speech translation device 1d in the fourth embodiment.

図と同様の処理については、同一の符号を付し、説明を適宜省略する。 Processes similar to those in the figures are given the same reference numerals, and descriptions thereof will be omitted as appropriate.

音声翻訳装置１ｄの処理において、ステップＳ１１～Ｓ１３の処理を経たのち、音声認識部２３のスコア算出部４３は、音声認識結果の信頼性スコアを算出し、算出した信頼性スコアを発話指示部２５に出力する（Ｓ６１）。 In the processing of the speech translation device 1d, after the processing of steps S11 to S13, the score calculation section 43 of the speech recognition section 23 calculates the reliability score of the speech recognition result, and sends the calculated reliability score to the utterance instruction section 25. (S61).

次に、発話指示部２５は、音声認識部２３のスコア算出部４３から信頼性スコアを取得すると、取得した信頼性スコアが閾値以下であるかどうかを判定する（Ｓ６２）。 Next, upon acquiring the reliability score from the score calculation unit 43 of the speech recognition unit 23, the speech instruction unit 25 determines whether the acquired reliability score is less than or equal to a threshold value (S62).

発話指示部２５は、信頼性スコアが閾値以下の場合（Ｓ６２でＹＥＳ）、信頼性スコアが閾値以下の音声の翻訳を行わずに、再度、発話を促す内容である発話指示テキスト情報を、表示部２７を介して出力する（Ｓ１８）。そして、音声翻訳装置１ｄは、ステップＳ１９に進み、図３等と同様の処理を行う。 If the reliability score is less than or equal to the threshold (YES in S62), the speech instruction unit 25 displays speech instruction text information that encourages speech again without translating the voice whose reliability score is less than or equal to the threshold. It is outputted via the section 27 (S18). The speech translation device 1d then proceeds to step S19 and performs the same processing as in FIG. 3 and the like.

発話指示部２５は、信頼性スコアが閾値よりも高い場合（Ｓ６２でＮＯ）、ステップＳ１４に進み、図３等と同様の処理を行う。 If the reliability score is higher than the threshold (NO in S62), the speech instruction unit 25 proceeds to step S14 and performs the same process as in FIG. 3 and the like.

＜作用効果＞
次に、本実施の形態における音声翻訳装置１ｄの作用効果について説明する。 <Effect>
Next, the effects of the speech translation device 1d in this embodiment will be explained.

以上のように、本実施の形態における音声翻訳装置１ｄにおいて、音声認識部２３は、音声を音声認識した結果、及び、当該結果の信頼性スコアを出力し、発話指示部２５は、音声認識部２３から取得した信頼性スコアが閾値以下の場合、信頼性スコアが閾値以下の音声の翻訳を行わずに、発話を促す内容を、表示部２７及び音声出力部２８の少なくともいずれかを介して出力する。 As described above, in the speech translation device 1d according to the present embodiment, the speech recognition unit 23 outputs the result of speech recognition of speech and the reliability score of the result, and the speech instruction unit 25 outputs the result of speech recognition of the speech and the reliability score of the result. If the reliability score obtained from 23 is less than the threshold, content to encourage speaking is outputted via at least one of the display unit 27 and the audio output unit 28 without translating the voice whose reliability score is less than the threshold. do.

これによれば、音声認識の精度を示す信頼性スコアが閾値以下であれば、発話指示部２５が再度、発話を促す内容を出力することで、対象となる話者が再度、発話する。このため、音声翻訳装置１ｄは、対象となる話者の音声を確実に音声認識することができるようになるため、より確実に音声を翻訳することができる。 According to this, if the reliability score indicating the accuracy of speech recognition is equal to or less than the threshold value, the speech instruction unit 25 outputs the content encouraging speech again, so that the target speaker speaks again. Therefore, the speech translation device 1d can reliably recognize the speech of the target speaker, and therefore can translate the speech more reliably.

特に、音声出力部２８が発話を促す内容を音声により出力すれば、話者は、正しく音声認識されていないと気付き易くなる。 Particularly, if the voice output unit 28 outputs the content that prompts the speaker to speak, the speaker will be more likely to notice that the voice is not being recognized correctly.

本実施の形態における音声翻訳装置１ｄにおいても、上述の実施の形態１等と同様の作用効果を奏する。 The speech translation device 1d in this embodiment also has the same effects as in the first embodiment described above.

（その他変形例等）
以上、本開示について、実施の形態１～４及び実施の形態２、４に基づいて説明したが、本開示は、これら実施の形態１～４及び実施の形態２、４等に限定されるものではない。 (Other variations, etc.)
Although the present disclosure has been described above based on Embodiments 1 to 4 and Embodiments 2 and 4, the present disclosure is limited to Embodiments 1 to 4 and Embodiments 2 and 4, etc. isn't it.

例えば、上記各実施の形態１～４及び実施の形態２、４に係る音声翻訳装置、音声翻訳方法及びそのプログラムでは、第１話者及び１以上の第２話者のそれぞれの音声を、ネットワークを介してクラウドサーバに送信することとで、クラウドサーバに保存してもよく、当該それぞれの音声を認識した第１テキスト文及び第２テキスト文だけをネットワークを介してクラウドサーバに送信することとで、クラウドサーバに保存してもよい。 For example, in the speech translation device, speech translation method, and program thereof according to each of the first to fourth embodiments and the second and fourth embodiments, the respective voices of the first speaker and one or more second speakers are transmitted over the network. may be stored in the cloud server by transmitting it to the cloud server via the network, and transmitting only the first text sentence and the second text sentence in which the respective voices are recognized to the cloud server via the network. You can also save it on a cloud server.

また、上記各実施の形態１～４及び実施の形態２、４の変形例に係る音声翻訳装置、音声翻訳方法及びそのプログラムにおいて、音声認識部及び翻訳部は、音声翻訳装置に搭載されていなくてもよい。この場合、音声認識部及び翻訳部は、クラウドサーバに搭載されるエンジンであってもよい。音声翻訳装置は、取得した音声情報をクラウドサーバに送信してもよく、音声情報に基づいてクラウドサーバが音声認識と翻訳とを行った結果である、テキスト文と翻訳テキスト文と翻訳音声とを、クラウドサーバから取得してもよい。 Furthermore, in the speech translation device, speech translation method, and program thereof according to each of the first to fourth embodiments and the modifications of the second and fourth embodiments, the speech recognition unit and the translation unit are not installed in the speech translation device. You can. In this case, the speech recognition section and the translation section may be engines installed in the cloud server. The speech translation device may send the acquired speech information to the cloud server, and the speech translation device may send the obtained speech information to the cloud server, and the text sentence, the translated text sentence, and the translated speech, which are the results of speech recognition and translation performed by the cloud server based on the speech information. , may be obtained from a cloud server.

また、上記各実施の形態１～４及び実施の形態２、４の変形例に係る音声翻訳方法は、コンピュータを用いたプログラムによって実現され、このようなプログラムは、記憶装置に記憶されてもよい。 Further, the speech translation methods according to the first to fourth embodiments and the modified examples of the second and fourth embodiments are realized by a program using a computer, and such a program may be stored in a storage device. .

また、上記各実施の形態１～４及び実施の形態２、４の変形例に係る音声翻訳装置、音声翻訳方法及びそのプログラムに含まれる各処理部は、典型的に集積回路であるＬＳＩとして実現される。これらは個別に１チップ化されてもよいし、一部又は全てを含むように１チップ化されてもよい。 Further, each processing unit included in the speech translation device, speech translation method, and program thereof according to each of the first to fourth embodiments and the modifications of the second and fourth embodiments is typically realized as an LSI, which is an integrated circuit. be done. These may be integrated into one chip individually, or may be integrated into one chip including some or all of them.

また、集積回路化はＬＳＩに限るものではなく、専用回路又は汎用プロセッサで実現してもよい。ＬＳＩ製造後にプログラムすることが可能なＦＰＧＡ（ＦｉｅｌｄＰｒｏｇｒａｍｍａｂｌｅＧａｔｅＡｒｒａｙ）、又はＬＳＩ内部の回路セルの接続や設定を再構成可能なリコンフィギュラブル・プロセッサを利用してもよい。 Further, circuit integration is not limited to LSI, and may be realized using a dedicated circuit or a general-purpose processor. An FPGA (Field Programmable Gate Array) that can be programmed after the LSI is manufactured, or a reconfigurable processor that can reconfigure the connections and settings of circuit cells inside the LSI may be used.

なお、上記各実施の形態１～４及び実施の形態２、４の変形例において、各構成要素は、専用のハードウェアで構成されるか、各構成要素に適したソフトウェアプログラムを実行することによって実現されてもよい。各構成要素は、ＣＰＵ又はプロセッサなどのプログラム実行部が、ハードディスク又は半導体メモリなどの記録媒体に記録されたソフトウェアプログラムを読み出して実行することによって実現されてもよい。 In each of the first to fourth embodiments and the modified examples of the second and fourth embodiments, each component is configured with dedicated hardware or by executing a software program suitable for each component. May be realized. Each component may be realized by a program execution unit such as a CPU or a processor reading and executing a software program recorded on a recording medium such as a hard disk or a semiconductor memory.

また、上記で用いた数字は、全て本開示を具体的に説明するために例示するものであり、本開示の実施の形態１～４及び実施の形態２、４の変形例は例示された数字に制限されない。 Furthermore, all the numbers used above are exemplified to specifically explain the present disclosure, and Embodiments 1 to 4 of the present disclosure and modified examples of Embodiments 2 and 4 are based on the exemplified numbers. not limited to.

また、ブロック図における機能ブロックの分割は一例であり、複数の機能ブロックを一つの機能ブロックとして実現したり、一つの機能ブロックを複数に分割したり、一部の機能を他の機能ブロックに移してもよい。また、類似する機能を有する複数の機能ブロックの機能を単一のハードウェア又はソフトウェアが並列又は時分割に処理してもよい。 Furthermore, the division of functional blocks in the block diagram is just an example; multiple functional blocks may be realized as one functional block, one functional block may be divided into multiple functional blocks, or some functions may be moved to other functional blocks. You can. Further, functions of a plurality of functional blocks having similar functions may be processed in parallel or in a time-sharing manner by a single piece of hardware or software.

また、フローチャートにおける各ステップが実行される順序は、本開示を具体的に説明するために例示するためであり、上記以外の順序であってもよい。また、上記ステップの一部が、他のステップと同時（並列）に実行されてもよい。 Further, the order in which the steps in the flowchart are executed is for illustrative purposes to specifically explain the present disclosure, and may be in an order other than the above. Further, some of the above steps may be executed simultaneously (in parallel) with other steps.

その他、実施の形態１～４及び実施の形態２、４の変形例に対して当業者が思いつく各種変形を施して得られる形態、本開示の趣旨を逸脱しない範囲で実施の形態１～４及び実施の形態２、４の変形例における構成要素及び機能を任意に組み合わせることで実現される形態も本開示に含まれる。 In addition, embodiments 1 to 4 and modifications of embodiments 2 and 4 may be modified in various ways that those skilled in the art can think of, and embodiments 1 to 4 and modifications may be made without departing from the spirit of the present disclosure. The present disclosure also includes forms realized by arbitrarily combining the components and functions of the modified examples of Embodiments 2 and 4.

本開示は、異なる言語を話す複数の話者が会話によって意思の疎通を図るために用いられる音声翻訳装置、音声翻訳方法及びそのプログラムに適用することができる。 The present disclosure can be applied to a speech translation device, a speech translation method, and a program thereof, which are used for a plurality of speakers of different languages to communicate through conversation.

１、１ａ、１ｂ、１ｃ、１ｄ音声翻訳装置
２１音声入力部
２２音声検出部
２３音声認識部
２４優先発話入力部
２５発話指示部
２６翻訳部
２７表示部
２８音声出力部
３１音源方向推定部
３１ａ制御部
３２入力切換部
４１第１ビームフォーマ部
４２第２ビームフォーマ部 1, 1a, 1b, 1c, 1d Speech translation device 21 Speech input section 22 Speech detection section 23 Speech recognition section 24 Priority speech input section 25 Speech instruction section 26 Translation section 27 Display section 28 Speech output section 31 Sound source direction estimation section 31a Control Section 32 Input switching section 41 First beamformer section 42 Second beamformer section

Claims

A voice translation device for carrying out a conversation between a first speaker who speaks in a first language and a second speaker who is a conversation partner of the first speaker and who speaks in a second language different from the first language. There it is,
a voice detection unit that detects a voice section uttered by the first speaker and the second speaker from the sound input to the voice input unit;
By performing voice recognition on the voice in the voice section detected by the voice detection unit, displaying a translation result translated from the first language indicated by the voice into the second language, and displaying the translation result from the second language to the second language. a display section that displays translation results translated into one language;
outputting content that prompts the second speaker to speak after the first speaker speaks in the second language via the display unit , after displaying the translation result, or at the same time , and and a speech instruction section that outputs content prompting the first speaker to speak in the first language via the display section , after displaying the translation result, or at the same time, after the speaker has uttered the speech translation device. .

Further, when the voice uttered by the first speaker or the second speaker is voice recognized, the voice uttered by the first speaker or the second speaker whose voice has been recognized is prioritized again. The speech translation device according to claim 1, further comprising a priority speech input unit that performs speech recognition.

moreover,
an audio input unit into which audio of the conversation between the first speaker and the second speaker is input;
a voice recognition unit that converts the voice in the voice section detected by the voice detection unit into a text sentence by voice recognition;
a translation unit that translates the text sentence converted by the speech recognition unit from the first language to the second language, and from the second language to the first language;
The speech translation device according to claim 1 or 2, further comprising a speech output section that outputs the result translated by the translation section as a voice.

A plurality of the audio input units are provided,
moreover,
a first beam that controls the directivity of sound collection in the direction of the source of the voice of the first speaker by signal processing the voice input to at least some of the voice input units of the plurality of voice input units; Forma part and
a second beam that controls the directivity of sound collection in the direction of the sound source of the voice by the second speaker by signal processing the voice input to at least some of the voice input units of the plurality of voice input units; Forma part and
an input switching unit that switches a signal to be acquired to an output signal of the first beamformer unit or an output signal of the second beamformer unit;
a sound source direction estimation unit that estimates a sound source direction by signal processing the audio input to the plurality of audio input units;
The speech translation according to claim 3, wherein the speech instruction section causes the input switching section to switch between acquiring the output signal of the first beamformer section and acquiring the output signal of the second beamformer section. Device.

A plurality of the audio input units are provided,
moreover,
a sound source direction estimation unit that estimates a sound source direction by signal processing the audio input to the plurality of audio input units;
Displaying the first language in a display area of the display unit corresponding to the position of the first speaker with respect to the voice translation device, and displaying the first language on the display unit corresponding to the position of the second speaker with respect to the voice translation device. a control unit that displays the second language in the area;
The control unit includes:
a display direction from the display section of the speech translation device toward the first speaker or the second speaker, the display direction being displayed in one of the display areas of the display section; and the sound source direction estimating section. Compare the sound source direction estimated by
If the display direction and the estimated sound source direction substantially match, causing the speech recognition unit and the translation unit to execute;
The speech translation device according to claim 3, wherein the speech recognition section and the translation section are stopped when the display direction and the estimated sound source direction are different.

The speech translation device according to claim 5, wherein when the control unit stops the speech recognition unit and the translation unit, the speech instruction unit outputs content that prompts speech in the instructed language again.

If the display direction and the estimated sound source direction are different, the speech instruction section outputs content encouraging speech in the instructed language again after a prescribed period has elapsed since the comparison was made by the control section. 6. The speech translation device according to 5 or 6.

A plurality of the audio input units are provided,
moreover,
a first beam that controls the directivity of sound collection in the direction of the source of the voice of the first speaker by signal processing the voice input to at least some of the voice input units of the plurality of voice input units; Forma part and
a second beam that controls the directivity of sound collection in the direction of the sound source of the voice by the second speaker by signal processing the voice input to at least some of the voice input units of the plurality of voice input units; Forma part and
The speech translation device according to claim 3, further comprising: a sound source direction estimating unit that estimates a sound source direction by signal processing an output signal of the first beamformer unit and an output signal of the second beamformer unit.

The speech instruction section is
When the voice translation device is activated, content that prompts the first speaker to speak is output in the first language via the display unit,
After the voice uttered by the first speaker is translated from the first language to the second language and the translation result is displayed on the display unit, the display displays content that prompts the second speaker to speak. The speech translation device according to any one of claims 1 to 8, wherein the speech translation device outputs the speech in the second language via a section.

The speech instruction section is
After the translation starts, the audio output unit outputs a voice to encourage speaking a predetermined number of times,
The speech translation device according to any one of claims 3 to 8, wherein the display unit outputs a message to prompt the user to speak after outputting the voice to prompt the user to speak the predetermined number of times.

The voice recognition unit outputs a result of voice recognition of the voice and a reliability score of the result,
When the reliability score acquired from the speech recognition unit is below a threshold value, the speech instruction unit displays content encouraging speech to the display unit and the above, without translating the speech whose reliability score is below the threshold value. The speech translation device according to claim 3, wherein the speech translation device outputs the speech through at least one of the speech output units.

A voice translation method for carrying out a conversation between a first speaker who speaks in a first language and a second speaker who is a conversation partner of the first speaker and who speaks in a second language different from the first language. There it is,
detecting audio sections uttered by the first speaker and the second speaker from the sounds input to the audio input unit;
Displaying a translation result translated from the first language indicated by the voice into the second language by performing voice recognition on the voice in the detected voice section, and translating the voice from the second language into the first language. The display unit that displays the results displays;
outputting content that prompts the second speaker to speak after the first speaker speaks in the second language via the display unit , after displaying the translation result, or at the same time , and A speech translation method comprising: outputting content that prompts the first speaker to speak after the speaker has uttered the content in the first language via the display unit , after displaying the translation result, or at the same time .

A program for causing a computer to execute the speech translation method according to claim 12.