JP7106120B2

JP7106120B2 - Voice dialog device and voice dialog system

Info

Publication number: JP7106120B2
Application number: JP2018219515A
Authority: JP
Inventors: 彰則伊藤
Original assignee: Tohoku University NUC
Current assignee: Tohoku University NUC
Priority date: 2018-11-22
Filing date: 2018-11-22
Publication date: 2022-07-26
Anticipated expiration: 2038-11-22
Also published as: JP2020086096A

Description

特許法第３０条第２項適用ウェブサイトの掲載日平成３０年１１月１１日ウェブサイトのアドレスｈｔｔｐｓ：／／ｌｉｎｋ．ｓｐｒｉｎｇｅｒ．ｃｏｍ／ｃｈａｐｔｅｒ／１０．１００７／９７８－３－０３０－０３７４８－２＿９Article 30, Paragraph 2 of the Patent Act applied Date of posting on website: November 11, 2018 Address of website: https://link. springer. com/chapter/10.1007/978-3-030-03748-2_9

本発明は、音声合成された応答音声を利用者の発話音声に反応して出力する音声対話装置に関する。 The present invention relates to a voice interaction apparatus that outputs synthesized response voice in response to a user's uttered voice.

近年、利用者が発話する音声を認識し、その音声に対応して音声で応答する音声対話システムが開発されている。音声対話システムは、例えばスマートフォンの情報検索、音楽を再生するスマートスピーカ、ロボット、エンタテインメントなどに利用されている。 2. Description of the Related Art In recent years, voice interaction systems have been developed that recognize voices spoken by users and respond with voices corresponding to the voices. Voice interaction systems are used, for example, for information retrieval on smartphones, smart speakers for playing music, robots, entertainment, and the like.

図７は、従来の音声対話システムが有する問題点を説明するための模式図である。図７（ａ）に示すように、従来の音声対話システム８１では、スピーカ８２から発する自己の応答音声８４を利用者８９の発話音声８５として誤って認識してしまうという問題があった。このような問題を防ぐために、従来の多くの音声対話システムでは、応答音声８４を発している間は音声認識を行わず、応答音声８４の再生を終了した後に、音声認識の機能を有効化している。従来の音声対話システムは、音声対話システムと利用者とが交互に発話することを想定したシステムであると言える。 FIG. 7 is a schematic diagram for explaining problems of the conventional voice dialogue system. As shown in FIG. 7( a ), the conventional voice interaction system 81 has a problem of erroneously recognizing its own response voice 84 emitted from a speaker 82 as an uttered voice 85 of a user 89 . In order to prevent such a problem, many conventional speech dialogue systems do not perform speech recognition while the response speech 84 is being produced, and enable the speech recognition function after the response speech 84 is finished being reproduced. there is It can be said that the conventional voice dialogue system is a system that assumes that the voice dialogue system and the user speak alternately.

一方で、このような音声対話システムを備える機器が普及するにつれ、新たな問題も生じつつある。図７（ｂ）に示すように、例えば、音声対話システム８１（８１Ａ，８１Ｂ）が同じ環境に複数台存在する状況では、一方の音声対話システム８１Ａから発生した応答音声８４を、他方の音声対話システム８１Ｂが利用者８９の発話音声８５と誤認識して作動し、誤認識による誤った応答音声８６を再生してしまうという新たな問題が生じる。 On the other hand, with the spread of devices equipped with such voice dialogue systems, new problems are arising. As shown in FIG. 7B, for example, in a situation where a plurality of voice dialogue systems 81 (81A, 81B) exist in the same environment, response speech 84 generated from one voice dialogue system 81A is A new problem arises in that the system 81B operates by erroneously recognizing the uttered voice 85 of the user 89 and reproduces an erroneous response voice 86 due to the erroneous recognition.

このような問題を解決するための技術としては、例えば特許文献１の技術が挙げられる。特許文献１には、音声認識システムを備える機器に利用者が接近したかどうかを、光などを用いる計測手段によって判別し、利用者が接近したときにのみ音声認識を行う方法が開示されている。 Techniques for solving such problems include, for example, the technique disclosed in Patent Document 1. Patent Literature 1 discloses a method of determining whether or not a user has approached a device equipped with a voice recognition system by measuring means using light or the like, and performing voice recognition only when the user approaches. .

特開２００３－４４０８９号公報JP-A-2003-44089

しかしながら、特許文献１の方法には、利用者が音声対話システムに特定の方向から特定の距離まで接近した場合にしか音声認識ができないという問題がある。特に、スマートスピーカやロボットのように、利用者が発話する様々な方向からの音声を認識する必要がある場合には、スマートスピーカまたはロボットの周囲のすべての方向において認識対象物との距離を計測する必要があるという問題が生じる。これにより、機器が高価になるのみならず、距離計測装置を実装するために機器が大きくなるという問題がある。また、移動型のロボットなどの場合には、ロボットの周囲の環境が大きく変化することにより認識対象物が刻々と変化するので、認識対象物を特定し難いという問題もある。 However, the method of Patent Document 1 has a problem that speech recognition can only be performed when the user approaches the voice interaction system from a specific direction to a specific distance. In particular, when it is necessary to recognize voice from various directions, such as smart speakers and robots, the distance to the object to be recognized is measured in all directions around the smart speaker or robot. The problem arises that you need to As a result, there is a problem that not only the equipment becomes expensive, but also the equipment becomes large due to the mounting of the distance measuring device. In addition, in the case of a mobile robot or the like, since the object to be recognized changes moment by moment due to a large change in the environment around the robot, there is also the problem that it is difficult to specify the object to be recognized.

本発明は、このような従来技術の問題点に鑑みてなされたものであり、その目的は、音声対話システムを備える様々な機器が同じ環境内で応答音声を再生している状況にあっても、応答音声の誤認識を防止することができる音声対話装置を提供することにある。 SUMMARY OF THE INVENTION The present invention has been made in view of the problems of the conventional technology, and its object is to solve the problem even in a situation where various devices equipped with a voice dialogue system are reproducing response voices in the same environment. To provide a voice dialogue device capable of preventing erroneous recognition of a response voice.

上記目的を達成するための本発明は、例えば以下に示す態様を含む。
（項１）
利用者の発話音声に反応して音声対話手段（４）により音声合成された応答音声信号に基づいて、スピーカ（５）から応答音声を出力する音声対話装置（１０）であって、
応答音声信号の可聴帯域外の周波数帯域に、前記応答音声信号であることを示す識別情報を埋め込む識別情報埋込手段（１）と、
マイクロフォン（６）から入力される入力音声信号の可聴帯域外の周波数帯域に、前記識別情報が含まれているか否かを判別する識別情報判別手段（２）と、
前記識別情報が含まれていると判別された場合に、前記入力音声信号から少なくとも前記応答音声信号を除外した音声信号を、音声対話手段（４）に出力する応答音声除外手段（３）と、
を備える、音声対話装置。
（項２）
前記応答音声除外手段（３）は、前記入力音声信号に前記識別情報が含まれていない場合に、前記入力音声信号を前記音声対話手段（４）に出力する、項１に記載の音声対話装置。
（項３）
前記識別情報判別手段（２）は、
前記入力音声信号の周波数帯域を制限する第１の帯域制限手段（２１，２６）と、
前記入力音声信号の電力と帯域が制限された前記入力音声信号の電力とを計算する第１の電力計算手段（２２，２７）と、
前記入力音声信号の電力と帯域が制限された前記入力音声信号の電力との比率に基づいて、前記入力音声信号に前記識別情報が含まれているか否かを判別する判別手段（２３，２８）と、
を備える、項１または２に記載の音声対話装置。
（項４）
前記応答音声除外手段（３）は、前記入力音声信号に前記識別情報が含まれている場合に、前記入力音声信号をミュートする、項１から３のいずれか一項に記載の音声対話装置。
（項５）
前記応答音声除外手段（３）は、
前記入力音声信号の周波数帯域を制限して、前記入力音声信号を出力する第２の帯域制限手段（３１）と、
前記識別情報の判別結果に基づいて、ミュートされた前記入力音声信号と前記入力音声信号とを切り替えて出力する第１の切替手段（３２）と、
を備える、項４に記載の音声対話装置。
（項６）
前記識別情報埋込手段（１）は、
前記可聴帯域外の周波数を有する信号を前記応答音声信号に重畳する第１の重畳手段（１２）
を備える、項４または５に記載の音声対話装置。
（項７）
前記応答音声除外手段（３）は、前記入力音声信号に前記識別情報が含まれている場合に、前記入力音声信号から前記応答音声信号を差し引いた音声信号を、前記音声対話手段（４）に出力する、項１から３のいずれか一項に記載の音声対話装置。
（項８）
前記応答音声除外手段（３）は、
前記入力音声信号の周波数帯域を制限して、前記入力音声信号を出力する第３の帯域制限手段（３４）と、
前記入力音声信号にキャリア信号を乗算して、復調信号を生成する復調手段（３５）と、
前記復調信号から応答音声信号を推定する応答音声推定手段（３７）と、
推定された前記応答音声信号を前記入力音声信号から差し引く応答音声差引手段（３８）と、
前記識別情報の判別結果に基づいて、前記応答音声差引手段（３８）の出力と前記入力音声信号とを切り替えて出力する第２の切替手段（３９）と、
を備える、項７に記載の音声対話装置。
（項９）
前記識別情報埋込手段（１）は、
前記応答音声信号の周波数帯域を制限する第４の帯域制限手段（１５）と、
帯域が制限された前記応答音声信号にキャリア信号を乗算することにより、変調信号を生成する変調手段（１６）と、
前記応答音声信号に前記変調信号を重畳することにより、前記応答音声信号の、前記キャリア信号のキャリア周波数を含む周波数帯域に、前記識別情報として前記変調信号を埋め込む第２の重畳手段（１７）と、
を備え、
前記変調信号の周波数帯域の上限および下限の周波数が、前記可聴帯域外の周波数である、項７または８に記載の音声対話装置。
（項１０）
前記可聴帯域が２０Ｈｚ～１５ｋＨｚの範囲の周波数帯域である、項１から９のいずれか一項に記載の音声対話装置。
（項１１）
項１から１０のいずれかに記載の音声対話装置の各手段としてコンピュータを機能させるためのプログラム。
（項１２）
項１から１０のいずれかに記載の音声対話装置（１０）と、
入力音声に基づいて、前記音声対話装置（１０）に前記入力音声信号を出力するマイクロフォン（６）と、
前記音声対話装置（１０）から入力される前記応答音声信号に基づいて、前記応答音声を出力するスピーカ（５）と、
を備える、音声対話システム。 The present invention for achieving the above object includes, for example, the following aspects.
(Section 1)
A voice interaction device (10) for outputting a response voice from a speaker (5) based on a response voice signal synthesized by voice interaction means (4) in response to a user's uttered voice,
Identification information embedding means (1) for embedding identification information indicating that the response audio signal is a response audio signal in a frequency band outside the audible band of the response audio signal;
identification information determination means (2) for determining whether or not the identification information is included in a frequency band outside the audible band of an input audio signal input from a microphone (6);
response voice exclusion means (3) for outputting a voice signal obtained by excluding at least the response voice signal from the input voice signal to voice interaction means (4) when it is determined that the identification information is included;
A voice interaction device comprising:
(Section 2)
Item 2. The voice interaction device according to Item 1, wherein said response voice exclusion means (3) outputs said input voice signal to said voice interaction means (4) when said input voice signal does not contain said identification information. .
(Section 3)
The identification information determination means (2)
first band limiting means (21, 26) for limiting the frequency band of the input audio signal;
first power calculation means (22, 27) for calculating the power of the input audio signal and the power of the band-limited input audio signal;
determination means (23, 28) for determining whether or not the identification information is included in the input audio signal based on the ratio of the power of the input audio signal and the power of the input audio signal whose band is limited; When,
Item 3. The voice interaction device according to Item 1 or 2, comprising:
(Section 4)
4. The voice interaction apparatus according to any one of items 1 to 3, wherein said response voice excluding means (3) mutes said input voice signal when said input voice signal includes said identification information.
(Section 5)
The response voice exclusion means (3)
a second band limiting means (31) for limiting the frequency band of the input audio signal and outputting the input audio signal;
a first switching means (32) for switching and outputting the muted input audio signal and the input audio signal based on the determination result of the identification information;
Item 5. The voice interaction device according to Item 4, comprising:
(Section 6)
The identification information embedding means (1)
first superimposing means (12) for superimposing a signal having a frequency outside the audible band on the response audio signal;
6. The voice interaction device according to Item 4 or 5, comprising:
(Section 7)
The response voice exclusion means (3), when the identification information is included in the input voice signal, outputs a voice signal obtained by subtracting the response voice signal from the input voice signal to the voice interaction means (4). Item 4. The voice interaction device according to any one of items 1 to 3, which outputs.
(Section 8)
The response voice exclusion means (3)
third band limiting means (34) for limiting the frequency band of the input audio signal and outputting the input audio signal;
demodulation means (35) for multiplying the input audio signal by a carrier signal to generate a demodulation signal;
response speech estimation means (37) for estimating a response speech signal from the demodulated signal;
response voice subtraction means (38) for subtracting said estimated response voice signal from said input voice signal;
a second switching means (39) for switching between the output of the response voice subtraction means (38) and the input voice signal based on the determination result of the identification information;
Item 8. The voice interaction device according to Item 7, comprising:
(Section 9)
The identification information embedding means (1)
fourth band limiting means (15) for limiting the frequency band of the response voice signal;
modulating means (16) for generating a modulated signal by multiplying the band-limited response voice signal by a carrier signal;
a second superimposing means (17) for embedding the modulated signal as the identification information in a frequency band of the response voice signal including the carrier frequency of the carrier signal by superimposing the modulated signal on the response voice signal; ,
with
Item 9. The voice interaction device according to item 7 or 8, wherein the upper and lower limits of the frequency band of the modulated signal are frequencies outside the audible band.
(Section 10)
Item 10. The voice interaction device according to any one of Items 1 to 9, wherein the audible band is a frequency band ranging from 20 Hz to 15 kHz.
(Item 11)
A program for causing a computer to function as each means of the voice interaction device according to any one of Items 1 to 10.
(Item 12)
11. A voice interaction device (10) according to any one of items 1 to 10;
a microphone (6) for outputting the input speech signal to the speech dialogue device (10) based on the input speech;
a speaker (5) for outputting the response voice based on the response voice signal input from the voice interaction device (10);
A spoken dialogue system comprising:

本発明によると、応答音声の誤認識を防止することができる音声対話装置を提供することができる。 According to the present invention, it is possible to provide a voice interaction apparatus capable of preventing erroneous recognition of a response voice.

本発明の一実施形態に係る音声対話システム１００の概略的な構成を説明するためのブロック図である。1 is a block diagram for explaining a schematic configuration of a voice dialogue system 100 according to one embodiment of the present invention; FIG. 一実施形態に係る音声対話手段４の例示的な構成を説明するためのブロック図である。4 is a block diagram for explaining an exemplary configuration of voice interaction means 4 according to one embodiment; FIG. 本発明の第１の実施形態に係る音声対話装置１０Ａの構成を説明するためのブロック図である。1 is a block diagram for explaining the configuration of a voice interaction device 10A according to a first embodiment of the invention; FIG. 本発明の第２の実施形態に係る音声対話装置１０Ｂの構成を説明するためのブロック図である。FIG. 4 is a block diagram for explaining the configuration of a voice interaction device 10B according to a second embodiment of the present invention; FIG. 音声対話装置１０の各手段をソフトウェアとして実現する場合の、音声対話装置１０のハードウェア構成を示すブロック図である。2 is a block diagram showing the hardware configuration of voice interaction device 10 when each means of voice interaction device 10 is implemented as software; FIG. 識別情報として変調信号を応答音声信号に埋め込んだ場合の音声信号のスペクトルの一例である。It is an example of the spectrum of the voice signal when the modulated signal is embedded in the response voice signal as the identification information. 従来の音声対話システムが有する問題点を説明するための模式図である。FIG. 2 is a schematic diagram for explaining problems of a conventional voice dialogue system;

以下、本発明の実施形態を、添付の図面を参照して詳細に説明する。なお、以下の説明および図面において、同じ符号は同じまたは類似の構成要素を示すこととし、よって、同じまたは類似の構成要素に関する重複した説明を省略する。 Embodiments of the present invention will now be described in detail with reference to the accompanying drawings. In the following description and drawings, the same reference numerals denote the same or similar components, and redundant description of the same or similar components will be omitted.

以下の説明において、発話音声とは、音声対話システムまたは音声対話装置を利用する利用者が発する音声を意味する。 In the following description, uttered voice means voice uttered by a user using the voice interaction system or the voice interaction device.

応答音声とは、例えばスピーカ等の音声出力手段を介して音声対話システムまたは音声対話装置から出力される音声を意味する。応答音声は、音声対話エンジンの機能により、利用者の発話音声に対する、音声認識（Automatic Speech Recognition: ASR）、言語理解、対話管理、応答文生成、および音声合成（Text To Speech: TTS）の一連の処理がなされた音声とすることができる。 A response voice means a voice output from a voice dialog system or a voice dialog device via a voice output means such as a speaker. Response speech is a series of speech recognition (Automatic Speech Recognition: ASR), language understanding, dialogue management, response sentence generation, and speech synthesis (Text To Speech: TTS) for the user's uttered speech by the functions of the speech dialogue engine. It can be a voice that has been processed by

入力音声とは、例えばマイクロフォン等の音声入力手段を介して音声対話装置に入力される音声を意味する。音声対話システムまたは音声対話装置が用いられる環境または状況に応じて、入力音声には、利用者の発話音声が主に含まれる場合があれば、音声対話エンジンによる応答音声が主に含まれる場合もある。入力音声には、発話音声および応答音声の両方が混在して含まれる場合がある。 The input voice means voice input to the voice interaction apparatus via voice input means such as a microphone. Depending on the environment or situation in which the voice dialogue system or voice dialogue device is used, the input voice may mainly contain the user's uttered voice, or may mainly contain the response voice from the voice dialogue engine. be. The input voice may contain a mixture of both spoken voice and response voice.

［概略構成］
図１は、本発明の一実施形態に係る音声対話システム１００の概略的な構成を説明するためのブロック図である。 [Outline configuration]
FIG. 1 is a block diagram for explaining a schematic configuration of a voice dialogue system 100 according to one embodiment of the invention.

一実施形態に係る音声対話システム１００は、音声対話装置１０と、スピーカ５と、マイクロフォン６とを備える。 A voice interaction system 100 according to one embodiment includes a voice interaction device 10 , a speaker 5 and a microphone 6 .

一実施形態に係る音声対話装置１０は、利用者の発話音声に反応して応答音声を出力する装置である。応答音声は、音声対話手段４により音声合成された応答音声信号に基づいて、スピーカ５から出力される。 A voice interaction device 10 according to one embodiment is a device that outputs a response voice in response to a user's uttered voice. A response voice is output from the speaker 5 based on the response voice signal synthesized by the voice dialogue means 4 .

スピーカ５は、音声対話装置１０から入力される応答音声信号に基づいて、応答音声を出力する。好ましくは、スピーカ５は、最大で２２ｋＨｚまでの周波数を有する音声を出力することができる。 Speaker 5 outputs a response voice based on the response voice signal input from voice interaction device 10 . Preferably, the speaker 5 is capable of outputting sound with frequencies up to 22 kHz.

マイクロフォン６は、入力音声に基づいて、音声対話装置１０に入力音声信号を出力する。好ましくは、マイクロフォン６から出力される入力音声信号は、少なくとも２０ｋＨｚの周波数成分が含まれるように、４０ｋＨｚ以上のサンプリング周波数でサンプリングされている。 The microphone 6 outputs an input speech signal to the voice interaction device 10 based on the input speech. Preferably, the input audio signal output from the microphone 6 is sampled at a sampling frequency of 40 kHz or higher so that at least frequency components of 20 kHz are included.

音声対話装置１０は、識別情報埋込手段１と、識別情報判別手段２と、応答音声除外手段３と、音声対話手段４とを備える。一実施形態に係る音声対話装置１０において、音声対話装置１０が備える各手段は、各手段が備える各機能毎に作製された電子回路により実現される。 The voice dialogue device 10 comprises identification information embedding means 1 , identification information determination means 2 , response voice exclusion means 3 , and voice dialogue means 4 . In the spoken dialogue apparatus 10 according to one embodiment, each means included in the spoken dialogue apparatus 10 is implemented by an electronic circuit produced for each function included in each means.

識別情報埋込手段１は、応答音声信号の可聴帯域外の周波数帯域に、応答音声信号であることを示す識別情報を埋め込む。識別情報は、応答音声の可聴帯域外の周波数帯域に埋め込まれているので、利用者は、識別情報の有無を気にかけることなく、スピーカ５を介して音声対話システムまたは音声対話装置からの応答音声を聞くことができる。 Identification information embedding means 1 embeds identification information indicating a response voice signal in a frequency band outside the audible band of the response voice signal. Since the identification information is embedded in a frequency band outside the audible band of the response voice, the user can receive the response from the voice interaction system or the voice interaction device through the speaker 5 without worrying about the presence or absence of the identification information. can hear audio.

好ましくは、可聴帯域は、２０Ｈｚ～１５ｋＨｚの範囲の周波数帯域とすることができる。好ましくは、可聴帯域外の低周波数側または高周波数側の周波数帯域のうち、高周波数側の周波数帯域に識別情報を埋め込むことが好ましい。好ましくは、識別情報を埋め込む高周波数側の周波数帯域は、１５ｋＨｚ～２０ｋＨｚの範囲の周波数帯域とすることができる。 Preferably, the audible band may be a frequency band in the range of 20Hz-15kHz. Preferably, the identification information is embedded in the frequency band on the high frequency side of the frequency band on the low frequency side or the high frequency side outside the audible band. Preferably, the frequency band on the high frequency side in which identification information is embedded can be a frequency band in the range of 15 kHz to 20 kHz.

識別情報判別手段２は、マイクロフォン６から入力される入力音声信号の可聴帯域外の周波数帯域に、識別情報が含まれているか否かを判別する。判別の結果、識別情報が含まれていない場合には、音声対話手段４による応答音声が入力音声信号に含まれていない、すなわち入力音声には利用者の発話音声が主に含まれている、と判断される。一方で、識別情報が含まれている場合には、音声対話手段４による応答音声が入力音声信号に含まれている、と判断される。 The identification information determination means 2 determines whether or not identification information is included in the frequency band outside the audible band of the input audio signal input from the microphone 6 . As a result of determination, if the identification information is not included, the input voice signal does not include the response voice from the voice interaction means 4, that is, the input voice mainly includes the user's uttered voice. is judged. On the other hand, if the identification information is included, it is determined that the input voice signal includes a response voice from the voice dialogue means 4 .

応答音声除外手段３は、入力音声信号に識別情報が含まれていると判別された場合には、入力音声信号から少なくとも応答音声信号を除外した音声信号を、音声対話手段４に出力する。入力音声信号から少なくとも応答音声信号を除外することは、入力音声信号をミュートすることと、入力音声信号から応答音声信号を差し引いた音声信号を生成することと、の両方を意味する。例示的には、入力音声信号をミュートすることとは、入力音声信号の全体を応答音声信号として除外することを意味する。 The response voice excluding means 3 outputs a voice signal obtained by excluding at least the response voice signal from the input voice signal to the voice interaction means 4 when it is determined that the identification information is included in the input voice signal. Excluding at least the response audio signal from the input audio signal means both muting the input audio signal and generating an audio signal that is the input audio signal minus the response audio signal. Illustratively, muting the input audio signal means excluding the entire input audio signal as the response audio signal.

一方で、入力音声信号に識別情報が含まれていないと判別された場合には、応答音声除外手段３は、入力音声信号を音声対話手段４に出力する。 On the other hand, when it is determined that the input voice signal does not contain the identification information, the response voice excluding means 3 outputs the input voice signal to the voice interaction means 4 .

図２は、一実施形態に係る音声対話手段４の例示的な構成を説明するためのブロック図である。一実施形態に係る音声対話手段４は、音声認識手段４１と、言語理解手段４２と、対話管理手段４３と、応答文生成手段４４と、音声合成手段４５とを備える。 FIG. 2 is a block diagram for explaining an exemplary configuration of the voice dialogue means 4 according to one embodiment. The voice dialogue means 4 according to one embodiment includes voice recognition means 41 , language understanding means 42 , dialogue management means 43 , response sentence generation means 44 and voice synthesis means 45 .

音声対話手段４は、入力される音声信号に対して、音声認識（Automatic Speech Recognition: ASR）、言語理解、対話管理、応答文生成、および音声合成（Text To Speech: TTS）の一連の処理を施すことにより、応答音声を生成する。例示的には、音声対話手段４は、公知の音声対話システムにおける公知の音声対話エンジンを用いることができる。音声対話手段４は、全部または一部を人工知能として実現することもできる。 The voice dialogue means 4 performs a series of processing of voice recognition (Automatic Speech Recognition: ASR), language understanding, dialogue management, response sentence generation, and voice synthesis (Text To Speech: TTS) for the input voice signal. By applying, a response voice is generated. Illustratively, the voice dialogue means 4 can use a known voice dialogue engine in a known voice dialogue system. The voice dialogue means 4 can also be realized wholly or partly as artificial intelligence.

以上、一実施形態に係る音声対話装置１０は、応答音声信号であることを示す識別情報に基づいて、音声対話手段４にて処理する音声信号を切り替える。音声対話装置１０は、入力音声信号に識別情報が含まれていない場合には、利用者の発話音声が入力音声に主に含まれていると判断して、音声対話手段４により、入力音声に対する応答音声を生成する。一方で、入力音声信号に識別情報が含まれている場合には、音声対話装置１０は、自己のまたは他の機器の音声対話手段４から出力された応答音声が入力音声に含まれていると判断して、入力音声信号から少なくとも応答音声信号を除外する。 As described above, the voice interaction device 10 according to one embodiment switches the voice signal to be processed by the voice interaction means 4 based on the identification information indicating the response voice signal. When the input speech signal does not contain the identification information, the speech dialogue device 10 determines that the input speech mainly contains the user's uttered speech, and the speech dialogue means 4 performs Generate a response voice. On the other hand, if the input voice signal contains the identification information, the voice interaction apparatus 10 determines that the input voice includes the response voice output from the voice interaction means 4 of itself or another device. A determination is made to exclude at least the response audio signal from the input audio signal.

入力音声信号に識別情報が含まれている場合には、入力音声信号がミュートされて入力音声に対する応答音声が生成されないか、または、入力音声信号から応答音声信号を差し引いた音声信号が生成されて、生成されたその音声信号に対する応答音声が生成される。これにより、音声対話装置１０は、応答音声の誤認識を防止することができる。 If the input audio signal contains identifying information, the input audio signal is muted and no response audio is generated for the input audio, or the input audio signal minus the response audio signal is generated. , a response voice is generated for the generated voice signal. Thereby, the voice interaction apparatus 10 can prevent erroneous recognition of the response voice.

また、音声対話装置１０は、自己のまたは他の機器の音声対話手段４から出力された応答音声が入力音声に含まれていると判断する場合には、入力音声信号から少なくとも応答音声信号を除外する。これにより、音声対話装置１０は、自己が発する応答音声を利用者による発話音声として誤って認識するという態様の、自己発話による誤認識を防止することができる。また、音声対話装置１０は、複数台が同じ環境内に存在している状況であっても、他の機器が発する応答音声を利用者による発話音声として誤って認識するという態様の誤認識も防止することができる。 Further, when the voice dialogue apparatus 10 judges that the response voice output from the voice dialogue means 4 of its own device or another device is included in the input voice, the voice dialog device 10 excludes at least the response voice signal from the input voice signal. do. As a result, the voice interaction apparatus 10 can prevent erroneous recognition due to self-utterance, such as erroneously recognizing the response voice uttered by itself as the voice uttered by the user. In addition, even in a situation where a plurality of devices exist in the same environment, the voice interaction device 10 also prevents erroneous recognition in which response voices emitted by other devices are erroneously recognized as uttered voices by the user. can do.

また、応答音声信号であることを示す識別情報は、応答音声の可聴帯域外の周波数帯域に埋め込まれている。これにより、利用者は、識別情報の有無に気付くことなく、スピーカ５を介して音声対話システムまたは音声対話装置からの応答音声を聞くことができる。利用者は、応答音声の誤認識に悩まされることなく、音声対話システムまたは音声対話装置を快適に利用することができる。 Further, the identification information indicating the response voice signal is embedded in the frequency band outside the audible band of the response voice. Thereby, the user can hear the response voice from the voice interaction system or the voice interaction device through the speaker 5 without noticing the presence or absence of the identification information. The user can comfortably use the voice interaction system or the voice interaction device without suffering from erroneous recognition of the response voice.

［第１の実施形態］
第１の実施形態では、音声対話装置１０Ａは、入力音声信号に識別情報が含まれている場合に、入力音声信号をミュートする。 [First Embodiment]
In the first embodiment, the voice interaction device 10A mutes the input voice signal when the input voice signal contains identification information.

以下において説明する第１の実施形態では、可聴帯域外の低周波数側または高周波数側の周波数帯域のうち、高周波数側の周波数帯域に識別情報を埋め込む場合を一例として説明する。また、第１の実施形態に係る音声対話装置１０Ａの構成のうち、一実施形態に係る音声対話装置１０と共通する構成は、特に言及しない限り、一実施形態に係る音声対話装置１０と同様であるので、重複する説明は省略する。 In the first embodiment described below, a case in which identification information is embedded in a frequency band on the high frequency side of frequency bands on the low frequency side or on the high frequency side outside the audible band will be described as an example. Further, of the configuration of the voice interaction device 10A according to the first embodiment, the configuration common to the voice interaction device 10 according to one embodiment is the same as that of the voice interaction device 10 according to one embodiment unless otherwise specified. Therefore, redundant description is omitted.

図３は、本発明の第１の実施形態に係る音声対話装置１０Ａの構成を説明するためのブロック図である。 FIG. 3 is a block diagram for explaining the configuration of the voice interaction device 10A according to the first embodiment of the invention.

第１の実施形態に係る音声対話装置１０Ａは、識別情報埋込手段１Ａと、識別情報判別手段２と、応答音声除外手段３Ａと、音声対話手段４とを備える。 A voice dialogue device 10A according to the first embodiment comprises identification information embedding means 1A, identification information determination means 2, response voice elimination means 3A, and voice dialogue means 4. FIG.

識別情報埋込手段１Ａは、発振手段１１と、重畳手段１２とを備える。発振手段１１は、可聴帯域外の周波数を有する信号を発振する。重畳手段１２は、発振手段１１から出力される可聴帯域外の周波数を有する信号を識別情報として、音声対話手段４から出力される応答音声信号に重畳する。 The identification information embedding means 1A comprises an oscillating means 11 and a superimposing means 12. As shown in FIG. Oscillating means 11 oscillates a signal having a frequency outside the audible band. The superimposing means 12 superimposes a signal having a frequency outside the audible band output from the oscillating means 11 on the response voice signal output from the voice dialogue means 4 as identification information.

発振手段１１が発振する信号の周波数は、可聴帯域外の周波数であり、好ましくは、１５ｋＨｚ～２０ｋＨｚの範囲の周波数とすることができる。例示的には、発振手段１１は、２０ｋＨｚの正弦波を出力する。例示的には、発振手段１１は発振器とすることができ、重畳手段１２は乗算器とすることができる。 The frequency of the signal oscillated by the oscillating means 11 is out of the audible band, preferably in the range of 15 kHz to 20 kHz. Illustratively, the oscillating means 11 outputs a sine wave of 20 kHz. Illustratively, the oscillating means 11 can be an oscillator and the superimposing means 12 can be a multiplier.

その後、可聴帯域外の周波数を有する信号が重畳された応答音声信号は、重畳手段１２からスピーカ５に出力され、スピーカ５が、応答音声信号に基づいて応答音声を出力する。 After that, the response voice signal superimposed with the signal having the frequency outside the audible band is output from the superimposing means 12 to the speaker 5, and the speaker 5 outputs the response voice based on the response voice signal.

識別情報判別手段２は、帯域制限手段２１と、電力計算手段２２（２２Ａ，２２Ｂ）と、判別手段２３とを備える。 The identification information determining means 2 includes band limiting means 21 , power calculating means 22 ( 22 A, 22 B), and determining means 23 .

帯域制限手段２１は、マイクロフォン６から入力される入力音声信号の周波数帯域を制限する。これにより、重畳された識別情報の周辺の周波数成分を抽出する。好ましくは、帯域制限手段２１は、入力音声信号を１５ｋＨｚ～２０ｋＨｚの範囲の周波数帯域に制限する。例示的には、帯域制限手段２１は、帯域通過フィルタ（Band-PASS Filter）または高域通過フィルタ（High-Pass Filter）とすることができる。 Band limiting means 21 limits the frequency band of the input audio signal input from microphone 6 . As a result, frequency components around the superimposed identification information are extracted. Preferably, the band limiting means 21 limits the input audio signal to a frequency band in the range of 15 kHz to 20 kHz. Illustratively, the band-limiting means 21 can be a band-pass filter or a high-pass filter.

電力計算手段２２Ａは、帯域制限手段２１により帯域が制限された入力音声信号の電力を計算し、電力計算手段２２Ｂは、マイクロフォン６から入力される入力音声信号の電力を計算する。これにより、電力計算手段２２Ａは、入力音声信号のうち、識別情報に対応する部分の周波数成分の電力を計算し、電力計算手段２２Ｂは、入力音声信号全体の電力を計算する。例示的には、電力計算手段２２（２２Ａ，２２Ｂ）は、公知のパワースペクトル密度の計算手法に基づいて、電力を計算する。 The power calculation means 22A calculates the power of the input audio signal whose band is limited by the band limiting means 21, and the power calculation means 22B calculates the power of the input audio signal input from the microphone 6. FIG. Thereby, the power calculation means 22A calculates the power of the frequency component of the portion corresponding to the identification information in the input audio signal, and the power calculation means 22B calculates the power of the entire input audio signal. Illustratively, the power calculator 22 (22A, 22B) calculates power based on a known power spectral density calculation method.

判別手段２３は、電力計算手段２２Ａにより計算された電力と、電力計算手段２２Ｂにより計算された電力との比率に基づいて、入力音声信号に識別情報が含まれているか否かを判別する。好ましくは、判別手段２３は、電力計算手段２２Ａにより計算された電力と、電力計算手段２２Ｂにより計算された電力との比率が、所定の閾値よりも大きい場合には、入力音声信号に識別情報が重畳されていると判別する。 The determining means 23 determines whether or not the input audio signal contains the identification information based on the ratio between the power calculated by the power calculating means 22A and the power calculated by the power calculating means 22B. Preferably, when the ratio of the power calculated by the power calculation means 22A and the power calculated by the power calculation means 22B is greater than a predetermined threshold, the determination means 23 determines that the input audio signal does not include the identification information. It is determined that they are superimposed.

判別結果は、応答音声除外手段３Ａに出力される。例示的には、判別結果は値「０」または値「１」のＢｏｏｌｅａｎ値として表すことができる。例示的には、判別結果の値「０」は、入力音声信号に識別情報が重畳されていることを意味し、判別結果の値「１」は、入力音声信号に識別情報が重畳されていないことを意味することとすることができる。例示的には、判別手段２３は比較器とすることができる。 The determination result is output to the response voice exclusion means 3A. Illustratively, the determination result can be represented as a Boolean value with a value of "0" or a value of "1". As an example, the discrimination result value “0” means that the identification information is superimposed on the input speech signal, and the discrimination result value “1” means that the identification information is not superimposed on the input speech signal. It can be said that it means that Illustratively, the discriminating means 23 can be a comparator.

応答音声除外手段３Ａは、帯域制限手段３１と、切替手段３２とを備える。任意の構成として、応答音声除外手段３Ａは、ダウンサンプル手段３３をさらに備えることができる。 The response voice excluding means 3A includes band limiting means 31 and switching means 32 . As an optional configuration, the response voice exclusion means 3A can further comprise a down-sampling means 33 .

帯域制限手段３１は、マイクロフォン６から入力される入力音声信号の周波数帯域を制限する。これにより、入力音声信号中に含まれている識別情報を除去する。好ましくは、帯域制限手段３１は、入力音声信号を１５ｋＨｚ以下の周波数帯域に制限する。例示的には、帯域制限手段３１は、帯域阻止フィルタ（Band-Elimination Filter）または低域通過フィルタ（Low-Pass Filter）とすることができる。 Band limiting means 31 limits the frequency band of the input audio signal input from microphone 6 . This removes the identification information contained in the input audio signal. Preferably, the band limiting means 31 limits the input audio signal to a frequency band of 15 kHz or less. Illustratively, the band-limiting means 31 can be a band-elimination filter or a low-pass filter.

切替手段３２は、判別手段２３から入力される識別情報の判別結果に基づいて、ミュートされた入力音声信号と、入力音声信号とを切り替えて出力する。 The switching means 32 switches and outputs the muted input audio signal and the input audio signal based on the determination result of the identification information input from the determining means 23 .

入力音声信号に識別情報が重畳されていない場合には、利用者の発話音声が入力音声に主に含まれているので、切替手段３２は、帯域制限手段３１から入力される音声信号を、音声対話手段４に出力する。 When the identification information is not superimposed on the input voice signal, the user's uttered voice is mainly included in the input voice. Output to the interactive means 4 .

一方で、入力音声信号に識別情報が重畳されている場合には、自己のまたは他の機器の音声対話手段４から出力された応答音声が入力音声に含まれている。よって、誤認識を防止するために、切替手段３２は、帯域制限手段３１から入力される入力音声信号をミュートして、音声対話手段４に出力する。 On the other hand, when the identification information is superimposed on the input voice signal, the input voice includes the response voice output from the voice interaction means 4 of the device itself or another device. Therefore, in order to prevent erroneous recognition, the switching means 32 mutes the input voice signal input from the band limiting means 31 and outputs it to the voice dialogue means 4 .

例示的には、切替手段３２は乗算器とすることができる。例示的には、識別情報の判別結果を上記した仕様のＢｏｏｌｅａｎ値として表す場合、切替手段３２は、帯域制限手段３１から入力される入力音声信号と、判別手段２３から入力される識別情報の判別結果を表す信号とを乗算することにより、後段の音声対話手段４に適切な出力を提供することができる。 Illustratively, the switching means 32 can be a multiplier. As an example, when the determination result of the identification information is expressed as a Boolean value of the above specification, the switching means 32 determines the input audio signal input from the band limiting means 31 and the identification information input from the determination means 23. By multiplying by the signal representing the result, an appropriate output can be provided to the voice interaction means 4 in the subsequent stage.

任意の構成として、ダウンサンプル手段３３は、切替手段３２の出力側に接続され、切替手段３２から出力される音声信号を、所定のサンプリング周波数でダウンサンプルして、後段の音声対話手段４に出力する。例示的には、サンプリング周波数は１６ｋＨｚとすることができる。 As an optional configuration, the down-sampling means 33 is connected to the output side of the switching means 32, down-samples the audio signal output from the switching means 32 at a predetermined sampling frequency, and outputs the result to the audio dialogue means 4 in the subsequent stage. do. Illustratively, the sampling frequency may be 16 kHz.

音声対話手段４は、一実施形態に係る音声対話装置１０と同様の構成とすることができる。 The voice dialogue means 4 can have the same configuration as the voice dialogue device 10 according to one embodiment.

以上、第１の実施形態に係る音声対話装置１０Ａによると、応答音声信号であることを示す識別情報に基づいて、音声対話手段４にて処理する音声信号を切り替えることができる。 As described above, according to the voice interaction device 10A according to the first embodiment, the voice signal to be processed by the voice interaction means 4 can be switched based on the identification information indicating that it is the response voice signal.

入力音声信号に識別情報が含まれている場合には、入力音声信号がミュートされて入力音声に対する応答音声は生成されない。これにより、音声対話装置１０Ａは、応答音声の誤認識を防止することができる。 If the input audio signal contains identification information, the input audio signal is muted and no response audio is generated for the input audio. Thereby, the voice interaction device 10A can prevent erroneous recognition of the response voice.

また、音声対話装置１０Ａは、自己発話による誤認識を防止することができるし、複数台が同じ環境内に存在している状況であっても、他の機器が発する応答音声を利用者による発話音声として誤って認識するという態様の誤認識も防止することができる。 Further, the voice interaction device 10A can prevent erroneous recognition due to self-utterance, and even in a situation where a plurality of devices exist in the same environment, the voice interaction device 10A can prevent the user from uttering a response voice uttered by another device. It is also possible to prevent erroneous recognition in the form of erroneously recognizing as voice.

［第２の実施形態］
第２の実施形態では、音声対話装置１０Ｂは、入力音声信号に識別情報が含まれている場合に、入力音声信号から応答音声信号を差し引いた音声信号を、音声対話手段４に出力する。 [Second embodiment]
In the second embodiment, the voice interaction device 10B outputs a voice signal obtained by subtracting the response voice signal from the input voice signal to the voice interaction means 4 when the input voice signal contains the identification information.

以下において説明する第２の実施形態では、可聴帯域外の低周波数側または高周波数側の周波数帯域のうち、高周波数側の周波数帯域に識別情報を埋め込む場合を一例として説明する。また、第２の実施形態に係る音声対話装置１０Ｂの構成のうち、一実施形態に係る音声対話装置１０と共通する構成は、特に言及しない限り、一実施形態に係る音声対話装置１０と同様であるので、重複する説明は省略する。 In the second embodiment described below, a case will be described as an example in which identification information is embedded in a frequency band on the high frequency side of frequency bands on the low frequency side or on the high frequency side outside the audible band. Further, of the configuration of the voice dialogue device 10B according to the second embodiment, the configuration common to the voice dialogue device 10 according to one embodiment is the same as that of the voice dialogue device 10 according to one embodiment unless otherwise specified. Therefore, redundant description is omitted.

図４は、本発明の第２の実施形態に係る音声対話装置１０Ｂの構成を説明するためのブロック図である。 FIG. 4 is a block diagram for explaining the configuration of a voice interaction device 10B according to the second embodiment of the invention.

第２の実施形態に係る音声対話装置１０Ｂは、識別情報埋込手段１Ｂと、識別情報判別手段２と、応答音声除外手段３Ｂと、音声対話手段４とを備える。 A voice dialogue device 10B according to the second embodiment comprises identification information embedding means 1B, identification information determination means 2, response voice elimination means 3B, and voice dialogue means 4. FIG.

識別情報埋込手段１Ｂは、帯域制限手段１５と、変調手段１６と、重畳手段１７とを備える。任意の構成として、識別情報埋込手段１Ｂは、アップサンプル手段１４をさらに備えることができる。 The identification information embedding means 1B comprises a band limiting means 15, a modulating means 16, and a superimposing means 17. FIG. As an optional configuration, the identification information embedding means 1B can further include an upsampling means 14 .

アップサンプル手段１４は、音声対話手段４から出力される応答音声信号を、所定のサンプリング周波数でアップサンプルして、後段の帯域制限手段１５および重畳手段１７に出力する。例示的には、サンプリング周波数は４８ｋＨｚまたは４４．１ｋＨｚとすることができる。好ましくは、アップサンプルする際のサンプリング周波数は、４０ｋＨｚ以上のサンプリング周波数である。これは、少なくとも２０ｋＨｚの周波数成分が含まれるようにするためである。 The up-sampling means 14 up-samples the response voice signal output from the voice dialogue means 4 at a predetermined sampling frequency, and outputs it to the band-limiting means 15 and the superimposing means 17 in the latter stage. Illustratively, the sampling frequency can be 48 kHz or 44.1 kHz. Preferably, the sampling frequency for upsampling is a sampling frequency of 40 kHz or higher. This is to include frequency components of at least 20 kHz.

帯域制限手段１５は、応答音声信号の周波数帯域を制限する。これにより、応答音声信号の周波数帯域は、可聴帯域の主要な周波数帯域に制限される。好ましくは、帯域制限手段１５は、３ｋＨｚ以下の周波数帯域に制限する。例示的には、帯域制限手段１５は、２ｋＨｚ以下の周波数帯域に制限する。例示的には、帯域制限手段１５は、低域通過フィルタまたは帯域阻止フィルタとすることができる。 Band limiting means 15 limits the frequency band of the response voice signal. Thereby, the frequency band of the response voice signal is restricted to the main frequency band of the audible band. Preferably, the band limiting means 15 limits the frequency band to 3 kHz or less. Illustratively, the band limiting means 15 limits the frequency band to 2 kHz or less. Illustratively, the band-limiting means 15 can be a low-pass filter or a band-stop filter.

変調手段１６は、帯域制限手段１５により帯域が制限された応答音声信号にキャリア信号を乗算することにより、変調信号を生成する。好ましくは、キャリア信号のキャリア周波数は、１６ｋＨｚ～２０ｋＨｚの範囲の周波数とすることができる。例示的には、キャリア信号は１８ｋＨｚの正弦波である。例示的には、変調手段１６は、キャリア信号を発振する発振器（図示せず）と、乗算器とを用いて構成することができる。 The modulating means 16 multiplies the response voice signal band-limited by the band limiting means 15 by the carrier signal to generate a modulated signal. Preferably, the carrier frequency of the carrier signal can be a frequency in the range of 16 kHz to 20 kHz. Illustratively, the carrier signal is an 18 kHz sine wave. Illustratively, the modulating means 16 can be configured using an oscillator (not shown) that oscillates a carrier signal and a multiplier.

好ましくは、変調信号の周波数帯域は可聴帯域外の周波数である。変調信号の周波数帯域は、キャリア信号のキャリア周波数と、帯域が制限された応答音声信号の周波数帯域とに基づいて定められる。例示的には、キャリア周波数が１８ｋＨｚであり、応答音声信号の周波数帯域が２ｋＨｚ以下である場合には、変調信号の周波数帯域は、１６ｋＨｚ～２０ｋＨｚ（１８±２ｋＨｚ）となる。可聴帯域は、好ましくは２０Ｈｚ～１５ｋＨｚの周波数帯域であるので、例示するこの変調信号の周波数帯域は、可聴帯域外の周波数となっている。 Preferably, the frequency band of the modulating signal is a frequency outside the audible band. The frequency band of the modulated signal is determined based on the carrier frequency of the carrier signal and the frequency band of the band-limited response voice signal. As an example, when the carrier frequency is 18 kHz and the frequency band of the response voice signal is 2 kHz or less, the frequency band of the modulated signal is 16 kHz to 20 kHz (18±2 kHz). Since the audible band is preferably a frequency band from 20 Hz to 15 kHz, the frequency band of this exemplary modulation signal is outside the audible band.

重畳手段１７は、応答音声信号に、変調手段１６から出力される変調信号を重畳する。これにより、応答音声信号に、識別情報として変調信号が埋め込まれる。変調信号が埋め込まれる周波数帯域は、応答音声信号の、キャリア周波数を含む周波数帯域である。例示的には、重畳手段１７は乗算器とすることができる。 The superimposing means 17 superimposes the modulated signal output from the modulating means 16 on the response voice signal. As a result, the modulated signal is embedded in the response voice signal as identification information. The frequency band in which the modulated signal is embedded is the frequency band containing the carrier frequency of the response voice signal. Illustratively, the superimposing means 17 can be a multiplier.

例示的には、キャリア周波数が１８ｋＨｚであり、応答音声信号の周波数帯域が２ｋＨｚ以下であるので、応答音声信号の１６ｋＨｚ～２０ｋＨｚ（１８±２ｋＨｚ）の周波数帯域に、変調信号が埋め込まれる。 Exemplarily, since the carrier frequency is 18 kHz and the frequency band of the response voice signal is 2 kHz or less, the modulated signal is embedded in the frequency band of 16 kHz to 20 kHz (18±2 kHz) of the response voice signal.

その後、可聴帯域外の周波数を有する変調信号が重畳された応答音声信号は、重畳手段１７からスピーカ５に出力され、スピーカ５が応答音声信号に基づいて応答音声を出力する。 After that, the response voice signal superimposed with the modulated signal having a frequency outside the audible band is output from the superimposing means 17 to the speaker 5, and the speaker 5 outputs the response voice based on the response voice signal.

識別情報判別手段２は、第１の実施形態に係る識別情報判別手段２と同様の構成とすることができる。 The identification information determination means 2 can have the same configuration as the identification information determination means 2 according to the first embodiment.

応答音声除外手段３Ｂは、帯域制限手段３４と、復調手段３５と、応答音声推定手段３７と、応答音声差引手段３８と、切替手段３９とを備える。任意の構成として、応答音声除外手段３Ｂは、ダウンサンプル手段３６（３６Ａ，３６Ｂ）をさらに備えることができる。 The response voice exclusion means 3B includes band limiting means 34, demodulation means 35, response voice estimation means 37, response voice subtraction means 38, and switching means 39. As an optional configuration, the response voice exclusion means 3B can further comprise down-sampling means 36 (36A, 36B).

帯域制限手段３４は、マイクロフォン６から入力される入力音声信号の周波数帯域を制限する。これにより、入力音声信号の周波数帯域は、可聴帯域の主要な周波数帯域に制限される。好ましくは、帯域制限手段３４は、３ｋＨｚ以下の周波数帯域に制限する。例示的には、帯域制限手段３４は、２ｋＨｚ以下の周波数帯域に制限する。例示的には、帯域制限手段３４は、低域通過フィルタまたは帯域阻止フィルタとすることができる。 Band limiting means 34 limits the frequency band of the input audio signal input from microphone 6 . Thereby, the frequency band of the input audio signal is limited to the main frequency band of the audible band. Preferably, the band limiting means 34 limits the frequency band to 3 kHz or less. Illustratively, the band limiting means 34 limits the frequency band to 2 kHz or less. Illustratively, band-limiting means 34 may be a low-pass filter or a band-stop filter.

復調手段３５は、マイクロフォン６から入力される入力音声信号にキャリア信号を乗算することにより、復調信号を生成する。好ましくは、キャリア信号のキャリア周波数は、１６ｋＨｚ～２０ｋＨｚの範囲の周波数とすることができる。例示的には、キャリア信号は１８ｋＨｚの正弦波である。例示的には、復調手段３５は、キャリア信号を発振する発振器（図示せず）と、乗算器と、検波器（図示せず）とを用いて構成することができる。検波器には、包絡線検波を行う検波器を用いることができる。 The demodulator 35 multiplies the input audio signal input from the microphone 6 by the carrier signal to generate a demodulated signal. Preferably, the carrier frequency of the carrier signal can be a frequency in the range of 16 kHz to 20 kHz. Illustratively, the carrier signal is an 18 kHz sine wave. Illustratively, the demodulation means 35 can be configured using an oscillator (not shown) that oscillates a carrier signal, a multiplier, and a detector (not shown). A detector that performs envelope detection can be used as the detector.

ダウンサンプル手段３６Ａは、帯域制限手段３４の出力側に接続される。ダウンサンプル手段３６Ａは、帯域制限手段３４から出力される、帯域が制限された入力音声信号を、所定のサンプリング周波数でダウンサンプルして、後段の応答音声差引手段３８および切替手段３９に出力する。例示的には、サンプリング周波数は１６ｋＨｚとすることができる。 The down-sampling means 36A is connected to the output side of the band-limiting means 34. FIG. The down-sampling means 36A down-samples the band-limited input voice signal output from the band-limiting means 34 at a predetermined sampling frequency, and outputs it to the response voice subtracting means 38 and the switching means 39 in the latter stage. Illustratively, the sampling frequency may be 16 kHz.

ダウンサンプル手段３６Ｂは、復調手段３５の出力側に接続される。ダウンサンプル手段３６Ｂは、復調手段３５から出力される復調信号を、所定のサンプリング周波数でダウンサンプルして、後段の応答音声推定手段３７に出力する。例示的には、サンプリング周波数は１６ｋＨｚとすることができる。 The down-sampling means 36B is connected to the output side of the demodulation means 35. FIG. The down-sampling means 36B down-samples the demodulated signal output from the demodulating means 35 at a predetermined sampling frequency, and outputs the down-sampled signal to the response voice estimating means 37 in the subsequent stage. Illustratively, the sampling frequency may be 16 kHz.

応答音声推定手段３７は、復調信号から応答音声信号を推定する。自己のまたは他の機器の識別情報埋込手段１Ｂにおいて、変調信号は、可聴帯域の主要な周波数帯域に制限された応答音声信号に基づいて生成されている。よって、マイクロフォン６から入力される入力音声信号が、自己のまたは他の機器の識別情報埋込手段１Ｂから出力された応答音声信号を含んでいる場合は、応答音声推定手段３７が処理対象とする復調信号も、周波数帯域が可聴帯域の主要な周波数帯域に制限されている。応答音声推定手段３７は、この周波数帯域が制限された狭帯域の復調信号から、入力音声信号に含まれていると期待される応答音声信号の全体を推定する。 A response voice estimation means 37 estimates a response voice signal from the demodulated signal. In its own or another device's identification information embedding means 1B, the modulated signal is generated based on the response voice signal limited to the main frequency band of the audible band. Therefore, when the input voice signal input from the microphone 6 contains the response voice signal output from the identification information embedding means 1B of its own device or another device, the response voice estimation means 37 treats it as the object of processing. The frequency band of the demodulated signal is also limited to the main frequency band of the audible band. The response speech estimator 37 estimates the entire response speech signal expected to be included in the input speech signal from the narrowband demodulated signal whose frequency band is limited.

例示的には、復調信号は、２ｋＨｚ以下の周波数帯域に制限されている。応答音声推定手段３７は、この０～２ｋＨｚの周波数帯域を有する復調信号のスペクトルから、例えば０～８ｋＨｚの周波数帯域を有するスペクトルを推定する。 Illustratively, the demodulated signal is restricted to a frequency band of 2 kHz or less. The response speech estimation means 37 estimates a spectrum having a frequency band of 0-8 kHz, for example, from the spectrum of the demodulated signal having a frequency band of 0-2 kHz.

狭帯域のスペクトルから広帯域のスペクトルを推定する方法には公知の種々の方法があり、応答音声推定手段３７には、これら種々の方法を適宜採用することができる。例えば、狭帯域スペクトルから広帯域スペクトルを推定する方法として、電話回線にて使用する０．３～３．４ｋＨｚの音声周波数帯域に関する通話品質向上技術を適用することができる。 There are various known methods for estimating the broadband spectrum from the narrowband spectrum, and the response speech estimation means 37 can appropriately employ these various methods. For example, as a method of estimating a broadband spectrum from a narrowband spectrum, it is possible to apply speech quality improvement technology related to the voice frequency band of 0.3 to 3.4 kHz used in telephone lines.

識別情報埋込手段１Ｂにおいて、音声対話手段４から出力される応答音声信号は、帯域制限手段１５により、可聴帯域の主要な周波数帯域（例示的には、２ｋＨｚ以下の周波数帯域）に制限されている。よって、マイクロフォン６から入力される入力音声信号が、自己のまたは他の機器の識別情報埋込手段１Ｂから出力された応答音声信号を含んでいる場合は、狭帯域のスペクトルから広帯域のスペクトルを推定する方法により、自己のまたは他の機器の音声対話手段４から出力されたと推定される、応答音声信号の全体を推定することが可能となる。 In the identification information embedding means 1B, the response voice signal output from the voice dialogue means 4 is limited to the main frequency band of the audible band (for example, the frequency band of 2 kHz or less) by the band limiting means 15. there is Therefore, if the input audio signal input from the microphone 6 contains the response audio signal output from the identification information embedding means 1B of the own device or another device, the broadband spectrum is estimated from the narrowband spectrum. This method makes it possible to estimate the entirety of the response voice signal that is estimated to have been output from the voice interaction means 4 of its own device or of another device.

応答音声差引手段３８は、応答音声推定手段３７により推定された応答音声信号を、入力音声信号から差し引く。これにより、自己のまたは他の機器の音声対話手段４から出力されたと推定される応答音声信号を、入力音声信号から差し引くことができる。例示的には、応答音声差引手段３８は減算器とすることができる。 The response voice subtracting means 38 subtracts the response voice signal estimated by the response voice estimating means 37 from the input voice signal. As a result, the response voice signal presumed to have been output from the voice interaction means 4 of the device itself or another device can be subtracted from the input voice signal. Illustratively, response voice subtractor 38 may be a subtractor.

推定された応答音声信号を入力音声信号から差し引いた音声信号には、利用者の発話音声に関する信号が主に含まれている。差し引いた音声信号に含まれている、利用者の発話音声以外の成分としては、例えば周囲の環境音に関する音声信号や、ノイズに関する音声信号が含まれている。 A speech signal obtained by subtracting the estimated response speech signal from the input speech signal mainly contains a signal related to the user's uttered speech. Components other than the user's uttered voice contained in the subtracted audio signal include, for example, an audio signal related to ambient environmental sounds and an audio signal related to noise.

切替手段３９は、判別手段２３から入力される識別情報の判別結果に基づいて、応答音声差引手段３８が出力する、推定された応答音声信号を入力音声信号から差し引いた音声信号と、入力音声信号とを切り替えて出力する。 The switching means 39 selects the audio signal obtained by subtracting the estimated response audio signal from the input audio signal, which is output by the response audio subtracting means 38, based on the determination result of the identification information input from the determining means 23, and the input audio signal. output by switching between

入力音声信号に識別情報が重畳されていない場合には、利用者の発話音声が入力音声に主に含まれているので、切替手段３９は、マイクロフォン６から入力される入力音声信号を、音声対話手段４に出力する。 When the identification information is not superimposed on the input voice signal, the user's uttered voice is mainly included in the input voice. Output to means 4.

一方で、入力音声信号に識別情報が重畳されている場合には、自己のまたは他の機器の音声対話手段４から出力された応答音声が入力音声に含まれている。よって、誤認識を防止するために、切替手段３９は、推定された応答音声信号を入力音声信号から差し引いた音声信号を、音声対話手段４に出力する。この際に切替手段３９が音声対話手段４に出力する音声信号には、利用者の発話音声に関する音声信号が主に含まれている。 On the other hand, when the identification information is superimposed on the input voice signal, the input voice includes the response voice output from the voice interaction means 4 of the device itself or another device. Therefore, in order to prevent erroneous recognition, the switching means 39 outputs a voice signal obtained by subtracting the estimated response voice signal from the input voice signal to the voice interaction means 4 . At this time, the voice signal output by the switching means 39 to the voice dialogue means 4 mainly includes voice signals relating to the user's uttered voice.

以上、第２の実施形態に係る音声対話装置１０Ｂによると、応答音声信号であることを示す識別情報に基づいて、音声対話手段４にて処理する音声信号を切り替えることができる。 As described above, according to the voice dialogue apparatus 10B according to the second embodiment, the voice signal to be processed by the voice dialogue means 4 can be switched based on the identification information indicating that it is the response voice signal.

入力音声信号に識別情報が含まれている場合には、推定された応答音声信号を入力音声信号から差し引いた音声信号を、音声対話手段４に出力する。これにより、音声対話手段４には、利用者の発話音声に関する音声信号を主に含む音声信号が入力される。これにより、音声対話装置１０Ｂは、応答音声の誤認識を防止することができる。 If the input voice signal contains the identification information, the voice signal obtained by subtracting the estimated response voice signal from the input voice signal is output to the voice dialogue means 4 . As a result, an audio signal mainly including an audio signal relating to the user's uttered voice is input to the audio dialogue means 4 . As a result, the voice interaction device 10B can prevent erroneous recognition of the response voice.

また、音声対話装置１０Ｂは、自己発話による誤認識を防止することができるし、複数台が同じ環境内に存在している状況であっても、他の機器が発する応答音声を利用者による発話音声として誤って認識するという態様の誤認識も防止することができる。 In addition, the voice interaction device 10B can prevent erroneous recognition due to self-utterance, and even in a situation where a plurality of devices exist in the same environment, it is possible for the user to utter a response voice uttered by another device. It is also possible to prevent erroneous recognition in the form of erroneously recognizing as voice.

そのうえ、音声対話装置１０Ｂは、バージ・イン（barge in）と呼ばれる、音声対話装置１０Ｂの応答と利用者の発話とが重なった場合であっても、応答音声の誤認識を防止することができる。 In addition, the speech dialogue device 10B can prevent erroneous recognition of the response speech even when the response of the speech dialogue device 10B overlaps with the user's utterance, which is called barge in. .

音声対話装置１０Ｂは、入力音声信号に含まれていると期待される応答音声信号の全体を推定し、推定した応答音声信号を入力音声信号から差し引いて、利用者の発話音声に関する音声信号を主に含む音声信号を音声対話手段４に入力する。これにより、音声対話装置１０Ｂは、バージ・インの状況であっても、利用者の発話音声に関する音声信号を主に含む音声信号を音声対話手段４に入力することができ、利用者の発話音声に対して適切な応答音声を返答することができる。 The voice interaction apparatus 10B estimates the entire response voice signal expected to be included in the input voice signal, subtracts the estimated response voice signal from the input voice signal, and extracts the voice signal related to the user's utterance voice as the main voice signal. is input to the voice dialogue means 4 . As a result, even in the barge-in situation, the voice dialogue device 10B can input a voice signal mainly containing a voice signal related to the user's uttered voice to the voice dialogue means 4, and the user's uttered voice can respond with an appropriate response voice.

［第３の実施形態］
第１および第２の実施形態では、音声対話装置１０（１０Ａ，１０Ｂ）が備える各手段は、各手段が備える各機能毎に作製された電子回路によりハードウェアとして実現されている。第３の実施形態では、音声対話装置１０（１０Ａ，１０Ｂ）が備える各手段の少なくとも一部の機能を、ソフトウェアとして実現する。 [Third Embodiment]
In the first and second embodiments, each means provided in the voice interaction apparatus 10 (10A, 10B) is realized as hardware by an electronic circuit produced for each function provided in each means. In the third embodiment, at least part of the functions of each means included in the voice interaction device 10 (10A, 10B) are implemented as software.

図５は、音声対話装置１０の各手段をソフトウェアとして実現する場合の、音声対話装置１０のハードウェア構成を示すブロック図である。 FIG. 5 is a block diagram showing the hardware configuration of voice interaction device 10 when each means of voice interaction device 10 is implemented as software.

図５に示すように、音声対話システム１００は、音声対話装置１０と、スピーカ５と、マイクロフォン６とを備える。任意の構成として、音声対話システム１００は、入力部９６と、出力部９７とを備えることができる。例示的には、音声対話システム１００はスマートフォンで構成することができる。例示的には、音声対話装置１０は汎用コンピュータで構成することもできる。任意の機能として、音声対話装置１０は、ネットワーク９９を介して外部サーバ（図示せず）と接続することもできる。 As shown in FIG. 5, the voice dialogue system 100 includes a voice dialogue device 10, a speaker 5, and a microphone 6. FIG. As an optional configuration, the spoken dialogue system 100 can comprise an input section 96 and an output section 97 . Illustratively, the voice interaction system 100 can be configured with a smart phone. Illustratively, the voice interaction device 10 can also be configured with a general-purpose computer. As an optional feature, the voice interaction device 10 can also connect with an external server (not shown) via the network 99 .

音声対話装置１０は、データ処理を行うＣＰＵ９１と、データ処理の作業領域に使用するメモリ９２と、処理データを記録する記録部９３と、各部の間でデータを伝送するバス９４と、外部機器とのデータの入出力を行うインタフェース部９５（以下、Ｉ／Ｆ部と記す）とを備えている。 The voice interaction apparatus 10 includes a CPU 91 that performs data processing, a memory 92 that is used as a work area for data processing, a recording unit 93 that records processed data, a bus 94 that transmits data between each unit, and an external device. and an interface section 95 (hereinafter referred to as an I/F section) for inputting and outputting data.

入力部９６および出力部９７は、音声対話装置１０に接続されている。例示的には、入力部９６はキーボードまたはマウス等の入力装置であり、出力部９７は液晶ディスプレイ等の表示装置である。 The input section 96 and the output section 97 are connected to the voice interaction device 10 . Illustratively, the input unit 96 is an input device such as a keyboard or mouse, and the output unit 97 is a display device such as a liquid crystal display.

音声対話装置１０は、第１および第２の実施形態において図１～図４を用いて説明した音声対話装置１０の各手段が行う処理を行うためのプログラムを、例えば実行形式（例えばプログラミング言語からコンパイラにより変換されて生成される）で記録部９３またはメモリ９２に予め記録している。音声対話装置１０は、記録部９３またはメモリ９２に記録したプログラムを使用して処理を行う。または、プログラムは、例えばＤＶＤ－ＲＯＭやＵＳＢメモリ等の、コンピュータ読み取り可能であって非一時的な有形の記録媒体９８から記録部９３またはメモリ９２にインストールされてもよいし、別所に配置された外部サーバ（図示せず）からネットワーク９９を介して記録部９３またはメモリ９２にインストールされてもよい。 The speech dialogue device 10 is configured such that a program for executing the processing performed by each means of the speech dialogue device 10 described in the first and second embodiments with reference to FIGS. converted and generated by a compiler) and recorded in the recording unit 93 or the memory 92 in advance. The voice interaction apparatus 10 performs processing using programs recorded in the recording unit 93 or memory 92 . Alternatively, the program may be installed in the recording unit 93 or the memory 92 from a computer-readable non-transitory tangible recording medium 98, such as a DVD-ROM or USB memory, or may be installed elsewhere. It may be installed in the recording section 93 or the memory 92 via the network 99 from an external server (not shown).

第１および第２の実施形態において音声対話装置１０の各手段によって行われていた処理は、本実施形態では、記録部９３またはメモリ９２に格納されたプログラムに基づいて、ＣＰＵ９１が行う。ＣＰＵ９１はメモリ９２を作業領域として必要なデータ（処理途中の中間データ等）を一時記憶し、記録部９３に演算結果等の長期保存するデータを適宜記録する。 In the present embodiment, the CPU 91 performs the processing performed by each means of the voice interaction apparatus 10 in the first and second embodiments based on the program stored in the recording unit 93 or the memory 92 . The CPU 91 temporarily stores necessary data (intermediate data during processing, etc.) using the memory 92 as a work area, and appropriately records data to be stored for a long time, such as calculation results, in the recording unit 93 .

なお、第１および実施形態において図１～図４を用いて説明した音声対話装置１０では、各手段は電子回路によりハードウェアとして実現されているが、本実施形態では、音声対話装置１０が備える各手段の少なくとも一部の機能は、ＣＰＵ９１によりソフトウェア的に実現されている。 In the speech dialogue device 10 described in the first embodiment and the embodiment with reference to FIGS. 1 to 4, each means is implemented as hardware by an electronic circuit. At least part of the function of each means is realized by software by the CPU 91 .

［その他の形態］
以上、本発明を特定の実施形態によって説明したが、本発明は上記した実施形態に限定されるものではない。 [Other forms]
Although the present invention has been described in terms of specific embodiments, the present invention is not limited to the embodiments described above.

上記実施形態では、可聴帯域外の低周波数側または高周波数側の周波数帯域のうち、高周波数側の周波数帯域に識別情報を埋め込む場合を一例として説明しているが、識別情報を埋め込む周波数帯域は、可聴帯域外の低周波数側であってもよい。 In the above-described embodiment, the case where the identification information is embedded in the frequency band on the high frequency side of the frequency bands on the low frequency side or the high frequency side outside the audible band is described as an example, but the frequency band in which the identification information is embedded is , may be on the low frequency side outside the audible band.

上記第１の実施形態では、ダウンサンプル手段３３は切替手段３２の出力側に接続されているが、ダウンサンプル手段３３は、帯域制限手段３１の出力側に接続されていてもよい。同様に、上記第２の実施形態では、ダウンサンプル手段３６Ａは帯域制限手段３４の出力側に接続され、ダウンサンプル手段３６Ｂは復調手段３５の出力側に接続されているが、これら２台のダウンサンプル手段３６（３６Ａ，３６Ｂ）に替えて、１台のダウンサンプル手段３６を切替手段３９の出力側に接続してもよい。 Although the down-sampling means 33 is connected to the output side of the switching means 32 in the first embodiment, the down-sampling means 33 may be connected to the output side of the band limiting means 31 . Similarly, in the second embodiment, the downsampling means 36A is connected to the output side of the band limiting means 34, and the downsampling means 36B is connected to the output side of the demodulation means 35. A single down-sampling means 36 may be connected to the output side of the switching means 39 instead of the sampling means 36 (36A, 36B).

上記実施形態では、音声対話装置１０は音声対話手段４を備えているが、音声対話手段４は音声対話装置１０内に備えられる必要はない。例えば、外部サーバ（図示せず）が音声対話手段４を備えており、音声対話装置１０がネットワーク９９を介して外部サーバの音声対話手段４と接続されていてもよい。すなわち、音声対話手段４はクラウド化されていてもよい。また、クラウド化にあたり、音声対話手段４は、音声対話手段４が備える構成の全てがクラウド化される必要はなく、全部または一部がクラウド化されてもよい。 In the above embodiment, the voice interaction device 10 includes the voice interaction means 4, but the voice interaction means 4 need not be provided within the voice interaction device 10. FIG. For example, an external server (not shown) may have the voice interaction means 4 and the voice interaction device 10 may be connected to the voice interaction means 4 of the external server via the network 99 . That is, the voice dialogue means 4 may be clouded. Moreover, in clouding, the voice interaction means 4 does not need to be clouded in its entirety, and all or part thereof may be clouded.

上記実施形態では、音声対話装置１０は一体の装置として実現されているが、音声対話装置１０は一体の装置である必要はない。音声対話装置１０の各手段が別所に配置され、これらがネットワークで接続されていてもよい。音声対話装置１０の各手段をソフトウェアとして実現する場合も同様に、ＣＰＵ９１、メモリ９２、記録部９３等が別所に配置され、これらがネットワークで接続されていてもよい。 In the above embodiment, the voice interaction device 10 is implemented as an integrated device, but the voice interaction device 10 need not be an integrated device. Each means of the voice interaction device 10 may be arranged separately and connected by a network. Similarly, when each means of the voice interaction apparatus 10 is implemented as software, the CPU 91, memory 92, recording unit 93, etc. may be arranged separately and connected via a network.

また、上記した音声対話装置１０の各手段をソフトウェアとして実現する場合において、音声対話装置１０の各手段が行う処理は、単一のＣＰＵ９１で実行されているが、これら各手段が行う処理は、単一のＣＰＵ９１で実行される必要は必ずしもなく、複数のＣＰＵで分散して処理されてもよい。また、ＣＰＵ９１に代えて、ＦＰＧＡ（Field Programmable Gate Array）が処理を行ってもよいし、例えばＧＰＵ（Graphics Processing Unit）をアクセラレータとして用いて、ＣＰＵ９１が行う並列演算処理を補助してもよい。すなわちＣＰＵ９１が行う処理とは、ＣＰＵまたはＦＰＧＡが、ＧＰＵ等のアクセラレータを用いて行う処理も含むことを意味する。 Further, in the case where each means of the voice dialogue device 10 described above is realized as software, the processing performed by each means of the voice dialogue device 10 is executed by a single CPU 91, but the processing performed by each means is as follows: It does not necessarily have to be executed by a single CPU 91, and may be processed in a distributed manner by a plurality of CPUs. Also, instead of the CPU 91, an FPGA (Field Programmable Gate Array) may perform processing, or a GPU (Graphics Processing Unit) may be used as an accelerator to assist the parallel arithmetic processing performed by the CPU 91, for example. That is, the processing performed by the CPU 91 includes processing performed by the CPU or FPGA using an accelerator such as a GPU.

以下に、本発明の実施例を示し、本発明の特徴をより明確にする。 Examples of the present invention are shown below to further clarify the features of the present invention.

図６に、識別情報として変調信号を応答音声信号に埋め込んだ場合の音声信号のスペクトルの一例を示す。 FIG. 6 shows an example of the spectrum of the voice signal when the modulated signal is embedded in the response voice signal as the identification information.

図６（ａ）は、音声対話エンジンが出力する応答音声信号を４８ｋＨｚにアップアンプリングした信号のスペクトルである。図６（ｂ）は、図６（ａ）に示す信号に情報埋込を行った信号のスペクトルである。 FIG. 6(a) is the spectrum of a signal obtained by up-amplifying the response speech signal output by the speech dialogue engine to 48 kHz. FIG. 6(b) is a spectrum of a signal obtained by embedding information in the signal shown in FIG. 6(a).

情報埋込は次の手順で行った。まず、２ｋＨｚのローパスフィルタにより、図６（ａ）に示す信号を２ｋＨ以下の周波数帯域に制限することにより、２ｋＨｚ以下に帯域制限された応答音声信号を得た。次に、この２ｋＨｚ以下に帯域制限された応答音声信号を、１８ｋＨｚのキャリア周波数で変調することにより、変調信号を得た。最後に、この変調信号を図６（ａ）に示す信号に重畳することにより情報埋込を行い、図６（ｂ）に示す信号のスペクトルを得た。 Information embedding was performed by the following procedure. First, a 2 kHz low-pass filter was used to limit the signal shown in FIG. 6A to a frequency band of 2 kHz or less, thereby obtaining a response voice signal band-limited to 2 kHz or less. Next, a modulated signal was obtained by modulating the response voice signal band-limited to 2 kHz or less with a carrier frequency of 18 kHz. Finally, information embedding was performed by superimposing this modulated signal on the signal shown in FIG. 6(a), and the spectrum of the signal shown in FIG. 6(b) was obtained.

図６（ｂ）を参照すると、キャリア周波数に対応する１８ｋＨｚの周辺に、音響レベル（sound level）が高い周波数成分が存在することが確認される。 Referring to FIG. 6B, it is confirmed that there are frequency components with high sound levels around 18 kHz corresponding to the carrier frequency.

図６（ｂ）に示されているように、１８ｋＨｚの周辺の周波数成分は、ヒトの可聴帯域である２０Ｈｚ～１５ｋＨｚの周波数成分と分離されている。よって、１８ｋＨｚの周辺の周波数成分は、音声対話エンジンが出力する応答音声信号か否かを表す識別情報として、利用可能であることが確認された。 As shown in FIG. 6(b), the frequency components around 18 kHz are separated from the frequency components in the human audible band of 20 Hz to 15 kHz. Therefore, it was confirmed that the frequency components around 18 kHz can be used as identification information indicating whether or not the speech dialogue engine outputs a response speech signal.

また、情報埋込時の変調処理により、１８ｋＨｚの周辺の周波数成分は、音声対話エンジンから出力される応答音声信号に対応する周波数成分である。よって、１８ｋＨｚの周辺の周波数成分を、変調処理時と同じキャリア周波数で復調することにより、音声対話エンジンから出力される応答音声信号を復元することができる。復元された応答音声信号は、２ｋＨｚ以下に帯域制限された応答音声信号に対応する周波数成分を有している。よって、狭帯域スペクトルから広帯域スペクトルを推定する公知の方法に基づいて、２ｋＨｚのローパスフィルタを適用する前の、音声対話エンジンが出力する応答音声信号を推定することが可能であることが確認された。 Further, due to the modulation processing at the time of information embedding, the frequency component around 18 kHz is the frequency component corresponding to the response voice signal output from the voice dialogue engine. Therefore, by demodulating the frequency components around 18 kHz with the same carrier frequency as in the modulation process, it is possible to restore the response voice signal output from the voice dialogue engine. The restored response voice signal has frequency components corresponding to the response voice signal band-limited to 2 kHz or less. Thus, it was confirmed that it is possible to estimate the response speech signal output by the speech dialogue engine before applying a 2 kHz low-pass filter, based on the known method of estimating the broadband spectrum from the narrowband spectrum. .

１００音声対話システム
１（１Ａ，１Ｂ）識別情報埋込手段
２識別情報判別手段
３（３Ａ，３Ｂ）応答音声除外手段
４音声対話手段
５スピーカ
６マイクロフォン
１０（１０Ａ，１０Ｂ）音声対話装置
１１発振手段
１２重畳手段
１４アップサンプル手段
１５帯域制限手段
１６変調手段
１７重畳手段
２１帯域制限手段
２２（２２Ａ，２２Ｂ）電力計算手段
２３判別手段
３１帯域制限手段
３２切替手段
３３ダウンサンプル手段
３４帯域制限手段
３５復調手段
３６（３６Ａ，３６Ｂ）ダウンサンプル手段
３７応答音声推定手段
３８応答音声差引手段
３９切替手段
８１（８１Ａ，８１Ｂ）従来の音声対話システム
８２スピーカ
８３マイクロフォン
８４応答音声
８５発話音声
８６誤認識による誤った応答音声
８９利用者 100 voice dialogue system 1 (1A, 1B) identification information embedding means 2 identification information determination means 3 (3A, 3B) response voice exclusion means 4 voice dialogue means 5 speaker 6 microphone 10 (10A, 10B) voice dialogue device 11 oscillation means 12 Superimposing means 14 Up-sampling means 15 Band-limiting means 16 Modulating means 17 Superimposing means 21 Band-limiting means 22 (22A, 22B) Power calculating means 23 Discriminating means 31 Band-limiting means 32 Switching means 33 Down-sampling means 34 Band-limiting means 35 Demodulation Means 36 (36A, 36B) Down-sampling means 37 Response voice estimation means 38 Response voice subtraction means 39 Switching means 81 (81A, 81B) Conventional voice dialogue system 82 Speaker 83 Microphone 84 Response voice 85 Speech voice 86 Wrong speech due to misrecognition Response voice 89 User

Claims

A voice interaction device for outputting a response voice from a speaker based on a response voice signal synthesized by voice interaction means in response to a user's uttered voice,
identification information embedding means for embedding identification information indicating that the response audio signal is the response audio signal in a frequency band outside the audible band of the response audio signal;
identification information determination means for determining whether or not the identification information is included in a frequency band outside the audible band of an input audio signal input from a microphone;
response voice exclusion means for outputting a voice signal obtained by excluding at least the response voice signal from the input voice signal to the voice interaction means when it is determined that the identification information is included ;
said response voice elimination means outputs a voice signal obtained by subtracting said response voice signal from said input voice signal to said voice interaction means when said identification information is included in said input voice signal;
The response voice exclusion means is
third band limiting means for limiting the frequency band of the input audio signal and outputting the input audio signal;
demodulation means for multiplying the input audio signal by a carrier signal to generate a demodulation signal;
response speech estimation means for estimating a response speech signal from the demodulated signal;
response voice subtraction means for subtracting the estimated response voice signal from the input voice signal;
a second switching means for switching between the output of the response voice subtraction means and the input voice signal based on the determination result of the identification information;
A voice interaction device comprising :

A voice interaction device for outputting a response voice from a speaker based on a response voice signal synthesized by voice interaction means in response to a user's uttered voice,
identification information embedding means for embedding identification information indicating that the response audio signal is the response audio signal in a frequency band outside the audible band of the response audio signal;
identification information determination means for determining whether or not the identification information is included in a frequency band outside the audible band of an input audio signal input from a microphone;
response voice exclusion means for outputting a voice signal obtained by excluding at least the response voice signal from the input voice signal to the voice interaction means when it is determined that the identification information is included;
said response voice elimination means outputs a voice signal obtained by subtracting said response voice signal from said input voice signal to said voice interaction means when said identification information is included in said input voice signal;
The identification information embedding means
fourth band limiting means for limiting the frequency band of the response voice signal;
modulating means for generating a modulating signal by multiplying the band-limited response voice signal by a carrier signal;
second superimposing means for embedding the modulated signal as the identification information in a frequency band of the response voice signal including the carrier frequency of the carrier signal by superimposing the modulated signal on the response voice signal;
with
A voice interactive device, wherein the upper and lower limits of the frequency band of the modulated signal are frequencies outside the audible band.

3. The voice interaction apparatus according to claim 1, wherein said response voice exclusion means outputs said input voice signal to said voice interaction means when said input voice signal does not contain said identification information.

The identification information determination means is
first band limiting means for limiting the frequency band of the input audio signal;
a first power calculation means for calculating the power of the input audio signal and the power of the band-limited input audio signal;
determining means for determining whether or not the identification information is included in the input audio signal based on the ratio of the power of the input audio signal to the power of the input audio signal whose band is limited;
4. A voice interaction device according to any one of claims 1 to 3, comprising:

5. The speech dialogue device according to any one of claims 1 to 4 , wherein said audible band is a frequency band in the range of 20Hz to 15kHz.

A program for causing a computer to function as each means of the voice interaction apparatus according to any one of claims 1 to 5 .

a voice interaction device according to any one of claims 1 to 5 ;
a microphone that outputs the input speech signal to the speech dialogue device based on the input speech;
a speaker for outputting the response voice based on the response voice signal input from the voice dialog device.