JP2006098534A

JP2006098534A - Device and method for speech processing

Info

Publication number: JP2006098534A
Application number: JP2004282410A
Authority: JP
Inventors: Ryushi Funayama; 竜士船山
Original assignee: Toyota Motor Corp
Current assignee: Toyota Motor Corp
Priority date: 2004-09-28
Filing date: 2004-09-28
Publication date: 2006-04-13

Abstract

PROBLEM TO BE SOLVED: To provide a device and a method for speech processing that can surely extract only a speech that a person utters, using a simple constitution. SOLUTION: The speech processor which processes a speech of dialog etc., between the person and a device is equipped with a sound-collecting means 10, a specific frequency passing means 20 of passing a specific frequency region of the speech collected by the sound-collecting means 10, a speech processing means 22 of performing speech processing, based on the speech passed through the specific frequency passing means 20, and a sound generating means 11 of generating sound based on the signal from a sound generation controller 22, wherein the specific frequency region of the specific frequency passing means 20 is set to a frequency region different from the frequency region of the sound generated by the sound generating means 11. COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は、人と装置との対話等の音声処理を行う音声処理装置及び音声処理方法に関する。 The present invention relates to a voice processing apparatus and a voice processing method for performing voice processing such as dialogue between a person and a device.

近年、カーナビゲーションにおける目的地設定や個人向けロボットとのコミュニケーション等において、そのユーザと対話しながら各種処理を行うようなシステムが開発されている。このようなシステムでは、マイクから入力された音声を認識し、その認識した音声に応じて構成した会話内容をスピーカから出力し、ユーザとの対話を行う。 2. Description of the Related Art In recent years, systems have been developed that perform various processes while interacting with a user in setting a destination in car navigation, communicating with a personal robot, and the like. In such a system, a voice input from a microphone is recognized, a conversation content configured according to the recognized voice is output from a speaker, and a dialogue with the user is performed.

このような対話形式において音声認識を行う場合、システムから出力する音声も、ユーザが発する音声と共にマイクに入力される。そのため、システムとユーザが同時に音声を発すると、音声認識の精度が低下し、ユーザの音声を正確に認識することができなくなる。そこで、システムでの音声出力が終わるのを待ってユーザが喋り始めたりあるいはユーザが喋り始める前にスイッチを押してシステムの音声出力を中断するなど、システムによる音声とユーザによる音声が重ならないようにする必要があった。また、スピーカからの音声がマイクに入力されないように、ヘッドセットのような別の装置が必要であった。 When performing speech recognition in such an interactive format, the sound output from the system is also input to the microphone together with the sound emitted by the user. For this reason, if the system and the user utter the voice at the same time, the accuracy of the voice recognition is lowered, and the user's voice cannot be accurately recognized. Therefore, wait for the audio output in the system to end, the user will start speaking, or before the user starts speaking, press the switch to interrupt the system's audio output, so that the system voice and the user's voice do not overlap There was a need. In addition, another device such as a headset is required so that sound from the speaker is not input to the microphone.

また、特許文献１には、ユーザが発した音声と装置から出力した音信号が重複した場合でもユーザが発した音声のみを抽出する入力音声抽出装置が開示されている。この入力音声抽出装置では、マイクに入力された音声信号から装置側で予め用意された音信号を差し引くことにより、ユーザの音声のみを抽出する。
特開平１０−３０７５９５号公報 Patent Document 1 discloses an input voice extraction device that extracts only a voice uttered by a user even when a voice uttered by the user and a sound signal output from the apparatus overlap. In this input voice extraction device, only a user's voice is extracted by subtracting a sound signal prepared in advance on the device side from a voice signal input to a microphone.
JP 10-307595 A

しかしながら、上記した入力音声抽出装置の場合、ユーザによる音声と装置が出力する音信号とで周波数成分が重複している場合、その重複している周波数成分についてはユーザの音声のみを抽出することができないので、音声の認識精度が低下する。これを回避するために、ユーザの音声の周波数特性に応じて装置から出力する音信号の周波数成分を制限した場合、部分的に周波数成分を使用できないので、出力する音声に違和感が生じる。また、入力音声抽出装置において、常時、マイクに入力された音声信号から装置側で予め用意した音信号を引くアルゴリズム的な処理が必要となるので、処理が煩雑になる。さらに、入力音声抽出装置では、出力する音信号が予め判っていなければ処理を行うことができない。 However, in the case of the input voice extraction device described above, if the frequency component overlaps between the user's voice and the sound signal output from the device, only the user's voice can be extracted for the overlapping frequency component. Since this is not possible, the speech recognition accuracy is reduced. In order to avoid this, when the frequency component of the sound signal output from the device is limited in accordance with the frequency characteristics of the user's voice, the frequency component cannot be partially used, so that the output voice is uncomfortable. In addition, in the input voice extraction device, an algorithmic process for subtracting a sound signal prepared in advance on the device side from a voice signal input to the microphone is necessary, so the process becomes complicated. Furthermore, the input speech extraction device cannot perform processing unless the sound signal to be output is known in advance.

そこで、本発明は、簡単な構成により、人が発した音声のみを確実に抽出することができる音声処理装置及び音声処理方法を提供することを課題とする。 Therefore, an object of the present invention is to provide a voice processing device and a voice processing method that can reliably extract only a voice uttered by a person with a simple configuration.

本発明に係る音声処理装置は、集音手段と、集音手段で集音した音声の特定周波数領域を通過させる特定周波数通過手段と、特定周波数通過手段を通過した音声に基づいて音声処理を行う音声処理手段と、音発生制御装置からの信号に基づいて音を発生させる音発生手段とを備え、特定周波数通過手段における特定周波数領域は、音発生手段から発生される音の周波数領域とは異なる周波数領域に設定されることを特徴とする。 The sound processing apparatus according to the present invention performs sound processing based on the sound collecting means, the specific frequency passing means that passes the specific frequency region of the sound collected by the sound collecting means, and the sound that has passed through the specific frequency passing means. Sound processing means and sound generation means for generating sound based on a signal from the sound generation control device, and the specific frequency region in the specific frequency passing means is different from the frequency region of the sound generated from the sound generation device It is set in the frequency domain.

この音声処理装置は、人が発した音声に対して音声認識等の各種音声処理を施すとともに、人に対して会話形式の音声等の各種音を発生する。そのために、音声処理装置では、集音手段で人の発した音声を集音するとともに、音発生手段から音発生制御手段からの信号に基づく音を発生させる。集音する際、音発生手段から発生した音を集音する場合もある。そして、音声処理装置では、特定周波数通過手段により、集音した音声のうち音発生手段から発生される音の周波数領域とは異なる特定周波数領域だけを通過させる。したがって、人による音声と音発生手段による音が同時に発生している場合、集音手段で集音した音声には人による音声と音発生手段による音の両方の周波数成分が含まれるが、特定周波数通過手段を通過した音声には音発生手段による音が含まれない。つまり、人の発した音声の周波数成分のみが抽出される。そこで、音声処理装置では、音声処理手段により、この特定周波数通過手段を通過した音声に基づいて音声認識等の音声処理を施す。この際、人が発した音声の周波数成分に対してのみ音声処理を施すことになるので、処理精度が向上する。特に、音声処理として音声認識を行う場合、音声認識精度が向上し、人の音声を正確に認識することができる。このように、この音声処理装置によれば、集音手段で集音した音声から音発生手段から発生する音の周波数領域とは異なる周波数領域のみを通過させるだけの簡単な構成により、人の発した音声だけを確実に抽出することができる。その結果、音発生手段から音を発生している間に人が音声を発した場合でも、人が単独で音声を発している場合と同等にその人の発した音声に対する処理を行うことができる。 This voice processing device performs various voice processing such as voice recognition on voice uttered by a person and generates various sounds such as conversational voice for the person. For this purpose, the sound processing device collects the sound produced by the person by the sound collecting means and generates sound based on the signal from the sound generation control means from the sound generating means. When collecting sound, the sound generated from the sound generating means may be collected. In the sound processing device, the specific frequency passing means passes only the specific frequency region different from the frequency region of the sound generated from the sound generating means in the collected sound. Therefore, when the sound generated by the person and the sound generated by the sound generating means are simultaneously generated, the sound collected by the sound collecting means includes both frequency components of the sound generated by the person and the sound generated by the sound generating means. The sound that has passed through the passage means does not include sound generated by the sound generation means. That is, only the frequency component of the voice uttered by a person is extracted. Therefore, in the voice processing apparatus, the voice processing unit performs voice processing such as voice recognition based on the voice that has passed through the specific frequency passing unit. At this time, since the sound processing is performed only on the frequency component of the sound uttered by a person, the processing accuracy is improved. In particular, when performing speech recognition as speech processing, speech recognition accuracy is improved and human speech can be accurately recognized. As described above, according to this sound processing device, the human speech is generated with a simple configuration in which only the frequency region different from the frequency region of the sound generated from the sound generating unit is allowed to pass from the sound collected by the sound collecting unit. It is possible to reliably extract only the voice that has been played. As a result, even when a person utters a sound while generating sound from the sound generating means, it is possible to perform processing on the sound uttered by the person as if the person uttered the sound alone. .

本発明の上記音声処理装置では、音発生制御装置と音発生手段との間に、音発生制御装置からの信号に基づく音の第２の特定周波数領域を通過させる第２特定周波数通過手段を備え、特定周波数通過手段における特定周波数領域と第２特定周波数通過手段における第２の特定周波数領域とは異なる周波数領域に設定される構成としてもよい。 In the audio processing apparatus of the present invention, the second specific frequency passing means for passing the second specific frequency region of the sound based on the signal from the sound generation control apparatus is provided between the sound generation control apparatus and the sound generation means. The specific frequency region in the specific frequency passing means and the second specific frequency region in the second specific frequency passing means may be set to different frequency regions.

この音声処理装置では、第２特定周波数通過手段により、音発生制御装置からの信号に基づく音のうち特定周波数通過手段における特定周波数領域とは異なる第２の特定周波数領域だけを通過させる。そして、音声処理装置では、音発生手段により、この第２特定周波数通過手段を通過した音だけを発生する。この音発生手段から発生される音の周波数成分は第２の特定周波数領域しか含んでいないが、人が発する音声の周波数成分には低周波から高周波までの全ての周波数領域を含んでいる。したがって、人による音声と音発生手段による音が同時に発生している場合、第２の特定周波数領域には人による音声と音声発生手段による音が重複するが、第２の特定周波数領域と異なる周波数領域（特定周波数領域が含まれる）には人による音声のみが存在する。そのため、特定周波数通過手段を通過した音声には音発生手段で発生した音が含まれることはないので、人の発した音声だけを更に確実に抽出することができる。 In this sound processing apparatus, only the second specific frequency region different from the specific frequency region in the specific frequency passing means among the sound based on the signal from the sound generation control device is passed by the second specific frequency passing means. In the sound processing apparatus, only the sound that has passed through the second specific frequency passing means is generated by the sound generating means. The frequency component of the sound generated from the sound generating means includes only the second specific frequency region, but the frequency component of the sound emitted by a person includes all frequency regions from low frequency to high frequency. Therefore, when the sound from the person and the sound from the sound generation means are generated simultaneously, the sound from the person and the sound from the sound generation means overlap in the second specific frequency region, but the frequency is different from that of the second specific frequency region. In a region (including a specific frequency region), only human speech exists. For this reason, since the sound that has passed through the specific frequency passing means does not include the sound generated by the sound generating means, it is possible to extract only the sound emitted by a person more reliably.

本発明の上記音声処理装置では、特定周波数通過手段における特定周波数領域と第２特定周波数通過手段における第２の特定周波数領域とは交互に設定されると好適である。 In the audio processing apparatus of the present invention, it is preferable that the specific frequency region in the specific frequency passing means and the second specific frequency region in the second specific frequency passing means are alternately set.

この音声処理装置では、集音した音声を通過させる特定周波数領域と音発生制御装置からの信号に基づく音を通過させる第２の特定周波数領域とを交互に設定することにより、低周波から高周波に至る広い周波数領域にわたって特定周波数領域及び第２の特定周波数領域を設定することができる。つまり、特定周波数領域、第２の特定周波数領域を所定の周波数帯に纏めて配置させるのでなく、広い周波数領域に飛び飛びで配置させることができる。そのため、人の発した音声の周波数成分として広い周波数領域に対して音声処理を施すことができるので、その処理精度が向上する（音声処理として音声認識を行う場合には音声認識の精度が向上する）。一方、音発生手段から発生する音の周波数成分も広い周波数領域の成分からなるので、多種多様の音を発生することができる（音声を発生する場合には違和感のない音声を発生することができる）。 In this sound processing device, the specific frequency region for allowing the collected sound to pass through and the second specific frequency region for allowing the sound based on the signal from the sound generation control device to pass alternately are set to change from a low frequency to a high frequency. The specific frequency region and the second specific frequency region can be set over a wide frequency region. That is, the specific frequency region and the second specific frequency region are not arranged in a predetermined frequency band, but can be arranged in a wide frequency region. Therefore, since it is possible to perform speech processing on a wide frequency range as frequency components of speech uttered by a person, the processing accuracy is improved (when speech recognition is performed as speech processing, the accuracy of speech recognition is improved. ). On the other hand, since the frequency component of the sound generated from the sound generating means is also composed of components in a wide frequency range, it is possible to generate a wide variety of sounds (when generating sound, it is possible to generate sound with no sense of incongruity). ).

本発明の上記音声処理装置では、集音手段で集音した音声を発した人物を特定する人物特定手段を備え、特定周波数通過手段における特定周波数領域は、人物特定手段で特定した人物の音声の特徴に応じて変更される構成としてもよい。 The sound processing apparatus of the present invention includes person specifying means for specifying a person who has emitted the sound collected by the sound collecting means, and the specific frequency region in the specific frequency passing means is the sound of the person specified by the person specifying means. It is good also as a structure changed according to a characteristic.

この音声処理装置では、人物特定手段により、音声を発した人物を特定する。そして、音声処理装置では、特定周波数通過手段で通過させる特定周波数領域をその特定した人物の音声の特徴に応じて設定する。つまり、個人差により、発話音声の周波数特性は人によって異なっているので（例えば、高音の方がよく出る人と低音の方がよく出る人とでその周波数特性が明らかに異なる）、音声認識等の音声処理を施す場合にその人の持っている特徴的な周波数領域を中心にして処理を行うようにし、音声処理精度を向上させる。全ての周波数領域ではなく、特定周波数領域の音声を用いるので、音声処理を行う際の情報量が少なくなるが、特徴のある周波数領域を処理対象とするので、音声処理精度（特に、音声認識精度）が低下しない。 In this sound processing device, the person who has uttered the sound is specified by the person specifying means. In the sound processing device, the specific frequency region to be passed by the specific frequency passing means is set according to the characteristics of the sound of the specified person. In other words, the frequency characteristics of uttered voices vary from person to person due to individual differences (for example, the frequency characteristics are clearly different between people with high frequency and those with low frequency). When the voice processing is performed, the processing is performed mainly on the characteristic frequency region of the person, and the voice processing accuracy is improved. Since the sound in the specific frequency region is used instead of the entire frequency region, the amount of information when performing the sound processing is reduced, but since the characteristic frequency region is processed, the sound processing accuracy (particularly, the speech recognition accuracy) ) Does not decrease.

本発明の上記音声処理装置では、人物特定手段は、集音手段で集音した音声の特徴を検出する特徴検出手段としてもよい。 In the voice processing apparatus according to the present invention, the person specifying means may be a feature detecting means for detecting a feature of the sound collected by the sound collecting means.

この音声処理装置では、集音手段で集音した音声の特徴を検出することにより、その音声を発している人物を特定する。なお、集音手段で集音した音声から特徴を直接検出してもよいし、あるいは、特定周波数通過手段によりこの集音した音声のうち特定周波数領域を通過させた音声成分から特徴を検出してもよい。 In this sound processing apparatus, the person who is emitting the sound is specified by detecting the characteristics of the sound collected by the sound collecting means. The feature may be detected directly from the sound collected by the sound collecting means, or the feature may be detected from the sound component that has passed through the specific frequency region of the collected sound by the specific frequency passing means. Also good.

本発明の上記音声処理装置では、特定周波数通過手段における特定周波数領域と第２の特定周波数通過手段における第２の特定周波数領域とを時間経過に応じて切り替えると好適である。 In the sound processing apparatus of the present invention, it is preferable to switch between the specific frequency region in the specific frequency passing means and the second specific frequency region in the second specific frequency passing means according to the passage of time.

この音声処理装置では、集音した音声を通過させる特定周波数領域と音発生制御装置からの信号に基づく音を通過させる第２の特定周波数領域とを時間経過に応じて切り替える。つまり、ある周波数領域をＡとし、Ａとは異なる周波数領域をＢとした場合、ある時刻では特定周波数領域をＡとするとともに第２の特定周波数領域をＢとし、その次の時刻では特定周波数領域をＢとするとともに第２の特定周波数領域をＡとし、更にその次の時刻では特定周波数領域をＡとするとともに第２の特定周波数領域をＢとする。この切り替える時間間隔を極短時間にすると、ある時刻で間引かれた周波数成分が極短時間後に補われることとなり、極短時間の間に全ての周波数成分が得られる。そのため、音発生手段から発生される音が音発生制御装置からの信号に基づく音の全ての周波数成分を含む音となり、人が聞いていて違和感を感じない音（特に、音声）を出力することができる。また、音声処理手段で処理される音声が人が発した音声の全ての周波数成分を含む音声となり、音声処理精度（特に、音声認識精度）も向上する。 In this sound processing device, a specific frequency region for allowing the collected sound to pass through and a second specific frequency region for allowing the sound based on the signal from the sound generation control device to pass are switched over time. That is, when a certain frequency region is A and a frequency region different from A is B, the specific frequency region is A at a certain time and the second specific frequency region is B, and the specific frequency region is the next time. And B, the second specific frequency region is A, and at the next time, the specific frequency region is A and the second specific frequency region is B. When this switching time interval is set to a very short time, the frequency components thinned out at a certain time are supplemented after a very short time, and all frequency components can be obtained in a very short time. Therefore, the sound generated from the sound generation means becomes a sound including all frequency components of the sound based on the signal from the sound generation control device, and outputs a sound (particularly a sound) that is heard by a person and does not feel uncomfortable. Can do. Further, the sound processed by the sound processing means becomes a sound including all frequency components of the sound uttered by a person, and the sound processing accuracy (particularly, the sound recognition accuracy) is improved.

本発明に係る音声処理方法は、集音ステップと、集音ステップで集音した音声の特定周波数領域を通過させる特定周波数通過ステップと、特定周波数通過ステップを通過した音声に基づいて音声処理を行う音声処理ステップと、音発生制御装置からの信号に基づいて音を発生させる音発生ステップとを含み、特定周波数通過ステップにおける特定周波数領域は、音発生ステップで発生される音声の周波数領域と異なる周波数領域に設定されることを特徴とする。 The sound processing method according to the present invention performs sound processing based on a sound collecting step, a specific frequency passing step that passes a specific frequency region of the sound collected in the sound collecting step, and a sound that has passed the specific frequency passing step. Including a sound processing step and a sound generation step for generating sound based on a signal from the sound generation control device, wherein the specific frequency region in the specific frequency passing step is different from the frequency region of the sound generated in the sound generation step It is set to an area.

本発明の上記音声処理方法では、音発生制御装置からの信号に基づく音の第２の特定周波数領域を通過させる第２特定周波数通過ステップを含み、特定周波数通過ステップにおける特定周波数領域と第２特定周波数通過ステップにおける第２の特定周波数領域とは異なる周波数領域に設定される構成としてもよい。 The audio processing method of the present invention includes a second specific frequency passing step for passing a second specific frequency region of sound based on a signal from the sound generation control device, and the specific frequency region and the second specific frequency in the specific frequency passing step. It is good also as a structure set to the frequency area | region different from the 2nd specific frequency area | region in a frequency passage step.

本発明の上記音声処理方法では、特定周波数通過ステップにおける特定周波数領域と第２特定周波数通過ステップにおける第２の特定周波数領域とは交互に設定されると好適である。 In the audio processing method of the present invention, it is preferable that the specific frequency region in the specific frequency passing step and the second specific frequency region in the second specific frequency passing step are alternately set.

本発明の上記音声処理方法では、集音ステップで集音した音声を発した人物を特定する人物特定ステップを含み、特定周波数通過ステップにおける特定周波数領域は、人物特定ステップで特定した人物の音声の特徴に応じて変更される構成としてもよい。 The sound processing method of the present invention includes a person specifying step for specifying a person who has emitted the sound collected in the sound collecting step, and the specific frequency region in the specific frequency passing step is the voice of the person specified in the person specifying step. It is good also as a structure changed according to a characteristic.

本発明の上記音声処理方法では、人物特定ステップは、集音ステップで集音した音声の特徴を検出するように構成してもよい。 In the voice processing method of the present invention, the person specifying step may be configured to detect a feature of the voice collected in the sound collecting step.

本発明の上記音声処理方法では、特定周波数通過ステップにおける特定周波数領域と第２の特定周波数通過ステップにおける第２の特定周波数領域とを時間経過に応じて切り替えると好適である。 In the audio processing method of the present invention, it is preferable to switch between the specific frequency region in the specific frequency passing step and the second specific frequency region in the second specific frequency passing step according to the passage of time.

なお、上記した各音声処理方法は、上記した各音声処理装置と同様の作用及び効果を有する。 Each voice processing method described above has the same operations and effects as each voice processing apparatus described above.

本発明によれば、集音した音声から出力する音の周波数領域とは異なる周波数領域のみを通過させ、その通過させた音声に基づいて音声処理を行うだけの簡単な構成により、人が発した音声のみを確実に抽出することができる。 According to the present invention, a person uttered by a simple configuration in which only a frequency region different from the frequency region of the sound output from the collected sound is passed and the sound processing is performed based on the passed sound. Only voice can be reliably extracted.

以下、図面を参照して、本発明に係る音声処理装置及び音声処理方法の実施の形態を説明する。 Hereinafter, embodiments of a sound processing apparatus and a sound processing method according to the present invention will be described with reference to the drawings.

本実施の形態では、本発明を、目的地設定等においてユーザとの対話形式のコミュニケーションが可能なカーナビゲーションシステムにおける音声処理装置に適用する。本実施の形態に係る音声処理装置は、マイクで集音した音声を認識し、その認識した音声に応じて構成した会話内容の音声をスピーカから出力し、ユーザとの対話を行う。特に、本実施の形態に係る音声処理装置では、マイクで集音した音声の周波数成分のうち第１特定周波領域（特定周波数領域に相当）の音声成分で音声認識を行い、会話内容の音声の周波数成分うち第２特定周波数領域の音声成分のみをスピーカから発生させる。本実施の形態には、３つの形態があり、第１の実施の形態が音声処理装置の基本的な構成であり、第２の実施の形態がユーザの音声の特徴に応じて第１特定周波数領域を変更する機能を有する構成であり、第３の実施の形態が極短時間毎に第１特定周波数領域と第２特定周波数領域とを切り替える機能を有する構成である。 In the present embodiment, the present invention is applied to a voice processing device in a car navigation system capable of interactive communication with a user in destination setting or the like. The voice processing apparatus according to the present embodiment recognizes voice collected by a microphone, outputs voice of conversation content configured according to the recognized voice from a speaker, and performs dialogue with the user. In particular, in the speech processing apparatus according to the present embodiment, speech recognition is performed using speech components in the first specific frequency region (corresponding to the specific frequency region) out of the frequency components of speech collected by a microphone, and the speech content of the conversation content is recorded. Of the frequency components, only the sound component in the second specific frequency region is generated from the speaker. In this embodiment, there are three forms, the first embodiment is a basic configuration of the speech processing apparatus, and the second embodiment is a first specific frequency according to the characteristics of the user's speech. This is a configuration having a function of changing the region, and the third embodiment is a configuration having a function of switching between the first specific frequency region and the second specific frequency region every extremely short time.

図１を参照して、本実施の形態に係る音声処理装置１の全体構成について説明する。図１は、本実施の形態に係る音声処理装置の全体構成図である。 With reference to FIG. 1, the overall configuration of a speech processing apparatus 1 according to the present embodiment will be described. FIG. 1 is an overall configuration diagram of the speech processing apparatus according to the present embodiment.

音声処理装置１は、カーナビゲーションシステムにおける１つの装置として構成され、目的地設定等をするためにユーザと対話を行うために用いられる。ちなみに、音声処理装置１は、経路案内の案内音声等の音声出力手段として用いられてもよい。音声処理装置１は、主に、マイク１０、スピーカ１１及び音声処理ユニット１２からなる。音声処理装置１では、マイク１０でユーザが発する音声等を集音し、音声処理ユニット１２においてその集音した音声からユーザの発した音声を認識する。さらに、音声処理装置１では、音声処理ユニット１２においてその認識した音声の内容に応じて発生する会話内容を構成するとともにその会話内容の発話音声を生成し、スピーカ１１からその発話音声を出力する。なお、本実施の形態では、マイク１０が特許請求の範囲に記載する集音手段に相当し、スピーカ１１が特許請求の範囲に記載する音発生手段に相当する。 The voice processing device 1 is configured as one device in a car navigation system, and is used to interact with a user to set a destination or the like. Incidentally, the voice processing device 1 may be used as voice output means such as guidance voice for route guidance. The audio processing device 1 mainly includes a microphone 10, a speaker 11, and an audio processing unit 12. In the voice processing device 1, the voice emitted by the user is collected by the microphone 10, and the voice emitted by the user is recognized from the collected voice by the voice processing unit 12. Further, in the voice processing device 1, the voice processing unit 12 constitutes the conversation contents generated according to the recognized voice contents, generates the utterance voice of the conversation contents, and outputs the utterance voice from the speaker 11. In the present embodiment, the microphone 10 corresponds to the sound collecting means described in the claims, and the speaker 11 corresponds to the sound generating means described in the claims.

図２〜図５を参照して、第１の実施の形態に係る音声処理装置１Ａの構成について説明する。図２は、第１の実施の形態に係る音声処理装置の構成図である。図３は、ユーザのみが音声を発する場合の図１の音声処理装置における処理の説明図である。図４は、ユーザが発する音声と音声処理装置が出力する音声とが重複する場合にマイクで集音される音声信号を示す図である。図５は、ユーザが発する音声と音声処理装置が出力する音声とが重複する場合の図１の音声処理装置における処理の説明図である。 With reference to FIGS. 2 to 5, the configuration of the speech processing apparatus 1 A according to the first embodiment will be described. FIG. 2 is a configuration diagram of the speech processing apparatus according to the first embodiment. FIG. 3 is an explanatory diagram of processing in the speech processing apparatus of FIG. 1 when only the user utters speech. FIG. 4 is a diagram illustrating an audio signal collected by a microphone when the audio emitted by the user and the audio output from the audio processing device overlap. FIG. 5 is an explanatory diagram of processing in the voice processing device in FIG. 1 when the voice uttered by the user and the voice output by the voice processing device overlap.

音声処理装置１Ａは、本実施の形態に係る音声処理装置の基本となる構成である。音声処理装置１Ａは、ユーザと対話を行うために、音声認識機能、会話構成機能、発話音声生成機能及び音声出力機能等を有する。特に、音声処理装置１Ａは、ユーザの音声と出力する音声とが同時に発している場合でも、ユーザの音声のみを確実に抽出し、ユーザの音声を確実に認識することができる。そのために、音声処理装置１Ａは、マイク１０、スピーカ１１及び音声処理ユニット１２Ａを備えており、音声処理ユニット１２Ａには第１フィルタ２０、第２フィルタ２１及び音声制御装置２２を有している。 Audio processing apparatus 1A is a basic configuration of the audio processing apparatus according to the present embodiment. The speech processing apparatus 1A has a speech recognition function, a conversation composition function, an utterance speech generation function, a speech output function, and the like in order to interact with the user. In particular, the voice processing apparatus 1A can reliably extract only the user's voice and reliably recognize the user's voice even when the user's voice and the output voice are emitted simultaneously. For this purpose, the sound processing device 1A includes a microphone 10, a speaker 11, and a sound processing unit 12A. The sound processing unit 12A includes a first filter 20, a second filter 21, and a sound control device 22.

なお、第１の実施の形態では、第１フィルタ２０が特許請求の範囲に記載する特定周波数通過手段に相当し、第２フィルタ２１が特許請求の範囲に記載する第２特定周波数通過手段に相当し、音声制御装置２２が特許請求の範囲に記載する音声処理手段及び音発生制御装置に相当する。 In the first embodiment, the first filter 20 corresponds to the specific frequency pass means described in the claims, and the second filter 21 corresponds to the second specific frequency pass means described in the claims. The sound control device 22 corresponds to the sound processing means and the sound generation control device described in the claims.

マイク１０では、空気の振動からなる音声（人の発する音声や装置で出力する音声）を集音する。そして、マイク１０では、その集音した音声を高速フーリエ変換等により周波数変換し、周波数毎の電気的な強度からなる原入力音声信号を第１フィルタ２０に送信する。スピーカ１１には、第２フィルタ２１から出力音声信号が入力される。そして、スピーカ１１では、周波数毎の電気的な強度からなる出力音声信号を逆高速フーリエ変換等により変換し、出力音声信号に応じた音声を出力する。なお、図３等に示すユーザの音声や装置の音声を示すグラフは、横軸が周波数であり、縦軸が強度であり、各音声のある瞬間での周波数特性を表している。この周波数特性は、時間の経過と共に変化し、人物が異なれば異なる周波数特性となり、発する内容により周波数特性も変化する。 The microphone 10 collects sound (air sound produced by a person or sound output from the device) made of air vibrations. In the microphone 10, the collected sound is frequency-converted by fast Fourier transform or the like, and an original input sound signal having an electrical intensity for each frequency is transmitted to the first filter 20. An output audio signal is input to the speaker 11 from the second filter 21. Then, the speaker 11 converts the output sound signal composed of the electric intensity for each frequency by inverse fast Fourier transform or the like, and outputs sound corresponding to the output sound signal. In the graphs showing the user's voice and the voice of the device shown in FIG. 3 and the like, the horizontal axis represents frequency and the vertical axis represents intensity, and represents frequency characteristics at a certain moment of each voice. This frequency characteristic changes with the passage of time. If a person is different, the frequency characteristic becomes different, and the frequency characteristic also changes depending on the content to be emitted.

第１フィルタ２０、第２フィルタ２１は、低周波から高周波にわたって通過させる周波数帯と通過させない周波数帯とが一定の狭い周波数間隔で配置されるバンドパスフィルタの一種であり、図３に示すようにくし型の周波数特性を有している。第１フィルタ２０の通過させる周波数帯と第２フィルタ２１の通過させる周波数帯とは、同じ幅であり、交互に重ならないように配置される。つまり、第１フィルタ２０と第２フィルタ２１とは、周波数全域にわたって一方の通過させる周波数帯に他方の通過させない周波数帯が配置され、通過させる周波数帯が異なっている。この通過させる周波数帯の幅及び通過させない周波数帯の幅は、実験によって求められ、音声認識する際に十分に認識可能でありかつ出力音声が違和感のない音声となるように設定される。なお、図３等に示す各フィルタの周波数特性は、横軸が周波数であり、縦軸が０（対応する周波数成分を通過させない）と１（対応する周波数成分を通過させる）で表される。 The first filter 20 and the second filter 21 are a kind of bandpass filter in which a frequency band that passes from low frequency to high frequency and a frequency band that does not pass through are arranged at a constant narrow frequency interval, as shown in FIG. Comb frequency characteristics. The frequency band passed by the first filter 20 and the frequency band passed by the second filter 21 have the same width and are arranged so as not to overlap each other. That is, the first filter 20 and the second filter 21 are arranged such that a frequency band that does not pass the other frequency band is arranged in one frequency band that is passed, and the frequency band that passes is different. The width of the frequency band to be passed and the width of the frequency band not to be passed are determined by experiments, and are set so that the voice can be recognized sufficiently and the output voice is not uncomfortable when the voice is recognized. In the frequency characteristics of each filter shown in FIG. 3 and the like, the horizontal axis is frequency, and the vertical axis is represented by 0 (does not pass the corresponding frequency component) and 1 (passes the corresponding frequency component).

第１フィルタ２０では、マイク１０から原入力音声信号が入力されると、図３に示すくし状の第１特定周波数領域の周波数成分のみを通過させる。そして、第１フィルタ２０では、その第１特定周波数領域の周波数成分からなる入力音声信号を音声制御装置２２に送信する。一方、第２フィルタ２１では、音声制御装置２２から原出力音声信号が入力されると、図３に示すくし状の第２特定周波数領域の周波数成分のみを通過させる。そして、第２フィルタ２１では、その第２特定周波数領域の周波数成分からなる出力音声信号をスピーカ１１に送信する。 When the original input audio signal is input from the microphone 10, the first filter 20 allows only the frequency components in the comb-shaped first specific frequency region shown in FIG. 3 to pass. Then, the first filter 20 transmits an input audio signal composed of frequency components in the first specific frequency region to the audio control device 22. On the other hand, when the original output audio signal is input from the audio control device 22, the second filter 21 passes only the frequency components in the comb-shaped second specific frequency region shown in FIG. Then, the second filter 21 transmits an output audio signal composed of frequency components in the second specific frequency region to the speaker 11.

音声制御装置２２は、ＣＰＵ[Central Processing Unit]、ＲＯＭ[Read Only Memory]、ＲＡＭ[Random Access Memory]等からなる。音声制御装置２２は、ユーザが発した音声に応じた入力音声信号が入力され、ユーザの発した音声に応じた適切な音声を出力するための原出力音声信号を出力する装置である。そのために、音声制御装置２２は、ＲＯＭに各機能を実現するための各種プログラムが記憶されており、その各種プログラムがＲＡＭにロードされ、ＣＰＵによって実行される。 The voice control device 22 includes a CPU [Central Processing Unit], a ROM [Read Only Memory], a RAM [Random Access Memory], and the like. The voice control device 22 is a device that receives an input voice signal corresponding to the voice uttered by the user and outputs an original output voice signal for outputting an appropriate voice corresponding to the voice uttered by the user. For this purpose, the voice control device 22 stores various programs for realizing each function in the ROM, and the various programs are loaded into the RAM and executed by the CPU.

音声制御装置２２では、第１フィルタ２０からの入力音声信号を受信すると、この入力音声信号に基づいて音声認識処理を行う。この入力音声信号は、第１フィルタ２０の第１特定周波数領域を通過しているので、くし状に間引かれた周波数成分からなる。そのため、音声制御装置２２には、第１特定周波数領域で間引かれた各言語情報の音声信号によって予め学習して作成された辞書データが格納されている。音声制御装置２２では、この辞書データに基づいて入力音声信号を言語情報に変換し、ユーザの音声を認識する。 When the voice control device 22 receives the input voice signal from the first filter 20, it performs voice recognition processing based on this input voice signal. Since this input sound signal passes through the first specific frequency region of the first filter 20, it consists of frequency components thinned out in a comb shape. For this reason, the voice control device 22 stores dictionary data created by learning in advance using voice signals of each language information thinned out in the first specific frequency region. The voice control device 22 converts the input voice signal into language information based on the dictionary data, and recognizes the user's voice.

さらに、音声制御装置２２では、認識した言語情報に応じて会話を構成する。そのために、音声制御装置２２には、ユーザの発した言葉に含まれる特定の単語に対して装置側で出力する会話内容が対応付けられたデータベースが格納されている。例えば、目的地設定の場合、ユーザが「目的地」という単語を発すると、「目的地の住所を教えて下さい」、「目的地の電話番号を入力して下さい」という会話内容が対応付けられている。 Furthermore, the voice control device 22 composes a conversation according to the recognized language information. For this purpose, the voice control device 22 stores a database in which conversation contents to be output on the device side are associated with specific words included in the words uttered by the user. For example, in the case of destination setting, when the user utters the word “Destination”, the conversation contents “Please tell me the destination address” and “Please enter the destination phone number” are associated. ing.

続いて、音声制御装置２２では、構成した会話に対して音声合成を行い、原出力音声信号を生成する。この原出力音声信号は、低周波から高周波まで全周波の周波数成分を含んでいる。なお、装置において音声合成を行うのでなく、会話内容に応じた原出力音声信号を予め装置内に格納しておいてもよい。 Subsequently, the voice control device 22 performs voice synthesis on the constructed conversation and generates an original output voice signal. This original output audio signal includes frequency components of all frequencies from low frequency to high frequency. Instead of performing speech synthesis in the apparatus, an original output speech signal corresponding to the conversation content may be stored in the apparatus in advance.

次に、ユーザが目的地を設定する場合の音声処理装置１Ａにおける動作について説明する。ここでは、図３を参照してユーザの音声のみが発せられた場合と図５を参照してユーザの音声と装置の音声が同時に発せられた場合について説明する。 Next, an operation in the sound processing apparatus 1A when the user sets a destination will be described. Here, a case where only the user's voice is uttered with reference to FIG. 3 and a case where the user's voice and the apparatus's voice are simultaneously emitted with reference to FIG. 5 will be described.

まず、ユーザのみが音声を発した場合について説明する。例えば、ユーザが目的地を設定するために「目的地設定」と発すると、マイク１０でその音声を集音する。集音された音声は、周波数変換され、原入力音声信号として第１フィルタ２０を通される。第１フィルタ２０の第１特定周波数領域を通過した入力音声信号は、音声制御装置２２に送信される。この入力音声信号は、ユーザが発した音声の周波数成分だけを含んでおり、周波数成分としてはくし状に間引かれている。 First, a case where only the user utters voice will be described. For example, when the user issues “destination setting” to set the destination, the microphone 10 collects the sound. The collected sound is frequency-converted and passed through the first filter 20 as an original input sound signal. The input audio signal that has passed through the first specific frequency region of the first filter 20 is transmitted to the audio control device 22. This input voice signal includes only the frequency component of the voice uttered by the user, and is thinned out in a comb shape as the frequency component.

音声制御装置２２では、入力音声信号から辞書データに基づいて言語情報を取得し、音声を認識する。入力音声信号は原入力音声信号から周波数成分が間引かれているが、その間引かれた音声信号で作成した辞書データを用いるので、十分に音声認識が可能である。さらに、音声制御装置２２では、この言語情報に応じて会話内容を設定する。ここでは、「目的地」に応じて、例えば、「目的地の住所を教えて下さい」という会話内容が設定される。続いて、音声制御装置２２では、その会話内容に応じて原出力音声信号を生成する。 The voice control device 22 acquires language information from the input voice signal based on dictionary data and recognizes the voice. The input speech signal has frequency components thinned out from the original input speech signal. Since dictionary data created using the thinned speech signal is used, speech recognition can be sufficiently performed. Further, the voice control device 22 sets conversation contents according to the language information. Here, according to the “destination”, for example, a conversation content “tell me the address of the destination” is set. Subsequently, the voice control device 22 generates an original output voice signal according to the content of the conversation.

この原出力音声信号は、第２フィルタ２１を通される。第２フィルタ２１の第２特定周波数領域を通過した出力音声信号は、スピーカ１１に送信される。スピーカ１１では、その出力音声信号を変換し、音声として出力する。出力音声信号は、くし状に周波数成分が間引かれた信号であるが、その間引かれる間隔が狭くかつ低周波から高周波まで全周波数にわたって周波数成分が分布しているので、ユーザと対話する際に違和感のない音声となる。 This original output audio signal is passed through the second filter 21. The output audio signal that has passed through the second specific frequency region of the second filter 21 is transmitted to the speaker 11. The speaker 11 converts the output audio signal and outputs it as audio. The output audio signal is a signal in which frequency components are thinned out in a comb shape, but since the thinned intervals are narrow and frequency components are distributed over all frequencies from low frequency to high frequency, when interacting with the user The sound is uncomfortable.

次に、ユーザと音声処理装置１Ａとが同時に音声を発した場合について説明する。図４に示すように、ユーザと音声処理装置１Ａとが同時に音声を発すると、ユーザによる音声の周波数成分に装置による音声の周波数成分が重なり、ユーザの音声のみの周波数成分から変化する。この際、音声処理装置１Ａから出力される周波数成分は間引かれたくし状なので、重なる部分は一定の周波数帯毎に飛び飛びになる。 Next, a case where the user and the sound processing apparatus 1A simultaneously emit sound will be described. As shown in FIG. 4, when the user and the sound processing apparatus 1 A simultaneously emit sound, the frequency component of the sound by the apparatus overlaps with the frequency component of the sound by the user, and changes from the frequency component of only the user's sound. At this time, since the frequency component output from the sound processing apparatus 1A has a thinned comb shape, the overlapped portion is skipped every certain frequency band.

例えば、音声処理装置１Ａから「目的地の住所を教えて下さい」と出力している間に、ユーザが「東京都中央区・・・」と喋り始めたとすると、２つの音声の周波数成分が重なってマイク１０に入る。マイク１０では、その重なった音声を集音し、その音声を周波数変換する。そして、その音声の原入力音声信号は、第１フィルタ２０を通され、第１特定周波数領域を通過した周波数成分からなる入力音声信号が音声制御装置２２に送信される。この入力音声信号は、第１特定周波数領域の周波数成分のみからなるので、第２特定周波数領域の周波数成分は含まれない。つまり、音声処理装置１Ａから出力された音声（すなわち、第２フィルタ２１を通過した出力音声信号）の周波数成分は第１フィルタ２０を通過できないので、入力音声信号には音声処理装置１Ａからの出力音声が含まれない。したがって、入力音声信号は、ユーザが発した音声の周波数成分だけを含んでいる。そのため、音声制御装置２２では、上記したユーザのみが音声を発した場合と同等の精度で音声認識を行うことができる。なお、音声制御装置２２以降の動作については、上記と同様の動作なので、その説明を省略する。 For example, if the user starts speaking “Chuo-ku, Tokyo ...” while outputting “Please tell me the address of the destination” from the speech processing apparatus 1A, the frequency components of the two sounds overlap. Into the microphone 10. The microphone 10 collects the overlapped sound and frequency-converts the sound. Then, the original input voice signal of the voice is passed through the first filter 20, and an input voice signal composed of frequency components that have passed through the first specific frequency region is transmitted to the voice control device 22. Since this input audio signal consists only of frequency components in the first specific frequency region, frequency components in the second specific frequency region are not included. That is, since the frequency component of the sound output from the sound processing device 1A (that is, the output sound signal that has passed through the second filter 21) cannot pass through the first filter 20, the output from the sound processing device 1A is not included in the input sound signal. Audio is not included. Therefore, the input voice signal includes only the frequency component of the voice uttered by the user. Therefore, the voice control device 22 can perform voice recognition with the same accuracy as when only the above-mentioned user utters voice. Since the operation after the voice control device 22 is the same as the above, the description thereof is omitted.

音声処理装置１Ａによれば、ユーザと装置とが同時に音声を発している場合でも、ユーザの発した音声のみを抽出することができるので、ユーザの音声を高精度に認識することができる。これにより、ユーザと音声処理装置１Ａとによる円滑な対話を行うことができる。このユーザの発する音声のみを確実に抽出するために、音声処理装置１Ａは、マイク１０と音声制御装置２２との間に第１フィルタ２０を配置させ、音声制御装置２２とスピーカ１１との間に第２フィルタ２１を配置させ、第１フィルタ２０の第１特定周波数領域と第２フィルタ２１の第２特定周波数領域とを重複しないようにするだけの簡単な構成である。 According to the sound processing apparatus 1A, even when the user and the apparatus are simultaneously producing sound, only the sound emitted by the user can be extracted, and thus the user's sound can be recognized with high accuracy. Thereby, a smooth dialogue between the user and the voice processing device 1A can be performed. In order to reliably extract only the voice uttered by the user, the voice processing device 1A places the first filter 20 between the microphone 10 and the voice control device 22, and between the voice control device 22 and the speaker 11. This is a simple configuration in which the second filter 21 is arranged so that the first specific frequency region of the first filter 20 and the second specific frequency region of the second filter 21 do not overlap.

さらに、音声処理装置１Ａによれば、フィルタ２０，２１の各特定周波数領域の通過させる周波数帯を全周波数にわたって交互に配置させたので、音声認識を行う場合にはその認識精度を向上させることができ、音声を出力する場合には違和感のない音声を出力することができる。 Furthermore, according to the speech processing apparatus 1A, since the frequency bands to be passed through the specific frequency regions of the filters 20 and 21 are alternately arranged over all frequencies, when performing speech recognition, the recognition accuracy can be improved. In the case of outputting sound, it is possible to output sound with no sense of incongruity.

図６及び図７を参照して、第２の実施の形態に係る音声処理装置１Ｂの構成について説明する。図６は、第２の実施の形態に係る音声処理装置の構成図である。図７は、ユーザの音声の特徴に応じたフィルタの周波数特性の一例であり、（ａ）が第１フィルタの周波数特性であり、（ｂ）が第２フィルタの周波数特性である。 With reference to FIG.6 and FIG.7, the structure of the audio | voice processing apparatus 1B which concerns on 2nd Embodiment is demonstrated. FIG. 6 is a configuration diagram of the speech processing apparatus according to the second embodiment. FIG. 7 is an example of the frequency characteristic of the filter according to the characteristics of the user's voice, where (a) is the frequency characteristic of the first filter and (b) is the frequency characteristic of the second filter.

音声処理装置１Ｂは、第１の実施の形態に係る音声処理装置１Ａとほぼ同様の装置であるが、対話するユーザに応じて第１特定周波数領域及び第２特定周波数領域を設定する。そのために、音声処理装置１Ｂは、マイク１０、スピーカ１１及び音声処理ユニット１２Ｂを備えており、音声処理ユニット１２Ｂには第１可変フィルタ３０、第２可変フィルタ３１、音声特徴検出装置３２、特定周波数領域データベース３３、特定周波数領域選択装置３４及び音声制御装置３５を有している。 The sound processing device 1B is substantially the same device as the sound processing device 1A according to the first embodiment, but sets a first specific frequency region and a second specific frequency region according to a user who interacts. For this purpose, the audio processing device 1B includes a microphone 10, a speaker 11, and an audio processing unit 12B. The audio processing unit 12B includes a first variable filter 30, a second variable filter 31, an audio feature detection device 32, a specific frequency. An area database 33, a specific frequency area selection device 34, and a voice control device 35 are included.

なお、第２の実施の形態では、第１可変フィルタ３０が特許請求の範囲に記載する特定周波数通過手段に相当し、第２可変フィルタ３１が特許請求の範囲に記載する第２特定周波数通過手段に相当し、音声特徴検出装置３２が特許請求の範囲に記載する人物特定手段及び特徴検出装置に相当し、音声制御装置３５が特許請求の範囲に記載する音声処理手段及び音発生制御装置に相当する。 In the second embodiment, the first variable filter 30 corresponds to the specific frequency passing means described in the claims, and the second variable filter 31 is the second specific frequency passing means described in the claims. The voice feature detection device 32 corresponds to the person specifying means and the feature detection device described in the claims, and the voice control device 35 corresponds to the voice processing means and the sound generation control device described in the claims. To do.

人は、発する音声に個人差があり、人によって音声の周波数特性が異なる。例えば、声の低い人は低周波域に周波数成分が偏り、声の高い人は高周波域に周波数成分が偏る。そのため、音声を認識する際、その音声の特徴を有する周波数域の成分を主な対象として認識処理を行う方が情報量が多くなり、認識精度が高くなる。そこで、音声処理装置１Ｂでは、音声の特徴からユーザを特定し、その特定したユーザに応じて第１特定周波数領域と第２特定周波数領域を設定する。 Humans have individual differences in voices to be uttered, and the frequency characteristics of voices vary from person to person. For example, a person with a low voice tends to have a frequency component biased toward a low frequency range, and a person with a high voice tends to have a frequency component biased toward a high frequency range. Therefore, when the speech is recognized, the amount of information increases and the recognition accuracy increases when the recognition processing is performed mainly on the frequency domain component having the features of the speech. In view of this, in the audio processing device 1B, the user is specified from the characteristics of the audio, and the first specific frequency region and the second specific frequency region are set according to the specified user.

第１可変フィルタ３０、第２可変フィルタ３１は、特定周波数領域選択装置３４からの第１選択信号、第２選択信号に応じて、低周波から高周波にわたって通過させる周波数帯を変化させることができるフィルタである。第１可変フィルタ３０で通過させる周波数帯と第２可変フィルタ３１で通過させる周波数帯とは、交互に重ならないように配置されるが、その幅はユーザに応じて変わる。 The first variable filter 30 and the second variable filter 31 are filters that can change a frequency band that passes from a low frequency to a high frequency according to the first selection signal and the second selection signal from the specific frequency region selection device 34. It is. The frequency band passed by the first variable filter 30 and the frequency band passed by the second variable filter 31 are arranged so as not to overlap each other, but the width varies depending on the user.

第１可変フィルタ３０では、マイク１０から入力音声信号が入力されると第１特定周波数領域の音声成分のみを通過させ、その第１特定周波数領域からなる入力音声信号を音声制御装置３５に送信する。一方、第２可変フィルタ３１では、音声制御装置３５から原出力音声信号が入力されると第２特定周波数領域の音声成分のみを通過させ、その第２特定周波数領域からなる出力音声信号をスピーカ１１に送信する。なお、ユーザが特定されるまで、第１可変フィルタ３０、第２可変フィルタ３１は、初期状態の各特定周波数領域（例えば、第１の実施の形態の第１フィルタ２０、第２フィルタ２１の各特定周波数領域）に設定されている。 In the first variable filter 30, when an input sound signal is input from the microphone 10, only the sound component in the first specific frequency region is allowed to pass, and the input sound signal including the first specific frequency region is transmitted to the sound control device 35. . On the other hand, in the second variable filter 31, when the original output audio signal is input from the audio control device 35, only the audio component in the second specific frequency region is allowed to pass, and the output audio signal including the second specific frequency region is transmitted to the speaker 11. Send to. Until the user is specified, the first variable filter 30 and the second variable filter 31 are in the specific frequency regions in the initial state (for example, the first filter 20 and the second filter 21 in the first embodiment). Specified frequency range).

音声特徴検出装置３２は、第１可変フィルタ３０における初期状態の第１特定周波数領域を通過した入力音声信号が入力され、その入力音声信号に基づいて音声の特徴を検出し、ユーザを特定する。そのために、音声特徴検出装置３２には、このカーナビゲーションシステムが搭載する車両を使用する可能性のある各ユーザが発した音声の音声信号から初期状態の第１特定周波数領域で間引かれた音声信号に対して予め周波数解析され、その周波数解析による各ユーザの音声の特徴（周波数特性）が格納されている。音声特徴検出装置３２では、その予め格納されている各ユーザの音声の特徴と入力音声信号との周波数特性を比較し、音声を発したユーザを特定する。音声特徴検出装置３２では、ユーザを特定すると、そのユーザの情報をユーザ信号として特定周波数領域選択装置３４及び音声制御装置３５に送信する。 The voice feature detection device 32 receives an input voice signal that has passed through the first specific frequency region in the initial state of the first variable filter 30, detects a voice feature based on the input voice signal, and specifies a user. For this purpose, the audio feature detection device 32 uses the audio signal thinned out in the first specific frequency region in the initial state from the audio signal of the audio emitted by each user who may use the vehicle installed in the car navigation system. The frequency analysis is performed on the signal in advance, and the voice characteristics (frequency characteristics) of each user based on the frequency analysis are stored. The voice feature detection device 32 compares the frequency characteristics of the voice characteristics of each user stored in advance and the input voice signal, and identifies the user who emitted the voice. When the voice feature detection device 32 specifies a user, the user information is transmitted as a user signal to the specific frequency domain selection device 34 and the voice control device 35.

特定周波数領域データベース３３には、このカーナビゲーションシステムが搭載する車両を使用する可能性のある各ユーザに応じた第１特定周波数領域及び第２特定周波数領域が格納されている。各ユーザに応じた第１特定周波数領域は、各ユーザが発した音声の音声信号が周波数解析され、その解析結果に基づいて各ユーザの音声の特徴を有する周波数域に広い幅の通過させる周波数帯が配置され、それ以外の周波数域に狭い幅の周波数域が配置される。例えば、低音に特徴を持つユーザの場合、図７（ａ）に示すように、低周波域に広い幅の通過させる周波数帯が設定され、高周波域に狭い幅の通過させる周波数帯が設定される。一方、第２特定周波数領域は、第１特定周波数領域における通過させる周波数帯に通過させない周波数帯が配置され、第１特定周波数領域における通過させない周波数帯に通過させる周波数帯が配置される。例えば、低音に特徴を持つユーザの場合、図７（ｂ）に示すように、低周波域に狭い幅の通過させる周波数帯が設定され、高周波域に広い幅の通過させる周波数帯が設定される。なお、特定周波数領域データベース３３には各ユーザに応じた第１特定周波数領域のみを格納する構成としてもよい。 The specific frequency region database 33 stores a first specific frequency region and a second specific frequency region corresponding to each user who may use the vehicle mounted on the car navigation system. The first specific frequency region corresponding to each user is a frequency band in which the sound signal of the sound emitted by each user is subjected to frequency analysis, and a wide frequency band is passed through the frequency region having the characteristics of each user's sound based on the analysis result. Is arranged, and a narrow frequency range is arranged in other frequency ranges. For example, in the case of a user who has a characteristic of bass, as shown in FIG. 7A, a wide frequency band is set in the low frequency range, and a narrow frequency band is set in the high frequency range. . On the other hand, in the second specific frequency region, a frequency band that is not allowed to pass in the first specific frequency region is disposed, and a frequency band that is not allowed to pass in the first specific frequency region is disposed. For example, in the case of a user having a characteristic of bass, as shown in FIG. 7B, a narrow frequency band is set in the low frequency range, and a wide frequency band is set in the high frequency range. . The specific frequency region database 33 may be configured to store only the first specific frequency region corresponding to each user.

ちなみに、低音に特徴を持つユーザの場合でも、低周波域だけでなく、高周波域にも通過させる周波数帯を設けるのは、そのユーザからは低周波域の音が主に発せられるが、高音域の音も多少発せられるからである。このように、第１特定周波数領域によって低周波域から高周波域までを音声の認識対象とすることにより、音声の認識精度を向上させることができるからである。また、第２特定周波数領域によって低周波域から高周波域まで出力する音声で使用できるようにすることにより、出力音声に違和感がなくなるからである。 By the way, even in the case of a user who is characterized by low frequencies, providing a frequency band that passes not only in the low frequency range but also in the high frequency range is that the user mainly emits the low frequency range, but the high frequency range This is because some of the sound is emitted. This is because the speech recognition accuracy can be improved by setting the first specific frequency region as the speech recognition target from the low frequency region to the high frequency region. In addition, since the second specific frequency region can be used for sound output from a low frequency region to a high frequency region, the output sound does not feel uncomfortable.

特定周波数領域選択装置３４は、音声特徴検出装置３２からのユーザ信号を受信すると、そのユーザ信号に示されるユーザをキーとして特定周波数領域データベース３３を参照し、特定周波数領域データベース３３からそのユーザに応じた第１特定周波数領域及び第２特定周波数領域を選択する。そして、特定周波数領域選択装置３４では、第１特定周波数領域（複数の通過させる周波数帯の情報）を示した第１選択信号を第１可変フィルタ３０に送信するとともに、第２特定周波数領域（複数の通過させる周波数帯の情報）を示した第２選択信号を第２可変フィルタ３１に送信する。なお、特定周波数領域データベース３３に各ユーザに応じた第１特定周波数領域のみを格納している場合、特定周波数領域選択装置３４では、ユーザに応じた第１特定周波数領域を選択すると、その第１特定周波数領域に基づいて第２特定周波数領域を設定する。 When the specific frequency domain selection device 34 receives the user signal from the voice feature detection device 32, the specific frequency domain selection device 34 refers to the specific frequency domain database 33 using the user indicated by the user signal as a key, and responds to the user from the specific frequency domain database 33. The first specific frequency region and the second specific frequency region are selected. Then, the specific frequency domain selection device 34 transmits a first selection signal indicating the first specific frequency domain (information of a plurality of frequency bands to be passed) to the first variable filter 30, and the second specific frequency domain (multiple frequency domains). The second selection signal indicating the frequency band information to be transmitted is transmitted to the second variable filter 31. When only the first specific frequency region corresponding to each user is stored in the specific frequency region database 33, the specific frequency region selecting device 34 selects the first specific frequency region corresponding to the user, and the first specific frequency region is selected. A second specific frequency region is set based on the specific frequency region.

音声制御装置３５は、第１の実施の形態に係る音声制御装置２２とほぼ同様の構成であるが、特定したユーザに応じて一部処理が異なる。音声制御装置３５には、初期状態の第１特定周波数領域で間引かれた各言語情報の音声信号によって予め学習して作成された辞書データ及び各ユーザに応じた第１特定周波数領域で間引かれた各言語情報の音声信号によって予め学習して作成された辞書データが格納されている。音声制御装置３５では、ユーザ信号を受信するまで、初期状態の第１特定周波数領域による辞書データに基づいて入力音声信号を言語情報に変換し、ユーザの音声を認識する。また、音声制御装置３５では、ユーザ信号を受信すると、ユーザ信号に示されるユーザに応じた辞書データに基づいて入力音声信号を言語情報に変換し、ユーザの音声を認識する。 The voice control device 35 has substantially the same configuration as the voice control device 22 according to the first embodiment, but part of the processing is different depending on the identified user. The voice control device 35 includes dictionary data created by learning in advance using speech signals of each language information thinned out in the first specific frequency region in the initial state, and thinning out in the first specific frequency region corresponding to each user. Stored is dictionary data created by learning in advance using the voice signal of each language information. The voice control device 35 converts the input voice signal into language information based on the dictionary data in the first specific frequency region in the initial state until the user signal is received, and recognizes the user's voice. In addition, when receiving the user signal, the voice control device 35 converts the input voice signal into language information based on dictionary data corresponding to the user indicated in the user signal, and recognizes the user's voice.

さらに、音声制御装置３５では、ユーザ信号を受信すると、そのユーザ信号に示されるユーザに応じた会話内容を構成するようにしてもよい。そのために、音声制御装置３５には、ユーザに応じて設定されている対話時の要望（例えば、女性の音声での対話、大阪弁での対話）、ユーザに応じた会話内容（例えば、目的地が食事する店の場合には和食店、目的地が宿泊施設の場合には旅館、目的地がガソリンスタンドの場合には指定された系列のガソリンスタンド）等のデータが予め格納されている。 Furthermore, when receiving the user signal, the voice control device 35 may constitute the conversation content corresponding to the user indicated by the user signal. For this purpose, the voice control device 35 has a request at the time of dialogue set according to the user (for example, dialogue with female voice, dialogue with Osaka dialect), conversation content according to the user (for example, destination) Stores data such as a Japanese restaurant in the case of a restaurant, a Japanese inn if the destination is an accommodation facility, and a designated series of gas stations if the destination is a gas station.

図６を参照して、音声処理装置１Ｂの動作について説明する。ユーザと音声処理装置１Ｂとの対話が開始する前に、第１可変フィルタ３０には初期状態の第１特定周波数領域が設定され、第２可変フィルタ３１には初期状態の第２特定周波数領域が設定されている。例えば、ユーザが目的地を設定するために「目的地設定」と発すると、マイク１０でその音声を集音する。集音された音声は、周波数変換され、原入力音声信号として第１可変フィルタ３０を通される。第１可変フィルタ３０の初期状態の第１特定周波数領域を通過した入力音声信号は、音声制御装置３５及び音声特徴検出装置３２に送信される。この際、音声制御装置３５では、この初期状態の第１特定周波数領域を通過した入力音声信号と初期状態の辞書データにより、第１の実施の形態と同様の動作を行う。そして、音声制御装置３５から送信された原出力音声信号が第２可変フィルタ３１を通され、第２可変フィルタ３１の初期状態の第２特定周波数領域を通過した出力音声信号がスピーカ１１に送信される。スピーカ１１では、その出力音声信号を変換し、音声として出力する。 With reference to FIG. 6, the operation of the sound processing apparatus 1B will be described. Before the dialogue between the user and the sound processing device 1B starts, the first variable frequency band 30 is set with the first specific frequency region in the initial state, and the second variable filter 31 has the second specific frequency region in the initial state. Is set. For example, when the user issues “destination setting” to set the destination, the microphone 10 collects the sound. The collected sound is frequency-converted and passed through the first variable filter 30 as an original input sound signal. The input audio signal that has passed through the first specific frequency region in the initial state of the first variable filter 30 is transmitted to the audio control device 35 and the audio feature detection device 32. At this time, the voice control device 35 performs the same operation as in the first embodiment by using the input voice signal that has passed through the first specific frequency region in the initial state and the dictionary data in the initial state. Then, the original output audio signal transmitted from the audio control device 35 is passed through the second variable filter 31, and the output audio signal that has passed through the second specific frequency region in the initial state of the second variable filter 31 is transmitted to the speaker 11. The The speaker 11 converts the output audio signal and outputs it as audio.

音声特徴検出装置３２では、この初期状態の第１特定周波数領域を通過した入力音声信号の周波数特性と保持している各ユーザの音声の特徴の周波数特性とをそれぞれ比較し、その周波数特性が類似するものを検出する。そして、音声特徴検出装置３２では、その検出結果からユーザを特定し、そのユーザを示すユーザ信号を特定周波数領域選択装置３４及び音声制御装置３５に送信する。 The voice feature detection device 32 compares the frequency characteristics of the input voice signal that has passed through the first specific frequency region in the initial state with the frequency characteristics of the voice characteristics of each user, and the frequency characteristics are similar. Detect what to do. Then, the voice feature detection device 32 specifies a user from the detection result, and transmits a user signal indicating the user to the specific frequency domain selection device 34 and the voice control device 35.

特定周波数領域選択装置３４では、ユーザ信号に基づいて、特定周波数領域データベース３３から特定されたユーザに応じて第１特定周波数領域及び第２特定周波数領域を選択する。そして、特定周波数領域選択装置３４では、その選択した第１特定周波数領域を示す第１選択信号を第１可変フィルタ３０に送信するとともに、第２特定周波数領域を示す第２選択信号を第２可変フィルタ３１に送信する。 The specific frequency region selection device 34 selects the first specific frequency region and the second specific frequency region according to the user specified from the specific frequency region database 33 based on the user signal. Then, the specific frequency region selection device 34 transmits a first selection signal indicating the selected first specific frequency region to the first variable filter 30, and a second variable selection signal indicating the second specific frequency region is second variable. It transmits to the filter 31.

第１可変フィルタ３０では、第１選択信号に基づいて、通過させる周波数帯を初期状態の第１特定周波数領域からユーザに応じた第１特定周波数領域に変化させる。原入力音声信号は、この第１可変フィルタ３０を通され、ユーザに応じた第１特定周波数領域を通過した入力音声信号が音声制御装置３５に送信される。この入力音声信号は、原入力音声信号から間引かれた信号であるが、ユーザの音声の特徴となる周波数域の周波数成分を主に含んでいる。この際、音声制御装置３５では、このユーザに応じた第１特定周波数領域を通過した入力音声信号とユーザに応じた辞書データにより、第１の実施の形態と同様の動作を行う。音声制御装置３５における音声の認識精度は、ユーザの特徴となる周波数成分に基づいて音声認識を行うので、初期状態の第１特定周波数領域を通過した入力音声信号を用いた場合より向上する。 The first variable filter 30 changes the frequency band to pass from the first specific frequency region in the initial state to the first specific frequency region corresponding to the user based on the first selection signal. The original input audio signal is passed through the first variable filter 30, and the input audio signal that has passed through the first specific frequency region corresponding to the user is transmitted to the audio control device 35. This input audio signal is a signal that is decimated from the original input audio signal, but mainly includes frequency components in the frequency range that are characteristic of the user's audio. At this time, the voice control device 35 performs the same operation as that of the first embodiment by using the input voice signal that has passed through the first specific frequency range corresponding to the user and the dictionary data corresponding to the user. The voice recognition accuracy in the voice control device 35 is improved as compared with the case where the input voice signal that has passed through the first specific frequency region in the initial state is used because the voice recognition is performed based on the frequency component that is a feature of the user.

また、第２可変フィルタ３１では、第２選択信号に基づいて、通過させる周波数帯を初期状態の第２特定周波数領域からユーザに応じた第２特定周波数領域に変化させている。音声制御装置３５から送信された原出力音声信号が第２可変フィルタ３１を通され、ユーザに応じた第２特定周波数領域を通過した出力音声信号がスピーカ１１に送信される。スピーカ１１では、その出力音声信号を変換し、音声として出力する。この出力される音声は、低周波から高周波まで全周波数数にわたる周波数成分を含んでいるので、違和感は殆ど感じない。 In the second variable filter 31, the frequency band to be passed is changed from the second specific frequency region in the initial state to the second specific frequency region corresponding to the user based on the second selection signal. The original output audio signal transmitted from the audio control device 35 is passed through the second variable filter 31, and the output audio signal that has passed through the second specific frequency region corresponding to the user is transmitted to the speaker 11. The speaker 11 converts the output audio signal and outputs it as audio. Since the output sound includes frequency components over the entire number of frequencies from low frequency to high frequency, there is almost no sense of incongruity.

この音声処理装置１Ｂによれば、音声処理装置１Ａと同様の効果を有する上に、ユーザの音声の特徴を考慮して音声認識を行うので、音声認識精度が更に高くなる。 According to the voice processing device 1B, the same effect as the voice processing device 1A is obtained, and voice recognition is performed in consideration of the characteristics of the user's voice, so that the voice recognition accuracy is further improved.

図８〜図１１を参照して、第３の実施の形態に係る音声処理装置１Ｃの構成について説明する。図８は、第３の実施の形態に係る音声処理装置の構成図である。図９は、第１フィルタが集音側、第２フィルタが音声出力側に切り替わった場合の図８の音声処理装置における処理の説明図である。図１０は、第２フィルタが集音側、第１フィルタが音声出力側に切り替わった場合の図８の音声処理装置における処理の説明図である。図１１は、図８の音声処理装置から出力される音声の周波数特性の一例である。 With reference to FIGS. 8 to 11, the configuration of a speech processing apparatus 1 C according to the third embodiment will be described. FIG. 8 is a configuration diagram of a speech processing apparatus according to the third embodiment. FIG. 9 is an explanatory diagram of processing in the sound processing apparatus of FIG. 8 when the first filter is switched to the sound collection side and the second filter is switched to the sound output side. FIG. 10 is an explanatory diagram of processing in the sound processing apparatus of FIG. 8 when the second filter is switched to the sound collection side and the first filter is switched to the sound output side. FIG. 11 is an example of the frequency characteristics of the sound output from the sound processing apparatus of FIG.

音声処理装置１Ｃは、第１の実施の形態に係る音声処理装置１Ａとほぼ同様の装置であるが、第１フィルタと第２フィルタとを極短時間毎に切り替える。そのために、音声処理装置１Ｃは、マイク１０、スピーカ１１及び音声処理ユニット１２Ｃを備えており、音声処理ユニット１２Ｃには第１フィルタ４０、第２フィルタ４１、切替タイミング発生装置４２、第１切替制御装置４３、第２切替制御装置４４及び音声制御装置４５を有している。 The sound processing device 1C is substantially the same device as the sound processing device 1A according to the first embodiment, but switches between the first filter and the second filter every very short time. For this purpose, the audio processing device 1C includes a microphone 10, a speaker 11, and an audio processing unit 12C. The audio processing unit 12C includes a first filter 40, a second filter 41, a switching timing generator 42, and a first switching control. The apparatus 43, the 2nd switching control apparatus 44, and the audio | voice control apparatus 45 are provided.

なお、第３の実施の形態では、第１フィルタ４０又は第２フィルタ４１が特許請求の範囲に記載する特定周波数通過手段に相当し、第１フィルタ４０又は第２フィルタ４１が特許請求の範囲に記載する第２特定周波数通過手段に相当し、音声制御装置４５が特許請求の範囲に記載する音声処理手段及び音発生制御装置に相当する。 In the third embodiment, the first filter 40 or the second filter 41 corresponds to the specific frequency passing means described in the claims, and the first filter 40 or the second filter 41 is in the claims. The sound control device 45 corresponds to the sound processing device and the sound generation control device described in the claims.

第１フィルタ４０、第２フィルタ４１は、第１の実施の形態に係る第１フィルタ２０、第２フィルタ２１と同様のフィルタである。第１フィルタ４０は、入力側に第１切替制御装置４３が接続し、出力側に第２切替制御装置４４が接続する。一方、第２フィルタ４１は、入力側に第２切替制御装置４４が接続し、出力側に第１切替制御装置４３が接続する。 The first filter 40 and the second filter 41 are the same filters as the first filter 20 and the second filter 21 according to the first embodiment. The first filter 40 has a first switching control device 43 connected to the input side and a second switching control device 44 connected to the output side. On the other hand, in the second filter 41, the second switching control device 44 is connected to the input side, and the first switching control device 43 is connected to the output side.

切替タイミング発生装置４２は、一定時間毎に切替タイミング信号を第１切替制御装置４３、第２切替制御装置４４及び音声制御装置４５に送信する。切替タイミング信号は、第１フィルタ４０と第２フィルタ４１を集音側と音声出力側のいずれに接続するかを指示すための信号である。切替タイミング信号を送信する一定時間としては、実験によって求められ、音声処理装置１Ｃから出力する音声に違和感が生じない程度の極短時間が設定される。 The switching timing generation device 42 transmits a switching timing signal to the first switching control device 43, the second switching control device 44, and the voice control device 45 at regular intervals. The switching timing signal is a signal for indicating whether the first filter 40 and the second filter 41 are connected to the sound collection side or the sound output side. The fixed time for transmitting the switching timing signal is set to an extremely short time that is obtained by experiments and does not cause discomfort in the sound output from the sound processing apparatus 1C.

第１切替制御装置４３は、マイク１０及びスピーカ１１と第１フィルタ４０及び第２フィルタ４１の間に配置され、一方側にマイク１０及びスピーカ１１が接続され、他方側に第１フィルタ４０及び第２フィルタ４１が接続される。第１切替制御装置４３では、切替タイミング信号に第１フィルタ４０を集音側（及び／又は第２フィルタ４１を音声出力側）に接続と指示されている場合には第１フィルタ４０をマイク１０に接続するとともに第２フィルタ４１をスピーカ１１に接続し（図９参照）、切替タイミング信号に第１フィルタ４０を音声出力側（及び／又は第２フィルタ４１を集音側）に接続と指示されている場合には第２フィルタ４１をマイク１０に接続するとともに第１フィルタ４０をスピーカ１１に接続する（図１０参照）。 The first switching control device 43 is disposed between the microphone 10 and the speaker 11 and the first filter 40 and the second filter 41. The microphone 10 and the speaker 11 are connected to one side, and the first filter 40 and the first filter are connected to the other side. Two filters 41 are connected. In the first switching control device 43, when the switching timing signal indicates that the first filter 40 is connected to the sound collection side (and / or the second filter 41 is connected to the sound output side), the first filter 40 is connected to the microphone 10. And the second filter 41 is connected to the speaker 11 (see FIG. 9), and the switching timing signal instructs the first filter 40 to be connected to the audio output side (and / or the second filter 41 to the sound collection side). If so, the second filter 41 is connected to the microphone 10 and the first filter 40 is connected to the speaker 11 (see FIG. 10).

第２切替制御装置４４は、第１フィルタ４０及び第２フィルタ４１と音声制御装置４５の間に配置され、一方側に第１フィルタ４０及び第２フィルタ４１が接続され、他方側に音声制御装置４５の入力端及び出力端が接続される。第２切替制御装置４４では、切替タイミング信号に第１フィルタ４０を集音側に接続と指示されている場合には第１フィルタ４０を音声制御装置４５の入力端に接続するとともに第２フィルタ４１を音声制御装置４５の出力端に接続し（図９参照）、切替タイミング信号に第１フィルタ４０を音声出力側に接続と指示されている場合には第２フィルタ４１を音声制御装置４５の入力端に接続するとともに第１フィルタ４０を音声制御装置４５の出力端に接続する（図１０参照）。 The second switching control device 44 is disposed between the first filter 40 and the second filter 41 and the sound control device 45, the first filter 40 and the second filter 41 are connected to one side, and the sound control device is connected to the other side. 45 input terminals and output terminals are connected. In the second switching control device 44, when the switching timing signal indicates that the first filter 40 is connected to the sound collection side, the first filter 40 is connected to the input end of the voice control device 45 and the second filter 41 is connected. Is connected to the output terminal of the voice control device 45 (see FIG. 9), and the second filter 41 is input to the voice control device 45 when the switching timing signal indicates that the first filter 40 is connected to the voice output side. The first filter 40 is connected to the output end of the voice control device 45 while being connected to the end (see FIG. 10).

音声制御装置４５は、第１の実施の形態に係る音声制御装置２２とほぼ同様の構成であるが、音声認識をする際に一部処理が異なる。音声処理装置１Ｃでは第１フィルタ４０と第２フィルタ４１とが切り替わるので、音声制御装置４５に入力される入力音声信号は、原入力音声信号をくし状に間引きした信号であるが、その間引かれる周波数帯が一定時間毎に変わる。したがって、入力音声信号から言語情報に変換する際の辞書データも切り替える必要がある。そのために、音声制御装置４５には、第１フィルタ４０の第１特定周波数領域で間引かれた各言語情報の音声信号によって予め学習して作成された第１辞書データ及び第２フィルタ４１の第２特定周波数領域で間引かれた各言語情報の音声信号によって予め学習して作成された第２辞書データが格納されている。音声制御装置４５では、切替タイミング信号に第１フィルタ４０を集音側に接続と指示されている場合には第１辞書データに基づいて入力音声信号を言語情報に変換し、切替タイミング信号に第１フィルタ４０を音声出力側に接続と指示されている場合には第２辞書データに基づいて入力音声信号を言語情報に変換し、ユーザの音声を認識する。 The voice control device 45 has substantially the same configuration as the voice control device 22 according to the first embodiment, but part of the processing is different when performing voice recognition. Since the first filter 40 and the second filter 41 are switched in the sound processing device 1C, the input sound signal input to the sound control device 45 is a signal obtained by thinning the original input sound signal into a comb shape, but is thinned out. The frequency band changes every certain time. Therefore, it is necessary to switch the dictionary data when converting the input voice signal into the language information. Therefore, the voice control device 45 includes the first dictionary data created by learning in advance using the voice signal of each language information thinned out in the first specific frequency region of the first filter 40 and the second filter 41. 2 Stores second dictionary data created by learning in advance using speech signals of each language information thinned out in a specific frequency region. The voice control device 45 converts the input voice signal into language information based on the first dictionary data when the switch timing signal instructs the first filter 40 to be connected to the sound collection side, and the switch timing signal When it is instructed to connect the first filter 40 to the voice output side, the input voice signal is converted into language information based on the second dictionary data, and the user's voice is recognized.

図８〜図１１を参照して、音声処理装置１Ｃの動作について説明する。初期状態の場合、第１切替制御装置４３では第１フィルタ４０をマイク１０に接続するとともに第２フィルタ４１をスピーカ１１に接続し、第２切替制御装置４４では第１フィルタ４０を音声制御装置４５の入力端に接続するとともに第２フィルタ４１を音声制御装置４５の出力端に接続している（図９参照）。 With reference to FIGS. 8 to 11, the operation of the sound processing apparatus 1 C will be described. In the initial state, the first switching control device 43 connects the first filter 40 to the microphone 10 and the second filter 41 to the speaker 11, and the second switching control device 44 connects the first filter 40 to the voice control device 45. And the second filter 41 is connected to the output end of the voice control device 45 (see FIG. 9).

ユーザと音声処理装置１Ｃとの対話が開始すると、切替タイミング発生装置４２では、極短時間毎に、第１フィルタ４０、第２フィルタ４１の接続先を示した切替タイミング信号を生成し、第１切替制御装置４３、第２切替制御装置４４及び音声制御装置４５に送信する。 When the conversation between the user and the voice processing device 1C is started, the switching timing generation device 42 generates a switching timing signal indicating the connection destination of the first filter 40 and the second filter 41 for each extremely short time. The data is transmitted to the switching control device 43, the second switching control device 44, and the voice control device 45.

切替タイミング信号として第１フィルタ４０を集音側に接続と指示されている場合、第１切替制御装置４３では第１フィルタ４０の接続先をマイク１０からスピーカ１１に切り替えるとともに第２フィルタ４１の接続先をスピーカ１１からマイク１０に切り替え、第２切替制御装置４４では第１フィルタ４０の接続先を音声制御装置４５の入力端から出力端に切り替えるとともに第２フィルタ４１の接続先を音声制御装置４５の出力端から入力端に切り替える（図１０参照）。 When it is instructed to connect the first filter 40 to the sound collection side as a switching timing signal, the first switching control device 43 switches the connection destination of the first filter 40 from the microphone 10 to the speaker 11 and connects the second filter 41. The second switch control device 44 switches the connection destination of the first filter 40 from the input end to the output end of the voice control device 45 and the connection destination of the second filter 41 to the voice control device 45. Are switched from the output terminal to the input terminal (see FIG. 10).

マイク１０で集音され、周波数変換された原入力音声信号は、第１切替制御装置４３に送信され、第１切替制御装置４３を介して第２フィルタ４１を通される。第２フィルタ４１の第２特定周波数領域を通過した入力音声信号は、第２切替制御装置４４を介して音声制御装置４５に送信される。 The original input audio signal collected by the microphone 10 and frequency-converted is transmitted to the first switching control device 43 and passed through the second filter 41 via the first switching control device 43. The input audio signal that has passed through the second specific frequency region of the second filter 41 is transmitted to the audio control device 45 via the second switching control device 44.

この際、音声制御装置４５では、この第２特定周波数領域を通過した入力音声信号と第２辞書データにより音声認識を行う。さらに、音声制御装置４５では、第１の実施の形態と同様の動作を行い、原出力音声信号を第２切替制御装置４４に送信する。 At this time, the voice control device 45 performs voice recognition using the input voice signal that has passed through the second specific frequency region and the second dictionary data. Further, the voice control device 45 performs the same operation as in the first embodiment, and transmits the original output voice signal to the second switching control device 44.

音声制御装置４５から送信された原出力音声信号は、第２切替制御装置４４を介して第１フィルタ４０を通される。第１フィルタ４０の第１特定周波数領域を通過した出力音声信号が、スピーカ１１に送信される。スピーカ１１では、その出力音声信号を変換し、音声として出力する。 The original output audio signal transmitted from the audio control device 45 is passed through the first filter 40 via the second switching control device 44. The output audio signal that has passed through the first specific frequency region of the first filter 40 is transmitted to the speaker 11. The speaker 11 converts the output audio signal and outputs it as audio.

切替タイミング信号として第１フィルタ４０を集音側に接続と指示されている場合、第１切替制御装置４３では第１フィルタ４０の接続先をスピーカ１１からマイク１０に切り替えるとともに第２フィルタ４１の接続先をマイク１０からスピーカ１１に切り替え、第２切替制御装置４４では第１フィルタ４０の接続先を音声制御装置４５の出力端から入力端に切り替えるとともに第２フィルタ４１の接続先を音声制御装置４５の入力端から出力端に切り替える（図９参照）。 When it is instructed to connect the first filter 40 to the sound collection side as a switching timing signal, the first switching control device 43 switches the connection destination of the first filter 40 from the speaker 11 to the microphone 10 and connects the second filter 41. The second switch control device 44 switches the connection destination of the first filter 40 from the output end to the input end of the voice control device 45 and the connection destination of the second filter 41 to the voice control device 45. Is switched from the input end to the output end (see FIG. 9).

マイク１０で集音され、周波数変換された原入力音声信号は、第１切替制御装置４３に送信され、第１切替制御装置４３を介して第１フィルタ４０を通される。第１フィルタ４０の第１特定周波数領域を通過した入力音声信号は、第２切替制御装置４４を介して音声制御装置４５に送信される。 The original input audio signal collected by the microphone 10 and frequency-converted is transmitted to the first switching control device 43 and passed through the first filter 40 via the first switching control device 43. The input audio signal that has passed through the first specific frequency region of the first filter 40 is transmitted to the audio control device 45 via the second switching control device 44.

この際、音声制御装置４５では、この第１特定周波数領域を通過した入力音声信号と第１辞書データにより音声認識を行う。さらに、音声制御装置４５では、第１の実施の形態と同様の動作を行い、原出力音声信号を第２切替制御装置４４に送信する。 At this time, the voice control device 45 performs voice recognition using the input voice signal that has passed through the first specific frequency region and the first dictionary data. Further, the voice control device 45 performs the same operation as in the first embodiment, and transmits the original output voice signal to the second switching control device 44.

音声制御装置４５から送信された原出力音声信号は、第２切替制御装置４４を介して第２フィルタ４１を通される。第２フィルタ４１の第２特定周波数領域を通過した出力音声信号が、スピーカ１１に送信される。スピーカ１１では、その出力音声信号を変換し、音声として出力する。 The original output audio signal transmitted from the audio control device 45 is passed through the second filter 41 via the second switching control device 44. The output audio signal that has passed through the second specific frequency region of the second filter 41 is transmitted to the speaker 11. The speaker 11 converts the output audio signal and outputs it as audio.

音声処理装置１Ｃでは、以上の動作が極短時間毎に繰り返し行われる。したがって、音声制御装置４５から出力される原出力音声信号は、極短時間毎に、第１フィルタ４０と第２フィルタ４１との切り替えに応じて、第１フィルタ４０と第２フィルタ４１とを交互に通過する。この際、この切り替えるタイミングが十分に速い速度で実行されると、音声制御装置４５で生成する原出力音声信号の周波数成分があまり変化しないうちに第１フィルタ４０と第２フィルタ４１とが切り替わる。そのため、図１１に示すように、ある瞬間に出力される原出力音声信号ＯＳは、第２フィルタ４１を通過し、第２特定周波数領域によって間引かれた出力音声信号ＯＳ２となる。そして、スピーカ１１からは出力音声信号ＯＳ２に応じた音声が出力される。次の瞬間に出力される原出力音声信号は、ある瞬間に出力される原出力音声信号ＯＳから殆ど変化しておらず、第１フィルタ４０を通過し、第１特定周波数領域によって間引かれた出力音声信号ＯＳ１となる。そして、スピーカ１１からは出力音声信号ＯＳ１に応じた音声が出力される。その結果、ユーザに聞こえる音声としては、出力音声信号ＯＳ１と出力音声信号ＯＳ２とが合成された音声信号ＯＳ３に応じた音声となる。この音声信号ＯＳ３は、全周波数成分を含む原出力音声信号とほぼ同じ信号となる。 In the sound processing apparatus 1C, the above operation is repeated every extremely short time. Therefore, the original output audio signal output from the audio control device 45 is alternately switched between the first filter 40 and the second filter 41 in response to switching between the first filter 40 and the second filter 41 every extremely short time. To pass through. At this time, if the switching timing is executed at a sufficiently high speed, the first filter 40 and the second filter 41 are switched before the frequency component of the original output audio signal generated by the audio control device 45 changes significantly. Therefore, as shown in FIG. 11, the original output audio signal OS output at a certain moment passes through the second filter 41 and becomes the output audio signal OS2 thinned out by the second specific frequency region. Then, sound corresponding to the output sound signal OS2 is output from the speaker 11. The original output audio signal output at the next moment hardly changes from the original output audio signal OS output at a certain moment, passes through the first filter 40, and is thinned out by the first specific frequency region. The output audio signal OS1. Then, sound corresponding to the output sound signal OS1 is output from the speaker 11. As a result, the sound heard by the user is a sound corresponding to the sound signal OS3 obtained by synthesizing the output sound signal OS1 and the output sound signal OS2. The audio signal OS3 is substantially the same signal as the original output audio signal including all frequency components.

なお、入力側についても第１特定周波数領域によって間引かれた入力音声信号と第２特定周波数領域によって間引かれた入力音声信号の両方に対して交互に音声認識を行うことになるので、全周波数成分を含む原入力音声信号に対して音声認識を行うのとほぼ同等となる。 Since the input side also performs voice recognition alternately for both the input voice signal thinned out by the first specific frequency region and the input voice signal thinned out by the second specific frequency region, This is almost equivalent to performing speech recognition on the original input speech signal including frequency components.

この音声処理装置１Ｃによれば、音声処理装置１Ａと同様の効果を有する上に、ユーザにとって違和感のない音声を出力することができるとともに、音声認識精度も向上する。 According to the voice processing device 1C, the same effect as that of the voice processing device 1A can be obtained, and voice that does not feel uncomfortable for the user can be output, and voice recognition accuracy is improved.

以上、本発明に係る実施の形態について説明したが、本発明は上記実施の形態に限定されることなく様々な形態で実施される。 As mentioned above, although embodiment which concerns on this invention was described, this invention is implemented in various forms, without being limited to the said embodiment.

例えば、本実施の形態ではカーナビゲーションシステムに適用したが、ユーザとの対話を行うことができるロボットなどの対話形式の音声処理を行う他のシステムにも適用可能である。また、本実施の形態では音声制御装置において音声認識、会話構成、発話音声生成の各処理を行う構成としたが、適用するシステムの目的に応じて行う処理内容を変えてもよい。また、システム側において１種類の音声を出力する場合だけでなく、複数のパーソナルロボットがそれぞれ異なる声色を持つ場合やカーナビゲーション上にエージェントが複数存在し、それぞれ異なる声色を持つような複数の種類の音声を出力する場合にも適用可能である。 For example, although the present embodiment is applied to a car navigation system, the present invention can also be applied to other systems that perform interactive voice processing, such as a robot that can interact with a user. Further, in the present embodiment, the voice control device is configured to perform each process of voice recognition, conversation configuration, and utterance voice generation. However, the processing content to be performed may be changed according to the purpose of the system to be applied. Also, not only when one type of voice is output on the system side, but also when multiple personal robots have different timbres, or when there are multiple agents on the car navigation system, The present invention can also be applied when outputting sound.

また、本実施の形態では音声制御装置とスピーカとの間に第２フィルタを設ける構成としたが、第２フィルタを設けずに、音声制御装置において第２フィルタを通した場合と同等の第２特定周波数領域の周波数成分からなる音声を生成するように構成してもよい。 In the present embodiment, the second filter is provided between the sound control device and the speaker. However, the second filter is not provided, and the second filter is the same as when the second filter is passed through the sound control device. You may comprise so that the audio | voice which consists of a frequency component of a specific frequency area | region may be produced | generated.

また、本実施の形態では音声処理装置から出力する音としてはユーザと対話を行うための会話音声を適用したが、「ブー」、「ピー」などの機械音や音楽など、様々な音に適用可能である。 Also, in this embodiment, the speech output for speech from the speech processing device is applied to the conversation, but applied to various sounds such as mechanical sounds such as “boo” and “pea” and music. Is possible.

また、本実施の形態では第１特定周波数領域と第２特定周波数領域とを所定の周波数間隔で交互に配置させる構成としたが、第１特定周波数領域や第２特定周波数領域を特定の周波数帯に纏めて配置させる構成としてもよい。例えば、「ブー」、「ピー」などの機械音の場合、発生させる音としては特定の狭い範囲の周波数帯となるので、この周波数帯に第２特定周波数領域を配置させ、これ以外の周波数帯に第１特定周波数領域を配置させる。 In the present embodiment, the first specific frequency region and the second specific frequency region are alternately arranged at a predetermined frequency interval. However, the first specific frequency region and the second specific frequency region are arranged in a specific frequency band. It is good also as a structure arrange | positioned collectively. For example, in the case of mechanical sounds such as “boo” and “pea”, the generated sound is in a specific narrow frequency band, so the second specific frequency region is arranged in this frequency band, and other frequency bands The first specific frequency region is arranged in

また、第１の実施の形態では第１フィルタ、第２フィルタにおいて通過させる周波数帯と通過させない周波数帯とを同じ幅で交互に配置させたが、異なる幅としてもよいし、あるいは、通過させる周波数帯が重ならないように配置されているなら、通過させない周波数帯が重なっていてもよい。 Further, in the first embodiment, the frequency bands that are allowed to pass through and the frequency bands that are not allowed to pass in the first filter and the second filter are alternately arranged with the same width, but they may have different widths or may be passed through. If the bands are arranged so as not to overlap, the frequency bands not allowed to pass may overlap.

また、第２の実施の形態では集音される音声の特徴を検出し、人物を特定する構成としたが、個人毎に割り当てたＩＤ番号を入力させ、そのＩＤ番号を照合することにより人物を特定する構成、人物の顔を撮像し、顔画像を認識することにより人物を特定する構成、あるいは、指紋や虹彩から人物を特定する構成などの様々な手法によって人物を特定してよい。 Further, in the second embodiment, the characteristics of the collected voice are detected and the person is specified. However, the ID number assigned to each individual is input and the person is identified by collating the ID number. The person may be specified by various methods such as a configuration for specifying, a configuration for specifying a person by capturing a face of a person and recognizing the face image, or a configuration for specifying a person from a fingerprint or an iris.

本実施の形態に係る音声処理装置の全体構成図である。1 is an overall configuration diagram of a speech processing apparatus according to the present embodiment. 第１の実施の形態に係る音声処理装置の構成図である。It is a block diagram of the speech processing unit which concerns on 1st Embodiment. ユーザのみが音声を発する場合の図１の音声処理装置における処理の説明図である。It is explanatory drawing of the process in the audio | voice processing apparatus of FIG. 1 when only a user utters an audio | voice. ユーザが発する音声と音声処理装置が出力する音声とが重複する場合にマイクに集音される音声信号を示す図である。It is a figure which shows the audio | voice signal collected by a microphone when the audio | voice which a user utters, and the audio | voice which an audio processing apparatus outputs overlap. ユーザが発する音声と音声処理装置が出力する音声とが重複する場合の図１の音声処理装置における処理の説明図である。It is explanatory drawing of the process in the audio | voice processing apparatus of FIG. 1 in case the audio | voice which a user utters, and the audio | voice which an audio processing apparatus outputs overlap. 第２の実施の形態に係る音声処理装置の構成図である。It is a block diagram of the speech processing unit which concerns on 2nd Embodiment. ユーザの音声の特徴に応じたフィルタの周波数特性の一例であり、（ａ）が第１フィルタの周波数特性であり、（ｂ）が第２フィルタの周波数特性である。It is an example of the frequency characteristic of the filter according to the characteristic of a user's voice, (a) is the frequency characteristic of the 1st filter, and (b) is the frequency characteristic of the 2nd filter. 第３の実施の形態に係る音声処理装置の構成図である。It is a block diagram of the audio | voice processing apparatus which concerns on 3rd Embodiment. 第１フィルタが集音側、第２フィルタが音声出力側に切り替わった場合の図８の音声処理装置における処理の説明図である。It is explanatory drawing of the process in the audio | voice processing apparatus of FIG. 8 when the 1st filter switches to the sound collection side and the 2nd filter switches to the audio | voice output side. 第２フィルタが集音側、第１フィルタが音声出力側に切り替わった場合の図８の音声処理装置における処理の説明図である。It is explanatory drawing of the process in the audio | voice processing apparatus of FIG. 8 when the 2nd filter switches to the sound collection side and a 1st filter switches to the audio | voice output side. 図８の音声処理装置から出力される音声の周波数特性の一例である。It is an example of the frequency characteristic of the audio | voice output from the audio | voice processing apparatus of FIG.

Explanation of symbols

１，１Ａ，１Ｂ，１Ｃ…音声処理装置、１０…マイク、１１…スピーカ、１２，１２Ａ，１２Ｂ，１２Ｃ…音声処理ユニット、２０，４０…第１フィルタ、２１，４１…第２フィルタ、２２，３５，４５…音声制御装置、３０…第１可変フィルタ、３１…第２可変フィルタ、３２…音声特徴検出装置、３３…特定周波数領域データベース、３４…特定周波数領域選択装置、４２…切替タイミング発生装置、４３…第１切替制御装置、４４…第２切替制御装置 DESCRIPTION OF SYMBOLS 1,1A, 1B, 1C ... Audio processing apparatus, 10 ... Microphone, 11 ... Speaker, 12, 12A, 12B, 12C ... Audio processing unit, 20, 40 ... First filter, 21, 41 ... Second filter, 22, 35, 45 ... voice control device, 30 ... first variable filter, 31 ... second variable filter, 32 ... voice feature detection device, 33 ... specific frequency domain database, 34 ... specific frequency domain selection device, 42 ... switching timing generation device 43 ... 1st switching control device, 44 ... 2nd switching control device

Claims

Sound collecting means,
Specific frequency passing means for passing a specific frequency region of the sound collected by the sound collecting means;
Sound processing means for performing sound processing based on the sound that has passed through the specific frequency passing means;
Sound generating means for generating sound based on a signal from the sound generation control device,
The audio processing apparatus according to claim 1, wherein the specific frequency region in the specific frequency passing unit is set to a frequency region different from the frequency region of the sound generated from the sound generating unit.

Between the sound generation control device and the sound generation means, a second specific frequency passage means for passing a second specific frequency region of the sound based on a signal from the sound generation control device,
The audio processing apparatus according to claim 1, wherein the specific frequency region in the specific frequency passing unit and the second specific frequency region in the second specific frequency passing unit are set to different frequency regions.

The audio processing apparatus according to claim 2, wherein the specific frequency region in the specific frequency passing means and the second specific frequency region in the second specific frequency passing means are alternately set.

Comprising a person specifying means for specifying a person who has emitted the sound collected by the sound collecting means,
The sound according to any one of claims 1 to 3, wherein the specific frequency region in the specific frequency passing means is changed according to the characteristics of the voice of the person specified by the person specifying means. Processing equipment.

The voice processing apparatus according to claim 4, wherein the person specifying unit is a feature detection unit that detects a feature of the voice collected by the sound collection unit.

4. The audio processing according to claim 2, wherein the specific frequency region in the specific frequency passing means and the second specific frequency region in the second specific frequency passing means are switched according to the passage of time. apparatus.

Collecting steps;
A specific frequency passing step for passing the specific frequency region of the sound collected in the sound collecting step;
A voice processing step for performing voice processing based on the voice that has passed through the specific frequency passing step;
A sound generation step for generating sound based on a signal from the sound generation control device,
The specific frequency region in the specific frequency passage step is set to a frequency region different from the frequency region of the sound generated in the sound generation step.

Including a second specific frequency passing step of passing a second specific frequency region of sound based on a signal from the sound generation control device;
The audio processing method according to claim 7, wherein the specific frequency region in the specific frequency passing step and the second specific frequency region in the second specific frequency passing step are set to different frequency regions.

The audio processing method according to claim 8, wherein the specific frequency region in the specific frequency passing step and the second specific frequency region in the second specific frequency passing step are alternately set.

Including a person specifying step of specifying a person who has emitted the sound collected in the sound collecting step,
The sound according to any one of claims 7 to 9, wherein the specific frequency region in the specific frequency passing step is changed according to the characteristics of the voice of the person specified in the person specifying step. Processing method.

The voice processing method according to claim 10, wherein the person specifying step detects a feature of the voice collected in the sound collecting step.

10. The audio processing according to claim 8, wherein the specific frequency region in the specific frequency passing step and the second specific frequency region in the second specific frequency passing step are switched according to the passage of time. Method.