JP2005157086A

JP2005157086A - Speech recognition device

Info

Publication number: JP2005157086A
Application number: JP2003397451A
Authority: JP
Inventors: Koichiro Mizushima; 考一郎水島
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 2003-11-27
Filing date: 2003-11-27
Publication date: 2005-06-16

Abstract

<P>PROBLEM TO BE SOLVED: To provide a speech recognition device capable of reducing misrecognition even when a plurality of speaker's voices are present. <P>SOLUTION: The speech recognition device comprises: microphones 101a and 101b which pick up a speaker's voice that a speaker speaks; amplifiers 102a and 102b which amplify it; a speaker's voice emphasis part 403 which emphasizes the speaker's voice; power calculators 107a and 107b provided to judge whether a detected speaker's voice is a portion of a conversation; speech detectors 108a and 108b; a speech recognition part 113 which recognizes a detected speaker's voice when a conversation detector 109 or the speech detector 108a or 108b detects the speaker's voice and outputs a predetermined signal when the word recognized by the speech recognition match a predetermined keyword; and a recognition controller 110 which deters the speech recognition part 113 from performing keyword detection when the conversation detector 109 detects a conversation. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

本発明は、不特定話者の音声を音声認識し、接続される他の装置の制御等を行うための音声認識装置に関する。 The present invention relates to a voice recognition apparatus for recognizing voices of unspecified speakers and controlling other connected apparatuses.

従来、複数話者を想定した音声認識装置として、それぞれの話者の近傍に複数のマイクを配置し、これらのマイクを介して収音された信号を用いるものが知られている（例えば、特許文献１参照。）。図６は、特許文献１に開示された従来の音声認識装置の構成を示す概念図である。図６に示す従来の音声認識装置は、以下に説明するように動作する。 2. Description of the Related Art Conventionally, as a speech recognition apparatus that assumes a plurality of speakers, a device that uses a plurality of microphones in the vicinity of each speaker and uses signals collected through these microphones is known (for example, patents). Reference 1). FIG. 6 is a conceptual diagram showing a configuration of a conventional speech recognition apparatus disclosed in Patent Document 1. The conventional speech recognition apparatus shown in FIG. 6 operates as described below.

まず、話者の音声である話者音声は、マイク６０１ａ〜６０１ｃによって収音され、マイク６０１ａ〜６０１ｃによって電気信号である収音信号に変換される。マイク６０１ａ〜６０１ｃによって得られた収音信号は、アンプ６０２ａ〜６０２ｃによって増幅された後、フィルタ６０３ａ〜６０３ｃによってそれぞれ帯域制限される。その後、フィルタ６０３ａ〜６０３ｃのうち、最も高いレベルの信号（帯域制限して得られた信号）を出力したフィルタ（以下、このフィルタをフィルタ６０３ａとする。）の信号をコンパレータ６０４が検出し、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｓｅｃｃｉｎｇＵｎｉｔ）６０７が検出結果に応じて音声切替部６０５を制御し、音声認識対象の「音声入力」とし、音声認識部６０６が音声入力を音声認識するようになっている。そして、音声入力を出力したフィルタ６０３ａに対応するマイク（この場合は、マイク６０１ａ。）以外のマイク６０１ｂ、６０１ｃからの信号は、音声認識対象外の信号である「騒音入力」として判定されていた。 First, the speaker voice, which is the voice of the speaker, is collected by the microphones 601a to 601c and converted into a sound collection signal that is an electric signal by the microphones 601a to 601c. The collected sound signals obtained by the microphones 601a to 601c are amplified by the amplifiers 602a to 602c and then band-limited by the filters 603a to 603c, respectively. Thereafter, the comparator 604 detects the signal of the filter (hereinafter referred to as filter 603a) that outputs the highest level signal (signal obtained by band limitation) among the filters 603a to 603c, and the CPU (Central Prosecting Unit) 607 controls the voice switching unit 605 in accordance with the detection result to set “voice input” as a voice recognition target, and the voice recognition unit 606 recognizes the voice input as voice. The signals from the microphones 601b and 601c other than the microphone corresponding to the filter 603a that outputs the voice input (in this case, the microphone 601a) are determined as “noise input” that is a signal that is not subject to voice recognition. .

さらに、上記の音声入力のスペクトルから騒音入力のスペクトルを減算して音声入力に含まれる騒音成分を削減した後、騒音成分削除後の信号が音声認識部６０６に入力されていた。音声認識部６０６は、音声入力と予め定められたキーワードが一致するか否かを検出することにより、音声認識開始スイッチを用いずに音声認識を起動させるようになっていた。
特開平１１−５２９７６号公報（第１−５頁、図１）電子情報通信学会編、「音響システムとディジタル処理」、第７章、マイクロホン系における信号処理、１７３〜２１８頁 Further, after the noise input spectrum is subtracted from the above-described voice input spectrum to reduce the noise component included in the voice input, the signal after the noise component deletion is input to the voice recognition unit 606. The voice recognition unit 606 activates voice recognition without using the voice recognition start switch by detecting whether or not the voice input matches a predetermined keyword.
JP-A-11-52976 (page 1-5, FIG. 1) The Institute of Electronics, Information and Communication Engineers, "Sound System and Digital Processing", Chapter 7, Signal Processing in Microphone, pp. 173-218

しかし、このような従来の音声認識装置では、音声認識の待機中に常時キーワード検出を行っているために、本来キーワード検出の不要な話者同士の会話に対してもキーワード検出を行うために誤認識が発生するという問題があった。 However, in such a conventional speech recognition apparatus, since keyword detection is always performed during speech recognition standby, an error occurs because keyword detection is performed even for conversations between speakers that originally do not require keyword detection. There was a problem that recognition occurred.

また、音声認識中には、認識対象の話者音声に他の話者音声が重畳されることに起因する誤認識が発生し、さらにその誤認識の発生要因が他の話者音声が重畳することによるものか否かが判断できないために、話者に対し、他の話者音声が重畳するために音声認識ができない旨の指示を与えることもできなかった。このため、音声認識を再度繰り返しても誤認識が発生してしまうという課題を有していた。 Moreover, during speech recognition, misrecognition caused by superimposing other speaker speech on the speaker speech to be recognized occurs, and the cause of the misrecognition is superimposed on other speaker speech. Since it cannot be determined whether or not it is due to the situation, it has not been possible to give an instruction to the speaker that speech recognition cannot be performed because other speaker's voice is superimposed. For this reason, there has been a problem that erroneous recognition occurs even if the speech recognition is repeated again.

本発明はこのような問題を解決するためになされたもので、複数の話者音声が存在する場合でも誤認識の発生を低減することが可能な音声認識装置を提供するものである。 The present invention has been made to solve such a problem, and provides a speech recognition apparatus capable of reducing the occurrence of erroneous recognition even when a plurality of speaker voices exist.

本発明の音声認識装置は、話者が発話した話者音声を含む音声を収音する複数の収音手段と、各前記収音手段が収音した音声に複数話者の話者音声の候補が含まれる場合に、複数の話者の話者音声の候補を強調する話者音声強調手段と、前記話者音声強調手段が強調した話者音声の候補のパワーを算出するパワー算出手段と、前記パワー算出手段が算出したパワーに基づいて前記話者音声の候補中の話者音声を検出する音声検出手段と、前記音声検出手段が過去の一定時間内に複数話者の話者音声を検出したか否かを検出する会話検出手段と、前記音声検出手段が話者音声を検出したときに、検出した話者音声の音声認識を行い、音声認識で認識された言葉が予め決められたキーワードと一致するとき、所定の信号を出力する音声認識手段と、前記音声検出手段が過去の一定時間内に複数話者の話者音声を検出したことを前記会話検出手段が検出した場合、複数話者の話者音声が検出された前記一定時間内に発話された話者音声の音声認識を前記音声認識手段が行うことを抑制する認識制御手段とを備えた構成を有している。 The speech recognition apparatus according to the present invention includes a plurality of sound collecting means for collecting sound including a speaker sound uttered by a speaker, and a plurality of speaker sound candidates for the sound collected by each of the sound collecting means. Is included, speaker voice emphasizing means for emphasizing speaker voice candidates of a plurality of speakers, power calculating means for calculating power of speaker voice candidates emphasized by the speaker voice emphasizing means, Based on the power calculated by the power calculation means, a voice detection means for detecting a speaker voice among the candidate speaker voices, and the voice detection means detects speaker voices of a plurality of speakers within a predetermined time in the past. A conversation detecting means for detecting whether or not the voice is detected, and when the voice detecting means detects a speaker voice, the detected voice of the speaker is recognized, and the words recognized by the voice recognition are predetermined keywords. Voice recognition means for outputting a predetermined signal when the If the speech detection means detects that the voice detection means has detected speaker voices of a plurality of speakers within a certain past time, the speech is spoken within the predetermined time when the speaker voices of the plurality of speakers are detected. And a recognition control means for suppressing the voice recognition means from performing voice recognition of the speaker's voice.

この構成により、会話検出手段を用いて、検出された話者音声が会話を構成する一部の話者音声か否かを判断し、会話を構成する話者音声であると判断した場合、キーワードの検出処理を行わないこととしたため、複数の話者音声が存在する場合でも誤認識の発生を低減することが可能な音声認識装置を実現することができる。 With this configuration, when the conversation detection means is used to determine whether or not the detected speaker voice is a part of the speaker voice constituting the conversation, and if it is determined that the detected speaker voice is the speaker voice constituting the conversation, the keyword Therefore, it is possible to realize a speech recognition apparatus that can reduce the occurrence of misrecognition even when there are a plurality of speaker voices.

また、本発明の音声認識装置は、さらに、前記会話検出手段が過去の前記一定時間内に複数話者の話者音声があったことを検出したことによって、前記音声認識手段が前記話者音声の音声認識を行わなかったとき、複数話者の話者音声の存在が理由で音声認識を行わなかったことを通知する信号を生成するガイダンス出力手段を備えた構成を有している。 In the speech recognition apparatus of the present invention, the speech recognition means further detects the speaker voice of a plurality of speakers within the predetermined time in the past. When the voice recognition is not performed, there is provided a configuration including guidance output means for generating a signal for notifying that the voice recognition is not performed due to the presence of speaker voices of a plurality of speakers.

この構成により、ガイダンス出力手段を設けて会話を検出して音声認識を行わなかったことを出力できるようにしたため、操作性を向上することが可能な音声認識装置を実現することができる。 With this configuration, a guidance output unit is provided to detect that a conversation has been detected and voice recognition has not been performed. Therefore, a voice recognition device capable of improving operability can be realized.

また、本発明の音声認識装置は、話者が発話した話者音声を含む音声を収音して収音信号を生成する複数の収音手段と、各前記収音手段が収音した音声に複数話者の話者音声の候補が含まれる場合に、複数の話者の話者音声の候補を強調する話者音声強調手段と、複数の前記収音手段が生成した収音信号間の相関値を前記収音信号間の遅延時間を変えて算出する相関算出手段と、前記収音信号間の相関値が最大となる遅延時間が予め決められた時間の範囲内にあるか否かに基づいて話者の居る方向を検出することによって話者を特定すると共に特定した話者の話者音声の候補中の話者音声を検出する話者方向検出手段と、前記話者方向検出手段が過去の一定時間内に複数話者の話者音声を検出したか否かを検出する会話検出手段と、前記話者方向検出手段が話者音声を検出したときに、検出した話者音声の音声認識を行い、音声認識で認識された言葉が予め決められたキーワードと一致するとき、所定の信号を出力する音声認識手段と、前記話者方向検出手段が過去の一定時間内に複数話者の話者音声を検出したことを前記会話検出手段が検出した場合、複数話者の話者音声が検出された前記一定時間内に発話された話者音声の音声認識を前記音声認識手段が行うことを抑制する認識制御手段とを備えた構成を有している。 The speech recognition apparatus according to the present invention also includes a plurality of sound collection means for collecting a sound including a speaker sound uttered by a speaker to generate a sound collection signal, and a sound collected by each of the sound collection means. Correlation between speaker voice enhancement means for emphasizing speaker voice candidates for a plurality of speakers and sound pickup signals generated by the plurality of sound pickup means when speaker voice candidates for a plurality of speakers are included Correlation calculating means for calculating a value by changing a delay time between the collected sound signals, and whether or not a delay time at which the correlation value between the collected sound signals is maximum is within a predetermined time range. A speaker direction detecting means for identifying a speaker by detecting a direction in which the speaker is present and detecting a speaker voice in a candidate speaker voice of the specified speaker; and the speaker direction detecting means in the past A conversation detecting means for detecting whether or not speaker voices of a plurality of speakers are detected within a predetermined time, and the speaker Speech recognition that performs speech recognition of the detected speaker speech when the direction detection means detects the speaker speech, and outputs a predetermined signal when the words recognized by speech recognition match a predetermined keyword And when the conversation detecting unit detects that the speaker direction detecting unit has detected speaker voices of a plurality of speakers within a predetermined time in the past, And a recognition control unit that suppresses the voice recognition unit from recognizing the voice of the speaker spoken in time.

この構成により、会話検出手段を設け、検出された話者音声が会話を構成する一部の話者音声か否かを判断し、会話を構成する話者音声であると判断した場合、キーワードの検出処理を行わないこととしたため、複数の話者音声が存在する場合でも誤認識の発生を低減することができると共に、話者の特定をより的確に行うことが可能な音声認識装置を実現することができる。 With this configuration, a conversation detecting means is provided, and it is determined whether or not the detected speaker voice is a part of the speaker voice constituting the conversation. Since the detection process is not performed, it is possible to reduce the occurrence of misrecognition even when a plurality of speaker voices exist, and to realize a voice recognition device capable of more accurately specifying a speaker. be able to.

本発明は、複数の話者音声が存在する場合でも誤認識の発生を低減することが可能な音声認識装置を提供することができるものである。 The present invention can provide a speech recognition apparatus capable of reducing the occurrence of erroneous recognition even when there are a plurality of speaker voices.

以下、本発明の実施の形態について、図面を用いて説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

（第１の実施の形態）
図１は、本発明の第１の実施の形態に係る音声認識装置のブロック構成を示す概念図である。図１において、音声認識装置１００は、車両２００の室内で音声認識を行い、音声認識されたキーワードに対応する対応信号を外部に出力する構成となっており、話者が発話した話者音声を含む音声を収音する複数の収音手段（図１には、マイク１０１ａおよびアンプ１０２ａからなる収音手段と、マイク１０１ｂおよびアンプ１０２ｂからなる収音手段とが示されている。）と、各収音手段が収音した音声に複数話者の話者音声の候補が含まれる場合に、収音手段毎に一人の話者の話者音声の候補を強調する話者音声強調手段（図１に、話者音声強調部と示す。以下、話者音声強調部という。）１０３と、話者音声強調部１０３が強調した話者音声の候補のパワーを算出するパワー算出手段（図１に、パワー算出器と示す。以下、パワー算出器という。）１０７ａ、１０７ｂと、パワー算出器１０７ａ、１０７ｂが算出したパワーに基づいて話者音声の候補中の話者音声を検出する音声検出手段（図１に、音声検出器と示す。以下、音声検出器という。）１０８ａ、１０８ｂと、音声検出器１０８ａ、１０８ｂが過去の一定時間内に複数話者の話者音声を検出したか否かを検出する会話検出手段（図１に、会話検出器と示す。以下、会話検出器という。）１０９と、音声検出器１０８ａ、１０８ｂが話者音声を検出したときに、検出した話者音声の音声認識を行い、音声認識で認識された言葉が予め決められたキーワードと一致するとき、所定の信号を出力する音声認識手段（図１に、音声認識部と示す。以下、音声認識部という。）１１３と、音声検出器１０８ａ、１０８ｂが過去の一定時間内に複数話者の話者音声を検出したことを会話検出器１０９が検出した場合、複数話者の話者音声が検出された一定時間内に発話された話者音声の音声認識を音声認識部１１３が行うことを抑制する認識制御手段（図１に、認識制御器と示す。以下、認識制御器という。）１１０とを含むように構成される。 (First embodiment)
FIG. 1 is a conceptual diagram showing a block configuration of the speech recognition apparatus according to the first embodiment of the present invention. In FIG. 1, a voice recognition device 100 is configured to perform voice recognition in a vehicle 200 and output a corresponding signal corresponding to a voice-recognized keyword to the outside. A plurality of sound collecting means for collecting included sounds (in FIG. 1, a sound collecting means including a microphone 101a and an amplifier 102a, and a sound collecting means including a microphone 101b and an amplifier 102b) are shown. When the sound collected by the sound collecting means includes speaker voice candidates of a plurality of speakers, the speaker voice enhancing means for emphasizing the speaker voice candidate of one speaker for each sound collecting means (FIG. 1). The speaker voice emphasizing unit 103 (hereinafter referred to as a speaker voice emphasizing unit) 103 and power calculating means (FIG. 1) for calculating the power of the speaker voice candidate emphasized by the speaker voice emphasizing unit 103. This is referred to as a power calculator. 107a and 107b, and voice detecting means for detecting speaker voices among speaker voice candidates based on the powers calculated by the power calculators 107a and 107b (shown as voice detectors in FIG. 1). 108a and 108b, and speech detectors 108a and 108b that detect whether or not a plurality of speaker voices have been detected within a predetermined time in the past (FIG. 1 shows a conversation detector). The term “detector” (hereinafter referred to as “conversation detector”) 109 and the speech detectors 108a and 108b detect the speech of the speaker, and perform speech recognition of the detected speaker speech. Is a voice recognition means (shown as a voice recognition unit in FIG. 1; hereinafter referred to as a voice recognition unit) 113 and voice detectors 108a and 108b. If the conversation detector 109 detects that the speaker voices of a plurality of speakers have been detected within a certain period of time, the voice of the speaker voices uttered within a certain period of time when the speaker voices of the plurality of speakers are detected It is comprised so that the recognition control means (it shows as a recognition controller in FIG. 1 and hereafter is called a recognition controller) 110 which suppresses that the speech recognition part 113 performs recognition.

ここで、収音手段を構成するマイク１０１ａ、１０１ｂは、例えば、車両２００中に設けられ、各座席に座った人の話者音声を収音しやすくなっている。そして、各話者が発話した話者音声を含む音声を収音して収音信号を生成するようになっている。また、収音手段を構成するアンプ１０２ａ、１０２ｂは、マイク１０１ａから出力された収音信号を増幅して増幅収音信号を生成するようになっている。ただし、マイク１０１ａ、１０１ｂから出力された収音信号のレベルが十分高い場合は、アンプ１０２ａ、１０２ｂを省略した構成とするのでもよい。 Here, the microphones 101a and 101b constituting the sound collecting means are provided in the vehicle 200, for example, and are easy to pick up the speaker voice of the person sitting in each seat. And the sound containing the speaker voice which each speaker uttered is picked up, and a sound pickup signal is generated. The amplifiers 102a and 102b constituting the sound collection means amplify the sound collection signal output from the microphone 101a and generate an amplified sound collection signal. However, when the level of the collected sound signal output from the microphones 101a and 101b is sufficiently high, the amplifiers 102a and 102b may be omitted.

ここで、話者音声強調部１０３は、さらに、アンプ１０２ａから出力された増幅収音信号を遅延させて遅延信号を生成する遅延器１０４ａ、アンプ１０２ｂから出力された増幅収音信号を遅延させて遅延信号を生成する遅延器１０４ｂ、アンプ１０２ａからの増幅収音信号と遅延器１０４ｂからの遅延信号とを加算する加算器１０５ａ、アンプ１０２ｂからの増幅収音信号と遅延器１０４ａからの遅延信号とを加算する加算器１０５ｂ、加算器１０５ａの出力信号の周波数スペクトルを所定の規則で変化させるイコライザ１０６ａ、加算器１０５ｂの出力信号の周波数スペクトルを所定の規則で変化させるイコライザ１０６ｂを含むように構成される。イコライザ１０６ａからの出力は、パワー算出器１０７ａおよびバッファ１１２に入力され、イコライザ１０６ｂからの出力は、パワー算出器１０７ｂおよびバッファ１１２に入力されるようになっている。 Here, the speaker voice emphasizing unit 103 further delays the amplified sound pickup signal output from the delay device 104a and the amplifier 102b that generates a delay signal by delaying the amplified sound pickup signal output from the amplifier 102a. A delay unit 104b that generates a delay signal, an adder 105a that adds an amplified sound pickup signal from the amplifier 102a and a delay signal from the delay unit 104b, an amplified sound pickup signal from the amplifier 102b, and a delay signal from the delay unit 104a And an equalizer 106a that changes the frequency spectrum of the output signal of the adder 105a according to a predetermined rule, and an equalizer 106b that changes the frequency spectrum of the output signal of the adder 105b according to a predetermined rule. The The output from the equalizer 106a is input to the power calculator 107a and the buffer 112, and the output from the equalizer 106b is input to the power calculator 107b and the buffer 112.

ここで、加算器１０５ａ、１０５ｂが、遅延器１０４ａ、１０４ｂが生成した遅延信号を、符号を反転してそれぞれアンプ１０２ｂ、１０２ａが生成した増幅収音信号に加算すること（以下、加算処理という。）によって、遅延成分を除去することができ、特定の話者音声（例えば、マイクに最も接近している話者の話者音声）を強調することができる。また、イコライザ１０６ａ、１０６ｂは、加算処理を行うことによってレベルが低下した低周波成分をもとの周波数スペクトルに復元させる（近づける）ように設けられたものである。 Here, the adders 105 a and 105 b add the delayed signals generated by the delay units 104 a and 104 b to the amplified sound collection signals generated by the amplifiers 102 b and 102 a by inverting the signs (hereinafter referred to as addition processing). ) Can remove a delay component and emphasize a specific speaker voice (for example, a speaker voice of a speaker closest to the microphone). Further, the equalizers 106a and 106b are provided so as to restore (close to) the low frequency component whose level has been reduced by performing the addition process to the original frequency spectrum.

音声認識装置１００は、図１に示すように、さらに、認識制御器１１０の制御の下に所定のメッセージ等を出力するガイダンス出力部１１４を有するのでもよい。ガイダンス出力部１１４は、会話検出器１０９が過去の一定時間内に複数話者の話者音声があったこと（すなわち、会話があったこと）を検出したことによって、音声認識部１１３が話者音声の音声認識を行わなかったとき、複数話者の話者音声の存在が理由で音声認識を行わなかったことを通知する信号を生成するようになっているのでもよい。この場合、アンプ１１５とスピーカ１１６とを音声認識装置１００に接続し、ガイダンス出力部１１４からの出力信号をアンプ１１５で増幅し、スピーカ１１６を介して出力するのでもよい。 As shown in FIG. 1, the speech recognition apparatus 100 may further include a guidance output unit 114 that outputs a predetermined message or the like under the control of the recognition controller 110. The guidance output unit 114 detects that the conversation detector 109 has detected the speaker voices of a plurality of speakers within a certain period of time in the past (that is, the conversation has occurred). When speech recognition is not performed, a signal may be generated to notify that speech recognition has not been performed due to the presence of speaker speech of multiple speakers. In this case, the amplifier 115 and the speaker 116 may be connected to the speech recognition apparatus 100, and the output signal from the guidance output unit 114 may be amplified by the amplifier 115 and output via the speaker 116.

音声認識装置１００は、図１に示すように、さらに、話者音声強調部１０３からの出力信号を記憶するバッファ１１２、認識制御器１１０の制御の下にバッファ１１２に記憶されたデータを選択する入力選択器１１１、認識制御器１１０の制御の下に所定のメッセージ等を出力するガイダンス出力部１１４を含み、認識制御器１１０の制御の下にバッファ１１２に記憶されたデータを選択し、音声認識部１１３が選択したデータに対して音声認識を行うようになっているのでもよい。 As shown in FIG. 1, the speech recognition apparatus 100 further selects the data stored in the buffer 112 under the control of the buffer 112 that stores the output signal from the speaker speech enhancement unit 103 and the recognition controller 110. A guidance output unit 114 that outputs a predetermined message or the like under the control of the input selector 111 and the recognition controller 110, selects the data stored in the buffer 112 under the control of the recognition controller 110, and performs voice recognition Voice recognition may be performed on the data selected by the unit 113.

以下、本発明の第１の実施の形態に係る音声認識装置１００の動作について図２、図３を用いて説明する。図２に示すように、音声認識装置１００は、待機モードと音声認識モードの２つのモードで動作するものとする。待機モードＳ２１０では、音声認識を開始するトリガとなるキーワードが発話されたことを検出するようになっており、音声認識開始ボタンを押すこと等の発話以外の手段または方法による音声認識開始の合図をユーザに要求することなく、発話のみで音声認識が開始できる。 Hereinafter, the operation of the speech recognition apparatus 100 according to the first embodiment of the present invention will be described with reference to FIGS. As shown in FIG. 2, the speech recognition apparatus 100 is assumed to operate in two modes: a standby mode and a speech recognition mode. In the standby mode S210, it is detected that a keyword serving as a trigger for starting speech recognition is uttered, and a signal for starting speech recognition by means or methods other than speech such as pressing a speech recognition start button is given. Voice recognition can be started only by speaking without requesting the user.

音声認識モードＳ２２０は、待機モードＳ２１０において音声認識を開始するトリガとなるキーワードが検出された場合に移るモードであり、音声コマンドを認識できるモードである。音声認識装置１００は、音声認識モードＳ２２０での動作が終了すると待機モードＳ２１０に戻るようになっている。 The voice recognition mode S220 is a mode in which a voice command is recognized when the keyword that becomes a trigger for starting voice recognition is detected in the standby mode S210. The voice recognition device 100 returns to the standby mode S210 when the operation in the voice recognition mode S220 ends.

待機モードＳ２１０では、まず、終了条件が成立しているか否かについての判断が認識制御器１１０によって行われる（Ｓ２１１）。ここで、終了条件とは、音声認識装置の電源オフの命令などの予め終了処理を行う際に満たすべき条件をいうものとする。ステップＳ２１１で終了条件が成立していないと判断された場合、処理はステップＳ２１２に進む。 In the standby mode S210, first, the recognition controller 110 determines whether or not an end condition is satisfied (S211). Here, the term “end condition” refers to a condition that should be satisfied when the end process is performed in advance, such as a power-off command for the speech recognition apparatus. If it is determined in step S211 that the end condition is not satisfied, the process proceeds to step S212.

ステップＳ２１１で終了条件が成立していないと判断された場合、マイク１０１ａおよびマイク１０１ｂが生成した収音信号中の話者音声の候補が強調される（Ｓ２１２）。以下、ステップＳ２１２での処理を行う動作を「話者音声強調動作」といい、話者音声強調動作の詳細を、図３を用いて説明する。 If it is determined in step S211 that the end condition is not satisfied, the speaker voice candidates in the collected sound signals generated by the microphone 101a and the microphone 101b are emphasized (S212). Hereinafter, the operation for performing the process in step S212 is referred to as “speaker speech enhancement operation”, and details of the speaker speech enhancement operation will be described with reference to FIG.

まず、話者音声が収音されて収音信号がマイク１０１ａ、１０１ｂによって生成される（Ｓ３０１）。車両２００内で話者Ａと話者Ｂが発話する話者音声とは、マイク１０１ａとマイク１０１ｂとにそれぞれ重畳されて収音される。ただし、車両走行騒音等の話者音声以外の音源がある場合、マイク１０１ａとマイク１０１ｂには話者音声以外の音源から出力される音も収音されることになる。 First, a speaker voice is collected and a collected signal is generated by the microphones 101a and 101b (S301). Speaker voices spoken by speaker A and speaker B in vehicle 200 are collected by being superimposed on microphone 101a and microphone 101b, respectively. However, when there is a sound source other than the speaker voice such as vehicle running noise, the sound output from the sound source other than the speaker voice is also collected by the microphone 101a and the microphone 101b.

収音信号は、アンプ１０２ａとアンプ１０２ｂによって所定の信号レベルまで増幅される。マイク１０１ａ、１０１ｂが話者音声を収音して得られた直後の信号は、一般にアナログ信号であるが、デジタル信号にして信号処理を行う方がその後の信号処理が効率的である。そのため、以下では、収音信号はデジタル信号で出力されるものとする。ここで、アナログ信号からデジタル信号への変換には、いわゆるＡＤ（ＡｎａｌｏｇｕｅｔｏＤｉｇｉｔａｌ）変換器を用いることができ、ＡＤ変換器については公知であり、その説明を省略する。デジタル信号への変換の際のサンプリング周波数は、音声の周波数帯域をカバーできれば良く、一般的には、１０〜４８ｋＨｚの範囲が用いられる。 The collected sound signal is amplified to a predetermined signal level by the amplifier 102a and the amplifier 102b. The signal immediately after the microphones 101a and 101b pick up the speaker's voice is generally an analog signal. However, it is more efficient to perform the signal processing after converting the signal into a digital signal. Therefore, in the following, it is assumed that the collected sound signal is output as a digital signal. Here, for conversion from an analog signal to a digital signal, a so-called AD (Analogue to Digital) converter can be used, and the AD converter is publicly known and description thereof is omitted. The sampling frequency at the time of conversion to a digital signal only needs to cover the frequency band of audio, and generally a range of 10 to 48 kHz is used.

ステップＳ３０１で話者音声の収音信号が生成されたら、特定の話者方向に指向性を持たせるための遅延付加処理が行われる（Ｓ３０２）。指向性を持たせるための遅延付加処理は、本発明の構成の場合、遅延器１０４ａ、１０４ｂが収音信号（または増幅収音信号）を遅延させて遅延信号を生成し、遅延信号を符号反転して収音信号（または増幅収音信号）に重畳することによって行われる。 When the collected sound signal of the speaker voice is generated in step S301, a delay addition process for giving directivity to a specific speaker direction is performed (S302). In the case of the configuration of the present invention, the delay adding process for providing directivity is such that the delay units 104a and 104b delay the collected sound signal (or amplified collected signal) to generate a delayed signal, and the delayed signal is inverted in sign. Then, it is performed by superimposing on the collected sound signal (or amplified collected sound signal).

なお、指向性を持たせる方法は、上記の方法に限られるものではなく、例えば非特許文献１に示された複数の方法を適用できる。具体的には、複数のマイクを配置し、特定の方向から到達する音声を同相にして加算して強調する、いわゆる、遅延和アレイ等も含まれる。本発明の第１の実施の形態においては、複数のマイクからの収音信号の遅延時間を、特定の方向から到来する信号が互いに打ち消されるよう設定してその他の方向からの音声を強調する、減算型アレイを用いて動作を説明する。 Note that the method of imparting directivity is not limited to the above method, and for example, a plurality of methods shown in Non-Patent Document 1 can be applied. Specifically, a so-called delay-and-sum array is also included in which a plurality of microphones are arranged and voices that arrive from a specific direction are added in phase and emphasized. In the first embodiment of the present invention, the delay time of the collected signals from a plurality of microphones is set so that signals arriving from a specific direction cancel each other, and sound from other directions is emphasized, The operation will be described using a subtraction type array.

話者音声強調部１０３を構成する加算器１０５ａは、話者Ｂの方向から到来する音声に由来する信号（以下、Ｂ話者到来信号という。）を打ち消すことにより話者Ａの方向から到来する音声に由来する信号（以下、Ａ話者到来信号という。）を強調する処理を行う。すなわち、話者Ｂとマイク１０１ａの距離は、話者Ｂとマイク１０１ｂの距離より長いため、話者Ｂの発話する音声はマイク１０１ｂに時間的に先に到達し、マイク１０１ａに遅れて到達する。 The adder 105a constituting the speaker voice enhancement unit 103 arrives from the direction of the speaker A by canceling a signal derived from the voice coming from the direction of the speaker B (hereinafter referred to as a B speaker arrival signal). Processing for emphasizing a signal derived from speech (hereinafter referred to as A-speaker arrival signal) is performed. That is, since the distance between the speaker B and the microphone 101a is longer than the distance between the speaker B and the microphone 101b, the speech uttered by the speaker B reaches the microphone 101b first, and arrives at the microphone 101a with a delay. .

そこで、話者Ｂから発話された音声がマイク１０１ａとマイク１０１ｂに到達する時間差を遅延器１０４ｂに設定してマイク１０１ａ、１０１ｂが収音する話者Ｂの話者音声の収音信号の時間差が加算器１０５ａの入力端でなくなるように設定した後、遅延器１０４ｂの出力信号の極性を反転させた信号と、アンプ１０２ａの出力信号とを加算器１０５ａが加算するようになっている。その結果、加算器１０５ａの出力信号からは、Ｂ話者到来信号は打ち消されるが、Ａ話者到来信号はＢ話者到来信号に比べて打ち消される程度が少ないため、Ｂ話者到来信号に対してＡ話者到来信号が強調されることとなる。遅延器の実現方法は、非特許文献１に記載されるＦＩＲ（ＦｉｎｉｔｅＩｍｐｕｌｓｅＲｅｓｐｏｎｓｅ）フィルタによる方法でも良く、１次または２次の全域通過型ＩＩＲ（ＩｎｆｉｎｉｔｅＩｍｐｕｌｓｅＲｅｓｐｏｎｓｅ）フィルタによる方法も演算量が小さいため有利である。 Therefore, the time difference between the voices of speakers B collected by the microphones 101a and 101b is set by setting the time difference between the voices uttered by the speaker B to the microphones 101a and 101b in the delay unit 104b. After setting so as not to be the input terminal of the adder 105a, the adder 105a adds the signal obtained by inverting the polarity of the output signal of the delay unit 104b and the output signal of the amplifier 102a. As a result, the B speaker arrival signal is canceled from the output signal of the adder 105a, but the A speaker arrival signal is less canceled than the B speaker arrival signal. As a result, the arrival signal of the speaker A is emphasized. The implementation method of the delay device may be a method using a FIR (Finite Impulse Response) filter described in Non-Patent Document 1, or a method using a first-order or second-order all-pass IIR (Infinite Impulse Response) filter. Therefore, it is advantageous.

加算器１０５ｂにおいて、Ａ話者到来信号を打ち消すことによりＢ話者到来信号を強調する処理を行う。その処理は、加算器１０５ａが行う上記処理において、打ち消す信号をＡ話者到来信号として同様の処理を行う。これにより、加算器１０５ｂの出力信号からは、Ａ話者到来信号は打ち消されるが、Ｂ話者到来信号は比較的打ち消されないため、Ａ話者到来信号に対してＢ話者到来信号が強調されることとなる。なお、ステップＳ３０２以降の処理は、サンプル点毎に実施するサンプル処理でも可能であるが、一定のフレーム長毎に処理を実施するフレーム処理による方が、演算量が少なく有利であるため、以下では、フレーム処理を行うものとして説明を行う。 The adder 105b performs processing for enhancing the B speaker arrival signal by canceling the A speaker arrival signal. The processing is the same as the processing performed by the adder 105a with the signal to be canceled as the A speaker arrival signal. Thus, the A speaker arrival signal is canceled from the output signal of the adder 105b, but the B speaker arrival signal is not relatively canceled, and therefore the B speaker arrival signal is emphasized with respect to the A speaker arrival signal. Will be. In addition, although the process after step S302 is possible also by the sample process implemented for every sample point, since the method by the frame process which performs a process for every fixed frame length has a small calculation amount, it is advantageous below. The description will be made assuming that frame processing is performed.

ステップＳ３０２で遅延加算が行われ各話者Ａ、Ｂからの話者音声が強調されたら、遅延加算によって変化した各話者Ａ、Ｂの話者音声の周波数スペクトルを元の周波数スペクトルに戻すようにするための処理（以下、「イコライジング」処理という。）が、イコライザ１０６ａによって行われる（Ｓ３０３）。例えば、イコライジング処理により、Ａ話者到来信号の周波数スペクトルの変化が補正される。 When delay addition is performed in step S302 and the speaker voices from the speakers A and B are emphasized, the frequency spectrum of the speaker voices of the speakers A and B changed by the delay addition is returned to the original frequency spectrum. The process (hereinafter referred to as “equalizing” process) is performed by the equalizer 106a (S303). For example, the change in the frequency spectrum of the A speaker arrival signal is corrected by the equalizing process.

これは、加算器１０５ａにおいて、マイク１０１ａの出力信号からマイク１０１ｂの出力信号を減算することにより、Ａ話者到来信号が、加算器１０５ａの出力信号において、低域ほど感度が低下する周波数スペクトルとなるため等によって必要とされる。この周波数スペクトルの変化を補正するため、低周波数ほどレベルを高くする補正がイコライザ１０６ａによって行われ、話者Ａの方向から到来する音声に対する感度の周波数特性が平坦化される。イコライザ１０６ａの実現方法は、入力信号を複数の周波数帯域の信号に分割し、各周波数帯域の信号のゲインを調整した後、各周波数帯域の信号を加算する方法でも良いが、２次程度のＩＩＲフィルタを用いて周波数スペクトルの補正を行う方法が一般的である。 This is because the adder 105a subtracts the output signal of the microphone 101b from the output signal of the microphone 101a, so that the A-speaker arrival signal has a frequency spectrum in which the sensitivity of the output signal of the adder 105a decreases with decreasing frequency. Is needed by to become. In order to correct the change in the frequency spectrum, the equalizer 106a corrects the level higher as the frequency is lower, and the frequency characteristic of the sensitivity to the voice coming from the direction of the speaker A is flattened. The equalizer 106a may be realized by dividing the input signal into a plurality of frequency band signals, adjusting the gain of each frequency band signal, and then adding the signals in each frequency band. A method of correcting a frequency spectrum using a filter is common.

イコライザ１０６ａと同様に、イコライザ１０６ｂを用いて、加算器１０５ｂが行う上記処理により、話者Ｂの方向から到来する信号の周波数スペクトルの変化を補正するイコライジング処理を行う。補正により、話者Ｂの方向から到来する音声に対する感度の周波数スペクトルが平坦化される。 Similarly to the equalizer 106a, the equalizer 106b is used to perform an equalizing process for correcting a change in the frequency spectrum of the signal arriving from the direction of the speaker B by the above process performed by the adder 105b. By the correction, the frequency spectrum of the sensitivity with respect to the voice coming from the direction of the speaker B is flattened.

以上の処理により、イコライザ１０６ａの出力信号として、Ａ話者到来信号を強調した信号が得られ、またイコライザ１０６ｂの出力信号として、Ｂ話者到来信号を強調した信号が得られる。その結果、Ａ話者到来信号を強調した信号およびＢ話者到来信号を強調した信号が話者音声強調部１０３から出力される。 With the above processing, a signal in which the A speaker arrival signal is emphasized is obtained as the output signal of the equalizer 106a, and a signal in which the B speaker arrival signal is emphasized is obtained as the output signal of the equalizer 106b. As a result, a signal in which the A speaker arrival signal is emphasized and a signal in which the B speaker arrival signal is emphasized are output from the speaker voice enhancement unit 103.

ステップＳ２１２で話者音声の候補の強調が行われたら、話者音声の候補を検出する（Ｓ２１３）。具体的には、以下の処理を行う。まず、イコライザ１０６ａからの出力信号のパワーが、パワー算出器１０７ａによって算出される。パワーの算出方法としては、イコライザ１０６ａの出力信号の全周波数成分を対象に算出する方法もとりうるが、音声が卓越する帯域に帯域制限した後、振幅を二乗する方法の方が好適である。パワー算出器１０７ａからは、振幅が二乗された後に時間平滑化が行われた後の信号が出力される。なお、時間平滑化の時定数は１００ｍｓ〜１ｓ程度が適当である。パワー算出器１０７ｂにおいても、上記パワー算出器１０７ａと同様に、イコライザ１０６ｂの出力信号を用いてパワーを算出する。 If the speaker voice candidate is emphasized in step S212, the speaker voice candidate is detected (S213). Specifically, the following processing is performed. First, the power of the output signal from the equalizer 106a is calculated by the power calculator 107a. As a method for calculating the power, a method for calculating all the frequency components of the output signal of the equalizer 106a can be used. However, a method in which the amplitude is squared after the band is limited to a band where the voice is dominant is preferable. The power calculator 107a outputs a signal after time smoothing after the amplitude is squared. The time smoothing time constant is suitably about 100 ms to 1 s. Similarly to the power calculator 107a, the power calculator 107b calculates power using the output signal of the equalizer 106b.

パワー算出器１０７ａ、１０７ｂによって算出されたパワーに基づいて、話者Ａの音声の有無が音声検出器１０８ａによって検出される。音声の有無の検出方法としては、例えば、パワー算出器１０７ａの出力を観測し、出力が予め定めた基準値を超える場合に話者音声の候補があると判断する方法がある。また、他の検出方法として、パワー算出器１０７ａの出力をさらに時間平滑化することにより平滑化パワーを算出し、パワー算出器１０７ａの出力と平滑化パワーとを比較して、パワー算出器１０７ａの出力が平滑化パワーより予め定めた相対基準値を超える場合に話者音声の候補があると判断する方法がある。 Based on the power calculated by the power calculators 107a and 107b, the voice detector 108a detects the presence or absence of the voice of the speaker A. As a method for detecting the presence or absence of speech, for example, there is a method of observing the output of the power calculator 107a and determining that there is a speaker speech candidate when the output exceeds a predetermined reference value. As another detection method, smoothing power is calculated by further smoothing the output of the power calculator 107a, and the output of the power calculator 107a is compared with the smoothed power. There is a method of determining that there is a speaker voice candidate when the output exceeds a predetermined relative reference value from the smoothing power.

後者の方法は、基準値が相対的に定められるため、周囲騒音がある場合に有効である。さらに、その他の話者音声の候補の検出方法として、パワー算出器１０７ａの出力とパワー算出器１０７ｂの出力を用いて、パワー算出器１０７ａの出力がパワー算出器１０７ｂの出力と比較して、パワー算出器１０７ａの出力がパワー算出器１０７ｂの出力より予め定めた相対基準値を超える場合に話者音声の候補があると判断する方法がある。この方法は、話者Ａのみが発話するとき、パワー算出器１０７ａの出力がパワー算出器１０７ｂの出力に比べて相対的に大きくなることを用いたもので、話者Ａのみが発話する場合の音声の検出精度を高めることができ有効である。 The latter method is effective when there is ambient noise because the reference value is relatively determined. Further, as another method for detecting a speaker voice candidate, the output of the power calculator 107a and the output of the power calculator 107b are used to compare the output of the power calculator 107a with the output of the power calculator 107b. There is a method of determining that there is a speaker voice candidate when the output of the calculator 107a exceeds a predetermined relative reference value than the output of the power calculator 107b. This method uses the fact that the output of the power calculator 107a is relatively larger than the output of the power calculator 107b when only the speaker A speaks, and the case where only the speaker A speaks is used. This is effective because it can improve the accuracy of voice detection.

音声検出器１０８ｂは、上記の音声検出器１０８ａと同様の処理を行い、話者Ｂの話者音声の候補の有無を検出する。 The voice detector 108b performs the same processing as the voice detector 108a described above, and detects the presence or absence of the speaker B's speaker voice candidate.

ステップＳ２１３で話者音声の候補が検出されたら、検出された話者音声の候補が話者音声であるか否かの判断を行い（Ｓ２１４）、音声検出器１０８ａまたは音声検出器１０８ｂの少なくともいずれか１つから話者音声が検出された場合、処理はステップＳ２１５に進み、いずれからも検出されない場合、処理は、ステップＳ２１１に戻る。 If a speaker voice candidate is detected in step S213, it is determined whether the detected speaker voice candidate is a speaker voice (S214), and at least one of the voice detector 108a and the voice detector 108b is determined. If the speaker voice is detected from any one of them, the process proceeds to step S215, and if neither is detected, the process returns to step S211.

ステップＳ２１４で、検出された話者音声の候補が話者音声であると判断された場合、その話者音声を含む話者間の会話を検出する処理が会話検出器１０９によってなされる（Ｓ２１５）。具体的には、ある話者（例えば、話者Ａ）の話者音声が検出された場合、一定時間内に他の話者（この場合は、話者Ｂ）の話者音声が収音されたかを検出する。ここで、上記の「一定時間」としては、１０秒から３０秒が適当である。 If it is determined in step S214 that the detected speaker voice candidate is a speaker voice, the conversation detector 109 performs a process of detecting a conversation between the speakers including the speaker voice (S215). . Specifically, when a speaker voice of a certain speaker (for example, speaker A) is detected, a speaker voice of another speaker (in this case, speaker B) is collected within a certain time. Detect. Here, 10 seconds to 30 seconds is appropriate as the “certain time”.

ステップＳ２１５で話者間の会話を検出処理が行われたら、検出処理によって会話が検出されたか否かの判断を行う（Ｓ２１６）。ステップＳ２１６で会話が検出されたと判断された場合、ステップＳ２１４で話者音声とされた音声は、話者Ａと話者Ｂとの間で行われる会話の一部であるため、キーワード検出を行う対象の音声ではないものとして、処理はステップＳ２１１に戻る。また、ステップＳ２１６で会話が検出されないと判断された場合、話者Ａもしくは話者Ｂによる単独の発話であると判断し、処理はステップＳ２１７に移る。 If the process for detecting the conversation between the speakers is performed in step S215, it is determined whether or not the conversation is detected by the detection process (S216). If it is determined in step S216 that a conversation has been detected, the speech detected as the speaker voice in step S214 is part of the conversation between speaker A and speaker B, and thus keyword detection is performed. The processing returns to step S211 as not being the target voice. If it is determined in step S216 that no conversation is detected, it is determined that the speech is a single utterance by speaker A or speaker B, and the process proceeds to step S217.

ステップＳ２１６で単独の発話であると判断された場合、発話された音声を対象にキーワード検出を行う（Ｓ２１７）。キーワード検出は、音声認識を用いて行うものとし、音声認識としては公知の音声認識を用いるものでよく、その詳細は本発明の本質ではないので省略する。ここで、キーワード検出の対象となるキーワードは予め設定されているものとし、予め設定されているキーワードは、例えば、「音声認識開始」、「音声認識スタート」等の音声認識の開始を意味するものでよい。 If it is determined in step S216 that the utterance is a single utterance, keyword detection is performed on the uttered voice (S217). The keyword detection is performed using speech recognition, and publicly known speech recognition may be used as speech recognition, and details thereof are omitted because they are not the essence of the present invention. Here, it is assumed that the keyword that is the target of keyword detection is set in advance, and the preset keyword means the start of voice recognition such as “start voice recognition”, “start voice recognition”, etc. It's okay.

キーワード検出は、認識制御器１１０により制御される。具体的には、ステップＳ２１４で音声検出器１０８ａによって話者音声が検出されたと判断された場合、バッファ１１２に蓄積されたイコライザ１０６ａからの出力を入力選択器１１１により選択する。また、音声検出器１０８ｂにおいて音声が検出された場合には、バッファ１１２に蓄積されたイコライザ１０６ｂからの出力を入力選択器１１１により選択する。さらに、音声認識部１１３により入力選択器１１１の出力に対しキーワード検出を行う。 Keyword detection is controlled by the recognition controller 110. Specifically, when it is determined in step S214 that the speaker voice is detected by the voice detector 108a, the input selector 111 selects the output from the equalizer 106a stored in the buffer 112. Further, when sound is detected by the sound detector 108 b, the input selector 111 selects the output from the equalizer 106 b accumulated in the buffer 112. Further, the voice recognition unit 113 performs keyword detection on the output of the input selector 111.

ステップＳ２１７でキーワード検出を行った後、予め設定されたキーワードが検出されたか否かを判断し（Ｓ２１８）、予め設定されたキーワードが検出された場合、音声認識装置１００の動作は音声認識モードＳ２２０に移り、そうでない場合にはステップＳ２１１に戻る。 After performing keyword detection in step S217, it is determined whether or not a preset keyword is detected (S218). When a preset keyword is detected, the operation of the speech recognition apparatus 100 is performed in the speech recognition mode S220. If not, the process returns to step S211.

音声認識装置１００は、音声認識モードＳ２２０で音声認識コマンドを受け付ける処理を行う。まず、音声認識モードＳ２２０のモード終了条件が成立しているか否かを判断し（Ｓ２２１）、予め定めた音声認識終了を表すキーワードの検出等のモード終了条件が成立していると判断した場合、待機モードＳ２１０に移る。モード終了条件が成立していない場合にはステップＳ２２２に進む。 The voice recognition device 100 performs processing for receiving a voice recognition command in the voice recognition mode S220. First, it is determined whether or not the mode end condition of the voice recognition mode S220 is satisfied (S221), and when it is determined that a mode end condition such as detection of a keyword representing the end of voice recognition is satisfied, The process proceeds to the standby mode S210. If the mode end condition is not satisfied, the process proceeds to step S222.

ステップＳ２２２〜ステップＳ２２４で、それぞれステップＳ２１２〜ステップＳ２１４での処理と同じ処理を行い、話者音声が検出された場合にはステップＳ２２５に進み、そうでない場合にはステップＳ２２１に戻る。 In steps S222 to S224, the same processes as those in steps S212 to S214 are performed. If a speaker voice is detected, the process proceeds to step S225. If not, the process returns to step S221.

ステップＳ２２４で話者音声が検出されたと判断された場合、音声認識コマンドや地名、機器の設定など、予め定められたキーワードの検出処理を行う（Ｓ２２５）。キーワードの検出処理は公知の音声認識を用いて行うものであれば良く、音声認識の詳細は本発明の本質ではないので省略する。キーワードの検出処理は、認識制御器１１０により制御され、ステップＳ２２３で、音声検出器１０８ａによって話者音声が検出された場合には、バッファ１１２に蓄積されたイコライザ１０６ａの出力を入力選択器１１１により選択する。また、音声検出器１０８ｂにおいて音声が検出された場合には、バッファ１１２に蓄積されたイコライザ１０６ｂの出力を入力選択器１１１により選択する。 If it is determined in step S224 that speaker voice has been detected, predetermined keyword detection processing such as voice recognition commands, place names, and device settings is performed (S225). The keyword detection process may be performed using publicly known speech recognition, and details of the speech recognition are not the essence of the present invention, and will be omitted. The keyword detection process is controlled by the recognition controller 110. When the speaker voice is detected by the voice detector 108a in step S223, the output of the equalizer 106a stored in the buffer 112 is output by the input selector 111. select. Further, when sound is detected by the sound detector 108b, the input selector 111 selects the output of the equalizer 106b accumulated in the buffer 112.

ステップＳ２２５でキーワードの検出処理が行われたら、キーワードの検出処理の対象となった話者音声が予め決められたキーワードと一致するか否かの判断が音声認識部１１３によって行われ、キーワードと一致する場合、処理はステップＳ２２７に移り、一致しない場合、処理はステップＳ２２８に進む。 When the keyword detection process is performed in step S225, the voice recognition unit 113 determines whether or not the speaker voice subjected to the keyword detection process matches a predetermined keyword. If so, the process proceeds to step S227, and if not, the process proceeds to step S228.

ステップＳ２２６でキーワードが検出されたと判断した場合には、車両に搭載された機器やネットワークで接続されたセンター等にキーワードに対応する信号（以下、対応信号という。）を出力または送信をし（Ｓ２２７）、ステップＳ２２１に戻る。キーワードが検出されないと判断した場合、処理はステップＳ２２８に移る。 If it is determined in step S226 that the keyword has been detected, a signal corresponding to the keyword (hereinafter referred to as a corresponding signal) is output or transmitted to a device mounted on the vehicle, a center connected via a network, or the like (S227). ), The process returns to step S221. If it is determined that no keyword is detected, the process proceeds to step S228.

ステップＳ２２８では、妨害話者の判定を行う。ステップＳ２２３で、音声検出器１０８ａと音声検出器１０８ｂのいずれにも音声が検出された場合には、複数の話者が発話しているためキーワードを検出できないものと判断し、処理は、ステップＳ２２９に進む。そうでない場合にはステップＳ２２１に戻る。 In step S228, the disturbing speaker is determined. If the voice is detected by both the voice detector 108a and the voice detector 108b in step S223, it is determined that a keyword cannot be detected because a plurality of speakers are speaking, and the processing is performed in step S229. Proceed to Otherwise, the process returns to step S221.

ステップＳ２２９では、ユーザに対し、複数の話者があるために音声認識が不可能である旨の情報を伝達する。情報の伝達手段は、モニタ等に文字や絵で表示する方法や音声により伝達する方法がある。本実施の形態では音声により伝達する方法をとり、ガイダンス出力部１１４より音声信号を出力し、アンプ１１５により増幅した後スピーカ１１６より出力する。伝達する音声の内容は、例えば、「音声認識を行いますのでドライバーの方以外はお静かに願います」等が考えられ、車両搭乗者に音声認識が出来ない原因を伝達できれば良い。このような情報を伝達することにより、他の話者がいることにより音声認識を失敗しても、情報伝達後に再度音声認識を行う際には音声認識を操作する話者のみが発話することが期待でき、音声認識が可能になる。ステップＳ２２９が終了するとステップＳ２２１に戻る。 In step S229, information indicating that speech recognition is impossible due to a plurality of speakers is transmitted to the user. As information transmission means, there are a method of displaying characters and pictures on a monitor or the like, and a method of transmitting by voice. In this embodiment, a method of transmitting by voice is used, and a voice signal is output from the guidance output unit 114, amplified by the amplifier 115, and then output from the speaker 116. The content of the voice to be transmitted may be, for example, “Please quietly except for the driver because voice recognition is performed”, and it is sufficient if the cause of the voice recognition cannot be transmitted to the vehicle occupant. By transmitting such information, even if speech recognition fails due to the presence of another speaker, only the speaker who operates the speech recognition speaks when performing speech recognition again after information transmission. Expectation and speech recognition will be possible. When step S229 ends, the process returns to step S221.

以上説明したように、本発明の第１の実施の形態に係る音声認識装置は、複数の話者方向に指向性を持たせて話者音声を収音し、それぞれの指向性収音出力を用いて判定した音声検出結果を用いる会話検出器を設け、搭乗者相互の会話が行われている場合には、音声認識待機時におけるキーワード検出を行わないことにより、音声認識起動における誤検出を抑制することができる。 As described above, the speech recognition apparatus according to the first embodiment of the present invention collects speaker speech with directivity in a plurality of speaker directions, and outputs each directional sound collection output. Conversation detectors that use the voice detection results determined by using them are provided, and when there is a conversation between passengers, keyword detection during voice recognition standby is not performed, thereby suppressing false detection during voice recognition activation. can do.

また、ガイダンス出力部を設けて会話を検出して音声認識を行わなかったことを出力できるようにしたため、操作性を向上することができる。 In addition, since the guidance output unit is provided so that conversation is detected and voice recognition is not performed, operability can be improved.

（第２の実施の形態）
図４は、本発明の第２の実施の形態に係る音声認識装置のブロック構成を示す図である。図４において、本発明の第２の実施の形態に係る音声認識装置４００は、本発明の第１の実施の形態に係る音声認識装置１００同様、車両２００の室内で音声認識を行い、音声認識されたキーワードに対応する対応信号を外部に出力する構成となっており、話者が発話した話者音声を含む音声を収音して収音信号を生成する複数の収音手段（図４には、マイク１０１ａおよびアンプ１０２ａからなる収音手段と、マイク１０１ｂおよびアンプ１０２ｂからなる収音手段とが示されている。）と、各収音手段が収音した音声に複数話者の話者音声の候補が含まれる場合に、収音手段毎に一人の話者の話者音声の候補を強調する話者音声強調手段（図１に、話者音声強調部と示す。以下、話者音声強調部という。）４０３と、複数の収音手段が生成した収音信号間の相関値を収音信号間の遅延時間を変えて算出する相関算出手段（図１に、相関算出器と示す。以下、相関算出器という。）４０１と、収音信号間の相関値が最大となる遅延時間が予め決められた時間の範囲内にあるか否かに基づいて話者の居る方向を検出することによって話者を特定すると共に特定した話者の話者音声の候補中の話者音声を検出する話者方向検出手段（図１に、話者方向検出器と示す。以下、話者方向検出器という。）４０２と、話者方向検出器４０２が過去の一定時間内に複数話者の話者音声を検出したか否かを検出する会話検出手段（図１に、会話検出器と示す。以下、会話検出器という。）４０９と、話者方向検出器４０２が話者音声を検出したときに、検出した話者音声の音声認識を行い、音声認識で認識された言葉が予め決められたキーワードと一致するとき、所定の信号を出力する音声認識手段（図１に、音声認識部と示す。以下、音声認識部という。）１１３と、話者方向検出器４０２が過去の一定時間内に複数話者の話者音声を検出したことを会話検出部４０９が検出した場合、複数話者の話者音声が検出された一定時間内に発話された話者音声の音声認識を音声認識部１１３が行うことを抑制する認識制御手段（図１に、認識制御器と示す。以下、認識制御器という。）１１０とを含むように構成される。 (Second Embodiment)
FIG. 4 is a diagram showing a block configuration of a speech recognition apparatus according to the second embodiment of the present invention. In FIG. 4, the speech recognition apparatus 400 according to the second embodiment of the present invention performs speech recognition in the room of the vehicle 200 in the same manner as the speech recognition apparatus 100 according to the first embodiment of the present invention. A corresponding signal corresponding to the keyword is output to the outside, and a plurality of sound collecting means (refer to FIG. 4) for collecting a sound including a speaker voice uttered by a speaker and generating a collected signal. Shows a sound collecting means composed of a microphone 101a and an amplifier 102a, and a sound collecting means composed of a microphone 101b and an amplifier 102b.) And a speaker of a plurality of speakers in the sound collected by each sound collecting means. When speech candidates are included, speaker speech enhancement means for emphasizing the speaker speech candidate of one speaker for each sound collection means (shown as a speaker speech enhancement unit in FIG. 1). 403) and a plurality of sound collecting means are live. Correlation calculation means (shown as a correlation calculator in FIG. 1; hereinafter referred to as a correlation calculator) 401 for calculating the correlation value between the collected sound signals by changing the delay time between the collected sound signals, and between the collected sound signals The speaker is identified and the speaker's voice is identified by detecting the direction of the speaker based on whether or not the delay time that maximizes the correlation value is within a predetermined time range. Speaker direction detection means (in FIG. 1, referred to as a speaker direction detector; hereinafter referred to as a speaker direction detector) 402 and a speaker direction detector 402 in the past. Conversation detection means (shown as a conversation detector in FIG. 1; hereinafter referred to as a conversation detector) 409 for detecting whether or not speaker voices of a plurality of speakers are detected within a predetermined time, and a speaker direction detector When the speaker voice is detected by 402, the voice of the detected speaker voice is recognized, and the voice A speech recognition unit (shown as a speech recognition unit in FIG. 1; hereinafter referred to as a speech recognition unit) 113 that outputs a predetermined signal when a word recognized by the knowledge matches a predetermined keyword; When the conversation detector 409 detects that the direction detector 402 has detected the speaker voices of a plurality of speakers within a predetermined time in the past, the speaker voices of the plurality of speakers have been uttered within a predetermined time. It is configured to include recognition control means (represented as a recognition controller in FIG. 1; hereinafter referred to as a recognition controller) 110 that suppresses speech recognition of the speaker's voice by the voice recognition unit 113.

本発明の第２の実施の形態に係る音声認識装置４００の構成手段のうち、本発明の第１の実施の形態に係る音声認識装置１００の構成手段と同様の構成手段には同一の符号を付し、その説明を省略する。本発明の第２の実施の形態に係る話者音声強調部４０３は、話者音声強調部４０３を構成する遅延器４０４ａ、４０４ｂの遅延時間を音声認識装置４００の動作中に設定できる点を除けば、本発明の第１の実施の形態の話者音声強調部１０３と同様の機能であるため、その説明を省略する。 Of the constituent means of the speech recognition apparatus 400 according to the second embodiment of the present invention, the same reference numerals are given to the same constituent means as the constituent means of the speech recognition apparatus 100 according to the first embodiment of the present invention. A description thereof will be omitted. The speaker voice emphasizing unit 403 according to the second embodiment of the present invention except that the delay time of the delay units 404a and 404b constituting the speaker voice emphasizing unit 403 can be set during the operation of the speech recognition apparatus 400. For example, the function is the same as that of the speaker voice emphasizing unit 103 according to the first embodiment of the present invention, and a description thereof will be omitted.

音声認識装置４００は、図４に示すように、さらに、遅延器４０４ａ、４０４ｂの遅延時間を設定する遅延時間設定器４０８を有し、話者方向検出器４０２が検出した話者の方向に応じて遅延器４０４ａ、４０４ｂに設定されている時間を更新するようになっているのでもよい。 As shown in FIG. 4, the speech recognition apparatus 400 further includes a delay time setting unit 408 that sets the delay times of the delay units 404 a and 404 b, according to the direction of the speaker detected by the speaker direction detector 402. Alternatively, the time set in the delay units 404a and 404b may be updated.

以下、本発明の第２の実施の形態に係る音声認識装置４００の動作について図５を用いて説明する。図５に示すように、音声認識装置４００は、本発明の第１の実施の形態に係る音声認識装置１００と同様に、待機モードと音声認識モードの２つのモードで動作するものとする。待機モードＳ５１０では、音声認識を開始するトリガとなるキーワードが発話されたことを検出するようになっており、音声認識開始ボタンを押すこと等の発話以外の手段または方法による音声認識開始の合図をユーザに要求することなく、発話のみで音声認識が開始できる。また、本発明の第１の実施の形態で説明した各ステップでの処理と同様の処理を行うステップについては、同一の番号を付し、その説明を省略する。 Hereinafter, the operation of the speech recognition apparatus 400 according to the second embodiment of the present invention will be described with reference to FIG. As shown in FIG. 5, it is assumed that the speech recognition apparatus 400 operates in two modes, a standby mode and a speech recognition mode, as with the speech recognition apparatus 100 according to the first embodiment of the present invention. In the standby mode S510, it is detected that a keyword serving as a trigger for starting speech recognition is uttered, and a signal for starting speech recognition by means or methods other than speech such as pressing a speech recognition start button is given. Voice recognition can be started only by speaking without requesting the user. Moreover, the same number is attached | subjected about the step which performs the process similar to the process in each step demonstrated in the 1st Embodiment of this invention, and the description is abbreviate | omitted.

ステップＳ２１２で収音信号中の話者音声が強調された後、アンプ１０２ａの出力信号とアンプ１０２ｂの出力信号との間の相互相関関数が相関算出器４０１によって算出される（Ｓ５１３１）。相互相関関数の算出方法として、時間領域において、アンプ１０２ａの出力信号をアンプ１０２ｂの出力信号に対し、時間遅れ（タイムラグ）τだけずらし、ずらした信号と他の信号との相関係数を算出する方法がある。また、その他の相互相関関数の算出方法としては、公知のＦＦＴ（ＦａｓｔＦｏｕｒｉｅｒＴｒａｎｓｆｏｒｍ）等の方法によりアンプ１０２ａの出力信号とアンプ１０２ｂの出力信号を周波数領域に変換し、両者のクロススペクトルを算出した後、公知の逆ＦＦＴ等の方法により時間領域に戻すことにより算出する方法がある。この方法は演算量が少ないという特徴がある。 After the speaker voice in the collected sound signal is emphasized in step S212, a cross-correlation function between the output signal of the amplifier 102a and the output signal of the amplifier 102b is calculated by the correlation calculator 401 (S5131). As a method of calculating the cross-correlation function, in the time domain, the output signal of the amplifier 102a is shifted from the output signal of the amplifier 102b by a time delay (time lag) τ, and a correlation coefficient between the shifted signal and another signal is calculated. There is a way. As another method of calculating the cross-correlation function, the output signal of the amplifier 102a and the output signal of the amplifier 102b are converted into the frequency domain by a known FFT (Fast Fourier Transform) method or the like, and the cross spectrum of both is calculated. Thereafter, there is a method of calculating by returning to the time domain by a known method such as inverse FFT. This method is characterized by a small amount of calculation.

ステップＳ５１３１で相関関数が算出されたら、話者方向の検出を行うことによって話者を特定すると共に特定した話者の話者音声の候補中の話者音声を検出する（Ｓ５１３２）。話者方向の検出は、前記ステップＳ５１３１で算出された相互相関関数の最大値をとるタイムラグτが、予め定められた話者方向に相当する時間差の許容範囲内にあるかによって行われる。予め定めた許容範囲以内にある場合にその方向に話者がいると話者方向検出器４０２により判定される。なお、許容範囲としては、幾何学的に算出される話者方向±１０°〜３０°に相当する時間差を設定することが適当である。ここで、タイムラグτは、話者のいる方向を特定する情報とすることができる。 When the correlation function is calculated in step S5131, the speaker direction is detected to identify the speaker, and the speaker voice among the speaker voice candidates identified is detected (S5132). The detection of the speaker direction is performed depending on whether or not the time lag τ that takes the maximum value of the cross-correlation function calculated in step S5131 is within the allowable time difference corresponding to the predetermined speaker direction. When it is within the predetermined allowable range, the speaker direction detector 402 determines that there is a speaker in that direction. As an allowable range, it is appropriate to set a time difference corresponding to a geometrically calculated speaker direction ± 10 ° to 30 °. Here, the time lag τ can be information specifying the direction in which the speaker is present.

ステップＳ２１４で話者音声が検出された際のタイムラグの最大値（以下、相関時間という。）が予め設定されている相関時間と異なるとき、新たに検出された相関時間を話者方向検出器４０２を介して遅延器４０４ａ、４０４ｂに設定する（Ｓ５３０）。相関時間を話者方向検出器４０２に設定することは、話者方向の情報を設定することと同等である。この処理は、処理フレーム毎に実施する必要はなく、検出される話者方向の長時間平均値を用いて、処理フレーム長よりも長い適当な時間間隔で実施すれば良く、例えば、音声認識装置の終了時や起動時などにのみ実施する構成でも良い。また、この処理は、必ずしも必須ではないのでステップＳ５３０を省いた構成も可能である。 When the maximum value of the time lag (hereinafter referred to as correlation time) when the speaker voice is detected in step S214 is different from the preset correlation time, the newly detected correlation time is used as the speaker direction detector 402. Are set in the delay units 404a and 404b (S530). Setting the correlation time in the speaker direction detector 402 is equivalent to setting the speaker direction information. This processing need not be performed for each processing frame, and may be performed at an appropriate time interval longer than the processing frame length using the long-term average value in the detected speaker direction. The configuration may be implemented only at the time of termination or startup. Moreover, since this process is not necessarily essential, a configuration in which step S530 is omitted is also possible.

ステップＳ２２２で話者音声の候補を強調したら、ステップＳ５１３１での処理と同様にアンプ１０２ａの出力とアンプ１０２ｂの出力の間の相関関数を算出する（Ｓ５２３１）。ステップＳ５２３１で相関関数が算出されたら、ステップＳ５２３２での処理と同様に話者方向が検出される（Ｓ５１３２）。話者方向が検出されたら、ステップＳ２２４以上でのステップでの処理が本発明の第１の実施の形態で説明したようになされる。 When the speaker voice candidate is emphasized in step S222, a correlation function between the output of the amplifier 102a and the output of the amplifier 102b is calculated in the same manner as the processing in step S5131 (S5231). When the correlation function is calculated in step S5231, the speaker direction is detected in the same manner as in step S5232 (S5132). When the speaker direction is detected, the processing in steps S224 and above is performed as described in the first embodiment of the present invention.

以上説明したように、本発明の第２の実施の形態に係る音声認識装置は、複数のマイクを用いて収音した信号の相関を用いて話者方向を検出し、その話者方向を用いて搭乗者相互の会話が行われているか否かを判定し、搭乗者相互の会話が行われている場合には、音声認識待機時におけるキーワード検出を行わないことにより、音声認識起動における誤検出を抑制することができる。 As described above, the speech recognition apparatus according to the second embodiment of the present invention detects the speaker direction using the correlation of signals collected using a plurality of microphones, and uses the speaker direction. It is determined whether or not there is a conversation between passengers, and if a conversation between passengers is being performed, the keyword detection at the time of voice recognition standby is not performed, so that false detection at the time of voice recognition activation is performed. Can be suppressed.

本発明にかかる音声認識装置は、複数の話者がある場合の誤認識を削減できるという効果を有し、音声認識を行って車載機器等を操作する操作装置等として有用である。 The speech recognition apparatus according to the present invention has an effect of reducing misrecognition when there are a plurality of speakers, and is useful as an operation device that performs speech recognition and operates an in-vehicle device or the like.

本発明の第１の実施の形態に係る音声認識装置のブロック構成を示す概念図The conceptual diagram which shows the block configuration of the speech recognition apparatus which concerns on the 1st Embodiment of this invention. 本発明の第１の実施の形態に係る音声認識装置の動作を説明するためのフローチャートThe flowchart for demonstrating operation | movement of the speech recognition apparatus which concerns on the 1st Embodiment of this invention. 本発明の第１の実施の形態に係る音声認識装置の話者音声強調動作を説明するためのフローチャートThe flowchart for demonstrating the speaker audio | voice emphasis operation | movement of the speech recognition apparatus which concerns on the 1st Embodiment of this invention. 本発明の第２の実施の形態に係る音声認識装置のブロック構成を示す概念図The conceptual diagram which shows the block structure of the speech recognition apparatus which concerns on the 2nd Embodiment of this invention. 本発明の第２の実施の形態に係る音声認識装置の動作を説明するためのフローチャートThe flowchart for demonstrating operation | movement of the speech recognition apparatus which concerns on the 2nd Embodiment of this invention. 従来の音声認識装置のブロック構成を示す概念図Conceptual diagram showing a block configuration of a conventional speech recognition apparatus

Explanation of symbols

１００、４００、６００音声認識装置
１０１ａ、１０１ｂ、６０１ａ、６０１ｂ、６０１ｃマイク
１０２ａ、１０２ｂ、１１５、６０２ａ、６０２ｂ、６０２ｃアンプ
１０３、４０３話者音声強調部
１０４ａ、１０４ｂ、４０４ａ、４０４ｂ遅延器
１０５ａ、１０５ｂ加算器
１０６ａ、１０６ｂイコライザ
１０７ａ、１０７ｂパワー算出器
１０８ａ、１０８ｂ音声検出器
１０９、４０９会話検出器
１１０認識制御器
１１１入力選択器
１１２バッファ
１１３、６０６音声認識部
１１４ガイダンス出力部
１１６スピーカ
２００車両
４０１相関算出器
４０２話者方向検出器
４０８遅延時間設定器
６０３ａ、６０３ｂ、６０３ｃフィルタ
６０４コンパレータ
６０５音声切替部
６０７ＣＰＵ 100, 400, 600 Speech recognition apparatus 101a, 101b, 601a, 601b, 601c Microphone 102a, 102b, 115, 602a, 602b, 602c Amplifier 103, 403 Speaker speech enhancement unit 104a, 104b, 404a, 404b Delay devices 105a, 105b Adder 106a, 106b Equalizer 107a, 107b Power calculator 108a, 108b Speech detector 109, 409 Conversation detector 110 Recognition controller 111 Input selector 112 Buffer 113, 606 Speech recognition unit 114 Guidance output unit 116 Speaker 200 Vehicle 401 Correlation Calculator 402 Speaker direction detector 408 Delay time setting unit 603a, 603b, 603c Filter 604 Comparator 605 Voice switching unit 607 CPU

Claims

A plurality of sound collecting means for collecting sounds including speaker voices spoken by a speaker, and a plurality of speaker sound candidates of the plurality of speakers included in the sound collected by each of the sound collecting means. Speaker voice emphasizing means for emphasizing speaker's speaker voice candidates, power calculating means for calculating power of speaker voice candidates emphasized by the speaker voice emphasizing means, and power calculated by the power calculating means And a voice detecting means for detecting a speaker voice in the speaker voice candidates based on the voice, and a conversation for detecting whether or not the voice detecting means has detected speaker voices of a plurality of speakers within a predetermined past time. Detecting means and when the voice detecting means detects speaker voice, performs speech recognition of the detected speaker voice, and when a word recognized by voice recognition matches a predetermined keyword, a predetermined signal And voice recognition means for outputting the past and the voice detection means When the conversation detecting unit detects that the speaker voices of a plurality of speakers are detected within a predetermined time, the voice recognition of the speaker voices uttered within the predetermined time when the speaker voices of the plurality of speakers are detected A speech recognition apparatus comprising: a recognition control unit that suppresses the speech recognition unit from performing the operation.

The voice recognition device further detects the voice of the speaker by the voice recognition means when the conversation detection means detects that there is a plurality of speaker voices within the predetermined time in the past. 2. The voice recognition according to claim 1, further comprising a guidance output means for generating a signal notifying that voice recognition has not been performed due to the presence of speaker voices of a plurality of speakers when not performed. apparatus.

A plurality of sound collecting means for collecting a sound including a speaker voice uttered by a speaker to generate a sound collecting signal, and a plurality of speaker voice candidates for the sound collected by each of the sound collecting means If included, speaker voice emphasizing means for emphasizing speaker voice candidates of a plurality of speakers, and a correlation value between the collected sound signals generated by the plurality of sound collecting means is used as a delay time between the collected sound signals. And calculating the direction of the speaker based on whether or not the delay time within which the correlation value between the collected sound signals is maximum is within a predetermined time range. And a speaker direction detecting means for detecting a speaker voice in a speaker voice candidate specified by the speaker, and the speaker direction detecting means is capable of detecting a plurality of speakers within a predetermined time in the past. A conversation detecting means for detecting whether or not a speaker voice is detected, and the speaker direction detecting means detects a speaker voice. Sometimes, speech recognition of the detected speaker voice is performed, and when a word recognized by the speech recognition matches a predetermined keyword, a voice recognition means for outputting a predetermined signal, and the speaker direction detection means When the conversation detecting means detects that the speaker voices of a plurality of speakers have been detected within a predetermined time in the past, the voices of the speakers spoken within the predetermined time when the speaker voices of the plurality of speakers are detected are detected. A speech recognition apparatus comprising: a recognition control unit that suppresses speech recognition by the speech recognition unit.

The voice recognition device further detects the voice of the speaker by the voice recognition means when the conversation detection means detects that there is a plurality of speaker voices within the predetermined time in the past. 4. The voice recognition according to claim 3, further comprising a guidance output means for generating a signal notifying that voice recognition has not been performed due to the presence of speaker voices of a plurality of speakers when not performed. apparatus.