JP2007219207A

JP2007219207A - Speech recognition device

Info

Publication number: JP2007219207A
Application number: JP2006040397A
Authority: JP
Inventors: Hiroshi Shibata; 浩柴田
Original assignee: Denso Ten Ltd
Current assignee: Denso Ten Ltd
Priority date: 2006-02-17
Filing date: 2006-02-17
Publication date: 2007-08-30

Abstract

<P>PROBLEM TO BE SOLVED: To provide a speech recognition device capable of appropriately performing a speech recognition even when a person other than a speech inputting object person is present under its use environment. <P>SOLUTION: The speech recognition device which recognizes a speech that the speech inputting object person (driver) speaks is equipped with a first decision means of deciding whether there is the speech spoken by the driver, a second decision means of deciding whether there is a speech that the person (passenger) other than the speech inputting object person speaks, and a condition meeting judging means of judging whether a speech recognition start condition of speech recognition is met, the speech recognition start condition including driver's speaking and no speaking of the passenger right after the driver's speaking. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は音声認識装置に関し、より詳細には、音声認識技術が採用され、使用者から発せられた音声を認識するための音声認識装置に関する。 The present invention relates to a speech recognition apparatus, and more particularly, to a speech recognition apparatus that employs speech recognition technology and recognizes speech emitted from a user.

音声認識装置は、様々な分野で用いられており、例えば、ナビゲーションシステムなどの車載機器に採用されている。運転者から発せられた音声がマイクロフォンで入力され、マイクロフォンで入力音声が電気信号へ変換される。電気信号へ変換された入力音声は音声処理部で解析され、運転者から発声されたコマンドが認識され、そしてその認識結果に従ってナビゲーション装置が動作することになる。 Voice recognition devices are used in various fields, and are used in in-vehicle devices such as navigation systems. The voice emitted from the driver is input by a microphone, and the input voice is converted into an electric signal by the microphone. The input voice converted into the electrical signal is analyzed by the voice processing unit, the command uttered by the driver is recognized, and the navigation device operates according to the recognition result.

音声認識装置においては、音声認識率の高さが非常に重要であるため、マイクロフォンで入力された音声に対する認識処理を行う期間である処理区間を特定することも大切になる。処理区間を特定せずに、常時認識処理を行うようにしてしまうと、同乗者との会話やカーステレオからの音楽、ノイズなどの影響を受けて誤動作を招くおそれがある。
そのため、従来の音声認識装置には、音声入力の際に使用者に操作される発話スイッチ（トークスイッチ）が設けられているものが多い。発話スイッチには、発話開始スイッチとプレストークスイッチとがある。 In a speech recognition apparatus, since a high speech recognition rate is very important, it is also important to specify a processing section that is a period for performing recognition processing on speech input by a microphone. If the recognition process is always performed without specifying the processing section, malfunction may occur due to the influence of conversation with passengers, music from the car stereo, noise, and the like.
For this reason, many conventional speech recognition devices are provided with an utterance switch (talk switch) that is operated by a user when inputting voice. The utterance switch includes an utterance start switch and a press talk switch.

発話開始スイッチは、使用者によって発話開始の直前に操作されるものであって、発話開始スイッチが操作されると、その操作直後からマイクロフォンで入力された音声に対する認識処理が行われることになる。
他方、プレストークスイッチは、使用者によって発話開始から終了まで押し続けられるものであって、プレストークスイッチが押下されている間、マイクロフォンで入力された音声に対する認識処理が行われることになる。 The utterance start switch is operated immediately before the start of utterance by the user. When the utterance start switch is operated, a recognition process is performed on the voice input from the microphone immediately after the operation.
On the other hand, the press talk switch is continuously pressed from the start to the end of the utterance by the user, and while the press talk switch is being pressed, the recognition process for the voice input by the microphone is performed.

しかしながら、このような音声認識装置では、使用者は発声の度に発話スイッチを操作しなければならず、非常に操作が煩雑になるという問題がある。特に、走行中の運転者による発話スイッチの操作は決して好ましいことではない。音声認識装置を採用するのであれば、手動操作を不要とするのが望ましい。 However, in such a speech recognition apparatus, the user must operate the speech switch every time he speaks, and there is a problem that the operation becomes very complicated. In particular, the operation of the speech switch by the driver while traveling is not preferable. If a voice recognition device is employed, it is desirable that manual operation is unnecessary.

このような問題を解決する技術として、例えば、下記の特許文献１に、使用者の顔がマイクロフォンの方を向いている場合や使用者の唇が動いたり、使用者の視線がマイクロフォンを見るといったような外観状態を検出して、使用者の発声の有無を判定し、使用者による発声が始まったと判定すると、音声認識を開始するようにした技術について開示されている。 As a technique for solving such a problem, for example, in Patent Document 1 below, when the user's face is facing the microphone, the user's lips move, or the user's line of sight looks at the microphone. A technique is disclosed in which voice recognition is started when a user's utterance is determined by detecting such an appearance state and when it is determined that the utterance by the user has started.

しかしながら、車両内に同乗者がいる場合など、その使用環境下において音声入力対象者以外の別の人間が存在する場合には、別の人間に話し掛けているのか、音声入力のためにマイクロフォンに向かって声を発しているのか区別がつかず、本来ならば必要がないにも拘らず、音声認識処理が開始され、誤動作を招くおそれがある。例えば、同乗者との会話で音声認識処理が開始されることが考えられる。
特開平１１−３５２９８７号公報 However, when there is another person other than the voice input target person in the usage environment, such as when there is a passenger in the vehicle, the person is talking to another person or heading to the microphone for voice input. The voice recognition process is started and may cause malfunction even though it is not necessary if necessary. For example, it is conceivable that the voice recognition process is started in a conversation with a passenger.
Japanese Patent Laid-Open No. 11-352987

Means for solving the problems and their effects

本発明は上記課題に鑑みなされたものであって、その使用環境下において、音声入力対象者以外の別の人間が存在しても、適切な音声認識を実現することのできる音声認識装置を提供することを目的としている。 The present invention has been made in view of the above problems, and provides a speech recognition device capable of realizing appropriate speech recognition even in the presence of another person other than the speech input target person. The purpose is to do.

上記目的を達成するために本発明に係る音声認識装置（１）は、音声入力対象者から発せられた音声を認識する音声認識装置において、音声入力対象者による発声の有無を判定する第１の判定手段と、音声入力対象者以外の者による発声の有無を判定する第２の判定手段と、音声認識を開始する音声認識開始条件が成立したか否かを判断する条件成立判断手段とを備えると共に、前記音声認識開始条件に、前記第１の判定手段に音声入力対象者による発声有りと判定されることと、音声入力対象者による発声から所定期間が経過するまで、前記第２の判定手段に音声入力対象者以外の者による発声有りと判定されないことと、が含まれていることを特徴としている。 In order to achieve the above object, a speech recognition device (1) according to the present invention is a speech recognition device that recognizes speech uttered by a speech input target person, and determines whether or not the speech input target person utters. A determination unit; a second determination unit that determines the presence or absence of utterance by a person other than the voice input target person; and a condition establishment determination unit that determines whether or not a voice recognition start condition for starting voice recognition is satisfied. Along with the voice recognition start condition, it is determined that there is utterance by the voice input target person in the first determination means, and the second determination means until a predetermined period has elapsed from the utterance by the voice input target person. Are not determined to be uttered by a person other than the voice input target person.

上記音声認識装置（１）によれば、音声認識を開始する音声認識開始条件に、音声入力対象者（例えば、運転者）が発声することが含まれているので、音声入力対象者の発声をトリガとして、音声認識を開始させることができる。従って、音声入力対象者にスイッチ押下などの手動操作を行わせなくても、音声入力対象者の所望するタイミングで音声認識を開始することができる。 According to the voice recognition device (1), the voice recognition start condition for starting voice recognition includes that the voice input target person (for example, the driver) speaks. Speech recognition can be started as a trigger. Therefore, voice recognition can be started at a timing desired by the voice input target person without causing the voice input target person to perform a manual operation such as pressing a switch.

また、前記音声認識開始条件に、音声入力対象者による発声から所定期間が経過するまで（例えば、発声終了から２秒経過するまで）、音声入力対象者以外の者（例えば、同乗者）が発声していないことが含まれているので、音声入力対象者の発声直後に音声入力対象者以外の者が発声した場合には、音声認識は開始されないことになる。 Moreover, a person other than the voice input target person (for example, a fellow passenger) speaks until the predetermined period has passed since the voice input target person uttered (for example, until 2 seconds have elapsed from the end of the utterance). Therefore, if a person other than the voice input target person speaks immediately after the voice input target person speaks, the voice recognition is not started.

音声入力対象者の発声直後に、音声入力対象者以外の者が発声する場合というのは、両者間で会話が交わされている可能性が高い。従って、音声入力対象者による発声が、音声入力対象者以外の者との間の会話の一部である可能性が高い場合には、音声認識は開始されないので、不要な時に音声認識が開始されるのを防止することができる。 When a person other than the voice input target person speaks immediately after the voice input target person speaks, there is a high possibility that a conversation is being exchanged between them. Therefore, when there is a high possibility that the utterance by the voice input target person is a part of the conversation with a person other than the voice input target person, the voice recognition is not started. Can be prevented.

例えば、下記のようなケースの場合、音声認識は開始されない。
１．音声入力対象者である運転者による発声。
２．上記１の発声より２秒以内に、音声入力対象者以外の者である同乗者による発声。
３．上記２の発声後の運転者による発声。
４．上記３の発声より２秒以内に、同乗者による発声。 For example, in the following cases, voice recognition is not started.
1. Speech by the driver who is the target of voice input.
2. Speech produced by a passenger who is not a voice input subject within 2 seconds from the speech of 1 above.
3. The utterance by the driver after the utterance of 2 above.
4). Speaking by passengers within 2 seconds of utterance 3 above.

また、本発明に係る音声認識装置（２）は、上記音声認識装置（１）において、音声入力手段で入力された音声に含まれる個人性情報に基づいて、発声主が音声入力対象者であるか否かを判断する発声主判断手段を備え、該発声主判断手段による判断結果に基づいて、前記第１の判定手段による判定、及び前記第２の判定手段による判定を行うように構成されていることを特徴としている。 Further, in the voice recognition device (2) according to the present invention, in the voice recognition device (1), the utterer is the voice input target person based on the personality information included in the voice input by the voice input means. A voicing main judgment means for judging whether or not, and based on the judgment result by the utterance main judgment means, the judgment by the first judgment means and the judgment by the second judgment means are made. It is characterized by being.

上記音声認識装置（２）によれば、音声入力手段で入力された音声に含まれる個人性情報に基づいて、発声主が音声入力対象者であるか否かが判断され、この判断結果に基づいて、音声入力対象者による発声の有無、及び音声入力対象者以外の者による発声の有無が判定される。個人性情報としては、例えば、声紋、ホルマント（声道の共振周波数）などが挙げられる。 According to the voice recognition device (2), it is determined whether or not the utterer is the voice input target person based on the personality information included in the voice input by the voice input means, and based on the determination result. Thus, the presence or absence of utterance by the voice input target person and the presence or absence of utterance by a person other than the voice input target person are determined. Examples of the personality information include a voice print, formant (resonance frequency of the vocal tract), and the like.

発声主が音声入力対象者である（又はその可能性が高い）と判断されれば、音声入力対象者が発声したと判定され、他方、発声主が音声入力対象者ではない（又はその可能性が高い）と判断されれば、音声入力対象者以外の者が発声したと判定されることになる。従って、音声入力対象者の音声に含まれる個人性情報があれば、これら判定を適切に行うことができる。例えば、音声入力対象者である運転者の声紋データがあれば、運転者の発声及び同乗者の発声を適切に判定することができる。 If it is determined that the utterer is a voice input target person (or the possibility is high), it is determined that the voice input target person uttered, while the utterance person is not the voice input target person (or the possibility). If it is determined that a person other than the voice input target person has uttered, it is determined. Therefore, if there is personality information included in the voice of the voice input target person, these determinations can be made appropriately. For example, if there is voiceprint data of a driver who is a voice input target, it is possible to appropriately determine the voice of the driver and the voice of the passenger.

また、本発明に係る音声認識装置（３）は、上記音声認識装置（１）において、音声入力手段で入力された音声から得られる音源方向に基づいて、発声主が音声入力対象者であるか否かを判断する発声主判断手段を備え、該発声主判断手段による判断結果に基づいて、前記第１の判定手段による判定、及び前記第２の判定手段による判定を行うように構成されていることを特徴としている。 Further, in the voice recognition device (3) according to the present invention, in the voice recognition device (1), based on the sound source direction obtained from the voice input by the voice input means, is the utterer the voice input target person? A voicing main judgment means for judging whether or not, and based on a judgment result by the utterance main judgment means, the judgment by the first judgment means and the judgment by the second judgment means are made. It is characterized by that.

上記音声認識装置（３）によれば、音声入力手段で入力された音声から得られる音源方向に基づいて、発声主が音声入力対象者であるか否かが判断され、この判断結果に基づいて、音声入力対象者による発声の有無、及び音声入力対象者以外の者による発声の有無が判定される。 According to the voice recognition device (3), it is determined whether or not the utterer is a voice input target person based on the sound source direction obtained from the voice input by the voice input means, and based on the determination result. The presence or absence of utterance by the voice input target person and the presence or absence of utterance by a person other than the voice input target person are determined.

発声主が音声入力対象者である（又はその可能性が高い）と判断されれば、音声入力対象者が発声したと判定され、他方、発声主が音声入力対象者ではない（又はその可能性が高い）と判断されれば、音声入力対象者以外の者が発声したと判定されることになる。従って、音声入力対象者がどの位置に存在するかを示すデータがあれば、これら判定を適切に行うことができる。例えば、運転者が音声入力対象者であれば、運転席の位置を示すデータがあれば、運転者の発声及び同乗者の発声を適切に判定することができる。 If it is determined that the utterer is a voice input target person (or the possibility is high), it is determined that the voice input target person uttered, while the utterance person is not the voice input target person (or the possibility). If it is determined that a person other than the voice input target person has uttered, it is determined. Therefore, if there is data indicating where the voice input target person exists, these determinations can be made appropriately. For example, if the driver is a voice input target, if there is data indicating the position of the driver's seat, the driver's voice and the passenger's voice can be appropriately determined.

また、本発明に係る音声認識装置（４）は、上記音声認識装置（１）において、画像入力手段で入力された画像から得られる音声入力対象者、もしくは音声入力対象者以外の者、あるいは音声入力対象者及び音声入力対象者以外の者の外観状態に基づいて、発声主が音声入力対象者であるか否かを判断する発声主判断手段を備え、該発声主判断手段による判断結果に基づいて、前記第１の判定手段による判定、及び前記第２の判定手段による判定を行うように構成されていることを特徴としている。 Further, the speech recognition device (4) according to the present invention is a speech input target person obtained from an image input by the image input means in the speech recognition device (1), a person other than the speech input target person, or a speech. Based on the appearance state of a person other than the input target person and the voice input target person, a voice main judgment means for judging whether or not the voice main person is the voice input target person is provided, and based on the judgment result by the voice main judgment means The determination by the first determination means and the determination by the second determination means are performed.

上記音声認識装置（４）によれば、画像入力手段で入力された音声から得られる音声入力対象者、もしくは音声入力対象者以外の者、あるいは音声入力対象者及び音声入力対象者以外の者の外観状態に基づいて、発声主が音声入力対象者であるか否かが判断され、この判断結果に基づいて、音声入力対象者による発声の有無、及び音声入力対象者以外の者による発声の有無が判定される。 According to the voice recognition device (4), the voice input target person obtained from the voice input by the image input means, the person other than the voice input target person, or the voice input target person and the person other than the voice input target person. Based on the appearance state, it is determined whether or not the utterer is a voice input target person. Based on the determination result, the presence or absence of utterance by the voice input target person and the utterance by a person other than the voice input target person Is determined.

顔がマイクロフォンの方を向いたり、唇が動くといったような外観状態は、その者が発声主である可能性が高い。例えば、マイクロフォンで発声が検知された時に、音声入力対象者の唇が動いていれば、発声主は音声入力対象者と判断することができ、他方、マイクロフォンで発声が検知された時に、音声入力対象者以外の者の唇が動いていれば、発声主は音声入力対象者以外の者と判断することができる。また、マイクロフォンで発声が検知された時に、音声入力対象者の唇が動いていなければ、発声主は音声入力対象者以外の者と判断することができる。 In the appearance state where the face faces the microphone or the lips move, the person is likely to be the main speaker. For example, if the lip of the voice input target person is moving when the utterance is detected with the microphone, the utterer can be determined as the voice input target person, while the voice input is performed when the utterance is detected with the microphone. If the lips of a person other than the target person are moving, the utterer can be determined to be a person other than the voice input target person. Further, when the utterance of the voice input target person is not moving when the utterance is detected by the microphone, the utterer can be determined to be a person other than the voice input target person.

発声主が音声入力対象者である（又はその可能性が高い）と判断されれば、音声入力対象者が発声したと判定され、他方、発声主が音声入力対象者ではない（又はその可能性が高い）と判断されれば、音声入力対象者以外の者が発声したと判定されることになる。従って、音声入力対象者、もしくは音声入力対象者以外の者、あるいは音声入力対象者及び音声入力対象者以外の者の外観状態を監視することによって、これら判定を適切に行うことができる。例えば、音声入力対象者である運転者の顔を監視すれば、運転者の発声及び同乗者の発声を適切に判定することができる。 If it is determined that the utterer is a voice input target person (or the possibility is high), it is determined that the voice input target person uttered, while the utterance person is not the voice input target person (or the possibility). If it is determined that a person other than the voice input target person has uttered, it is determined. Therefore, these determinations can be made appropriately by monitoring the appearance of the voice input target person, the person other than the voice input target person, or the voice input target person and the person other than the voice input target person. For example, if the face of the driver who is the voice input target is monitored, the utterance of the driver and the utterance of the passenger can be appropriately determined.

また、本発明に係る音声認識装置（５）は、上記音声認識装置（１）〜（４）のいずれかにおいて、音声入力対象者による発声から前記所定期間が経過するまでに、前記第２の判定手段により音声入力対象者以外の者が発声したと判定された場合、前記音声認識開始条件を成立させない保留期間を設定する保留期間設定手段を備えていることを特徴としている。 In addition, the speech recognition device (5) according to the present invention may be configured such that, in any of the speech recognition devices (1) to (4), the second period until the predetermined period elapses after the speech is input by the speech input target person. When it is determined by the determination means that a person other than the voice input target person has uttered, there is provided a holding period setting means for setting a holding period in which the voice recognition start condition is not satisfied.

音声入力対象者による発声から前記所定期間が経過するまで、音声入力対象者以外の者による発声が無かったとしても、それまで両者の間で会話が交わされていたのであれば、その時の音声入力対象者による発声は、操作対象機器に対するものではなく、会話の一部である可能性が高い。 Even if there was no utterance by a person other than the voice input target until the predetermined period has passed since the voice by the voice input target person, if there was a conversation between the two until then, the voice input at that time The utterance by the target person is not for the operation target device but is likely to be part of the conversation.

例えば、下記のようなケースが考えられる。
１．音声入力対象者である運転者による発声。
２．上記１の発声より２秒以内に、音声入力対象者以外の者である同乗者による発声。
３．上記２の発声より２秒以内に、運転者による発声。
４．上記３の発声から２秒経過しても、同乗者による発声無し。
上記３での運転者による発声は、操作対象機器に対するものではなく、会話の一部である可能性が高い。 For example, the following cases can be considered.
1. Speech by the driver who is the target of voice input.
2. Speech produced by a passenger who is not a voice input subject within 2 seconds from the speech of 1 above.
3. The utterance by the driver within 2 seconds from the above utterance 2.
4). Even if 2 seconds have passed since the above utterance 3, there is no utterance by the passenger.
The utterance by the driver in the above 3 is not for the operation target device but is likely to be part of the conversation.

上記音声認識装置（５）によれば、音声入力対象者による発声から前記所定期間が経過するまでに、音声入力対象者以外の者が発声したと判定された場合（すなわち、両者間で会話が交わされている可能性が高い場合）、前記音声認識開始条件を成立させない保留期間（例えば、１０秒間）が設定される。これにより、上記３での運転者による発声で音声認識が開始されないようにすることができ、不必要な音声認識の開始を防止することができる。 According to the voice recognition device (5), when it is determined that a person other than the voice input target person has uttered before the predetermined period has elapsed since the voice input target person uttered (that is, a conversation between the two is performed). When there is a high possibility that the voice recognition is exchanged, a holding period (for example, 10 seconds) in which the voice recognition start condition is not satisfied is set. Thereby, it is possible to prevent the voice recognition from being started by the utterance by the driver in the above 3, and to prevent unnecessary voice recognition from starting.

以下、本発明に係る音声認識装置の実施の形態を図面に基づいて説明する。図１は、実施の形態（１）に係る音声認識装置が採用されたナビゲーションシステムの要部を概略的に示したブロック図である。図中１は、音声認識装置を示しており、音声認識装置１はマイクロフォン６からの音声信号をディジタル信号に変換するＡ／Ｄ変換器２と、マイクロフォン６から得られる数秒程度の（ディジタル信号に変換後の）音声信号を記憶するＦＩＦＯ（先入れ先出し）タイプのバッファメモリ３と、音声入力対象者である運転手の音声に含まれる個人性情報（例えば、声紋データ）が記憶されたＥＥＰＲＯＭ４と、ＣＰＵやＲＯＭ、ＲＡＭなどを有した音声処理部５とを含んで構成されている。 Embodiments of a speech recognition apparatus according to the present invention will be described below with reference to the drawings. FIG. 1 is a block diagram schematically showing a main part of a navigation system in which the speech recognition apparatus according to the embodiment (1) is employed. In the figure, reference numeral 1 denotes a speech recognition device. The speech recognition device 1 includes an A / D converter 2 that converts a speech signal from the microphone 6 into a digital signal, and a digital signal that is obtained from the microphone 6 (about several seconds). FIFO (first-in first-out) type buffer memory 3 for storing voice signals after conversion, EEPROM 4 in which personality information (for example, voiceprint data) included in the voice of the driver who is the voice input target is stored, and CPU And a voice processing unit 5 having a ROM, a RAM, and the like.

音声処理部５には、マイクロフォン６で入力された音声に含まれる個人性情報、及びＥＥＰＲＯＭ４に記憶されている個人性情報に基づいて、発声主が運転者（音声入力対象者）であるか否かを判断する機能（話者認識機能）や、マイクロフォン６からの音声信号に対する音声認識処理を行う機能などが装備されている。また、音声処理部５で認識処理されることによって得られた音声コマンドに応じた信号が車内通信でナビゲーション装置７へ送信されるようになっている。
図２に示したように、マイクロフォン６は車両８の運転席９及び助手席１０の前方略中央部に設置され、運転者により発せられた音声及び同乗者により発せられた音声の両方を適切に取得することができるようになっている。 The voice processing unit 5 determines whether or not the speaker is the driver (speech input target person) based on the personality information included in the voice input by the microphone 6 and the personality information stored in the EEPROM 4. And the like (speaker recognition function) and a function of performing voice recognition processing on a voice signal from the microphone 6 are provided. In addition, a signal corresponding to a voice command obtained by the recognition processing in the voice processing unit 5 is transmitted to the navigation device 7 by in-vehicle communication.
As shown in FIG. 2, the microphone 6 is installed at a substantially central portion in front of the driver seat 9 and the passenger seat 10 of the vehicle 8, and appropriately reproduces both the voice emitted by the driver and the voice emitted by the passenger. You can get it.

次に、実施の形態（１）に係る音声認識装置１における音声処理部５の行う処理動作［１］を図３に示したフローチャートに基づいて説明する。なお、この処理動作［１］はナビゲーション装置７からの起動要求、又はイグニッションスイッチＯＮを受けて行われる動作である。 Next, the processing operation [1] performed by the speech processing unit 5 in the speech recognition apparatus 1 according to Embodiment (1) will be described based on the flowchart shown in FIG. This processing operation [1] is an operation performed in response to an activation request from the navigation device 7 or an ignition switch ON.

まず、マイクロフォン６で音声の入力があったか否か（例えば、ある大きさ以上の音量の音声入力があったか否か）を判断し（ステップＳ１）、音声入力があったと判断すれば、入力音声に含まれる個人性情報、及びＥＥＰＲＯＭ４に記憶されている（運転者の音声に含まれる）個人性情報に基づいて、話者認識処理を行い（ステップＳ２）、発声主が音声入力対象者の運転者であるか否かを判断する（ステップＳ３）。一方、音声入力は無いと判断すれば、そのままステップＳ１へ戻る。 First, it is determined whether or not a sound is input from the microphone 6 (for example, whether or not a sound having a volume of a certain level or more is input) (step S1). If it is determined that a sound is input, it is included in the input sound. Speaker recognition processing is performed based on the personality information to be recorded and the personality information stored in the EEPROM 4 (included in the driver's voice) (step S2). It is determined whether or not there is (step S3). On the other hand, if it is determined that there is no voice input, the process directly returns to step S1.

ステップＳ３において、発声主が運転者であると判断すれば、つまり音声認識開始条件の１つが成立したと判断すれば、次に、マイクロフォン６での音声入力が終了したか否か（例えば、ある大きさ以上の音量の音声入力がなくなったか否か）を判断する（ステップＳ４）。一方、発声主が運転者でないと判断すれば、そのままステップＳ１へ戻る。 If it is determined in step S3 that the speaker is the driver, that is, if it is determined that one of the voice recognition start conditions is satisfied, next, whether or not the voice input with the microphone 6 has ended (for example, there is It is determined whether or not there is no voice input with a volume higher than the volume (step S4). On the other hand, if it is determined that the speaker is not the driver, the process directly returns to step S1.

ステップＳ４において、音声入力が終了した（すなわち、運転者による発声が終了した）と判断すれば、次に、マイクロフォン６で音声の入力があったか否か（すなわち、改めて別の音声入力があったか否か）を判断する（ステップＳ５）。
音声入力があったと判断すれば、入力音声に含まれる個人性情報、及びＥＥＰＲＯＭ４に記憶されている（運転者の音声に含まれる）個人性情報に基づいて、話者認識処理を行い（ステップＳ６）、発声主が同乗者（音声入力対象者以外の者）であるか否かを判断する（ステップＳ７）。 If it is determined in step S4 that the voice input has been completed (that is, the utterance by the driver has been completed), it is next determined whether there has been a voice input with the microphone 6 (that is, whether there has been another voice input). ) Is determined (step S5).
If it is determined that there is a voice input, speaker recognition processing is performed based on the personality information included in the input voice and the personality information stored in the EEPROM 4 (included in the driver's voice) (step S6). ), It is determined whether or not the speaker is a passenger (a person other than the voice input target person) (step S7).

発声主は同乗者ではないと判断すれば、次に、運転者の発声終了から２秒経過しているか否かを判断する（ステップＳ８）。ステップＳ５において、音声入力が無いと判断した場合にも、ステップＳ８へ進み、上記と同様の判断処理を行う。ここで、運転者の発声終了から２秒経過していると判断されるのは、運転者の発声終了から２秒経過するまでに、同乗者が発声していない場合である。つまり、運転者の発声直後に、同乗者が発声していない場合である。
運転者による発声直後に同乗者が発声する場合というのは、両者間で会話が交わされている可能性が高い。換言すれば、その直後に同乗者による発声の無い運転者の発声は、会話ではなく、ナビゲーション装置７に対する音声操作の意思表示である可能性が高い。 If it is determined that the speaker is not a passenger, it is then determined whether or not 2 seconds have elapsed since the driver's end of speaking (step S8). If it is determined in step S5 that there is no voice input, the process proceeds to step S8 and the same determination process as described above is performed. Here, it is determined that 2 seconds have elapsed from the end of the driver's utterance when the passenger does not utter until 2 seconds have elapsed from the end of the driver's utterance. That is, it is a case where the passenger does not speak immediately after the driver speaks.
When the passenger speaks immediately after the driver speaks, there is a high possibility that a conversation is being exchanged between the two. In other words, there is a high possibility that the driver's utterance without the utterance by the passenger immediately after that is not a conversation but a voice operation intention display for the navigation device 7.

ステップＳ８において、運転者の発声終了から２秒経過していると判断すれば、つまり音声認識開始条件の１つが成立したと判断すれば、次に、バッファメモリ３から音声信号を読み出して、その音声信号に対する認識処理を行い（ステップＳ９）、認識処理によって得られた音声コマンドに応じた信号を車内通信でナビゲーション装置７へ送信する（ステップＳ１０）。 In step S8, if it is determined that 2 seconds have elapsed since the end of the driver's utterance, that is, if it is determined that one of the voice recognition start conditions is satisfied, then the voice signal is read from the buffer memory 3, A recognition process is performed on the voice signal (step S9), and a signal corresponding to the voice command obtained by the recognition process is transmitted to the navigation device 7 by in-vehicle communication (step S10).

その後、音声認識終了条件が成立したか否かを判断し（ステップＳ１１）、音声認識終了条件が成立していると判断すれば、ステップＳ１へ戻り、音声認識終了条件が成立していないと判断すれば、ステップＳ９へ戻り、音声認識処理を継続する。なお、音声認識終了条件としては、例えば、音声入力がある時間以上継続して検出されないことが挙げられる。一方、ステップＳ８において、運転者の発声終了から２秒経過していないと判断すればステップＳ５へ戻る。 Thereafter, it is determined whether or not the voice recognition end condition is satisfied (step S11). If it is determined that the voice recognition end condition is satisfied, the process returns to step S1, and it is determined that the voice recognition end condition is not satisfied. If it does, it will return to step S9 and will continue a speech recognition process. Note that the voice recognition end condition includes, for example, that voice input is not continuously detected for a certain period of time. On the other hand, if it is determined in step S8 that 2 seconds have not elapsed since the end of the driver's speech, the process returns to step S5.

また、ステップＳ７において、発声主は同乗者であると判断すれば、つまり運転者の発声終了から２秒経過するまでに同乗者が発声したと判断すれば、運転者の発声は両者の間での会話の一部である可能性が高い（すなわち、ナビゲーション装置７に対する音声操作の意思表示である可能性は低い）ので、音声認識開始条件は不成立として、ステップＳ１へ戻る。 In step S7, if it is determined that the speaker is the passenger, that is, if it is determined that the passenger has spoken within 2 seconds after the end of the driver's speech, the driver's speech is Therefore, it is determined that the voice recognition start condition is not satisfied, and the process returns to step S1.

上記実施の形態（１）に係る音声認識装置によれば、音声認識を開始する音声認識開始条件に、音声入力対象者（運転者）が発声したことが含まれているので、音声入力対象者の発声をトリガとして、音声認識を開始させることができる。従って、音声入力対象者にスイッチ押下などの手動操作を行わせなくても、音声入力対象者の所望するタイミングで音声認識を開始することができる。 According to the voice recognition device according to the above embodiment (1), since the voice recognition start condition for starting voice recognition includes that the voice input target person (driver) uttered, the voice input target person Voice recognition can be started with the utterance of. Therefore, voice recognition can be started at a timing desired by the voice input target person without causing the voice input target person to perform a manual operation such as pressing a switch.

また、前記音声認識開始条件に、音声入力対象者による発声終了から２秒経過するまで、音声入力対象者以外の者（同乗者）が発声していないことが含まれているので、音声入力対象者の発声直後に、音声入力対象者以外の者が発声した場合には、音声認識は開始されないことになる。 In addition, since the voice recognition start condition includes that a person other than the voice input target person (passenger) does not utter until two seconds have elapsed from the end of the utterance by the voice input target person. If a person other than the voice input target person utters immediately after the person utters, the speech recognition is not started.

音声入力対象者の発声直後に、音声入力対象者以外の者が発声する場合というのは、両者間で会話が交わされている可能性が高い。従って、音声入力対象者による発声が、音声入力対象者以外の者との間での会話の一部である可能性が高い場合には、音声認識は開始されないので、不要な時に音声認識が開始されるのを防止することができる。 When a person other than the voice input target person speaks immediately after the voice input target person speaks, there is a high possibility that a conversation is being exchanged between them. Therefore, when there is a high possibility that the utterance by the voice input target person is part of a conversation with a person other than the voice input target person, the voice recognition is not started. Can be prevented.

図４は、実施の形態（２）に係る音声認識装置が採用されたナビゲーションシステムの要部を概略的に示したブロック図である。図中２１は、音声認識装置を示しており、音声認識装置２１はマイクロフォン２６、２７からの音声信号をディジタル信号に変換するＡ／Ｄ変換器２２、２３と、マイクロフォン２６、２７から得られる数秒程度の（ディジタル信号に変換後の）音声信号を記憶するＦＩＦＯタイプのバッファメモリ２４と、ＣＰＵやＲＯＭ、ＲＡＭなどを有した音声処理部２５とを含んで構成されている。 FIG. 4 is a block diagram schematically showing a main part of a navigation system in which the speech recognition apparatus according to the embodiment (2) is employed. In the figure, reference numeral 21 denotes a speech recognition device. The speech recognition device 21 is an A / D converter 22 and 23 that converts speech signals from the microphones 26 and 27 into digital signals, and a few seconds obtained from the microphones 26 and 27. It comprises a FIFO type buffer memory 24 for storing a sound signal (converted into a digital signal) and a sound processing unit 25 having a CPU, ROM, RAM and the like.

音声処理部２５には、マイクロフォン２６、２７で入力された音声から音源方向を特定する機能や、マイクロフォン２６、２７からの音声信号に対する音声認識処理を行う機能などが装備されている。また、音声処理部２５で認識処理されることによって得られた音声コマンドに応じた信号が車内通信でナビゲーション装置７へ送信されるようになっている。 The voice processing unit 25 is equipped with a function of specifying a sound source direction from voices input from the microphones 26 and 27, a function of performing voice recognition processing on voice signals from the microphones 26 and 27, and the like. In addition, a signal corresponding to the voice command obtained by the recognition processing in the voice processing unit 25 is transmitted to the navigation device 7 by in-vehicle communication.

マイクロフォン２６、２７は指向性を有しており、図５に示したように、マイクロフォン２６は車両８の助手席１０の前方にその指向性が運転席９を向くように設置され、マイクロフォン２７は車両８の運転席９の前方にその指向性が助手席１０を向くように設置され、運転者により発せられた音声がマイクロフォン２６で、同乗者により発せられた音声がマイクロフォン２７で適切に取得することができるようになっている。 The microphones 26 and 27 have directivity. As shown in FIG. 5, the microphone 26 is installed in front of the passenger seat 10 of the vehicle 8 so that the directivity faces the driver seat 9. It is installed in front of the driver's seat 9 of the vehicle 8 so that its directivity faces the passenger seat 10, and the sound emitted by the driver is appropriately acquired by the microphone 26 and the sound emitted by the passenger is appropriately acquired by the microphone 27. Be able to.

次に、実施の形態（２）に係る音声認識装置２１における音声処理部２５の行う処理動作［２］を図６に示したフローチャートに基づいて説明する。なお、この処理動作［２］はナビゲーション装置７からの起動要求、又はイグニッションスイッチＯＮを受けて行われる動作である。 Next, the processing operation [2] performed by the voice processing unit 25 in the voice recognition device 21 according to Embodiment (2) will be described based on the flowchart shown in FIG. This processing operation [2] is an operation performed in response to an activation request from the navigation device 7 or an ignition switch ON.

まず、マイクロフォン２６、２７で音声の入力があったか否か（例えば、ある大きさ以上の音量の音声入力があったか否か）を判断し（ステップＳ２１）、音声入力があったと判断すれば、マイクロフォン２６、２７で入力された音声からその音源方向を特定する処理を行い（ステップＳ２２）、発声主が音声入力対象者の運転者であるか否かを判断する（ステップＳ２３）。一方、音声入力は無いと判断すれば、そのままステップＳ２１へ戻る。 First, it is determined whether or not a sound is input from the microphones 26 and 27 (for example, whether or not a sound input having a volume higher than a certain level is input) (step S21). , 27 is performed to identify the sound source direction from the voice input (step S22), and it is determined whether or not the speaker is the driver of the voice input target person (step S23). On the other hand, if it is determined that there is no voice input, the process directly returns to step S21.

マイクロフォン２６へ入力される音量の方が、マイクロフォン２７へ入力される音量よりも大きい場合、音源方向は運転席９の方向であると判断することができ、その逆に、マイクロフォン２６へ入力される音量の方が、マイクロフォン２７へ入力される音量よりも小さい場合、音源方向は助手席１０の方向であると判断することができる。 When the volume input to the microphone 26 is larger than the volume input to the microphone 27, it can be determined that the sound source direction is the direction of the driver's seat 9, and vice versa. When the volume is smaller than the volume input to the microphone 27, it can be determined that the sound source direction is the direction of the passenger seat 10.

ステップＳ２３において、発声主が運転者であると判断すれば、つまり音声認識開始条件の１つが成立したと判断すれば、次に、マイクロフォン２６、２７での音声入力が終了したか否か（例えば、ある大きさ以上の音量の音声入力がなくなったか否か）を判断する（ステップＳ２４）。一方、発声主が運転者でないと判断すれば、そのままステップＳ２１へ戻る。 If it is determined in step S23 that the speaker is the driver, that is, if it is determined that one of the voice recognition start conditions is satisfied, then whether or not the voice input with the microphones 26 and 27 has ended (for example, Then, it is determined whether or not there is no more voice input with a volume higher than a certain level (step S24). On the other hand, if it is determined that the speaker is not the driver, the process directly returns to step S21.

ステップＳ２４において、音声入力が終了した（すなわち、運転者による発声が終了した）と判断すれば、次に、マイクロフォン２６、２７で音声の入力があったか否か（すなわち、改めて別の音声入力があったか否か）を判断する（ステップＳ２５）。
音声入力があったと判断すれば、マイクロフォン２６、２７で入力された音声からその音源方向を特定する処理を行い（ステップＳ２６）、発声主が同乗者（音声入力対象者以外の者）であるか否かを判断する（ステップＳ２７）。 If it is determined in step S24 that the voice input has been completed (that is, the utterance by the driver has been completed), next, whether or not there has been a voice input by the microphones 26 and 27 (that is, another voice input has been made again) (No) is determined (step S25).
If it is determined that there is a voice input, a process of specifying the sound source direction from the voices input from the microphones 26 and 27 is performed (step S26), and whether the speaker is a passenger (a person other than the voice input target person). It is determined whether or not (step S27).

発声主は同乗者ではないと判断すれば、次に、運転者の発声終了から２秒経過しているか否かを判断する（ステップＳ２８）。ステップＳ２５において、音声入力が無いと判断した場合にも、ステップＳ２８へ進み、上記と同様の判断処理を行う。ここで、運転者の発声終了から２秒経過していると判断されるのは、運転者の発声終了から２秒経過するまでに、同乗者が発声していない場合である。つまり、運転者の発声直後に、同乗者が発声していない場合である。
運転者による発声直後に同乗者が発声する場合というのは、両者間で会話が交わされている可能性が高い。換言すれば、その直後に同乗者による発声の無い運転者の発声は、会話ではなく、ナビゲーション装置７に対する音声操作の意思表示である可能性が高い。 If it is determined that the speaker is not a passenger, it is then determined whether 2 seconds have elapsed since the driver's utterance was completed (step S28). If it is determined in step S25 that there is no voice input, the process proceeds to step S28, and the same determination process as described above is performed. Here, it is determined that 2 seconds have elapsed from the end of the driver's utterance when the passenger does not utter until 2 seconds have elapsed from the end of the driver's utterance. That is, it is a case where the passenger does not speak immediately after the driver speaks.
When the passenger speaks immediately after the driver speaks, there is a high possibility that a conversation is being exchanged between the two. In other words, there is a high possibility that the driver's utterance without the utterance by the passenger immediately after that is not a conversation but a voice operation intention display for the navigation device 7.

ステップＳ２８において、運転者の発声終了から２秒経過していると判断すれば、つまり音声認識開始条件の１つが成立したと判断すれば、次に、バッファメモリ２４から音声信号を読み出して、その音声信号に対する認識処理を行い（ステップＳ２９）、認識処理によって得られた音声コマンドに応じた信号を車内通信でナビゲーション装置７へ送信する（ステップＳ３０）。 In step S28, if it is determined that two seconds have elapsed since the end of the driver's utterance, that is, if it is determined that one of the voice recognition start conditions is satisfied, then the voice signal is read from the buffer memory 24, A recognition process is performed on the voice signal (step S29), and a signal corresponding to the voice command obtained by the recognition process is transmitted to the navigation device 7 by in-vehicle communication (step S30).

その後、音声認識終了条件が成立したか否かを判断し（ステップＳ３１）、音声認識終了条件が成立していると判断すれば、ステップＳ２１へ戻り、音声認識終了条件が成立していないと判断すれば、ステップＳ２９へ戻り、音声認識処理を継続する。なお、音声認識終了条件としては、例えば、音声入力がある時間以上継続して検出されないことが挙げられる。一方、ステップＳ２８において、運転者の発声終了から２秒経過していないと判断すればステップＳ２５へ戻る。 Thereafter, it is determined whether or not the voice recognition end condition is satisfied (step S31). If it is determined that the voice recognition end condition is satisfied, the process returns to step S21, and it is determined that the voice recognition end condition is not satisfied. If it does, it will return to step S29 and will continue a speech recognition process. Note that the voice recognition end condition includes, for example, that voice input is not continuously detected for a certain period of time. On the other hand, if it is determined in step S28 that 2 seconds have not elapsed since the end of the driver's speech, the process returns to step S25.

また、ステップＳ２７において、発声主は同乗者であると判断すれば、つまり運転者の発声終了から２秒経過するまでに同乗者が発声したと判断すれば、運転者の発声は両者の間での会話の一部である可能性が高い（すなわち、ナビゲーション装置７に対する音声操作の意思表示である可能性は低い）ので、音声認識開始条件は不成立として、ステップＳ２１へ戻る。 In step S27, if it is determined that the speaker is the passenger, that is, if it is determined that the passenger has spoken within 2 seconds from the end of the driver's speech, the driver's speech is between the two. Therefore, since the voice recognition start condition is not satisfied, the process returns to step S21.

図７は、実施の形態（３）に係る音声認識装置が採用されたナビゲーションシステムの要部を概略的に示したブロック図である。図中３１は、音声認識装置を示しており、音声認識装置３１はマイクロフォン３６からの音声信号をディジタル信号に変換するＡ／Ｄ変換器３２と、マイクロフォン３６から得られる数秒程度の（ディジタル信号に変換後の）音声信号を記憶するＦＩＦＯタイプのバッファメモリ３３と、ＣＰＵやＲＯＭ、ＲＡＭなどを有した音声処理部３４と、ＣＰＵやＲＯＭ、ＲＡＭなどを有し、ＣＣＤカメラ３７からの画像データを処理する画像処理部３５とを含んで構成されている。 FIG. 7 is a block diagram schematically showing a main part of a navigation system in which the speech recognition apparatus according to the embodiment (3) is employed. In the figure, reference numeral 31 denotes a speech recognition device. The speech recognition device 31 has an A / D converter 32 that converts a speech signal from the microphone 36 into a digital signal, and a digital signal that is obtained from the microphone 36 (about several seconds). FIFO type buffer memory 33 for storing audio signals (after conversion), audio processing unit 34 having CPU, ROM, RAM, etc., CPU, ROM, RAM, etc., and receiving image data from CCD camera 37 And an image processing unit 35 to be processed.

音声処理部３４には、マイクロフォン３６からの音声信号に対する音声認識処理を行う機能などが装備されている。また、音声処理部３４で認識処理されることによって得られた音声コマンドに応じた信号が車内通信でナビゲーション装置７へ送信されるようになっている。画像処理部３５には、ＣＣＤカメラ３７から得られる画像データに基づいて、運転者及び同乗者の外観状態（特に唇の動き）を監視し、発声主が音声入力対象者の運転者、同乗者のいずれであるのかを特定する機能などが装備されている。 The voice processing unit 34 is equipped with a function for performing voice recognition processing on a voice signal from the microphone 36. In addition, a signal corresponding to the voice command obtained by the recognition processing by the voice processing unit 34 is transmitted to the navigation device 7 by in-vehicle communication. The image processing unit 35 monitors the appearance of the driver and passengers (especially the movement of the lips) based on the image data obtained from the CCD camera 37, and the speaker is the driver or passenger who is the voice input subject. It is equipped with a function to identify which one is.

マイクロフォン３６は指向性を有しており、図８に示したように、車両８の運転席９の前方にその指向性が運転席９を向くように設置され、運転者により発せられた音声を適切に取得することができるようになっている。またＣＣＤカメラ３７は運転席９及び助手席１０の前方中央部に設置され、運転者及び同乗者の外観状態を撮影することができるようになっている。 The microphone 36 has directivity, and as shown in FIG. 8, the microphone 36 is installed in front of the driver seat 9 of the vehicle 8 so that the directivity faces the driver seat 9. You can get it properly. The CCD camera 37 is installed in the front center part of the driver's seat 9 and the passenger seat 10 so that the appearance of the driver and passengers can be photographed.

次に、実施の形態（３）に係る音声認識装置３１における音声処理部３４の行う処理動作［３］を図９に示したフローチャートに基づいて説明する。なお、この処理動作［３］はナビゲーション装置７からの起動要求、又はイグニッションスイッチＯＮを受けて行われる動作である。 Next, the processing operation [3] performed by the voice processing unit 34 in the voice recognition device 31 according to Embodiment (3) will be described based on the flowchart shown in FIG. This processing operation [3] is an operation performed in response to an activation request from the navigation device 7 or an ignition switch ON.

まず、マイクロフォン３６で音声の入力があったか否か（例えば、ある大きさ以上の音量の音声入力があったか否か）を判断し（ステップＳ４１）、音声入力があったと判断すれば、画像処理部３５へ発声主特定情報の送信を要求し（ステップＳ４２）、画像処理部３５から送られてきた発声主特定情報に基づいて、発声主が音声入力対象者の運転者であるか否かを判断する（ステップＳ４３）。一方、音声入力は無いと判断すれば、そのままステップＳ４１へ戻る。 First, it is determined whether or not a sound is input from the microphone 36 (for example, whether or not a sound having a volume higher than a certain level is input) (step S41). If it is determined that a sound is input, the image processing unit 35 is determined. Is requested to transmit the utterance person specifying information (step S42), and based on the utterance person specifying information sent from the image processing unit 35, it is determined whether or not the uttering person is the driver of the voice input target person. (Step S43). On the other hand, if it is determined that there is no voice input, the process directly returns to step S41.

ステップＳ４３において、発声主が運転者であると判断すれば、つまり音声認識開始条件の１つが成立したと判断すれば、次に、マイクロフォン３６での音声入力が終了したか否か（例えば、ある大きさ以上の音量の音声入力がなくなったか否か）を判断する（ステップＳ４４）。一方、発声主が運転者でないと判断すれば、そのままステップＳ４１へ戻る。 If it is determined in step S43 that the speaker is the driver, that is, if it is determined that one of the voice recognition start conditions is satisfied, then whether or not the voice input with the microphone 36 has ended (for example, there is It is determined whether or not there is no voice input with a volume higher than the volume (step S44). On the other hand, if it is determined that the speaker is not the driver, the process directly returns to step S41.

ステップＳ４４において、音声入力が終了した（すなわち、運転者による発声が終了した）と判断すれば、次に、マイクロフォン３６で音声の入力があったか否か（すなわち、改めて別の音声入力があったか否か）を判断する（ステップＳ４５）。
音声入力があったと判断すれば、画像処理部３５へ発声主特定情報の送信を要求し（ステップＳ４６）、画像処理部３５から送られてきた発声主特定情報に基づいて、発声主が同乗者（音声入力対象者以外の者）であるか否かを判断する（ステップＳ４７）。 If it is determined in step S44 that the voice input has ended (that is, the utterance by the driver has ended), next, whether or not there has been a voice input with the microphone 36 (that is, whether another voice input has been made again). ) Is determined (step S45).
If it is determined that there has been a voice input, the image processing unit 35 is requested to transmit the speaker main specifying information (step S46), and based on the speaker main specifying information sent from the image processing unit 35, the speaker speaks with the passenger. It is determined whether or not it is a person other than the voice input target person (step S47).

発声主は同乗者ではないと判断すれば、次に、運転者の発声終了から２秒経過しているか否かを判断する（ステップＳ４８）。ステップＳ４５において、音声入力が無いと判断した場合にも、ステップＳ４８へ進み、上記と同様の判断処理を行う。ここで、運転者の発声終了から２秒経過していると判断されるのは、運転者の発声終了から２秒経過するまでに、同乗者が発声していない場合である。つまり、運転者の発声直後に、同乗者が発声していない場合である。
運転者による発声直後に同乗者が発声する場合というのは、両者間で会話が交わされている可能性が高い。換言すれば、その直後に同乗者による発声の無い運転者の発声は、会話ではなく、ナビゲーション装置７に対する音声操作の意思表示である可能性が高い。 If it is determined that the speaker is not a fellow passenger, it is then determined whether or not 2 seconds have elapsed since the end of the driver's utterance (step S48). If it is determined in step S45 that there is no voice input, the process proceeds to step S48 and the same determination process as described above is performed. Here, it is determined that 2 seconds have elapsed from the end of the driver's utterance when the passenger does not utter until 2 seconds have elapsed from the end of the driver's utterance. That is, it is a case where the passenger does not speak immediately after the driver speaks.
When the passenger speaks immediately after the driver speaks, there is a high possibility that a conversation is being exchanged between the two. In other words, there is a high possibility that the driver's utterance without the utterance by the passenger immediately after that is not a conversation but a voice operation intention display for the navigation device 7.

ステップＳ４８において、運転者の発声終了から２秒経過していると判断すれば、つまり音声認識開始条件の１つが成立したと判断すれば、次に、バッファメモリ３３から音声信号を読み出して、その音声信号に対する認識処理を行い（ステップＳ４９）、認識処理によって得られた音声コマンドに応じた信号を車内通信でナビゲーション装置７へ送信する（ステップＳ５０）。 If it is determined in step S48 that two seconds have elapsed since the end of the driver's utterance, that is, if it is determined that one of the voice recognition start conditions is satisfied, the voice signal is then read from the buffer memory 33, A recognition process is performed on the voice signal (step S49), and a signal corresponding to the voice command obtained by the recognition process is transmitted to the navigation device 7 by in-vehicle communication (step S50).

その後、音声認識終了条件が成立したか否かを判断し（ステップＳ５１）、音声認識終了条件が成立していると判断すれば、ステップＳ４１へ戻り、音声認識終了条件が成立していないと判断すれば、ステップＳ４９へ戻り、音声認識処理を継続する。なお、音声認識終了条件としては、例えば、音声入力がある時間以上継続して検出されないことが挙げられる。一方、ステップＳ４８において、運転者の発声終了から２秒経過していないと判断すればステップＳ４５へ戻る。 Thereafter, it is determined whether or not the voice recognition end condition is satisfied (step S51). If it is determined that the voice recognition end condition is satisfied, the process returns to step S41, and it is determined that the voice recognition end condition is not satisfied. Then, the process returns to step S49 and the voice recognition process is continued. Note that the voice recognition end condition includes, for example, that voice input is not continuously detected for a certain period of time. On the other hand, if it is determined in step S48 that 2 seconds have not elapsed since the end of the driver's speech, the process returns to step S45.

また、ステップＳ４７において、発声主は同乗者であると判断すれば、つまり運転者の発声終了から２秒経過するまでに同乗者が発声したと判断すれば、運転者の発声は両者の間での会話の一部である可能性が高い（すなわち、ナビゲーション装置７に対する音声操作の意思表示である可能性は低い）ので、音声認識開始条件は不成立として、ステップＳ４１へ戻る。 In step S47, if it is determined that the speaker is the passenger, that is, if it is determined that the passenger has spoken within 2 seconds from the end of the driver's speech, the driver's speech is Therefore, since the voice recognition start condition is not satisfied, the process returns to step S41.

上記実施の形態（１）〜（３）に係る音声認識装置では、音声入力対象者である運転者の発声直後に、同乗者による発声があると（ステップＳ７、Ｓ２７、Ｓ４７で「Ｙ」と判断）、音声認識開始条件は不成立として、ステップＳ１、Ｓ２１、Ｓ４１へ戻るようにしているが、別の実施の形態に係る音声認識装置では、例えば、図１０に示したように、音声入力対象者である運転者の発声直後に、同乗者による発声があった場合、ステップＳ７Ａへ進んで保留するいった、保留期間（例えば、１０秒）を設けるようにしても良い。このような保留期間を設けるのは、下記のようなケースでの会話による音声認識処理の開始を防止するためである。 In the speech recognition apparatuses according to the above embodiments (1) to (3), if there is a utterance by the passenger immediately after the utterance of the driver who is the voice input target (“Y” in steps S7, S27, and S47) Judgment), the voice recognition start condition is not established, and the process returns to steps S1, S21, and S41. However, in the voice recognition device according to another embodiment, for example, as shown in FIG. If there is an utterance by the passenger immediately after the driver's utterance, a suspension period (for example, 10 seconds) may be provided in which the process proceeds to step S7A and is suspended. The reason for providing such a holding period is to prevent the start of speech recognition processing by conversation in the following cases.

運転者による発声直後に、同乗者による発声が無かったとしても、それまで両者の間で会話が交わされていたのであれば、その時の運転者による発声は、ナビゲーション装置７に対するものではなく、同乗者への応答である可能性が高い。例えば、下記のようなケースが考えられる。
１．運転者による発声。
２．上記１の発声より２秒以内に、同乗者による発声。
３．上記２の発声より２秒以内に、運転者による発声。
４．上記３の発声から２秒経過しても、同乗者による発声無し。
上記３での運転者による発声は、ナビゲーション装置７に対するものではなく、同乗者に対する応答である可能性が高い。 Even if there is no utterance by the passenger immediately after the utterance by the driver, the utterance by the driver at that time is not for the navigation device 7 as long as there is a conversation between them. This is likely to be a response to the person. For example, the following cases can be considered.
1. Speaking by the driver.
2. Speaking by passengers within 2 seconds of utterance 1 above.
3. The utterance by the driver within 2 seconds from the above utterance 2.
4). Even if 2 seconds have passed since the above utterance 3, there is no utterance by the passenger.
It is highly possible that the utterance by the driver in the above 3 is not a response to the navigation device 7 but a response to the passenger.

本発明の実施の形態（１）に係る音声認識装置が採用されたナビゲーションシステムの要部を概略的に示したブロック図である。It is the block diagram which showed roughly the principal part of the navigation system by which the speech recognition apparatus which concerns on embodiment (1) of this invention was employ | adopted. マイクロフォンの設置場所を説明するための説明図である。It is explanatory drawing for demonstrating the installation place of a microphone. 実施の形態（１）に係る音声認識装置における音声処理部の行う処理動作を示したフローチャートである。It is the flowchart which showed the processing operation which the speech processing part in the speech recognition apparatus which concerns on embodiment (1) performs. 実施の形態（２）に係る音声認識装置が採用されたナビゲーションシステムの要部を概略的に示したブロック図である。It is the block diagram which showed roughly the principal part of the navigation system by which the speech recognition apparatus which concerns on embodiment (2) was employ | adopted. マイクロフォンの設置場所を説明するための説明図である。It is explanatory drawing for demonstrating the installation place of a microphone. 実施の形態（２）に係る音声認識装置における音声処理部の行う処理動作を示したフローチャートである。It is the flowchart which showed the processing operation which the speech processing part in the speech recognition apparatus which concerns on embodiment (2) performs. 実施の形態（３）に係る音声認識装置が採用されたナビゲーションシステムの要部を概略的に示したブロック図である。It is the block diagram which showed schematically the principal part of the navigation system by which the speech recognition apparatus which concerns on embodiment (3) was employ | adopted. マイクロフォン及びＣＣＤカメラの設置場所を説明するための説明図である。It is explanatory drawing for demonstrating the installation place of a microphone and a CCD camera. 実施の形態（３）に係る音声認識装置における音声処理部の行う処理動作を示したフローチャートである。It is the flowchart which showed the processing operation which the speech processing part in the speech recognition apparatus which concerns on embodiment (3) performs. 別の実施の形態に係る音声認識装置における音声処理部の行う処理動作を示したフローチャートである。It is the flowchart which showed the processing operation which the speech processing part in the speech recognition apparatus which concerns on another embodiment performs.

Explanation of symbols

１、２１、３１音声認識装置
２、２２、２３、３２Ａ／Ｄ変換器
３、２４、３３バッファメモリ
４ＥＥＰＲＯＭ
５、２５、３４音声処理部
６、２６、２７、３６マイクロフォン
７ナビゲーション装置
３５画像処理部
３７ＣＣＤカメラ 1, 21, 31 Voice recognition device 2, 22, 23, 32 A / D converter 3, 24, 33 Buffer memory 4 EEPROM
5, 25, 34 Audio processor 6, 26, 27, 36 Microphone 7 Navigation device 35 Image processor 37 CCD camera

Claims

In a speech recognition apparatus that recognizes speech emitted from a speech input target person,
First determination means for determining presence or absence of utterance by a voice input target person;
Second determination means for determining the presence or absence of utterance by a person other than the voice input target person;
A condition establishment determination means for determining whether or not a voice recognition start condition for starting voice recognition is satisfied;
The voice recognition start condition is determined to be uttered by a voice input target person in the first determination means;
The speech recognition apparatus characterized in that the second determination means does not determine that there is a utterance by a person other than the speech input subject until a predetermined period has elapsed since the speech by the speech input subject. .

Based on the personality information included in the voice input by the voice input means, the voice judgment means for judging whether or not the voice is the voice input target person,
The speech recognition according to claim 1, wherein the determination by the first determination unit and the determination by the second determination unit are performed based on a determination result by the main utterance determination unit. apparatus.

Based on the sound source direction obtained from the voice input by the voice input means, comprises a speaker main judgment means for judging whether or not the voice owner is a voice input target person,
The speech recognition according to claim 1, wherein the determination by the first determination unit and the determination by the second determination unit are performed based on a determination result by the main utterance determination unit. apparatus.

Based on the appearance of the voice input target obtained from the image input means, the person other than the voice input target, or the voice input target and the person other than the voice input target, the speaker speaks. It has a utterance main judgment means for judging whether or not it is a target person,
The speech recognition according to claim 1, wherein the determination by the first determination unit and the determination by the second determination unit are performed based on a determination result by the main utterance determination unit. apparatus.

When it is determined by the second determination means that a person other than the voice input target person has uttered before the predetermined period has elapsed since the voice input target person uttered,
The speech recognition apparatus according to claim 1, further comprising a holding period setting unit that sets a holding period in which the voice recognition start condition is not satisfied.