JP2023113171A

JP2023113171A - Voice processing unit, voice processing method, voice processing program and voice processing system

Info

Publication number: JP2023113171A
Application number: JP2022015324A
Authority: JP
Inventors: 智史山梨; Tomohito Yamanashi; 南生也持木; Naoya Mochiki; 裕番場; Yutaka Banba
Original assignee: Panasonic Intellectual Property Management Co Ltd
Current assignee: Panasonic Intellectual Property Management Co Ltd
Priority date: 2022-02-03
Filing date: 2022-02-03
Publication date: 2023-08-16
Also published as: WO2023149015A1

Abstract

To suppress erroneous detection of voice recognition.SOLUTION: A voice processing unit 10 includes: a voice acquisition part 20; a determination part 22; a voice processing part 24; and a switching part 26. The voice acquisition part 20 acquires a voice signal from a microphone MC for collecting voice in space. The determination part 22 determines whether or not a level of a reference signal which is a reproduction signal reproduced from a speaker SP for emitting sound in the space is equal to or higher than a threshold value. The voice processing part 24 outputs a removal signal obtained by removing a voice component of the reference signal from the voice signal, to a voice recognition part 40 as an output signal. . When the level of the reference signal is determined to be equal to or higher than the threshold value, the switching part 26 outputs a replacement signal which is at least one of comfort noise and a mute signal, to the voice recognition part 40 as the output signal instead of the removal signal.SELECTED DRAWING: Figure 3

Description

本開示は、音声処理装置、音声処理方法、音声処理プログラム、および音声処理システムに関する。 The present disclosure relates to an audio processing device, an audio processing method, an audio processing program, and an audio processing system.

発話者により発話された音声に基づいて、音声認識コマンドを処理する音声処理システムが知られている。例えば、マイクロホンで収音された音声を第１の音声認識部で認識し、スピーカから出音される音声を第２の音声認識部で認識する。そして、第２の音声認識部で認識された音声に音声認識コマンドが含まれる場合、第１の音声認識部による認識を停止させる構成が開示されている（例えば、特許文献１参照）。 Speech processing systems are known that process speech recognition commands based on speech uttered by a speaker. For example, the first speech recognition unit recognizes the sound picked up by the microphone, and the second speech recognition unit recognizes the sound output from the speaker. A configuration is disclosed that stops recognition by the first speech recognition unit when the speech recognized by the second speech recognition unit includes a speech recognition command (for example, see Patent Document 1).

特許第６２２５９２０号公報Japanese Patent No. 6225920

しかしながら、従来技術では、マイクロホンで収音された音声にエコーキャンセラでは除去しきれない残エコー成分等のノイズ成分が含まれる場合には、音声認識の誤検出が発生する場合があった。すなわち、従来技術では、音声認識の誤検出を抑制することが困難となる場合があった。 However, in the prior art, when the voice picked up by the microphone contains noise components such as residual echo components that cannot be completely removed by the echo canceller, erroneous detection of voice recognition may occur. That is, in the prior art, it was sometimes difficult to suppress erroneous detection of voice recognition.

本開示は、音声認識の誤検出を抑制することができる、音声処理装置、音声処理方法、音声処理プログラム、および音声処理システムを提供することを目的とする。 An object of the present disclosure is to provide a speech processing device, a speech processing method, a speech processing program, and a speech processing system capable of suppressing erroneous detection of speech recognition.

本開示の一態様に係る音声処理装置は、音声取得部と、判定部と、音声処理部と、切替部と、を備える。音声取得部は、空間の音声を収音するマイクから音声信号を取得する。判定部は、前記空間に出音するスピーカから再生される再生信号である参照信号のレベルが閾値以上であるか否かを判定する。音声処理部は、前記音声信号から前記参照信号の音声成分を除去した除去信号を出力信号として音声認識部へ出力する。切替部は、前記参照信号のレベルが前記閾値以上と判定された場合、前記除去信号に換えて、コンフォートノイズおよびミュート信号の少なくとも一方である置換信号を前記出力信号として前記音声認識部へ出力する。 A speech processing device according to an aspect of the present disclosure includes a speech acquisition unit, a determination unit, a speech processing unit, and a switching unit. The audio acquisition unit acquires an audio signal from a microphone that picks up spatial audio. The determination unit determines whether or not the level of a reference signal, which is a reproduction signal reproduced from a speaker emitting sound in the space, is equal to or higher than a threshold. The speech processing unit outputs a removal signal obtained by removing the speech component of the reference signal from the speech signal to the speech recognition unit as an output signal. The switching unit outputs a replacement signal, which is at least one of comfort noise and a mute signal, as the output signal to the speech recognition unit in place of the removal signal when the level of the reference signal is determined to be equal to or higher than the threshold. .

本開示によれば、音声認識の誤検出を抑制することができる。 According to the present disclosure, erroneous detection of speech recognition can be suppressed.

図１は、本実施形態の音声処理システム１の概略構成の一例を示す図である。FIG. 1 is a diagram showing an example of a schematic configuration of a speech processing system 1 of this embodiment. 図２は、音声処理装置の一例のハードウェア構成図である。FIG. 2 is a hardware configuration diagram of an example of the audio processing device. 図３は、音声処理装置の構成の一例を示すブロック図である。FIG. 3 is a block diagram showing an example of the configuration of the audio processing device. 図４は、本実施形態の音声処理装置１０で実行される情報処理の流れの一例を表すフローチャートである。FIG. 4 is a flowchart showing an example of the flow of information processing executed by the speech processing device 10 of this embodiment.

以下、適宜図面を参照しながら、本開示の実施形態を詳細に説明する。ただし、必要以上に詳細な説明は省略する場合がある。なお、添付図面及び以下の説明は、当業者が本開示を十分に理解するために提供されるのであって、これらにより特許請求の範囲に記載の主題を限定することは意図されていない。 Hereinafter, embodiments of the present disclosure will be described in detail with reference to the drawings as appropriate. However, more detailed description than necessary may be omitted. It should be noted that the accompanying drawings and the following description are provided for a thorough understanding of the present disclosure by those skilled in the art and are not intended to limit the claimed subject matter.

図１は、本実施形態の音声処理システム１の概略構成の一例を示す図である。 FIG. 1 is a diagram showing an example of a schematic configuration of a speech processing system 1 of this embodiment.

音声処理システム１は、空間内の音声を認識するためのシステムである。本実施形態では、空間が車両２の車室内の空間である場合を一例として説明する。また、本実施形態では、音声処理システム１が車両２に搭載された形態を一例として説明する。なお、空間は、車両２の車室内に限定されない。 The speech processing system 1 is a system for recognizing speech in space. In this embodiment, a case where the space is the space in the vehicle interior of the vehicle 2 will be described as an example. In addition, in the present embodiment, a form in which the voice processing system 1 is mounted in the vehicle 2 will be described as an example. Note that the space is not limited to the interior of the vehicle 2 .

音声処理システム１は、マイクＭＣ、スピーカＳＰ、音声処理装置１０、音源装置３０、音声認識部４０、電子機器５０、およびディスプレイ６０を備える。マイクＭＣ、スピーカＳＰ、音声認識部４０、およびディスプレイ６０と、音声処理装置１０とは、通信可能に接続されている。音声処理システム１は、マイクＭＣ、スピーカＳＰ、音声処理装置１０、および音声認識部４０を少なくとも備える構成であればよい。 The voice processing system 1 includes a microphone MC, a speaker SP, a voice processing device 10 , a sound source device 30 , a voice recognition section 40 , an electronic device 50 and a display 60 . Microphone MC, speaker SP, speech recognition unit 40, display 60, and speech processing device 10 are communicably connected. The voice processing system 1 may be configured to include at least the microphone MC, the speaker SP, the voice processing device 10, and the voice recognition section 40. FIG.

マイクＭＣは、空間の音声を収音する。本実施形態では、マイクＭＣは、少なくとも車両２の車室内の空間の音声を収音する。本実施形態では、マイクＭＣが車両２の運転者ｈｍ１の座席である運転席の近傍に設けられた形態を一例として説明する。このため、本実施形態では、マイクＭＣは、運転者ｈｍ１によって発話された音声成分を少なくとも含む音声を収音する。 A microphone MC picks up sounds in the space. In this embodiment, the microphone MC picks up at least the sound of the space inside the vehicle 2 . In this embodiment, an example will be described in which the microphone MC is provided in the vicinity of the driver's seat, which is the seat of the driver hm1 of the vehicle 2 . Therefore, in the present embodiment, the microphone MC picks up the voice including at least the voice component uttered by the driver hm1.

車両２には、複数のマイクＭＣが設けられた構成であってもよい。この場合、これらの複数のマイクＭＣは、車両２の車室内における互いに異なる位置に配置されていることが好ましい。詳細には、例えば、車両２の運転者ｈｍ１、乗員ｈｍ２、乗員ｈｍ３、および乗員ｈｍ３の各々の座席の近傍に、それぞれマイクＭＣが配置されていてもよい。本実施形態では、車両２には、１つのマイクＭＣが設けられた形態を一例として説明する。 The vehicle 2 may be configured to have a plurality of microphones MC. In this case, it is preferable that these multiple microphones MC are arranged at mutually different positions in the cabin of the vehicle 2 . Specifically, for example, the microphones MC may be arranged near the seats of the driver hm1, the passenger hm2, the passenger hm3, and the passenger hm3 of the vehicle 2, respectively. In the present embodiment, a mode in which one microphone MC is provided in the vehicle 2 will be described as an example.

マイクＭＣは、指向性マイク、無指向性マイク、の何れであってもよい。マイクＭＣは、小型のＭＥＭＳ（ＭｉｃｒｏＥｌｅｃｔｒｏＭｅｃｈａｎｉｃａｌＳｙｓｔｅｍｓ）マイク、ＥＣＭ（ＥｌｅｃｔｒｅｔＣｏｎｄｅｎｓｅｒＭｉｃｒｏｐｈｏｎｅ）の何れであってもよい。マイクＭＣは、ビームフォーミング可能なマイクであってもよい。例えば、マイクＭＣは、特定の方向に指向性を有し、指向方向の音声を収音可能なマイクアレイでもよい。 The microphone MC may be either a directional microphone or an omnidirectional microphone. The microphone MC may be either a small MEMS (Micro Electro Mechanical Systems) microphone or an ECM (Electret Condenser Microphone). Microphone MC may be a beamforming microphone. For example, the microphone MC may be a microphone array that has directivity in a specific direction and is capable of picking up sound in the directional direction.

マイクＭＣは、収音した音声の音声信号を音声処理装置１０へ出力する。音声処理装置１０は、マイクＭＣに対応付けて設けられている。このため、音声処理システム１が複数のマイクＭＣを備えた構成である場合、音声処理システム１は、複数のマイクＭＣの各々に対応する複数の音声処理装置１０を備えた構成であればよい。本実施形態では、音声処理システム１が、１つのマイクＭＣと、該マイクＭＣに通信可能に接続された１つの音声処理装置１０と、を備える形態を一例として説明する。 The microphone MC outputs an audio signal of the collected audio to the audio processing device 10 . The voice processing device 10 is provided in association with the microphone MC. Therefore, when the audio processing system 1 is configured to include multiple microphones MC, the audio processing system 1 may be configured to include multiple audio processing devices 10 corresponding to each of the multiple microphones MC. In this embodiment, an example will be described in which the audio processing system 1 includes one microphone MC and one audio processing device 10 communicably connected to the microphone MC.

スピーカＳＰは、マイクＭＣによる収音対象の空間と同じ空間に出音する。本実施形態では、スピーカＳＰは、少なくとも車両２の車室内の空間に出音する。 The speaker SP emits sound to the same space as the sound pickup target space of the microphone MC. In this embodiment, the speaker SP emits sound at least to the space inside the vehicle 2 .

本実施形態では、車両２の車室内にスピーカＳＰ１～スピーカＳＰ４の４つのスピーカＳＰが配置された形態を一例として説明する。なお、音声処理システム１は、少なくとも１つのスピーカＳＰを備えた構成であればよく、スピーカＳＰの数および配置位置は限定されない。本実施形態では、車両２の車室内の運転者ｈｍ１、乗員ｈｍ２、乗員ｈｍ３、および乗員ｈｍ３の各々の座席の近傍に、それぞれ、スピーカＳＰ１、スピーカＳＰ２、スピーカＳＰ３、およびスピーカＳＰ４が配置された形態を一例として説明する。なお、これらのスピーカＳＰ１～スピーカＳＰ４を総称して説明する場合には、単にスピーカＳＰと称して説明する。 In the present embodiment, as an example, a form in which four speakers SP of speakers SP1 to SP4 are arranged in the vehicle interior of the vehicle 2 will be described. Note that the audio processing system 1 may be configured to include at least one speaker SP, and the number and arrangement positions of the speakers SP are not limited. In this embodiment, a speaker SP1, a speaker SP2, a speaker SP3, and a speaker SP4 are arranged near the seats of the driver hm1, the passenger hm2, the passenger hm3, and the passenger hm3 in the cabin of the vehicle 2, respectively. The form will be described as an example. Note that when these speakers SP1 to SP4 are collectively described, they are simply referred to as the speaker SP.

スピーカＳＰは、音源装置３０に電気的に接続されている。スピーカＳＰは、音源装置３０から受付けた再生信号によって表される音を出音する。再生信号とは、音源装置３０からスピーカＳＰに出力される信号である。スピーカＳＰは、音源装置３０から受付けた再生信号に応じた音を出音する。詳細には、スピーカＳＰは、音源装置３０から受付けた再生信号のレベルに応じた音量の音を出音する。すなわち、本実施形態では、レベルとは、信号のレベルを意味し、具体的には、信号によって表される音の大きさを意味する。 The speaker SP is electrically connected to the sound source device 30 . The speaker SP emits sound represented by the reproduction signal received from the sound source device 30 . A reproduction signal is a signal output from the sound source device 30 to the speaker SP. The speaker SP emits sound corresponding to the reproduction signal received from the sound source device 30 . Specifically, the speaker SP emits sound with a volume corresponding to the level of the reproduction signal received from the sound source device 30 . That is, in this embodiment, the level means the level of a signal, and specifically means the volume of sound represented by the signal.

音源装置３０は、例えば、ラジオ受信装置、テレビ放送装置、オーディオ機器、などである。ラジオ受信装置は、ラジオ放送信号を受信し、受信したラジオ放送信号から再生信号を生成してスピーカＳＰに出力する。この場合、再生信号は、例えば、ラジオ音声のラジオ音声信号である。テレビ放送装置は、テレビ放送信号を受信し、受信したテレビ放送信号から再生信号を生成してスピーカＳＰに出力する。この場合、再生信号は、例えば、テレビ音声のテレビ音声信号である。オーディオ機器は、メモリ等に記録されたオーディオ信号などの再生信号をスピーカＳＰに出力する。この場合、再生信号は、例えば、オーディオ信号、などである。 The sound source device 30 is, for example, a radio receiver, a television broadcast device, an audio device, or the like. The radio receiver receives a radio broadcast signal, generates a reproduced signal from the received radio broadcast signal, and outputs the reproduced signal to the speaker SP. In this case, the reproduced signal is, for example, a radio audio signal of radio audio. The television broadcasting apparatus receives a television broadcasting signal, generates a reproduction signal from the received television broadcasting signal, and outputs it to the speaker SP. In this case, the reproduced signal is, for example, a television audio signal of television audio. The audio device outputs a reproduced signal such as an audio signal recorded in a memory or the like to the speaker SP. In this case, the reproduced signal is, for example, an audio signal.

本実施形態では、音源装置３０は、４つのスピーカＳＰ（スピーカＳＰ１～スピーカＳＰ４）を利用するために４チャンネルの再生信号を生成し、参照信号として４つのスピーカＳＰの各々に出力する。詳細には、音源装置３０は、スピーカＳＰ１に再生信号である参照信号１を出力し、スピーカＳＰ２に再生信号である参照信号２を出力し、スピーカＳＰ３に再生信号である参照信号３を出力し、スピーカＳＰ４に再生信号である参照信号４を出力する。これらの参照信号１～参照信号４は、複数のスピーカＳＰの各々に出力される再生信号である。参照信号１～参照信号４を総称して説明する場合には、単に参照信号と称して説明する。 In this embodiment, the sound source device 30 generates 4-channel reproduction signals to use the four speakers SP (speakers SP1 to SP4), and outputs them as reference signals to each of the four speakers SP. Specifically, the sound source device 30 outputs a reference signal 1, which is a reproduced signal, to the speaker SP1, outputs a reference signal 2, which is a reproduced signal, to the speaker SP2, and outputs a reference signal 3, which is a reproduced signal, to the speaker SP3. , the reference signal 4, which is a reproduced signal, is output to the speaker SP4. These reference signals 1 to 4 are reproduction signals output to each of the plurality of speakers SP. When the reference signals 1 to 4 are collectively described, they will simply be referred to as reference signals.

音声処理装置１０は、マイクＭＣから受付けた音声信号およびスピーカＳＰから再生される再生信号である参照信号に基づいた出力信号を、音声認識部４０へ出力する。音声処理装置１０の詳細は後述する。 The speech processing device 10 outputs to the speech recognition section 40 an output signal based on the reference signal, which is the speech signal received from the microphone MC and the reproduction signal reproduced from the speaker SP. The details of the audio processing device 10 will be described later.

音声認識部４０は、音声処理装置１０から受付けた出力信号によって表される音声を認識し、音声認識結果を表す信号を電子機器５０へ出力する。例えば、音声認識部４０は、出力信号によって表される音声コマンドを認識し、電子機器５０へ出力する。音声コマンドは、電子機器５０に各種の処理を実行させるための信号である。音声コマンドは、音声認識コマンド、キーワード、ウェイクアップワード、等と称される場合がある。 The speech recognition unit 40 recognizes speech represented by the output signal received from the speech processing device 10 and outputs a signal representing the speech recognition result to the electronic device 50 . For example, the voice recognition unit 40 recognizes voice commands represented by output signals and outputs them to the electronic device 50 . A voice command is a signal for causing the electronic device 50 to execute various processes. Voice commands may also be referred to as voice recognition commands, keywords, wake-up words, and the like.

電子機器５０は、音声認識部４０から受付けた音声認識結果を表す信号である音声コマンドに応じた処理を実行する。例えば、電子機器５０は、音声コマンドに基づいて、窓を開閉する処理、車両２の運転に関する処理、エアコンの温度を変更する処理、オーディオ機器のボリュームを変更する処理、等を実行する。電子機器５０は、例えば、カーナビゲーション装置、エアコンディショナ、パネルメータ、テレビ、携帯端末、車両２の各部を駆動する駆動装置、等である。 Electronic device 50 executes processing according to a voice command, which is a signal representing a voice recognition result received from voice recognition unit 40 . For example, the electronic device 50 executes processing for opening and closing windows, processing for driving the vehicle 2, processing for changing the temperature of the air conditioner, processing for changing the volume of the audio device, and the like, based on the voice command. The electronic device 50 is, for example, a car navigation device, an air conditioner, a panel meter, a television, a mobile terminal, a driving device that drives each part of the vehicle 2, and the like.

ディスプレイ６０は、各種の情報を表示する表示装置である。ディスプレイ６０は、例えば、車両２に設けられた各種のディスプレイ、ヘッドアップディスプレイ、カーナビゲーションシステムのディスプレイ、車両２のメータ内に設けられたマルチインフォメーションディスプレイ、オーディオ操作等を受付け可能なセンターディスプレイ、等である。本実施形態では、ディスプレイ６０には後述する音声処理装置１０によって情報が表示される。なお、ディスプレイ６０は、電子機器５０の一例として機能してもよい。 The display 60 is a display device that displays various information. The display 60 is, for example, various displays provided in the vehicle 2, a head-up display, a display of a car navigation system, a multi-information display provided in a meter of the vehicle 2, a center display capable of accepting audio operations, etc. is. In this embodiment, information is displayed on the display 60 by the audio processing device 10, which will be described later. Note that the display 60 may function as an example of the electronic device 50 .

音声処理装置１０について詳細に説明する。まず、音声処理装置１０のハードウェア構成の一例を説明する。 The speech processing device 10 will be described in detail. First, an example of the hardware configuration of the audio processing device 10 will be described.

図２は、音声処理装置１０の一例のハードウェア構成図である。 FIG. 2 is a hardware configuration diagram of an example of the audio processing device 10. As shown in FIG.

音声処理装置１０は、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）１１Ａ、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）１１Ｂ、ＲＡＭ１１Ｃ、およびＩ／Ｆ１１Ｄ等がバス１１Ｅにより相互に接続されており、通常のコンピュータを利用したハードウェア構成となっている。 The audio processing device 10 includes a CPU (Central Processing Unit) 11A, a ROM (Read Only Memory) 11B, a RAM 11C, and an I/F 11D, etc., which are interconnected via a bus 11E. It's becoming

ＣＰＵ１１Ａは、本実施形態の音声処理装置１０を制御する演算装置である。ＲＯＭ１１Ｂは、ＣＰＵ１１Ａによる各種の処理を実現するプログラム等を記憶する。ＲＡＭ１１Ｃは、ＣＰＵ１１Ａによる各種の処理に必要なデータを記憶する。Ｉ／Ｆ１１Ｄは、データを送受信するためのインタフェースである。 The CPU 11A is an arithmetic device that controls the speech processing device 10 of this embodiment. The ROM 11B stores programs and the like for realizing various processes by the CPU 11A. The RAM 11C stores data required for various processes by the CPU 11A. I/F 11D is an interface for transmitting and receiving data.

本実施形態の音声処理装置１０で実行される情報処理を実行するためのプログラムは、ＲＯＭ１１Ｂ等に予め組み込んで提供される。なお、本実施形態の音声処理装置１０で実行されるプログラムは、音声処理装置１０にインストール可能な形式又は実行可能な形式のファイルでＣＤ－ＲＯＭ、フレキシブルディスク（ＦＤ）、ＣＤ－Ｒ、ＤＶＤ（ＤｉｇｉｔａｌＶｅｒｓａｔｉｌｅＤｉｓｋ）等のコンピュータで読み取り可能な記録媒体に記録されて提供するように構成してもよい。 A program for executing information processing executed by the speech processing apparatus 10 of the present embodiment is provided by being incorporated in the ROM 11B or the like in advance. The program executed by the sound processing device 10 of the present embodiment is a file in a format that can be installed in the sound processing device 10 or in a format that can be executed on CD-ROM, flexible disk (FD), CD-R, DVD ( It may be configured to be provided by being recorded on a computer-readable recording medium such as a Digital Versatile Disk).

次に、音声処理装置１０の構成について詳細に説明する。 Next, the configuration of the audio processing device 10 will be described in detail.

図３は、音声処理装置１０の構成の一例を示すブロック図である。図３には説明のために、音声処理装置１０に加えて、マイクＭＣ、音源装置３０、音声認識部４０、電子機器５０、およびディスプレイ６０を示す。 FIG. 3 is a block diagram showing an example of the configuration of the speech processing device 10. As shown in FIG. For the sake of explanation, FIG.

音声処理装置１０は、音声取得部２０と、判定部２２と、音声処理部２４と、切替部２６と、生成部２８と、出力制御部２９と、を備える。 The audio processing device 10 includes an audio acquisition unit 20 , a determination unit 22 , an audio processing unit 24 , a switching unit 26 , a generation unit 28 and an output control unit 29 .

音声取得部２０、判定部２２、音声処理部２４、切替部２６、生成部２８、および出力制御部２９の一部または全ては、例えば、ＣＰＵ１１Ａなどの処理装置にプログラムを実行させること、すなわち、ソフトウェアにより実現してもよいし、ＩＣ（ＩｎｔｅｇｒａｔｅｄＣｉｒｃｕｉｔ）などのハードウェアにより実現してもよいし、ソフトウェアおよびハードウェアを併用して実現してもよい。また、音声取得部２０、判定部２２、音声処理部２４、切替部２６、生成部２８、および出力制御部２９の少なくとも１つを、ネットワークなどを介して音声処理装置１０と通信可能に接続された外部の情報処理装置に搭載した構成としてもよい。 Some or all of the voice acquisition unit 20, the determination unit 22, the voice processing unit 24, the switching unit 26, the generation unit 28, and the output control unit 29 cause a processing device such as the CPU 11A to execute a program, that is, It may be implemented by software, may be implemented by hardware such as an IC (Integrated Circuit), or may be implemented by using both software and hardware. At least one of the voice acquisition unit 20, the determination unit 22, the voice processing unit 24, the switching unit 26, the generation unit 28, and the output control unit 29 is communicably connected to the voice processing device 10 via a network or the like. Alternatively, it may be installed in an external information processing device.

音声取得部２０は、マイクＭＣから音声信号を取得する。音声取得部２０は、取得した音声信号を音声処理部２４へ出力する。 The voice acquisition unit 20 acquires a voice signal from the microphone MC. The audio acquisition unit 20 outputs the acquired audio signal to the audio processing unit 24 .

判定部２２は、スピーカＳＰから再生される再生信号である参照信号のレベルが閾値以上であるか否かを判定する。参照信号のレベルとは、参照信号である再生信号によって表される音の大きさを表す。上述したように、スピーカＳＰは、音源装置３０から受付けた再生信号のレベルに応じた音量の音を出音する。このため、再生信号である参照信号のレベルが大きいほど、スピーカＳＰから出音される音の音量は大きくなる。 The determination unit 22 determines whether or not the level of the reference signal, which is the reproduction signal reproduced from the speaker SP, is equal to or higher than the threshold. The level of the reference signal represents the loudness of the sound represented by the reproduced signal, which is the reference signal. As described above, the speaker SP emits sound with a volume corresponding to the level of the reproduced signal received from the sound source device 30 . Therefore, the higher the level of the reference signal, which is the reproduced signal, the higher the volume of the sound emitted from the speaker SP.

閾値は、再生信号のレベルを徐々に大きくし、該再生信号に応じてスピーカＳＰから出音される音に歪みが発生し始めたときの該再生信号のレベル以下であり、且つ該レベルに近い値を予め定めればよい。また、閾値は、再生信号のレベルを徐々に大きくし、再生信号に応じてスピーカＳＰから出音される音に歪みが発生し始めたときの該再生信号のレベルと一致する値であってもよい。スピーカＳＰから出音される音の歪みは、音割れと称される場合もある。 The threshold value is equal to or lower than, and close to, the level of the reproduced signal when the level of the reproduced signal is gradually increased and distortion begins to occur in the sound output from the speaker SP according to the reproduced signal. A value may be determined in advance. Also, the threshold value may be a value that coincides with the level of the reproduced signal when the level of the reproduced signal is gradually increased and distortion begins to occur in the sound emitted from the speaker SP according to the reproduced signal. good. Distortion of the sound output from the speaker SP is sometimes referred to as sound distortion.

例えば、判定部２２は、複数のスピーカＳＰ１～スピーカＳＰ４の各々ごとに、上記条件を満たす閾値を定める。 For example, the determination unit 22 determines a threshold that satisfies the above conditions for each of the plurality of speakers SP1 to SP4.

そして、判定部２２は、複数のスピーカＳＰ１～スピーカＳＰ４の各々から受付けた参照信号１～参照信号４の各々のレベルの内の少なくとも１つが、それぞれのスピーカＳＰ１～スピーカＳＰ４に対応する閾値以上であるか否かを判定する。 Then, the determination unit 22 determines that at least one of the levels of the reference signals 1 to 4 received from each of the plurality of speakers SP1 to SP4 is equal to or higher than the threshold value corresponding to each of the speakers SP1 to SP4. Determine whether or not there is

また、判定部２２は、複数のスピーカＳＰ１～スピーカＳＰ４の各々の、上記条件を満たす閾値の最低値、平均値、または最大値を、複数のスピーカＳＰ１～スピーカＳＰ４に共通する閾値として設定してもよい。そして、判定部２２は、複数のスピーカＳＰ１～スピーカＳＰ４の各々から受付けた参照信号１～参照信号４の各々のレベルの内の少なくとも１つが、共通する閾値として設定した該閾値以上であるか否かを判定してもよい。 Further, the determination unit 22 sets the minimum value, the average value, or the maximum value of the threshold values that satisfy the above conditions for each of the plurality of speakers SP1 to SP4 as a common threshold value for the plurality of speakers SP1 to SP4. good too. Then, the determination unit 22 determines whether at least one of the levels of the reference signals 1 to 4 received from each of the plurality of speakers SP1 to SP4 is equal to or higher than the threshold set as a common threshold. It may be determined whether

本実施形態では、判定部２２は、複数のスピーカＳＰ１～スピーカＳＰ４の各々から受付けた参照信号１～参照信号４の各々のレベルの内の少なくとも１つが、それぞれのスピーカＳＰ１～スピーカＳＰ４に対応する閾値以上であるか否かを判定する形態を一例として説明する。 In this embodiment, the determination unit 22 determines that at least one of the levels of the reference signals 1 to 4 received from each of the plurality of speakers SP1 to SP4 corresponds to each of the speakers SP1 to SP4. A form of determining whether or not it is equal to or greater than a threshold will be described as an example.

なお、複数のスピーカＳＰ１～スピーカＳＰ４の各々に対応する閾値は、判定部２２のメモリ等に予め記憶しておけばよい。また、複数のスピーカＳＰ１～スピーカＳＰ４の各々に対応する閾値は、音声処理システム１に設けられるスピーカＳＰの種類や設置位置等に応じて、ユーザによる操作指示等によって上記条件を満たす範囲で適宜変更可能としてもよい。 Note that threshold values corresponding to each of the plurality of speakers SP1 to SP4 may be stored in advance in the memory or the like of the determination unit 22. FIG. Further, the threshold value corresponding to each of the plurality of speakers SP1 to SP4 is appropriately changed according to the type and installation position of the speaker SP provided in the audio processing system 1 within the range that satisfies the above conditions by an operation instruction or the like by the user. It may be possible.

音声処理部２４は、音声取得部２０から受付けた音声信号から参照信号の音声成分を除去した除去信号を生成する。 The audio processing unit 24 generates a removal signal by removing the audio component of the reference signal from the audio signal received from the audio acquisition unit 20 .

音声処理部２４は、音声取得部２０から受付けた音声信号に含まれる、再生信号である参照信号の音声成分を除去する。音声処理部２４は、公知のエコーキャンセラ、および、クロストークキャンセラの少なくとも一方の方式を用いて、音声信号に含まれる参照信号の音声成分を除去すればよい。 The audio processing unit 24 removes the audio component of the reference signal, which is the reproduced signal, included in the audio signal received from the audio acquisition unit 20 . The audio processing unit 24 may remove the audio component of the reference signal included in the audio signal using at least one of a known echo canceller and crosstalk canceller.

例えば、音声処理部２４は、適応フィルタＦと、適応フィルタ制御部２４Ａと、減算部２４Ｂと、を有する。 For example, the audio processing unit 24 has an adaptive filter F, an adaptive filter control unit 24A, and a subtraction unit 24B.

適応フィルタＦは、参照信号の特性を変化させる機能を備えたフィルタである。本実施形態では、適応フィルタＦは、適応フィルタＦ１～適応フィルタＦ４を含む。適応フィルタＦの数は、入力される参照信号の数等に基づいて適宜設定される。 The adaptive filter F is a filter having a function of changing the characteristics of the reference signal. In this embodiment, the adaptive filter F includes adaptive filters F1 to F4. The number of adaptive filters F is appropriately set based on the number of input reference signals and the like.

適応フィルタ制御部２４Ａは、減算部２４Ｂから出力される除去信号に応じて、公知の方法で適応フィルタＦ１～適応フィルタＦ４の各々のフィルタ係数を設定する。適応フィルタＦ１～適応フィルタＦ４は、それぞれで受け付けた参照信号１～参照信号４の各々および設定されたフィルタ係数に基づいた通過信号を、減算信号として減算部２４Ｂへ出力する。このため、減算部２４Ｂには、適応フィルタＦ１～適応フィルタＦ４の各々から出力された、参照信号１～参照信号４の各々および設定されたフィルタ係数に基づいた通過信号を足し合わせた信号である減算信号が出力される。 The adaptive filter control section 24A sets the filter coefficients of the adaptive filters F1 to F4 by a known method according to the removal signal output from the subtraction section 24B. The adaptive filters F1 to F4 output passing signals based on the respective received reference signals 1 to 4 and the set filter coefficients to the subtraction section 24B as subtraction signals. For this reason, in the subtraction unit 24B, a signal obtained by adding the reference signals 1 to 4 output from the adaptive filters F1 to F4 and the passing signals based on the set filter coefficients A subtraction signal is output.

減算部２４Ｂは、音声取得部２０から受付けた音声信号から上記減算信号を減算することで、音声信号から参照信号の音声成分を除去する除去処理を実行する。減算部２４Ｂは、除去処理によって得られた除去信号、すなわち音声信号から参照信号の音声成分を除去した除去信号を、適応フィルタ制御部２４Ａおよび切替部２６へ出力する。 The subtraction unit 24B subtracts the subtraction signal from the audio signal received from the audio acquisition unit 20, thereby performing a removal process of removing the audio component of the reference signal from the audio signal. The subtraction unit 24B outputs the removal signal obtained by the removal processing, that is, the removal signal obtained by removing the audio component of the reference signal from the audio signal to the adaptive filter control unit 24A and the switching unit 26. FIG.

切替部２６は、参照信号のレベルが閾値以上と判定された場合、音声処理部２４から受付けた除去信号に換えて、コンフォートノイズおよびミュート信号の少なくとも一方である置換信号を出力信号として音声認識部４０へ出力する。 When it is determined that the level of the reference signal is equal to or higher than the threshold, the switching unit 26 converts the replacement signal, which is at least one of the comfort noise and the mute signal, into the output signal instead of the removal signal received from the audio processing unit 24, and converts it into the output signal. 40.

詳細には、切替部２６は、判定部２２によって参照信号のレベルが閾値以上と判定された場合、音声処理部２４から受付けた除去信号に変えて、生成部２８から受付けた置換信号を音声認識部４０へ出力するように切り替える。 Specifically, when the determination unit 22 determines that the level of the reference signal is equal to or higher than the threshold, the switching unit 26 replaces the removal signal received from the speech processing unit 24 with the replacement signal received from the generation unit 28 for speech recognition. Switch to output to unit 40 .

生成部２８は、コンフォートノイズおよびミュート信号の少なくとも一方である置換信号を生成し、切替部２６へ出力する。ミュート信号は、音のレベルが”０”である信号である。ミュート信号は、言い換えると、無音状態、消音状態、または無信号（ＭＵＴＥ）を表す信号である。 The generation unit 28 generates a replacement signal that is at least one of the comfort noise and the mute signal, and outputs the replacement signal to the switching unit 26 . A mute signal is a signal whose sound level is "0". A mute signal, in other words, is a signal representing a silence, mute, or no signal (MUTE).

生成部２８は、コンフォートノイズを置換信号として生成する場合には、判定部２２によって閾値以上と判定される直前のタイミングの音声信号に含まれるノイズレベルに応じたレベルのコンフォートノイズを生成することが好ましい。例えば、音声取得部２０は、マイクＭＣから取得した音声信号を、音声処理部２４および生成部２８に出力する。生成部２８は、音声取得部２０から受付けた音声信号における、判定部２２によって閾値以上と判定される直前のタイミングの音声信号に含まれるノイズレベルを公知の方法で特定する。そして、生成部２８は、特定したノイズレベルに応じたレベルのコンフォートノイズを生成する。例えば、生成部２８は、特定したノイズレベルと同じレベル、すなわち同じレベルの音量を表すコンフォートノイズを生成する。 When generating the comfort noise as the replacement signal, the generation unit 28 can generate the comfort noise at a level corresponding to the noise level included in the audio signal at the timing just before the determination unit 22 determines that the noise is equal to or greater than the threshold. preferable. For example, the audio acquisition unit 20 outputs audio signals acquired from the microphone MC to the audio processing unit 24 and the generation unit 28 . The generation unit 28 identifies, by a known method, the noise level included in the audio signal received from the audio acquisition unit 20 at the timing immediately before the determination unit 22 determines that the noise level is equal to or greater than the threshold. Then, the generator 28 generates comfort noise of a level corresponding to the specified noise level. For example, the generation unit 28 generates comfort noise representing the same level of noise level as the specified noise level, that is, the same level of volume.

生成部２８が、閾値以上と判定される直前のタイミングの音声信号に含まれるノイズレベルに応じたレベルのコンフォートノイズを置換信号として生成することで、音声認識部４０に出力される出力信号のレベルが急激に変動することが抑制される。例えば、車両２の走行環境の変化等に応じて空間の音環境が変動する場合、空間の音環境の変動に応じたレベルのコンフォートノイズが置換信号として音声認識部４０に出力される。このため、音声認識部４０に出力される出力信号が置換信号から除去信号へ又は除去信号から置換信号に切り替わるときに、出力信号のレベルが急激に変動することが抑制される。このため、出力信号のレベルの急激な変動による、音声認識部４０の音声認識性能の低下を抑制することができる。 The level of the output signal output to the speech recognition unit 40 is generated by the generation unit 28 generating, as a replacement signal, comfort noise having a level corresponding to the noise level included in the audio signal at the timing immediately before it is determined to be equal to or greater than the threshold. is suppressed from rapidly changing. For example, when the spatial sound environment fluctuates according to a change in the traveling environment of the vehicle 2 or the like, comfort noise having a level corresponding to the fluctuation of the spatial sound environment is output to the speech recognition unit 40 as a substitution signal. Therefore, when the output signal output to the speech recognition unit 40 switches from the replacement signal to the removal signal or from the removal signal to the replacement signal, the level of the output signal is suppressed from fluctuating abruptly. Therefore, it is possible to suppress deterioration of the speech recognition performance of the speech recognition unit 40 due to sudden changes in the level of the output signal.

また、生成部２８は、コンフォートノイズおよびミュート信号の双方を含む置換信号を生成し、切替部２６へ出力してもよい。例えば、生成部２８は、コンフォートノイズとミュート信号とを交互に配列した置換信号を生成する。この場合、生成部２８は、コンフォートノイズとミュート信号とが切り替わるときのレベルが徐々に変化するようにレベルを調整した出力信号を生成することが好ましい。 Further, the generating section 28 may generate a replacement signal including both the comfort noise and the mute signal and output it to the switching section 26 . For example, the generator 28 generates a replacement signal in which comfort noise and mute signals are alternately arranged. In this case, the generator 28 preferably generates an output signal whose level is adjusted so that the level changes gradually when the comfort noise and the mute signal are switched.

なお、生成部２８は、置換信号を常時生成してもよいが、判定部２２によって参照信号のレベルが閾値以上と判定された場合に、置換信号を生成し切替部２６へ出力することが好ましい。そして、生成部２８は、判定部２２によって参照信号のレベルが閾値未満と判定された場合には、置換信号の生成処理を停止してもよい。 Note that the generation unit 28 may always generate the replacement signal, but it is preferable to generate the replacement signal and output it to the switching unit 26 when the determination unit 22 determines that the level of the reference signal is equal to or higher than the threshold. . Then, when the determination unit 22 determines that the level of the reference signal is less than the threshold, the generation unit 28 may stop the replacement signal generation process.

判定部２２によって参照信号のレベルが閾値未満と判定された場合、生成部２８が置換信号の生成処理を停止することで、音声処理装置１０の処理演算量の削減を図ることができる。 When the determination unit 22 determines that the level of the reference signal is less than the threshold value, the generation unit 28 stops the replacement signal generation process, thereby reducing the amount of processing calculation of the speech processing device 10 .

切替部２６は、判定部２２によって参照信号のレベルが閾値以上と判定された場合、音声処理部２４から受付けた除去信号に変えて、生成部２８から受付けた置換信号を出力信号として音声認識部４０へ出力する。このため、判定部２２によって参照信号のレベルが閾値以上と判定された場合、音声認識部４０には除去信号に変えて置換信号が出力される。 When the determination unit 22 determines that the level of the reference signal is equal to or higher than the threshold value, the switching unit 26 replaces the removal signal received from the voice processing unit 24 with the replacement signal received from the generation unit 28 as an output signal to the voice recognition unit. 40. Therefore, when the determination unit 22 determines that the level of the reference signal is equal to or higher than the threshold value, the replacement signal is output to the speech recognition unit 40 instead of the removal signal.

なお、切替部２６は、判定部２２によって参照信号のレベルが閾値以上と判定されている期間、除去信号に換えて置換信号を出力信号として音声認識部４０へ出力してよい。そして、切替部２６は、判定部２２によって参照信号のレベルが閾値未満と判定されている期間には、音声処理部２４から受付けた除去信号を出力信号として音声認識部４０へ出力してよい。 Note that the switching unit 26 may output the replacement signal as an output signal to the speech recognition unit 40 instead of the removal signal during the period in which the determination unit 22 determines that the level of the reference signal is equal to or higher than the threshold. Then, the switching unit 26 may output the removal signal received from the speech processing unit 24 to the speech recognition unit 40 as an output signal during the period when the determination unit 22 determines that the level of the reference signal is less than the threshold value.

この場合、参照信号のレベルが閾値以上である期間は、音声認識部４０には置換信号が出力信号として出力される。また、参照信号のレベルが閾値未満である期間は、音声認識部４０には除去信号が出力信号として出力される。 In this case, while the level of the reference signal is equal to or higher than the threshold, the replacement signal is output to the speech recognition section 40 as an output signal. Also, during a period in which the level of the reference signal is less than the threshold, the removal signal is output to the speech recognition section 40 as an output signal.

また、切替部２６は、参照信号のレベルが閾値以上と判定された場合、除去信号に換えて置換信号を出力信号として、予め定めた第１の時間継続して音声認識部４０へ出力してもよい。 Further, when it is determined that the level of the reference signal is equal to or higher than the threshold, the switching unit 26 outputs the replacement signal instead of the removal signal as an output signal to the speech recognition unit 40 continuously for a predetermined first period of time. good too.

第１の時間は、予め定めればよい。例えば、第１の時間には、音声認識部４０へ出力される出力信号が除去信号と置換信号とに短時間で繰り返し切り替わることで音声認識部４０の性能低下が発生するときの、音声認識部４０への置換信号の継続出力時間より長い時間を定めればよい。また、例えば、第１の時間には、１つの音声コマンドの発話に要する平均発話期間以上であり、且つ、２つの音声コマンドが連続して発話されたときの平均発話期間未満の値などを定めてもよい。また、第１の時間は、ユーザによる操作指示等に応じて適宜変更可能としてもよい。 The first time may be determined in advance. For example, at the first time, the output signal output to the speech recognition unit 40 is repeatedly switched between the removal signal and the replacement signal in a short time, and the performance of the speech recognition unit 40 is degraded. A time longer than the continuous output time of the replacement signal to 40 may be determined. Also, for example, for the first time, a value that is equal to or longer than the average utterance period required for uttering one voice command and less than the average utterance period when two voice commands are continuously uttered is set. may Also, the first time may be changed as appropriate according to an operation instruction or the like by the user.

この場合、参照信号のレベルが閾値以上となったタイミングから少なくとも第１の時間継続して、置換信号が出力信号として音声認識部４０へ出力される。そして、該第１の時間経過後に、除去信号が出力信号として音声認識部４０へ出力される。 In this case, the replacement signal is output as the output signal to the speech recognition unit 40 continuously for at least the first time from the timing when the level of the reference signal becomes equal to or higher than the threshold. Then, after the first time has elapsed, the removal signal is output to the speech recognition section 40 as an output signal.

また、切替部２６は、判定部２２によって参照信号のレベルが予め定めた第２の時間以上継続して閾値以上と判定された場合、除去信号に換えて置換信号を出力信号として音声認識部４０へ出力してもよい。 Further, when the determination unit 22 determines that the level of the reference signal is equal to or higher than the threshold continuously for a predetermined second time or longer, the switching unit 26 outputs the replacement signal instead of the removal signal as the output signal to the voice recognition unit 40 . You can output to

第２の時間は、予め定めればよい。例えば、第２の時間には、音声認識部４０へ出力される出力信号が除去信号と置換信号とに短時間で繰り返し切り替わることで音声認識部４０の性能低下が発生するときの、音声認識部４０への除去信号または置換信号の継続出力時間より長い時間を定めればよい。また、例えば、第２の時間には、１つの音声コマンドの発話に要する平均発話期間以上であり、且つ、２つの音声コマンドが連続して発話されたときの平均発話期間未満の値などを定めてもよい。また、第２の時間は、ユーザによる操作指示等に応じて適宜変更可能としてもよい。 The second time may be determined in advance. For example, at the second time, the output signal output to the speech recognition unit 40 is repeatedly switched between the removal signal and the replacement signal in a short time, and the performance of the speech recognition unit 40 is degraded. A time longer than the continuous output time of the removal signal or the replacement signal to 40 may be determined. Also, for example, for the second time, a value that is equal to or longer than the average utterance period required for uttering one voice command and less than the average utterance period when two voice commands are continuously uttered is set. may Also, the second time may be changed as appropriate according to an operation instruction or the like by the user.

この場合、参照信号のレベルが閾値以上である状態が第２の時間継続した場合に、置換信号が出力信号として音声認識部４０へ出力される。そして、参照信号のレベルが閾値未満または該レベルが閾値以上である状態の継続時間が第２の時間未満である場合、除去信号が出力信号として音声認識部４０へ出力される。 In this case, when the state in which the level of the reference signal is equal to or higher than the threshold continues for the second time, the replacement signal is output to the speech recognition section 40 as the output signal. Then, when the level of the reference signal is less than the threshold or the duration of the state where the level is equal to or greater than the threshold is less than the second time, the removal signal is output to the speech recognition section 40 as an output signal.

なお、音声処理部２４は、音声信号から参照信号の音声成分を除去する除去処理を常時行ってもよいが、判定部２２によって参照信号のレベルが閾値以上と判定された場合、除去処理を停止してもよい。例えば、判定部２２は、参照信号のレベルが閾値以上と判定した場合、除去処理を停止するように音声処理部２４を制御する。 Note that the audio processing unit 24 may always perform removal processing for removing the audio component of the reference signal from the audio signal, but if the determination unit 22 determines that the level of the reference signal is equal to or higher than the threshold, the removal processing is stopped. You may For example, when determining that the level of the reference signal is equal to or higher than the threshold, the determination unit 22 controls the audio processing unit 24 to stop the removal processing.

参照信号のレベルが閾値以上と判定された場合、音声処理部２４が除去処理を停止することで、音声処理装置１０の処理演算量の削減を図ることができる。 When it is determined that the level of the reference signal is equal to or higher than the threshold, the audio processing unit 24 stops the removal processing, thereby reducing the amount of processing computation of the audio processing device 10 .

出力制御部２９は、参照信号のレベルが閾値以上と判定された場合、音声認識停止中であることを表す情報を出力する。出力制御部２９は、例えば、音声認識停止中であることを表す情報をディスプレイ６０に出力する。 When the level of the reference signal is determined to be equal to or higher than the threshold, the output control unit 29 outputs information indicating that speech recognition is stopped. The output control unit 29 outputs, for example, information indicating that speech recognition is stopped to the display 60 .

上述したように、参照信号のレベルが閾値以上である場合、音声認識部４０には置換信号が出力信号として出力される。置換信号は、コンフォートノイズおよびミュート信号の少なくとも一方であるため、置換信号を受付けている期間、音声認識部４０は音声認識を行わない。このため、例えば、車両２の車室内の空間にスピーカＳＰによって閾値以上のレベルの再生信号に応じた音量の音が出音されている状況では、運転者ｈｍ１などが音声コマンドなどを発話した場合であっても、音声認識部４０による音声認識が行われない状態となる。そこで、再生信号である参照信号のレベルが閾値以上と判定された場合、出力制御部２９が音声認識停止中であることを表す情報を出力することで、ユーザに対して音声認識部４０の音声認識の状況を容易に提示することができる。 As described above, when the level of the reference signal is equal to or higher than the threshold, the replacement signal is output to the speech recognition section 40 as an output signal. Since the replacement signal is at least one of comfort noise and a mute signal, the speech recognition unit 40 does not perform speech recognition while the replacement signal is being accepted. For this reason, for example, in a situation where sound is being emitted in the space inside the vehicle 2 by the loudspeaker SP in accordance with the level of the reproduced signal equal to or higher than the threshold, if the driver hm1 or the like utters a voice command or the like, Even then, the speech recognition unit 40 does not perform speech recognition. Therefore, when it is determined that the level of the reference signal, which is the reproduced signal, is equal to or higher than the threshold, the output control unit 29 outputs information indicating that the speech recognition is stopped so that the user can hear the speech of the speech recognition unit 40. The situation of recognition can be presented easily.

なお、出力制御部２９による情報の出力対象は、ディスプレイ６０に限定されない。例えば、出力制御部２９は、音声認識停止中であることを表す情報を、予め登録された運転者ｈｍ１によって管理される携帯端末などの情報処理装置へ送信してもよい。また、出力制御部２９は、音声認識停止中であることを表す情報を、スピーカＳＰから出力してもよい。この場合、音声認識停止中であることを表す情報の再生信号のレベルは、上記閾値未満のレベルとすればよい。 Note that the output target of information by the output control unit 29 is not limited to the display 60 . For example, the output control unit 29 may transmit information indicating that the voice recognition is stopped to an information processing device such as a mobile terminal managed by the pre-registered driver hm1. Further, the output control unit 29 may output information indicating that speech recognition is stopped from the speaker SP. In this case, the level of the reproduced signal of the information indicating that the speech recognition is stopped may be set to a level less than the above threshold.

次に、本実施形態の音声処理装置１０で実行される情報処理の流れの一例を説明する。 Next, an example of the flow of information processing executed by the speech processing device 10 of this embodiment will be described.

図４は、本実施形態の音声処理装置１０で実行される情報処理の流れの一例を表すフローチャートである。 FIG. 4 is a flowchart showing an example of the flow of information processing executed by the speech processing device 10 of this embodiment.

音声取得部２０が、マイクＭＣから音声信号を取得する（ステップＳ１００）。 The voice acquisition unit 20 acquires a voice signal from the microphone MC (step S100).

判定部２２は、スピーカＳＰから再生される再生信号である参照信号のレベルが閾値以上であるか否かを判定する（ステップＳ１０２）。参照信号のレベルが閾値以上であると判定された場合（ステップＳ１０２：Ｙｅｓ）、処理がステップＳ１０４へ進む。 The determination unit 22 determines whether or not the level of the reference signal, which is the reproduction signal reproduced from the speaker SP, is equal to or higher than the threshold (step S102). If it is determined that the reference signal level is greater than or equal to the threshold (step S102: Yes), the process proceeds to step S104.

ステップＳ１０４では、判定部２２は、除去処理を停止するように音声処理部２４を制御する。ステップＳ１０４の処理によって、音声処理部２４は除去処理を停止する。 In step S104, the determination unit 22 controls the audio processing unit 24 to stop the removal process. By the process of step S104, the audio processing unit 24 stops the removal process.

生成部２８は、コンフォートノイズおよびミュート信号の少なくとも一方である置換信号を生成し、切替部２６へ出力する（ステップＳ１０６）。 The generation unit 28 generates a replacement signal that is at least one of the comfort noise and the mute signal, and outputs it to the switching unit 26 (step S106).

切替部２６は、生成部２８で生成された置換信号を出力信号として音声認識部４０へ出力する（ステップＳ１０８）。置換信号はコンフォートノイズおよびミュート信号の少なくとも一方であるため、置換信号には音声コマンドが含まれない。このため、置換信号を受付けつけている期間、音声認識部４０は、音声コマンドの認識を行わない状態となる。 The switching unit 26 outputs the replacement signal generated by the generating unit 28 to the speech recognition unit 40 as an output signal (step S108). Since the replacement signal is comfort noise and/or a mute signal, the replacement signal does not include voice commands. Therefore, the voice recognition unit 40 does not recognize voice commands while the replacement signal is being accepted.

出力制御部２９は、音声認識停止中であることを表す情報をディスプレイ６０に出力する（ステップＳ１１０）。 The output control unit 29 outputs information indicating that speech recognition is stopped to the display 60 (step S110).

次に、音声処理装置１０は、処理を終了するか否かを判断する（ステップＳ１１２）。例えば、音声処理装置１０は、ユーザによる操作指示等によって音声処理装置１０への電力供給の遮断が指示されたか否かを判別することで、ステップＳ１１２の判断を行う。ステップＳ１１２で肯定判断すると（ステップＳ１１２：Ｙｅｓ）、音声処理装置１０は本ルーチンを終了する。音声処理装置１０がステップＳ１１２で否定判断すると（ステップＳ１１２：Ｎｏ）、処理が上記ステップＳ１００へ戻る。 Next, the speech processing device 10 determines whether or not to end the processing (step S112). For example, the sound processing device 10 makes the determination in step S112 by determining whether or not an instruction to cut off power supply to the sound processing device 10 has been given by an operation instruction or the like by the user. If an affirmative determination is made in step S112 (step S112: Yes), the speech processing device 10 ends this routine. When the voice processing device 10 makes a negative determination in step S112 (step S112: No), the process returns to step S100.

一方、上記ステップＳ１０２において、スピーカＳＰから再生される再生信号である参照信号のレベルが閾値未満であると判定されると（ステップＳ１０２：Ｎｏ）、処理がステップＳ１１４へ進む。 On the other hand, if it is determined in step S102 that the level of the reference signal, which is the reproduction signal reproduced from the speaker SP, is less than the threshold (step S102: No), the process proceeds to step S114.

ステップＳ１１４では、音声処理部２４が除去処理を実行し、音声取得部２０から受付けた音声信号から参照信号の音声成分を除去した除去信号を生成する。なお、上記ステップＳ１０４の処理によって音声処理部２４による除去処理が停止されている場合には、判定部２２が除去処理の停止を解除するように音声処理部２４を制御した後に、音声処理部２４がステップＳ１１４の除去処理を実行すればよい。 In step S<b>114 , the audio processing unit 24 performs removal processing to generate a removed signal by removing the audio component of the reference signal from the audio signal received from the audio acquisition unit 20 . Note that when the removal process by the audio processing unit 24 has been stopped by the process of step S104, after the determination unit 22 controls the audio processing unit 24 to cancel the stop of the removal process, the audio processing unit 24 should execute the removal process of step S114.

切替部２６は、音声処理部２４で生成された除去信号を出力信号として音声認識部４０へ出力する（ステップＳ１１６）。除去信号は、音声信号から参照信号である再生信号を除去した信号であるため、除去信号には音声コマンドが含まれる場合がある。このため、除去信号を出力信号として受付けつけている期間、音声認識部４０は、音声コマンドの認識を行うことが可能な状態となる。そして、処理が上記ステップＳ１１２へ進む。 The switching unit 26 outputs the removal signal generated by the speech processing unit 24 to the speech recognition unit 40 as an output signal (step S116). Since the removed signal is a signal obtained by removing the reproduced signal, which is the reference signal, from the audio signal, the removed signal may include a voice command. For this reason, the voice recognition unit 40 is in a state in which voice commands can be recognized during a period in which the removal signal is accepted as an output signal. Then, the process proceeds to step S112.

以上説明したように、本実施形態の音声処理装置１０は、音声取得部２０と、判定部２２と、音声処理部２４と、切替部２６と、を備える。音声取得部２０は、空間の音声を収音するマイクＭＣから音声信号を取得する。判定部２２は、空間に出音するスピーカＳＰから再生される再生信号である参照信号のレベルが閾値以上であるか否かを判定する。音声処理部２４は、音声信号から参照信号の音声成分を除去した除去信号を出力信号として音声認識部４０へ出力する。切替部２６は、参照信号のレベルが閾値以上と判定された場合、除去信号に換えて、コンフォートノイズおよびミュート信号の少なくとも一方である置換信号を出力信号として音声認識部４０へ出力する。 As described above, the speech processing device 10 of this embodiment includes the speech acquisition unit 20, the determination unit 22, the speech processing unit 24, and the switching unit 26. The audio acquisition unit 20 acquires an audio signal from a microphone MC that picks up spatial audio. The determination unit 22 determines whether or not the level of the reference signal, which is the reproduction signal reproduced from the speaker SP that emits sound in space, is equal to or higher than a threshold. The speech processing unit 24 outputs a removal signal obtained by removing the speech component of the reference signal from the speech signal to the speech recognition unit 40 as an output signal. When the level of the reference signal is determined to be equal to or higher than the threshold, the switching unit 26 outputs a replacement signal, which is at least one of comfort noise and a mute signal, to the speech recognition unit 40 as an output signal instead of the removal signal.

ここで、従来技術には、マイクロホンで収音された音声を第１の音声認識部で認識し、スピーカから出音される音声を第２の音声認識部で認識し、第２の音声認識部で認識された音声に音声認識コマンドが含まれる場合、第１の音声認識部による認識を停止させる構成が開示されている。しかし、従来技術では、マイクロホンで収音された音声にエコーキャンセラ等では除去しきれない残エコー成分等のノイズ成分が含まれる場合には、音声認識の誤検出が発生する場合があった。すなわち、従来技術では、音声認識の誤検出を抑制することが困難となる場合があった。また、従来技術では、第２の音声認識部の性能等によって、第１の音声認識部による音声認識に誤検出が発生する場合があった。 Here, in the conventional technology, a first voice recognition unit recognizes voice picked up by a microphone, a second voice recognition unit recognizes voice output from a speaker, and a second voice recognition unit recognizes voice. A configuration is disclosed that stops recognition by a first speech recognition unit when a speech recognition command is included in the speech recognized by . However, in the prior art, if the voice picked up by the microphone contains noise components such as residual echo components that cannot be removed by an echo canceller or the like, erroneous detection of voice recognition may occur. That is, in the prior art, it was sometimes difficult to suppress erroneous detection of voice recognition. Further, in the prior art, there have been cases where erroneous detection occurs in speech recognition by the first speech recognition unit, depending on the performance of the second speech recognition unit.

一方、本実施形態の音声処理装置１０では、再生信号である参照信号のレベルが閾値以上と判定された場合、マイクＭＣから取得した音声信号から参照信号の音声成分を除去した除去信号に換えて、コンフォートノイズおよびミュート信号の少なくとも一方である置換信号を出力信号として音声認識部４０へ出力する。置換信号はコンフォートノイズおよびミュート信号の少なくとも一方であるため、置換信号には音声コマンドが含まれない。このため、置換信号を受付けつけている期間、音声認識部４０は、音声コマンドの認識を行わない状態となる。 On the other hand, in the audio processing apparatus 10 of the present embodiment, when it is determined that the level of the reference signal, which is the reproduced signal, is equal to or higher than the threshold, the audio signal obtained from the microphone MC is replaced with a removed signal obtained by removing the audio component of the reference signal. , the replacement signal, which is at least one of the comfort noise and the mute signal, is output to the speech recognition unit 40 as an output signal. Since the replacement signal is comfort noise and/or a mute signal, the replacement signal does not include voice commands. Therefore, the voice recognition unit 40 does not recognize voice commands while the replacement signal is being accepted.

このため、本実施形態の音声処理装置１０では、例えばスピーカＳＰから再生される再生信号のレベルが大きく、マイクＭＣで収音された音声信号に除去処理によってキャンセルしきれない成分が残存する音環境であっても、再生信号に起因する音声認識の誤検出を抑制することができる。 Therefore, in the audio processing apparatus 10 of the present embodiment, for example, the level of the reproduced signal reproduced from the speaker SP is high, and the audio signal picked up by the microphone MC contains components that cannot be completely canceled by the removal processing. However, it is possible to suppress erroneous detection of voice recognition due to the reproduced signal.

従って、本実施形態の音声処理装置１０は、音声認識の誤検出を抑制することができる。 Therefore, the speech processing device 10 of the present embodiment can suppress erroneous detection of speech recognition.

また、本実施形態の音声処理装置１０では、判定部２２は、マイクＭＣから取得した音声信号のレベルではなく、スピーカＳＰから再生される再生信号のレベルが閾値以上であるか否かを判断する。このため、本実施形態の音声処理装置１０では、ユーザによって発話された音声のレベルの大小に拘わらず、再生信号のレベルが閾値未満である場合、マイクＭＣによって収音された該ユーザの音声成分を含む除去信号を音声認識対象として音声認識部４０へ出力することができる。よって、本実施形態の音声処理装置１０は、上記効果に加えて、ユーザによって発話された音声コマンド等を含む音声信号を、効率よく音声認識可能とすることができる。 Further, in the audio processing device 10 of the present embodiment, the determination unit 22 determines whether or not the level of the reproduced signal reproduced from the speaker SP, not the level of the audio signal acquired from the microphone MC, is equal to or higher than the threshold. . Therefore, in the speech processing apparatus 10 of the present embodiment, regardless of the level of the speech uttered by the user, if the level of the reproduced signal is less than the threshold, the speech component of the user picked up by the microphone MC can be output to the speech recognition unit 40 as a speech recognition target. Therefore, in addition to the effects described above, the speech processing apparatus 10 of the present embodiment can efficiently perform speech recognition of speech signals including voice commands uttered by the user.

また、本実施形態の音声処理システム１では、スピーカＳＰの再生信号に対しては音声認識部４０による音声認識が行われないことから、上記効果に加えて、音声処理システム１の処理演算量の削減を図ることができる。また、本実施形態では、再生信号に対しては音声認識が行われないため、音声認識部４０の音声認識精度に拘わらず、音声認識の誤検出を抑制することができる。 In addition, in the speech processing system 1 of the present embodiment, since speech recognition by the speech recognition unit 40 is not performed on the reproduced signal of the speaker SP, in addition to the above effect, the amount of processing computation of the speech processing system 1 is reduced. reduction can be achieved. Further, in the present embodiment, since speech recognition is not performed on the reproduced signal, erroneous detection of speech recognition can be suppressed regardless of the speech recognition accuracy of the speech recognition unit 40 .

なお、本実施形態では、音声処理システム１は、車両２に搭載された形態を一例として説明した。しかし、音声処理システム１は、音声処理対象の任意の空間に配置された構成であればよく、車両２に搭載された形態に限定されない。 In addition, in this embodiment, the voice processing system 1 has been described as being mounted in the vehicle 2 as an example. However, the voice processing system 1 is not limited to being mounted on the vehicle 2 as long as it is arranged in an arbitrary space for voice processing.

なお、上記には実施形態を説明したが、上記実施形態は、例として提示したものであり、発明の範囲を限定することは意図していない。上記新規な実施形態は、その他の様々な形態で実施されることが可能であり、発明の要旨を逸脱しない範囲で、種々の省略、置き換え、変更を行うことができる。上記実施形態は、発明の範囲または要旨に含まれるとともに、特許請求の範囲に記載された発明とその均等の範囲に含まれる。 Although the embodiment has been described above, the embodiment is presented as an example and is not intended to limit the scope of the invention. The novel embodiments described above can be embodied in various other forms, and various omissions, replacements, and modifications can be made without departing from the scope of the invention. The above embodiments are included in the scope or gist of the invention, and are included in the scope of the invention described in the claims and equivalents thereof.

１音声処理システム
１０音声処理装置
２０音声取得部
２２判定部
２４音声処理部
２６切替部
２８生成部
４０音声認識部
５０電子機器
６０ディスプレイ
ＭＣマイク
ＳＰスピーカ 1 Speech processing system 10 Speech processing device 20 Speech acquisition unit 22 Judgment unit 24 Speech processing unit 26 Switching unit 28 Generation unit 40 Speech recognition unit 50 Electronic device 60 Display MC Microphone SP Speaker

Claims

an audio acquisition unit that acquires an audio signal from a microphone that picks up spatial audio;
a determination unit that determines whether a level of a reference signal, which is a reproduction signal reproduced from a speaker that emits sound in the space, is equal to or higher than a threshold;
a speech processing unit that outputs a removal signal obtained by removing the speech component of the reference signal from the speech signal to a speech recognition unit as an output signal;
a switching unit that outputs a replacement signal, which is at least one of comfort noise and a mute signal, as the output signal to the speech recognition unit instead of the removal signal when the level of the reference signal is determined to be equal to or higher than the threshold;
A speech processing device comprising:

The determination unit
When the level of the reference signal is determined to be equal to or higher than the threshold, controlling the audio processing unit to stop removal processing for removing the audio component of the reference signal from the audio signal;
The audio processing device according to claim 1.

Further comprising a generation unit that generates the replacement signal,
The generating unit
generating the replacement signal that is the comfort noise according to the noise level included in the audio signal immediately before it is determined to be equal to or greater than the threshold;
3. The audio processing device according to claim 1 or 2.

The switching unit is
outputting the replacement signal as the output signal to the speech recognition unit in place of the removal signal during a period in which the level of the reference signal is determined to be equal to or higher than the threshold;
The audio processing device according to any one of claims 1 to 3.

The switching unit is
when the level of the reference signal is determined to be equal to or higher than the threshold, continuously outputting the replacement signal as the output signal instead of the removal signal to the speech recognition unit for a predetermined first time;
The audio processing device according to any one of claims 1 to 3.

The switching unit is
3. Outputting the replacement signal to the speech recognition unit as the output signal in place of the removal signal when the level of the reference signal is continuously determined to be the threshold value or more for a predetermined second time period or longer. The audio processing device according to any one of claims 1 to 3.

an output control unit that outputs information indicating that speech recognition is stopped when the level of the reference signal is determined to be equal to or higher than the threshold;
The audio processing device according to any one of claims 1 to 6, further comprising:

A speech processing method executed by a speech processing device,
acquiring an audio signal from a microphone that picks up spatial audio;
a step of determining whether the level of a reference signal, which is a reproduction signal reproduced from a speaker emitting sound in the space, is equal to or higher than a threshold;
a step of outputting, as an output signal, a removed signal obtained by removing the audio component of the reference signal from the audio signal to a speech recognition unit;
a step of outputting a replacement signal, which is at least one of comfort noise and a mute signal, as the output signal to the speech recognition unit in place of the removal signal when the level of the reference signal is determined to be equal to or higher than the threshold;
audio processing methods, including

acquiring an audio signal from a microphone that picks up spatial audio;
a step of determining whether the level of a reference signal, which is a reproduction signal reproduced from a speaker emitting sound in the space, is equal to or higher than a threshold;
a step of outputting, as an output signal, a removed signal obtained by removing the audio component of the reference signal from the audio signal to a speech recognition unit;
a step of outputting a replacement signal, which is at least one of comfort noise and a mute signal, as the output signal to the speech recognition unit in place of the removal signal when the level of the reference signal is determined to be equal to or higher than the threshold;
A sound processing program that causes a computer to execute

A voice processing system comprising a voice processing device, a microphone that collects voice in a space, a speaker that outputs sound to the space, and a voice recognition unit that recognizes voice,
The audio processing device is
an audio acquisition unit that acquires an audio signal from the microphone;
a determination unit that determines whether a level of a reference signal, which is a reproduction signal reproduced from the speaker, is equal to or higher than a threshold;
a speech processing unit that outputs a removal signal obtained by removing the speech component of the reference signal from the speech signal to the speech recognition unit as an output signal;
a switching unit that outputs a replacement signal, which is at least one of comfort noise and a mute signal, as the output signal to the speech recognition unit instead of the removal signal when the level of the reference signal is determined to be equal to or higher than the threshold;
An audio processing system comprising: