JP2021135311A

JP2021135311A - Voice processing device and voice processing method

Info

Publication number: JP2021135311A
Application number: JP2020028731A
Authority: JP
Inventors: 正成宮本; Masanari Miyamoto
Original assignee: Panasonic Intellectual Property Management Co Ltd
Current assignee: Panasonic Intellectual Property Management Co Ltd
Priority date: 2020-02-21
Filing date: 2020-02-21
Publication date: 2021-09-13

Abstract

To adaptively suppress acoustic crosstalk components caused by utterance voices of other speakers mixed in an utterance voice of a main speaker depending on the states of a plurality of speakers present in a closed space, and to improve the sound quality of the utterance voice of the main speaker.SOLUTION: An acoustic crosstalk suppression device comprises: a single talk detection part that detects a single talk state in which any one of a plurality of persons, including a main speaker, who are present in a closed space is uttering, on the basis of the voice signals collected by a plurality of microphones; a mixing ratio estimation part that estimates a mixing ratio indicating a ratio of a voice signal of a main speaker mixed in the voice signals of other persons on the basis of the sound pressure ratio of the voice signal collected in the single talk state of the main speaker and the sound pressure ratio of the voice signal collected in the single talk state of any other; and a determination part that determines the necessity of crosstalk components caused by utterances of the other persons mixed in the voice signal of the main speaker on the basis of the estimation result of the mixing ratio.SELECTED DRAWING: Figure 1

Description

本開示は、音声処理装置および音声処理方法に関する。 The present disclosure relates to a voice processing device and a voice processing method.

特許文献１には、車室内の状況として乗員の配置パターンを予め想定し、各配置パターンそれぞれに対して音の伝達特性を測定し、その測定により得られメモリ等に記憶された各伝達特性を用いて、スピーカから出力される音声信号に含まれる音響を推定して除去する音響除去装置が開示されている。この音響除去装置によれば、乗員の配置が配置パターンのいずれかを満たす限り、音響の除去または抑圧が可能である。 In Patent Document 1, an arrangement pattern of occupants is assumed in advance as a situation in the vehicle interior, sound transmission characteristics are measured for each arrangement pattern, and each transmission characteristic obtained by the measurement and stored in a memory or the like is described. A sound removing device that estimates and removes sound contained in an audio signal output from a speaker is disclosed. According to this sound removal device, sound can be removed or suppressed as long as the arrangement of the occupants satisfies any of the arrangement patterns.

特開２００９−２１６８３５号公報JP-A-2009-216835

特許文献１の構成では、ドライバーの発話音声を収音することを目的としたマイクがドライバーの前に１つ配置されているだけで、ドライバーの声は高音圧で収音可能ではあるが、一方で同じ車両内の同乗者（つまり他の乗員）の声をその同じマイクで高音圧に収音することは困難な場合が想定される。これは、マイクの配置箇所がドライバーの近くに偏っているので、ドライバーからマイクまでの距離と同乗者からマイクまでの距離とが異なるためである。このため、ドライバーと同乗者とがほぼ同時に発話した時にメイン話者（例えばドライバー）の音声信号に含まれる他の話者（例えば同乗者）の音声信号をクロストーク成分として抑圧したくても、他の話者の音声信号が高音圧で収音されていなければクロストーク抑圧の効果が現れず、メイン話者の音声信号の音質が劣化する可能性があった。これは、ドライバーのマイクでは他の話者（例えば同乗者）の音声を高音圧で収音することが難しく、同乗者の音声信号をクロストーク成分として抑圧するための適応フィルタのフィルタ係数の学習が困難なためである。 In the configuration of Patent Document 1, only one microphone is arranged in front of the driver for the purpose of picking up the voice spoken by the driver, and the voice of the driver can be picked up at high sound pressure. In some cases, it may be difficult to pick up the voice of a passenger (that is, another occupant) in the same vehicle with the same microphone at a high sound pressure. This is because the location of the microphone is biased toward the driver, so the distance from the driver to the microphone and the distance from the passenger to the microphone are different. Therefore, even if the driver and the passenger want to suppress the voice signal of another speaker (for example, the passenger) included in the voice signal of the main speaker (for example, the driver) as a crosstalk component when the driver and the passenger speak at almost the same time. If the audio signals of other speakers are not picked up at high sound pressure, the effect of suppressing crosstalk does not appear, and the sound quality of the audio signals of the main speaker may deteriorate. This is because it is difficult for the driver's microphone to pick up the voice of another speaker (for example, a passenger) at a high sound pressure, and learning the filter coefficient of the adaptive filter for suppressing the voice signal of the passenger as a crosstalk component. Is difficult.

本開示は、上述した従来の状況に鑑みて案出され、閉空間に存在する複数の話者の状況に応じて、メイン話者の発話音声に含まれ得る他の話者の発話音声による音響的なクロストーク成分を適応的に抑圧し、メイン話者の発話音声の音質を改善する音声処理装置および音声処理方法を提供することを目的とする。 The present disclosure has been devised in view of the above-mentioned conventional situations, and depending on the situation of a plurality of speakers existing in a closed space, the sound produced by the uttered voices of other speakers that may be included in the uttered voices of the main speaker. It is an object of the present invention to provide a voice processing device and a voice processing method for adaptively suppressing a typical cross-talk component and improving the sound quality of the spoken voice of the main speaker.

本開示は、閉空間内に配置された複数のマイクと接続され、前記複数のマイクのそれぞれにより収音された音声信号に基づいて、前記閉空間内に存在するメイン話者を含む複数人のうちいずれか一人が発話しているシングルトーク状態を検出するシングルトーク検出部と、前記メイン話者のシングルトーク状態で前記複数のマイクのそれぞれにより収音された音声信号の音圧比率と、前記メイン話者以外の他人物のシングルトーク状態で前記複数のマイクのそれぞれにより収音された音声信号の音圧比率とに基づいて、前記他人物の音声信号に対して前記メイン話者の音声信号が含まれる割合を示す混合率を推定する混合率推定部と、前記混合率の推定結果に基づいて、前記メイン話者の音声信号に含まれる前記他人物の発話によるクロストーク成分の抑圧の要否を判別する決定部と、を備える、音声処理装置を提供する。 The present disclosure is based on audio signals connected to a plurality of microphones arranged in the closed space and picked up by each of the plurality of microphones, and a plurality of persons including a main speaker existing in the closed space. The sound pressure ratio of the single talk detector that detects the single talk state in which one of them is speaking, the sound pressure ratio of the sound signal picked up by each of the plurality of microphones in the single talk state of the main speaker, and the above. Based on the sound pressure ratio of the sound signals picked up by each of the plurality of microphones in the single talk state of another person other than the main speaker, the sound signal of the main speaker is compared with the sound signal of the other person. Based on the mixing ratio estimation unit that estimates the mixing ratio indicating the proportion of the mixture and the estimation result of the mixing ratio, it is necessary to suppress the crosstalk component by the speech of the other person included in the voice signal of the main speaker. Provided is a sound processing device including a determination unit for determining whether or not to use the sound.

また、本開示は、閉空間内に配置された複数のマイクと接続された音声処理装置により実行される音声処理方法であって、前記複数のマイクのそれぞれにより収音された音声信号に基づいて、前記閉空間内に存在するメイン話者を含む複数人のうちいずれか一人が発話しているシングルトーク状態を検出し、前記メイン話者のシングルトーク状態で前記複数のマイクのそれぞれにより収音された音声信号の音圧比率と、前記メイン話者以外の他人物のシングルトーク状態で前記複数のマイクのそれぞれにより収音された音声信号の音圧比率とに基づいて、前記他人物の音声信号に対して前記メイン話者の音声信号が含まれる割合を示す混合率を推定し、前記混合率の推定結果に基づいて、前記メイン話者の音声信号に含まれる前記他人物の発話によるクロストーク成分の抑圧の要否を判別する、音声処理方法を提供する。 Further, the present disclosure is a voice processing method executed by a voice processing device connected to a plurality of microphones arranged in a closed space, and is based on a sound signal picked up by each of the plurality of microphones. , A single talk state in which any one of a plurality of persons including the main speaker existing in the closed space is speaking is detected, and the sound is picked up by each of the plurality of microphones in the single talk state of the main speaker. Based on the sound pressure ratio of the voice signal, and the sound pressure ratio of the voice signal picked up by each of the plurality of microphones in the single talk state of another person other than the main speaker, the voice of the other person. A mixing ratio indicating the ratio of the voice signal of the main speaker to the signal is estimated, and based on the estimation result of the mixing ratio, a cross due to the speech of the other person included in the voice signal of the main speaker is used. Provided is a sound processing method for determining the necessity of suppressing the talk component.

本開示によれば、閉空間に存在する複数の話者の状況に応じて、メイン話者の発話音声に含まれ得る他の話者の発話音声による音響的なクロストーク成分を適応的に抑圧でき、メイン話者の発話音声の音質を改善できる。 According to the present disclosure, the acoustic cross-talk component due to the utterance voice of another speaker that may be included in the utterance voice of the main speaker is adaptively suppressed according to the situation of a plurality of speakers existing in the closed space. It is possible to improve the sound quality of the utterance voice of the main speaker.

実施の形態１に係る音響クロストーク抑圧装置の機能的構成例を示すブロック図A block diagram showing a functional configuration example of the acoustic crosstalk suppression device according to the first embodiment. 実施の形態１に係る音響クロストーク抑圧動作手順例を示すフローチャートA flowchart showing an example of an acoustic crosstalk suppression operation procedure according to the first embodiment. 実施の形態２に係る音響クロストーク抑圧装置の機能的構成例を示すブロック図A block diagram showing a functional configuration example of the acoustic crosstalk suppression device according to the second embodiment. 実施の形態２に係る音響クロストーク抑圧動作手順例を示すフローチャートA flowchart showing an example of an acoustic crosstalk suppression operation procedure according to the second embodiment. 実施の形態２の変形例に係る音響クロストーク抑圧装置の機能的構成例を示すブロック図A block diagram showing a functional configuration example of the acoustic crosstalk suppression device according to the modified example of the second embodiment. 音圧ヒートマップが重畳された全方位カメラによる撮像画像の一例を示す図A diagram showing an example of an image captured by an omnidirectional camera on which a sound pressure heat map is superimposed. 店員と顧客の真ん中にマイクアレイが置かれた状況の一例を示す図Diagram showing an example of a situation where a microphone array is placed in the middle of a clerk and a customer 図７の状況において、店員および顧客それぞれの方向に指向性が形成されて収音された音声に対する音響クロストーク抑圧処理例を説明する図FIG. 7 is a diagram illustrating an example of acoustic crosstalk suppression processing for voice collected by forming directivity in each direction of a store clerk and a customer in the situation of FIG. 7. 店員に近く顧客から離れた位置にマイクアレイが置かれた状況の一例を示す図A diagram showing an example of a situation where the microphone array is placed near the store clerk and away from the customer. 図９の状況において、店員および顧客それぞれの方向に指向性が形成されて収音された音声に対する音響クロストーク抑圧処理例を説明する図FIG. 9 is a diagram illustrating an example of acoustic crosstalk suppression processing for voice collected by forming directivity in each direction of a store clerk and a customer in the situation of FIG.

（本開示に至る経緯）
音響クロストーク抑圧装置が利用される場面として、例えば、二人の人物が会話する状況が想定される。音響クロストーク抑圧装置は、例えば、特許第６６３５３９４号等に開示されるように、一方の人物が発話した音声に他方の人物が発話した音声がクロストーク成分として含まれる場合に、クロストーク成分を抑圧（言い換えると、減算）するための抑圧信号を生成し、その一方の人物の発話による音声信号から抑圧信号を抑圧することで、クロストーク成分が抑圧された音声信号を出力できる。二人の人物が会話する状況として、例えば、刑務所などで刑務官と犯罪者などの入所者とが向かい合って会話する状況、店舗などで店員と顧客とがテーブルを挟んで対話する状況、オフィスなどで社員と上司とが会議で話し合う状況などが挙げられるが、上述した状況に限定されなくてよい。発話の内容は、ログとして記録され、テキストに変換されて保存されてもよいし、発話の音声信号が音声認識の処理として入力されてもよい。 (Background to this disclosure)
As a scene where the acoustic crosstalk suppression device is used, for example, a situation where two people talk with each other is assumed. The acoustic cross-talk suppressor, for example, as disclosed in Japanese Patent No. 6635394, when the voice spoken by one person includes the voice spoken by the other person as the cross-talk component, the acoustic cross-talk component is used. By generating a suppression signal for suppression (in other words, subtraction) and suppressing the suppression signal from the voice signal uttered by one of the persons, it is possible to output a voice signal in which the crosstalk component is suppressed. The situation where two people talk is, for example, a situation where a prison officer and a resident such as a criminal talk face-to-face in a prison, a situation where a clerk and a customer talk across a table in a store, etc. There is a situation where an employee and a boss discuss at a meeting, but it does not have to be limited to the above-mentioned situation. The content of the utterance may be recorded as a log, converted into text and saved, or the voice signal of the utterance may be input as a voice recognition process.

以下、店舗内で店員と顧客が対話する状況を一例として示す。音響クロストーク抑圧装置は、例えば店舗内に設置されている円卓のテーブルに配置された複数のマイクのそれぞれに接続され、店員および顧客の一方がメイン話者として発話する音声を目的音とし、このメイン話者の音声に妨害音として混ざる他の話者が発話する音声を抑圧する。 Below, the situation where the clerk and the customer interact in the store is shown as an example. The acoustic crosstalk suppression device is connected to each of a plurality of microphones arranged on a round table table installed in the store, for example, and uses a voice uttered by one of the clerk and the customer as the main speaker as the target sound. Suppresses the voices spoken by other speakers that are mixed with the main speaker's voice as disturbing sounds.

図７は、店員ｈｍ１と顧客ｈｍ２の真ん中にマイクアレイｍＡが置かれた状況の一例を示す図である。マイクアレイｍＡは、複数個の無指向性マイクを収容した筐体を有し、それぞれの無指向性マイクで周囲の音声を収音する。マイクアレイｍＡにより収音された音声は、公知の方法（例えば、マイクアレイｍＡに接続されたＰＣ（図示略）で行われるビームフォーミング処理）により、店員ｈｍ１および顧客ｈｍ２のそれぞれの方向に指向性が形成されて音声出力が可能となる。なお、マイクとしては、マイクアレイｍＡに限らず、１個もしくは複数個の無指向性マイクであってもよい。 FIG. 7 is a diagram showing an example of a situation in which the microphone array mA is placed in the middle of the clerk hm1 and the customer hm2. The microphone array mA has a housing that accommodates a plurality of omnidirectional microphones, and each omnidirectional microphone picks up ambient sound. The sound picked up by the microphone array mA is directional in each direction of the clerk hm1 and the customer hm2 by a known method (for example, a beamforming process performed by a PC (not shown) connected to the microphone array mA). Is formed and audio output becomes possible. The microphone is not limited to the microphone array mA, and may be one or a plurality of omnidirectional microphones.

図７では、マイクアレイｍＡから店員ｈｍ１までの距離とマイクアレイｍＡから顧客ｈｍ２までの距離とがほぼ等しく、マイクアレイｍＡから店員ｈｍ１へ向かう方向ｄ１とマイクアレイｍＡから顧客ｈｍ２に向かう方向ｄ２とが、マイクアレイｍＡが置かれたテーブルの面からほぼ同じ角度である場合、マイクアレイｍＡは、店員ｈｍ１の声と顧客ｈｍ２の声とを高い割合で分離して収音できる。 In FIG. 7, the distance from the microphone array mA to the clerk hm1 and the distance from the microphone array mA to the customer hm2 are almost equal, and the direction d1 from the microphone array mA to the clerk hm1 and the direction d2 from the microphone array mA to the customer hm2. However, when the microphone array mA is at substantially the same angle from the surface of the table on which the microphone array mA is placed, the microphone array mA can separate the voice of the clerk hm1 and the voice of the customer hm2 at a high ratio and collect the sound.

図８は、図７の状況において、店員ｈｍ１および顧客ｈｍ２それぞれの方向に指向性が形成されて収音された音声に対する音響クロストーク抑圧処理例を説明する図である。マイクアレイｍＡは、一例として４個の無指向性のマイク素子ｍ１〜ｍ４を有する。図示は省略するが、マイクアレイｍＡあるいはマイクアレイｍＡに接続されたＰＣは、マイクアレイｍＡにより収音された音声信号を入力し、店員ｈｍ１および顧客ｈｍ２それぞれの方向に指向性を形成して（ビームフォーミングの処理を行って）音声を出力する。４個のマイク素子ｍ１〜ｍ４でそれぞれ収音される、店員ｈｍ１の声Ｖ１と顧客ｈｍ２の声Ｖ２は、音圧比で５：５としている。 FIG. 8 is a diagram illustrating an example of acoustic crosstalk suppression processing for voice picked up by forming directivity in each direction of the clerk hm1 and the customer hm2 in the situation of FIG. 7. The microphone array mA has, for example, four omnidirectional microphone elements m1 to m4. Although not shown, the microphone array mA or the PC connected to the microphone array mA inputs the voice signal picked up by the microphone array mA and forms directivity in each direction of the clerk hm1 and the customer hm2 ( Output sound (performing beamforming processing). The voice V1 of the clerk hm1 and the voice V2 of the customer hm2, which are picked up by the four microphone elements m1 to m4, have a sound pressure ratio of 5: 5.

ビームフォーミングの処理によって店員ｈｍ１の方向ｄ１に指向性が形成された場合、店員ｈｍ１の声Ｖ１と顧客ｈｍ２の声Ｖ２は、例えば音圧比で７：３となったとする。同様に、ビームフォーミングの処理によって顧客ｈｍ２の方向ｄ２に指向性が形成された場合、店員ｈｍ１の声Ｖ１と顧客ｈｍ２の声Ｖ２は、例えば音圧比で３：７となったとする。 When directivity is formed in the direction d1 of the clerk hm1 by the beamforming process, the voice V1 of the clerk hm1 and the voice V2 of the customer hm2 are assumed to be, for example, 7: 3 in sound pressure ratio. Similarly, when directivity is formed in the direction d2 of the customer hm2 by the beamforming process, it is assumed that the voice V1 of the clerk hm1 and the voice V2 of the customer hm2 have, for example, a sound pressure ratio of 3: 7.

ビームフォーミング後の店員ｈｍ１の声Ｖ１の音声信号を主信号とし、ビームフォーミングの処理後の顧客ｈｍ２の声Ｖ２の音声信号を参照信号として、音響クロストーク抑圧処理が行われると、クロストーク抑圧後の店員ｈｍ１の声Ｖ１と顧客ｈｍ２の声Ｖ２は、例えば音圧比で９：１となる。したがって、店員ｈｍ１の声Ｖ１が相対的に強調される。同様に、ビームフォーミングの処理後の店員ｈｍ１の声Ｖ１の音声信号を参照信号とし、ビームフォーミングの処理後の顧客ｈｍ２の声Ｖ２の音声信号を主信号として、音響クロストーク抑圧処理が行われると、クロストーク抑圧後の店員ｈｍ１の声Ｖ１と顧客ｈｍ２の声Ｖ２は、例えば音圧比で１：９となる。したがって、顧客ｈｍ２の声Ｖ２が強調される。音声認識エンジンｅｇは、音響クロストーク抑圧後の店員ｈｍ１の声Ｖ１および顧客ｈｍ２の声Ｖ２のいずれも精度良く認識可能である。 When the acoustic crosstalk suppression processing is performed using the voice signal of the voice V1 of the clerk hm1 after beamforming as the main signal and the voice signal of the voice V2 of the customer hm2 after the beamforming processing as a reference signal, after the crosstalk is suppressed. The voice V1 of the clerk hm1 and the voice V2 of the customer hm2 are, for example, 9: 1 in sound pressure ratio. Therefore, the voice V1 of the clerk hm1 is relatively emphasized. Similarly, when the acoustic crosstalk suppression processing is performed using the voice signal of the voice V1 of the clerk hm1 after the beamforming process as a reference signal and the voice signal of the customer hm2's voice V2 after the beamforming process as the main signal. The voice V1 of the clerk hm1 and the voice V2 of the customer hm2 after the cross talk is suppressed are, for example, 1: 9 in sound pressure ratio. Therefore, the voice V2 of the customer hm2 is emphasized. The voice recognition engine egg can accurately recognize both the voice V1 of the clerk hm1 and the voice V2 of the customer hm2 after suppressing the acoustic crosstalk.

図９は、店員ｈｍ１に近く顧客ｈｍ２から離れた位置にマイクアレイｍＡが置かれた状況の一例を示す図である。通常、マイクアレイｍＡは、店員ｈｍ１と顧客ｈｍ２の真ん中に置かれることよりも、むしろどちらかの方に片寄って置かれることが多い、または、物理的に店員ｈｍ１と顧客ｈｍ２との間に置かれていたとしても空間特性の影響によって、指向性特性にばらつきが生じる場合がある。前者を例に考えると、マイクアレイｍＡから店員ｈｍ１までの距離とマイクアレイｍＡから顧客ｈｍ２までの距離が大きく異なる。したがって、マイクアレイｍＡにおいて受音（収音）される店員ｈｍ１の音声信号の音圧と顧客ｈｍ２の音声信号の音圧とに差が生じる（図１０参照）。例えば、図１０に示すように、マイクアレイｍＡを構成するそれぞれのマイクごとに、店員ｈｍ１，顧客ｈｍ２の音声信号の音圧の比率が７：３となるように差が生じている。このため、マイクアレイｍＡは、図７の状況とは異なり、店員ｈｍ１の声および顧客ｈｍ２の声を高い割合で分離して収音できない。なお、マイクアレイｍＡは、人体あるいは衣服に装着されてもよく、この場合、マイクアレイｍＡが装着された方の人物の声が支配的に収音され、より一層分離して収音できない。 FIG. 9 is a diagram showing an example of a situation in which the microphone array mA is placed at a position close to the clerk hm1 and away from the customer hm2. Normally, the microphone array mA is often placed on either side rather than being placed in the middle of the clerk hm1 and the customer hm2, or physically placed between the clerk hm1 and the customer hm2. Even if it is, the directivity characteristics may vary due to the influence of the spatial characteristics. Considering the former as an example, the distance from the microphone array mA to the clerk hm1 and the distance from the microphone array mA to the customer hm2 are significantly different. Therefore, there is a difference between the sound pressure of the audio signal of the clerk hm1 and the sound pressure of the audio signal of the customer hm2 received (picked up) by the microphone array mA (see FIG. 10). For example, as shown in FIG. 10, there is a difference so that the sound pressure ratio of the audio signals of the clerk hm1 and the customer hm2 is 7: 3 for each microphone constituting the microphone array mA. Therefore, unlike the situation shown in FIG. 7, the microphone array mA cannot separate the voice of the clerk hm1 and the voice of the customer hm2 at a high rate and cannot collect the sound. The microphone array mA may be attached to a human body or clothing. In this case, the voice of the person to whom the microphone array mA is attached is predominantly picked up, and cannot be further separated and picked up.

図１０は、図９の状況において、店員ｈｍ１および顧客ｈｍ２それぞれの方向に指向性が形成されて収音された音声に対する音響クロストーク抑圧処理例を説明する図である。４個のマイク素子ｍ１〜ｍ４でそれぞれ収音される、店員ｈｍ１の声Ｖ１と顧客ｈｍ２の声Ｖ２は、音圧比で５：５としている。 FIG. 10 is a diagram for explaining an example of acoustic crosstalk suppression processing for a sound collected by forming directivity in each direction of a clerk hm1 and a customer hm2 in the situation of FIG. The voice V1 of the clerk hm1 and the voice V2 of the customer hm2, which are picked up by the four microphone elements m1 to m4, have a sound pressure ratio of 5: 5.

ビームフォーミングの処理によって店員ｈｍ１の方向ｄ１に指向性が形成された場合、マイクアレイｍＡは、店員ｈｍ１の近くに配置されるので、店員ｈｍ１の声Ｖ１を支配的に収音可能である。店員ｈｍ１の声Ｖ１と顧客ｈｍ２の声Ｖ２は、例えば音圧比で９：１となる。一方、ビームフォーミングによって顧客ｈｍ２の方向ｄ２に指向性が形成された場合、マイクアレイｍＡは、顧客ｈｍ２から遠くに配置されるので、顧客ｈｍ２の声Ｖ２を十分に収音できない。店員ｈｍ１の声Ｖ１と顧客ｈｍ２の声Ｖ２は、例えば音圧比で４：６となる。 When the directivity is formed in the direction d1 of the clerk hm1 by the beamforming process, the microphone array mA is arranged near the clerk hm1, so that the voice V1 of the clerk hm1 can be predominantly picked up. The voice V1 of the clerk hm1 and the voice V2 of the customer hm2 have, for example, a sound pressure ratio of 9: 1. On the other hand, when the directivity is formed in the direction d2 of the customer hm2 by beamforming, the microphone array mA is arranged far from the customer hm2, so that the voice V2 of the customer hm2 cannot be sufficiently picked up. The voice V1 of the clerk hm1 and the voice V2 of the customer hm2 have, for example, a sound pressure ratio of 4: 6.

このような場合、ビームフォーミング後の店員ｈｍ１の声Ｖ１の音声信号を参照信号とし、ビームフォーミング後の顧客ｈｍ２の声Ｖ２の音声信号を主信号として、音響クロストーク抑圧処理が行われると、参照信号の店員ｈｍ１の声がクリアであるので、クロストーク抑圧の性能が高い。したがって、顧客ｈｍ２の声Ｖ２が相対的に十分に強調される。音声認識エンジンｅｇは、顧客ｈｍ２の声Ｖ２を精度良く認識可能である。 In such a case, the audio signal of the voice V1 of the clerk hm1 after beamforming is used as a reference signal, and the audio signal of the voice V2 of the customer hm2 after beamforming is used as the main signal, and the acoustic crosstalk suppression process is performed. Since the voice of the signal clerk hm1 is clear, the crosstalk suppression performance is high. Therefore, the voice V2 of the customer hm2 is relatively sufficiently emphasized. The voice recognition engine egg can accurately recognize the voice V2 of the customer hm2.

一方、ビームフォーミング後の店員ｈｍ１の声Ｖ１の音声信号を主信号とし、ビームフォーミング後の顧客ｈｍ２の声Ｖ２の音声信号を参照信号として、音響クロストーク抑圧処理が行われると、店員ｈｍ１の声Ｖ１と顧客ｈｍ２の声Ｖ２の音圧比が４：６とほぼ同等であるので、音響クロストーク抑圧処理の性能が低い。この結果、妨害音である顧客ｈｍ２の声Ｖ２を抑圧するどころか、却って、顧客ｈｍ２の声Ｖ２が加算されてしまい、店員ｈｍ１の声Ｖ１が益々クリアでなくなってしまう可能性があった。 On the other hand, when the acoustic crosstalk suppression process is performed using the voice signal of the clerk hm1's voice V1 after beamforming as the main signal and the voice signal of the customer hm2's voice V2 after beamforming as the reference signal, the voice of the clerk hm1 is performed. Since the sound pressure ratio of V1 and the voice V2 of the customer hm2 is almost the same as 4: 6, the performance of the acoustic crosstalk suppression processing is low. As a result, instead of suppressing the voice V2 of the customer hm2, which is a disturbing sound, the voice V2 of the customer hm2 is added, and the voice V1 of the clerk hm1 may become more and more unclear.

そこで、以下の実施の形態では、音声処理装置の一例としての音響クロストーク抑圧装置は、参照信号によってクロストーク成分の抑圧性能が低い場合には音響クロストーク抑圧処理を行わないでそのまま出力する。実施の形態１では無指向性マイクを用いる場合を示し、実施の形態２では指向性を形成可能なマイクアレイを用いる場合を示す。 Therefore, in the following embodiment, the acoustic crosstalk suppression device as an example of the voice processing device outputs the sound as it is without performing the acoustic crosstalk suppression processing when the suppression performance of the crosstalk component is low due to the reference signal. The first embodiment shows a case where an omnidirectional microphone is used, and the second embodiment shows a case where a microphone array capable of forming directivity is used.

以下、適宜図面を参照しながら、本開示に係る音声処理装置および音声処理方法を具体的に開示した実施の形態を詳細に説明する。但し、必要以上に詳細な説明は省略する場合がある。例えば、既によく知られた事項の詳細説明および実質的に同一の構成に対する重複説明を省略する場合がある。これは、以下の説明が不必要に冗長になるのを避け、当業者の理解を容易にするためである。なお、添付図面および以下の説明は、当業者が本開示を十分に理解するために提供されるのであって、これらにより特許請求の範囲に記載の主題を限定することは意図されていない。 Hereinafter, embodiments in which the voice processing apparatus and the voice processing method according to the present disclosure are specifically disclosed will be described in detail with reference to the drawings as appropriate. However, more detailed explanation than necessary may be omitted. For example, detailed explanations of already well-known matters and duplicate explanations for substantially the same configuration may be omitted. This is to avoid unnecessary redundancy of the following description and to facilitate the understanding of those skilled in the art. It should be noted that the accompanying drawings and the following description are provided for those skilled in the art to fully understand the present disclosure, and are not intended to limit the subject matter described in the claims.

（実施の形態１）
図１は、実施の形態１に係る音響クロストーク抑圧装置５の機能的構成例を示すブロック図である。音声処理装置の一例としての音響クロストーク抑圧装置５は、目的音に混ざる妨害音を抑圧するものであり、ＤＳＰ（ＤｉｇｉｔａｌＳｉｇｎａｌＰｒｏｃｅｓｓｏｒ）１０とメモリ５０とを含む構成である。音響クロストーク抑圧装置５には、２個のマイクｍｃ１，ｍｃ２が入力機器として接続され、音声認識エンジン（図示略）が出力機器として接続される。 (Embodiment 1)
FIG. 1 is a block diagram showing a functional configuration example of the acoustic crosstalk suppression device 5 according to the first embodiment. The acoustic crosstalk suppression device 5 as an example of the voice processing device suppresses the disturbing sound mixed with the target sound, and has a configuration including a DSP (Digital Signal Processor) 10 and a memory 50. Two microphones mc1 and mc2 are connected to the acoustic crosstalk suppression device 5 as input devices, and a voice recognition engine (not shown) is connected as an output device.

収音装置の一例としてのマイクｍｃ１は、１個の無指向性マイクであり、例えばメイン話者が発話する音声を主に収音可能に配置され、メイン話者が発話する音声が収音された音声信号（主信号）を取得する。同様に、収音装置の一例としてのマイクｍｃ２は、１個の無指向性マイクであり、例えばメイン話者でない他の話者が発話する音声を主に収音可能に配置され、他の話者が発話する音声が収音された音声信号（参照信号）を取得する。なお、マイクｍｃ１は他の話者が発話する音声を収音して参照信号を取得し、マイクｍｃ２はメイン話者が発話する音声を収音して主信号を取得してもよい。マイクｍｃ１，ｍｃ２は、例えば、高音質小型エレクトレットコンデンサーマイクロホン（ＥＣＭ：ＥｌｅｃｔｒｅｔＣｏｎｄｅｎｓｅｒＭｉｃｒｏｐｈｏｎｅ）で構成される。 The microphone mc1 as an example of the sound collecting device is one omnidirectional microphone. For example, the sound uttered by the main speaker is arranged so as to be able to collect the sound mainly, and the sound uttered by the main speaker is picked up. Acquires the voice signal (main signal). Similarly, the microphone mc2 as an example of the sound collecting device is one omnidirectional microphone, for example, the voice uttered by another speaker who is not the main speaker is arranged so as to be able to collect the sound mainly, and the other talk. Acquires a voice signal (reference signal) in which the voice spoken by a person is picked up. The microphone mc1 may pick up the voice uttered by another speaker to acquire the reference signal, and the microphone mc2 may pick up the voice uttered by the main speaker to acquire the main signal. The microphones mc1 and mc2 are composed of, for example, a high-quality sound compact electret condenser microphone (ECM).

音声認識エンジンは、音響クロストーク抑圧装置５から出力されるクロストーク抑圧後の音声信号あるいはクロストーク抑圧が行われない音声信号を基に、音声認識の処理を行い、その処理結果として音声信号の内容を示すテキストデータを生成する。なお、出力機器として、音声認識エンジンの代わりに、ネットワーク（図示略）を介して音声認識等の処理を行うクラウドサーバ、あるいは音声を出力可能なスピーカが接続されてもよい。また、マイクｍｃ１，ｍｃ２および音声認識エンジンは、音響クロストーク抑圧装置５に内蔵されてもよい。 The voice recognition engine performs voice recognition processing based on the voice signal after crosstalk suppression or the voice signal in which crosstalk suppression is not performed, which is output from the acoustic crosstalk suppression device 5, and the voice signal is processed as a result of the processing. Generate text data showing the contents. As the output device, instead of the voice recognition engine, a cloud server that performs processing such as voice recognition via a network (not shown) or a speaker capable of outputting voice may be connected. Further, the microphones mc1 and mc2 and the voice recognition engine may be built in the acoustic crosstalk suppression device 5.

音響クロストーク抑圧装置５は、例えば２人の話者（メイン話者を含む複数人のそれぞれ）が会話している場合、同時に発話した２人の声の一方を目的音、他方を妨害音として、妨害音によるクロストーク成分を抑圧して目的音を明瞭（クリア）な音声に変換する。具体的に、音響クロストーク抑圧装置５は、妨害音を含む音声信号を参照信号として所定の（後述参照）の信号処理を施すことによって、音響クロストーク成分を再現した疑似クロストーク信号（抑圧信号の一例）を生成する。音響クロストーク抑圧装置５は、マイクｍｃ１またはマイクｍｃ２で収音された目的音の音声信号からその疑似クロストーク信号を除去（具体的には減算）することによってクロストーク抑圧後のクリアな（つまり音質が改善された）音声信号を生成する。 In the acoustic crosstalk suppression device 5, for example, when two speakers (each of a plurality of speakers including the main speaker) are talking, one of the two voices spoken at the same time is used as the target sound and the other as the disturbing sound. , Suppresses the cross-talk component caused by disturbing sounds and converts the target sound into clear voice. Specifically, the acoustic crosstalk suppression device 5 performs a predetermined (see below) signal processing using an audio signal including an interfering sound as a reference signal to reproduce a pseudo crosstalk signal (suppression signal) that reproduces an acoustic crosstalk component. An example) is generated. The acoustic crosstalk suppression device 5 is clear after crosstalk suppression (that is, by removing (specifically subtracting) the pseudo crosstalk signal from the audio signal of the target sound picked up by the microphone mc1 or the microphone mc2. Generates an audio signal (with improved sound quality).

メモリ５０は、マイクｍｃ１が店員ｈｍ１の発話による音声（つまり目的音）を収音する際、過去に顧客ｈｍ２が発話した音声（つまり妨害音）のクリアな音声信号を記憶する。同様に、メモリ５０は、マイクｍｃ１が顧客ｈｍ２の発話による音声（つまり目的音）を収音する際、過去に店員ｈｍ１が発話した音声（つまり妨害音）のクリアな音声信号を記憶する。メモリ５０に記憶された音声信号は、参照信号として音響クロストークの再現（つまり、上述した疑似クロストーク信号の生成）に用いられる。 The memory 50 stores a clear voice signal of the voice (that is, the disturbing sound) uttered by the customer hm2 in the past when the microphone mc1 picks up the voice (that is, the target sound) uttered by the clerk hm1. Similarly, when the microphone mc1 picks up the voice (that is, the target sound) uttered by the customer hm2, the memory 50 stores a clear voice signal of the voice (that is, the disturbing sound) uttered by the clerk hm1 in the past. The audio signal stored in the memory 50 is used as a reference signal for reproducing acoustic crosstalk (that is, generating the pseudo-crosstalk signal described above).

ＤＳＰ１０は、マイクｍｃ１で収音された音声の音声信号に対して音響クロストーク抑圧処理を行うプロセッサである。ＤＳＰ１０は、シングルトーク検出部４５、音圧比較部４６、妨害音混合率推定部４１、信号処理選択部４２、切替部４３、および抑圧ユニット２０を有する。 The DSP 10 is a processor that performs acoustic crosstalk suppression processing on the voice signal of the voice picked up by the microphone mc1. The DSP 10 includes a single talk detection unit 45, a sound pressure comparison unit 46, a disturbing sound mixing ratio estimation unit 41, a signal processing selection unit 42, a switching unit 43, and a suppression unit 20.

シングルトーク検出部４５は、マイクｍｃ１およびマイクｍｃ２のそれぞれにより収音された音声信号に基づいて、店員ｈｍ１および顧客ｈｍ２のうちいずれか一方が発話しているシングルトーク状態を検出する。例えば、シングルトーク検出部４５は、発話があった時に、マイクｍｃ１またはマイクｍｃ２で収音される音声のうち、一方の音声の音圧だけが他方の音声の音圧に比べて大きかった場合、シングルトーク状態を検出したと判断する。また、シングルトーク検出部４５は、マイクｍｃ１またはマイクｍｃ２で収音される音声の音色が同じである場合、シングルトーク状態を検出したと判断してもよい。また、マイクｍｃ１が店員ｈｍ１の近くに配置され、マイクｍｃ２が顧客ｈｍ２の近くに配置された場合、店員ｈｍ１が発話するシングルトーク時、マイクｍｃ１で収音される音声の音圧が高く、マイクｍｃ２で収音される音声の音圧が低くなると判断される。これに対し、店員ｈｍ１および顧客ｈｍ２の双方が発話するダブルトーク時、マイクｍｃ１およびマイクｍｃ２で収音される音声の音圧は、いずれも高くなると判断される。したがって、シングルトーク検出部４５は、マイクｍｃ１で収音される音声とマイクｍｃ２で収音される音声の音圧差を基に、シングルトーク状態を検出する。 The single talk detection unit 45 detects the single talk state spoken by either the clerk hm1 or the customer hm2 based on the audio signals picked up by the microphone mc1 and the microphone mc2, respectively. For example, when the single talk detection unit 45 has a speech, the sound pressure of only one of the sounds picked up by the microphone mc1 or the microphone mc2 is higher than the sound pressure of the other voice. It is determined that a single talk state has been detected. Further, the single talk detection unit 45 may determine that the single talk state has been detected when the tones of the sounds picked up by the microphone mc1 or the microphone mc2 are the same. Further, when the microphone mc1 is arranged near the clerk hm1 and the microphone mc2 is arranged near the customer hm2, the sound pressure of the sound picked up by the microphone mc1 is high during the single talk spoken by the clerk hm1, and the microphone It is determined that the sound pressure of the sound picked up by the mc2 becomes low. On the other hand, at the time of double talk spoken by both the clerk hm1 and the customer hm2, it is determined that the sound pressure of the voice picked up by the microphone mc1 and the microphone mc2 is high. Therefore, the single talk detection unit 45 detects the single talk state based on the sound pressure difference between the sound picked up by the microphone mc1 and the sound picked up by the microphone mc2.

音圧比較部４６は、シングルトーク検出部４５で検出された、メイン話者である店員ｈｍ１が発話するシングルトーク状態で、マイクｍｃ１で収音される音声の音圧とマイクｍｃ２で収音される音声の音圧とを比較する。音圧比較部４６は、比較により、音圧比率（つまり、マイクｍｃ２で収音される音声の音圧に対するマイクｍｃ１で収音される音声の音圧の割合を示す値）を得る。同様に、音圧比較部４６は、シングルトーク検出部４５で検出された、他の話者である顧客ｈｍ２が発話するシングルトーク状態で、マイクｍｃ１で収音される音声の音圧とマイクｍｃ２で収音される音声の音圧とを比較する。音圧比較部４６は、比較により、音圧比率（つまり、マイクｍｃ１で収音される音声の音圧に対するマイクｍｃ２で収音される音声の音圧の割合を示す値）を得る。なお、メイン話者が顧客ｈｍ２であり、他の話者が店員ｈｍ１である場合も同様である。 The sound pressure comparison unit 46 collects the sound pressure of the sound picked up by the microphone mc1 and the sound picked up by the microphone mc2 in the single talk state detected by the single talk detection unit 45 and spoken by the clerk hm1 who is the main speaker. Compare with the sound pressure of the sound. The sound pressure comparison unit 46 obtains a sound pressure ratio (that is, a value indicating the ratio of the sound pressure of the sound picked up by the microphone mc1 to the sound pressure of the sound picked up by the microphone mc2) by comparison. Similarly, the sound pressure comparison unit 46 is the sound pressure of the sound picked up by the microphone mc1 and the microphone mc2 in the single talk state detected by the single talk detection unit 45 and spoken by the customer hm2 who is another speaker. Compare with the sound pressure of the sound picked up by. The sound pressure comparison unit 46 obtains a sound pressure ratio (that is, a value indicating the ratio of the sound pressure of the sound picked up by the microphone mc2 to the sound pressure of the sound picked up by the microphone mc1) by comparison. The same applies when the main speaker is the customer hm2 and the other speaker is the clerk hm1.

混合率推定部の一例としての妨害音混合率推定部４１は、音圧比較部４６で得られたシングルトーク時の音圧比率を基に、マイクｍｃ１またはマイクｍｃ２で収音されるメイン話者ではない他の話者の音声の音声信号（言い換えると、参照信号）に含まれる妨害音の混合率を推定する。ここでいう混合率は、参照信号に含まれる妨害音（言い換えると、メイン話者の主信号）の参照信号に対する割合である。具体的に、メイン話者が店員ｈｍ１である場合、混合率は、他の話者である顧客ｈｍ２が発話する音声の音声信号（参照信号）に含まれる店員ｈｍ１が発話する音声（妨害音）の、顧客ｈｍ２が発話する音声の音声信号（参照信号）に対する割合である。同様に、メイン話者が顧客ｈｍ２である場合、混合率は、他の話者である店員ｈｍ１が発話する音声の音声信号（参照信号）に含まれる顧客ｈｍ２が発話する音声（妨害音）の、店員ｈｍ１が発話する音声の音声信号（参照信号）に対する割合である。 The disturbing sound mixing rate estimation unit 41 as an example of the mixing rate estimation unit is a main speaker that collects sound with the microphone mc1 or the microphone mc2 based on the sound pressure ratio at the time of single talk obtained by the sound pressure comparison unit 46. Estimate the mixing ratio of disturbing sounds contained in the voice signal (in other words, the reference signal) of the voice of another speaker who is not. The mixing ratio here is the ratio of the disturbing sound (in other words, the main signal of the main speaker) included in the reference signal to the reference signal. Specifically, when the main speaker is the clerk hm1, the mixing ratio is the voice (interference sound) uttered by the clerk hm1 included in the voice signal (reference signal) of the voice spoken by the customer hm2 who is another speaker. This is the ratio of the voice spoken by the customer hm2 to the voice signal (reference signal). Similarly, when the main speaker is the customer hm2, the mixing ratio is the voice (interference sound) spoken by the customer hm2 included in the voice signal (reference signal) of the voice spoken by the clerk hm1 who is another speaker. , The ratio of the voice spoken by the clerk hm1 to the voice signal (reference signal).

一例として、音圧比較部４６は、メイン話者である店員ｈｍ１のみが発話している時にマイクｍｃ１とマイクｍｃ２の音圧比率を比較する。このときマイクｍｃ１：マイクｍｃ２＝２：１であったとする。続いて、音圧比較部４６は、メイン話者である顧客ｈｍ２のみが発話している時にマイクｍｃ１とマイクｍｃ２の音圧比率を比較する。このとき、マイクｍｃ１：マイクｍｃ２＝１：１０であったとする。これらの音圧比率を分析すると、次のことが分かる。 As an example, the sound pressure comparison unit 46 compares the sound pressure ratios of the microphone mc1 and the microphone mc2 when only the clerk hm1 who is the main speaker is speaking. At this time, it is assumed that the microphone mc1: microphone mc2 = 2: 1. Subsequently, the sound pressure comparison unit 46 compares the sound pressure ratios of the microphone mc1 and the microphone mc2 when only the customer hm2 who is the main speaker is speaking. At this time, it is assumed that the microphone mc1: microphone mc2 = 1:10. Analysis of these sound pressure ratios reveals the following.

具体的には、店員ｈｍ１が発話した時、マイクｍｃ２で収音される店員ｈｍ１の音声の音圧は、１／３と比較的大きい。したがって、マイクｍｃ２が収音する音声を参照信号として使用できるか否かについて、マイクｍｃ２が収音する音声にメイン話者（妨害音）である店員ｈｍ１の発話した目的音（主信号）が含まれる割合が高いために店員ｈｍ１の音声の混合率が大きくなる。したがって、マイクｍｃ２が収音する音声は参照信号としては不適切である。 Specifically, when the clerk hm1 speaks, the sound pressure of the voice of the clerk hm1 picked up by the microphone mc2 is relatively large, 1/3. Therefore, regarding whether or not the voice picked up by the microphone mc2 can be used as a reference signal, the voice picked up by the microphone mc2 includes the target sound (main signal) spoken by the clerk hm1 who is the main speaker (interfering sound). Since the ratio is high, the mixing ratio of the voice of the clerk hm1 becomes large. Therefore, the sound picked up by the microphone mc2 is inappropriate as a reference signal.

一方、顧客ｈｍ２が発話した時、マイクｍｃ１で収音される顧客ｈｍ２の音声の音圧は、１／１１と小さい。したがって、マイクｍｃ１が収音する音声を参照信号として使用できるか否かについて、マイクｍｃ１が収音する音声にメイン話者（妨害音）である顧客ｈｍ２の発話した目的音（主信号）が含まれる割合が低いために顧客ｈｍ２の音声の混合率が小さくなる。したがって、マイクｍｃ１が収音する音声は参照信号として適切である。 On the other hand, when the customer hm2 speaks, the sound pressure of the voice of the customer hm2 picked up by the microphone mc1 is as small as 1/11. Therefore, regarding whether or not the voice picked up by the microphone mc1 can be used as a reference signal, the voice picked up by the microphone mc1 includes the target sound (main signal) uttered by the customer hm2 who is the main speaker (interfering sound). Since the ratio is low, the mixing ratio of the voice of the customer hm2 becomes small. Therefore, the sound picked up by the microphone mc1 is suitable as a reference signal.

決定部の一例としての信号処理選択部４２は、妨害音混合率推定部４１によって推定された混合率を基に、切替部４３に切り替えを指示する。具体的に、信号処理選択部４２は、妨害音混合率推定部４１により推定された混合率と閾値（図２参照）との比較に基づいて、参照信号が不適切である場合にクロストーク成分の抑圧を行わないように、切替部４３に指示する。また、信号処理選択部４２は、妨害音混合率推定部４１により推定された混合率と閾値（図２参照）との比較に基づいて、参照信号が適切である場合にクロストーク成分の抑圧を行うように、切替部４３に指示する。 The signal processing selection unit 42 as an example of the determination unit instructs the switching unit 43 to switch based on the mixing ratio estimated by the disturbing sound mixing ratio estimation unit 41. Specifically, the signal processing selection unit 42 determines the crosstalk component when the reference signal is inappropriate, based on the comparison between the mixing ratio estimated by the disturbing sound mixing ratio estimation unit 41 and the threshold value (see FIG. 2). Instruct the switching unit 43 not to suppress the above. Further, the signal processing selection unit 42 suppresses the crosstalk component when the reference signal is appropriate, based on the comparison between the mixing ratio estimated by the disturbing sound mixing ratio estimation unit 41 and the threshold value (see FIG. 2). Instruct the switching unit 43 to do so.

切替部４３は、入力されたメイン話者の音声信号を、抑圧ユニット２０を介さずに音響クロストーク抑圧装置５の出力段に伝達する第１端子４３ａと、入力されたメイン話者の音声信号を、抑圧ユニット２０を介して音響クロストーク抑圧装置５の出力段に伝達する第２端子４３ｂとを有する。切替部４３は、信号処理選択部４２からの指示にしたがい、メイン話者の音声信号の入力を第１端子４３ａまたは第２端子４３ｂに切り替える。切替部４３は、例えば機械的、電気的あるいは磁気的な切替スイッチである。 The switching unit 43 has a first terminal 43a that transmits the input voice signal of the main speaker to the output stage of the acoustic crosstalk suppression device 5 without going through the suppression unit 20, and the input voice signal of the main speaker. Has a second terminal 43b that transmits the signal to the output stage of the acoustic crosstalk suppression device 5 via the suppression unit 20. The switching unit 43 switches the input of the audio signal of the main speaker to the first terminal 43a or the second terminal 43b according to the instruction from the signal processing selection unit 42. The changeover unit 43 is, for example, a mechanical, electrical or magnetic changeover switch.

抑圧ユニット２０は、加算器２２、フィルタ更新部２５およびディレイ２９を有する。抑圧ユニット２０では、クロストーク抑圧部の一例としての加算器２２は、マイクｍｃ１で収音された音声の音声信号に、畳み込み信号生成部２３より生成された擬似クロストーク信号を減算する。これにより、加算器２２は、マイクｍｃ１で収音された音声に含まれるクロストーク成分を抑圧できる。抑圧ユニット２０では、加算器２２は、クロストーク成分が抑圧された後の音声信号を出力する。なお、加算器２２が行う処理は厳密には減算であるが、疑似クロストーク信号を減算する処理であっても、反転した疑似クロストーク信号を加算する処理であっても良く、減算としても加算としても実現できる。そのため、本明細書では、この処理は、加算器２２が行う処理として記載する。 The suppression unit 20 includes an adder 22, a filter update unit 25, and a delay 29. In the suppression unit 20, the adder 22 as an example of the crosstalk suppression unit subtracts the pseudo crosstalk signal generated by the convolution signal generation unit 23 from the audio signal of the sound picked up by the microphone mc1. As a result, the adder 22 can suppress the crosstalk component included in the sound picked up by the microphone mc1. In the suppression unit 20, the adder 22 outputs an audio signal after the crosstalk component is suppressed. Strictly speaking, the process performed by the adder 22 is subtraction, but it may be a process of subtracting a pseudo-crosstalk signal or a process of adding an inverted pseudo-crosstalk signal, and it can be added as a subtraction. Can also be realized. Therefore, in the present specification, this process is described as a process performed by the adder 22.

以後、説明を分かり易くするために、店員ｈｍ１が発話する音声を目的音（メイン話者の音声）とし、顧客ｈｍ２が発話する音声を妨害音（メイン話者でない他人物の音声）とする場合を例示する。なお、顧客ｈｍ２が発話する音声を目的音とし、店員ｈｍ１が発話する音声を妨害音とする場合も同様である。 Hereinafter, in order to make the explanation easier to understand, the voice spoken by the clerk hm1 is used as the target sound (voice of the main speaker), and the voice spoken by the customer hm2 is used as the disturbing sound (voice of another person who is not the main speaker). Is illustrated. The same applies to the case where the voice uttered by the customer hm2 is used as the target sound and the voice uttered by the clerk hm1 is used as the disturbing sound.

抑圧ユニット２０が抑圧すべきクロストーク成分は、マイクｍｃ１が収音する店員ｈｍ１の発話による音声に対し、過去に顧客ｈｍ２が発話した声がマイクｍｃ１に到達した音声である。つまり、マイクｍｃ１が収音するクロストーク成分は、顧客ｈｍ２が発話した声が、店員ｈｍ１に届くまでに要した時間分ずれて混合された音声である。そこで、抑圧ユニット２０は、過去に顧客ｈｍ２が発話した声の音声を保持しておき、これに信号処理を施すことによって、この混合された音声を再現した疑似クロストーク信号を生成する。 The cross-talk component to be suppressed by the suppression unit 20 is a voice that the voice uttered by the customer hm2 in the past reaches the microphone mc1 with respect to the voice uttered by the clerk hm1 that is picked up by the microphone mc1. That is, the crosstalk component picked up by the microphone mc1 is a voice mixed by the time required for the voice spoken by the customer hm2 to reach the clerk hm1. Therefore, the suppression unit 20 holds the voice of the voice spoken by the customer hm2 in the past, and performs signal processing on the voice to generate a pseudo crosstalk signal that reproduces the mixed voice.

フィルタ更新部２５は、畳み込み信号生成部２３、更新量計算部２６、非線形変換部２７およびノルム算出部２８を有する。 The filter update unit 25 includes a convolution signal generation unit 23, an update amount calculation unit 26, a non-linear conversion unit 27, and a norm calculation unit 28.

フィルタの一例としての畳み込み信号生成部２３は、参照信号から疑似クロストーク信号を生成する処理を行う適応フィルタであり、具体的には、特開２００７−１９５９５号公報等に記載されているＦＩＲ（ＦｉｎｉｔｅＩｍｐｕｌｓｅＲｅｓｐｏｎｓｅ）フィルタを用いる。畳み込み信号生成部２３は、マイクｍｃ１に対する店員ｈｍ１と顧客ｈｍ２との間の伝達特性を再現し、参照信号を処理することにより、疑似クロストーク信号を生成する。ただし、店員ｈｍ１と顧客ｈｍ２とが対面している場所の伝達特性は定常的なものではないため、畳み込み信号生成部２３の特性も随時変化させる必要がある。そこで、実施の形態１では、フィルタ更新部２５によって、ＦＩＲフィルタの係数またはタップ数を制御することによって、畳み込み信号生成部２３の特性が、マイクｍｃ１に対する店員ｈｍ１と顧客ｈｍ２との間の最新の伝達特性に近づくよう変化させる。以下、適応フィルタの更新を、学習と表現することもある。 The convolution signal generation unit 23 as an example of the filter is an adaptive filter that performs a process of generating a pseudo crosstalk signal from a reference signal, and specifically, FIR (FIR) described in JP-A-2007-19595 and the like. Use the Finite Adaptive Response) filter. The convolution signal generation unit 23 reproduces the transmission characteristics between the clerk hm1 and the customer hm2 with respect to the microphone mc1 and processes the reference signal to generate a pseudo crosstalk signal. However, since the transmission characteristics of the place where the clerk hm1 and the customer hm2 face each other are not steady, it is necessary to change the characteristics of the convolution signal generation unit 23 as needed. Therefore, in the first embodiment, the filter updating unit 25 controls the coefficient or the number of taps of the FIR filter so that the characteristic of the convolution signal generation unit 23 is the latest between the clerk hm1 and the customer hm2 with respect to the microphone mc1. Change to approach the transmission characteristics. Hereinafter, the update of the adaptive filter may be referred to as learning.

ここで、前述したように、マイクｍｃ１が収音する店員ｈｍ１の音声は、顧客ｈｍ２の声がマイクｍｃ１に届く時間分遅延する。マイクｍｃ１が店員ｈｍ１の声を収音する場合、顧客ｈｍ２の声は、店員ｈｍ１が発話する直前にメモリ５０に保持されるため、参照信号には、顧客ｈｍ２の声がマイクｍｃ１に届くまでの間の遅延が反映されていない。そのため、実施の形態１では、ディレイ２９により、この時間差を吸収し、フィルタ更新部２５は、マイクｍｃ１で収音されたタイミングに合致する参照信号を得る。すなわち、マイクｍｃ１および顧客ｈｍ２間の距離を音速で除算した時間分、参照信号をディレイ２９によって遅延させることで、マイクｍｃ１にて実際に収音されたタイミングの再生音を再現する。ディレイ２９の値は、マイクｍｃ１と顧客ｈｍ２の間の距離を実測し、それを音速で除算することによって得ることができる。 Here, as described above, the voice of the clerk hm1 picked up by the microphone mc1 is delayed by the time when the voice of the customer hm2 reaches the microphone mc1. When the microphone mc1 picks up the voice of the clerk hm1, the voice of the customer hm2 is held in the memory 50 immediately before the clerk hm1 speaks, so that the reference signal is until the voice of the customer hm2 reaches the microphone mc1. The delay between them is not reflected. Therefore, in the first embodiment, the delay 29 absorbs this time difference, and the filter update unit 25 obtains a reference signal that matches the timing picked up by the microphone mc1. That is, by delaying the reference signal by the delay 29 for the time obtained by dividing the distance between the microphone mc1 and the customer hm2 by the speed of sound, the reproduced sound at the timing actually picked up by the microphone mc1 is reproduced. The value of the delay 29 can be obtained by actually measuring the distance between the microphone mc1 and the customer hm2 and dividing it by the speed of sound.

非線形変換部２７は、音響クロストーク抑圧後の信号に対して非線形変換を行う。この非線形変換は、音響クロストーク抑圧後の信号をフィルタの更新すべき方向（正か負）を指し示す情報へと変換する処理である。非線形変換部２７は、非線形変換した後の信号を更新量計算部２６に出力する。 The non-linear conversion unit 27 performs non-linear conversion on the signal after suppressing the acoustic crosstalk. This non-linear conversion is a process of converting the signal after suppressing the acoustic crosstalk into information indicating the direction (positive or negative) to be updated of the filter. The non-linear conversion unit 27 outputs the signal after the non-linear conversion to the update amount calculation unit 26.

ノルム算出部２８は、過去に顧客ｈｍ２が発話した声の音声信号のノルムを算出する。顧客ｈｍ２が発話した声の音声信号のノルムとは、過去の所定時間内に顧客ｈｍ２が発話した声の音声信号の大きさの総和であり、この時間内の信号の大きさの度合いを示す値である。ノルムは、更新量計算部２６にて、顧客ｈｍ２が発話した声の音声の音量の影響を正規化するために用いられる。一般に、音量が大きいほどフィルタの更新量も大きく算出されてしまうため、正規化を行わなくては、畳み込み信号生成部２３の特性が大きな音声の特性に過剰に影響されてしまう。そこで、実施の形態１では、ディレイ２９から出力された音声信号を、ノルム算出部２８が算出したノルムを用いて正規化することで畳み込み信号生成部２３の更新量を安定させている。 The norm calculation unit 28 calculates the norm of the voice signal of the voice spoken by the customer hm2 in the past. The norm of the voice signal of the voice spoken by the customer hm2 is the sum of the loudness of the voice signal of the voice spoken by the customer hm2 within a predetermined time in the past, and is a value indicating the degree of the magnitude of the signal within this time. Is. The norm is used by the update amount calculation unit 26 to normalize the influence of the volume of the voice of the voice spoken by the customer hm2. In general, the louder the volume, the larger the update amount of the filter is calculated. Therefore, the characteristics of the convolution signal generation unit 23 are excessively affected by the characteristics of the large voice unless normalization is performed. Therefore, in the first embodiment, the update amount of the convolution signal generation unit 23 is stabilized by normalizing the audio signal output from the delay 29 using the norm calculated by the norm calculation unit 28.

更新量計算部２６は、非線形変換部２７とノルム算出部２８とディレイ２９とから受け取る信号から、畳み込み信号生成部２３のフィルタ特性の更新量（具体的には、ＦＩＲフィルタの係数またはタップ数の更新量）を計算する。具体的には、ディレイ２９から受け取る、過去に顧客ｈｍ２が発話した声の音声をノルム算出部２８で算出したノルムに基づき正規化する。そして、この過去に顧客ｈｍ２が発話した声の音声を正規化した結果に、非線形変換部２７から得られた情報に基づき正または負の情報を付加することで更新量を決定する。実施の形態１では、更新量計算部２６は、ＩＣＡ（独立成分解析）アルゴリズムまたはＮＬＭＳ（ＮｏｒｍａｌｉｚｅｄＬｅａｓｔＭｅａｎＳｑｕａｒｅ）アルゴリズムによりフィルタ特性の更新量を計算する。 The update amount calculation unit 26 updates the filter characteristics of the convolution signal generation unit 23 (specifically, the coefficient of the FIR filter or the number of taps) from the signals received from the nonlinear conversion unit 27, the norm calculation unit 28, and the delay 29. Update amount) is calculated. Specifically, the voice of the voice received by the customer hm2 in the past received from the delay 29 is normalized based on the norm calculated by the norm calculation unit 28. Then, the update amount is determined by adding positive or negative information based on the information obtained from the nonlinear conversion unit 27 to the result of normalizing the voice of the voice spoken by the customer hm2 in the past. In the first embodiment, the update amount calculation unit 26 calculates the update amount of the filter characteristics by the ICA (Independent Component Analysis) algorithm or the NLMS (Normalized Last Mean Square) algorithm.

更新量計算部２６、非線形変換部２７およびノルム算出部２８の処理を随時実行していくことで、フィルタ更新部２５は、畳み込み信号生成部２３の特性を、店員ｈｍ１の声を収音するマイクｍｃ１と顧客ｈｍ２との間の伝達特性に近づけることができる。なお、顧客ｈｍ２が発話する音声を目的音とし、店員ｈｍ１が発話する音声を妨害音とする場合には、フィルタ更新部２５は、畳み込み信号生成部２３の特性を、顧客ｈｍ２の声を収音するマイクｍｃ１と店員ｈｍ１との間の伝達特性に近づける。 By executing the processes of the update amount calculation unit 26, the nonlinear conversion unit 27, and the norm calculation unit 28 at any time, the filter update unit 25 uses the characteristics of the convolution signal generation unit 23 as a microphone for collecting the voice of the clerk hm1. The transmission characteristics between mc1 and customer hm2 can be approached. When the voice uttered by the customer hm2 is used as the target sound and the voice uttered by the clerk hm1 is used as the disturbing sound, the filter updating unit 25 picks up the characteristics of the convolution signal generation unit 23 and the voice of the customer hm2. The transmission characteristics between the microphone mc1 and the clerk hm1 are brought closer.

次に、実施の形態１に係る音響クロストーク抑圧装置５の動作を示す。 Next, the operation of the acoustic crosstalk suppression device 5 according to the first embodiment will be shown.

図２は、実施の形態１に係る音響クロストーク抑圧動作手順例を示すフローチャートである。図２の説明において、この処理は、マイクｍｃ１，ｍｃ２で収音される音声の音声信号に対し、１サンプル毎に実行される。 FIG. 2 is a flowchart showing an example of an acoustic crosstalk suppression operation procedure according to the first embodiment. In the description of FIG. 2, this process is executed for each sample with respect to the audio signal of the audio picked up by the microphones mc1 and mc2.

図２において、マイクｍｃ１は、メイン話者である店員ｈｍ１が発話する音声を収音し、音声認識したい主信号を取得する（Ｓ１）。また、マイクｍｃ２は、メイン話者ではない顧客ｈｍ２が発話する音声を収音し、参照信号を取得する（Ｓ２）。ＤＳＰ１０は、この参照信号をメモリ５０に記憶する。 In FIG. 2, the microphone mc1 picks up the voice uttered by the clerk hm1 who is the main speaker, and acquires the main signal to be recognized by the voice (S1). Further, the microphone mc2 picks up the voice uttered by the customer hm2 who is not the main speaker and acquires the reference signal (S2). The DSP 10 stores this reference signal in the memory 50.

シングルトーク検出部４５は、マイクｍｃ１，ｍｃ２で収音される音声を基に、店員ｈｍ１および顧客ｈｍ２のいずれか一方が発話しているシングルトーク状態を検出する（Ｓ３）。シングルトーク状態が検出された場合、音圧比較部４６は、メイン話者である店員ｈｍ１が発話するシングルトーク状態で、マイクｍｃ１で収音される音声の音圧とマイクｍｃ２で収音される音声の音圧とを比較して音圧比率（上述参照）を得る（Ｓ４）。同様に、音圧比較部４６は、他の話者である顧客ｈｍ２が発話するシングルトーク状態で、マイクｍｃ１で収音される音声の音圧とマイクｍｃ２で収音される音声の音圧とを比較して音圧比率（上述参照）を得る。 The single talk detection unit 45 detects the single talk state spoken by either the clerk hm1 or the customer hm2 based on the voice picked up by the microphones mc1 and mc2 (S3). When the single talk state is detected, the sound pressure comparison unit 46 picks up the sound pressure of the sound picked up by the microphone mc1 and the sound picked up by the microphone mc2 in the single talk state spoken by the clerk hm1 who is the main speaker. The sound pressure ratio (see above) is obtained by comparing with the sound pressure of the voice (S4). Similarly, the sound pressure comparison unit 46 sets the sound pressure of the sound picked up by the microphone mc1 and the sound pressure of the sound picked up by the microphone mc2 in a single talk state spoken by the customer hm2 who is another speaker. To obtain the sound pressure ratio (see above).

妨害音混合率推定部４１は、音圧比較部４６によって得られたシングルトーク時の音圧比率を基に、マイクｍｃ２（またはマイクｍｃ１）で収音される音声の音声信号（参照信号）に含まれる妨害音の混合率（上述参照）を推定する（Ｓ５）。 The disturbing sound mixing ratio estimation unit 41 uses the sound pressure ratio at the time of single talk obtained by the sound pressure comparison unit 46 as the sound signal (reference signal) of the sound picked up by the microphone mc2 (or microphone mc1). The mixing ratio of the contained interfering sounds (see above) is estimated (S5).

妨害音混合率推定部４１は、推定された混合率が閾値以下であるか否かを判別する（Ｓ６）。閾値は、音響クロストーク抑圧処理を行った場合に、メイン話者の音声が劣化しない（つまり妨害音が増加しない）とされる、参照信号に含まれる妨害音（言い換えると、メイン話者の音声）の割合に設定される。 The disturbing sound mixing rate estimation unit 41 determines whether or not the estimated mixing rate is equal to or less than the threshold value (S6). The threshold is the disturbing sound contained in the reference signal (in other words, the voice of the main speaker), which is said to not deteriorate the voice of the main speaker (that is, the disturbing sound does not increase) when the acoustic crosstalk suppression processing is performed. ) Is set.

混合率が閾値を超える場合（Ｓ６、ＮＯ）、ＤＳＰ１０は、図２に示す本処理を終了する。つまり、この場合には、クロストーク成分の抑圧が行われないので、メイン話者である店員ｈｍ１の主信号（音声信号）がそのまま音響クロストーク抑圧装置５の出力段に出力される。 When the mixing ratio exceeds the threshold value (S6, NO), DSP10 ends the present process shown in FIG. That is, in this case, since the crosstalk component is not suppressed, the main signal (voice signal) of the clerk hm1 who is the main speaker is output as it is to the output stage of the acoustic crosstalk suppression device 5.

一方、混合率が閾値以下である場合（Ｓ４、ＹＥＳ）、フィルタ更新部２５は、フィルタ更新部２５に内蔵されるメモリ（図示略）に記憶されている対応するフィルタ係数を読み込み、畳み込み信号生成部２３に設定する（Ｓ７）。畳み込み信号生成部２３は、マイクｍｃ２で収音され、ディレイ２９で遅延された参照信号を用いて、疑似クロストーク信号に相当するクロストーク抑圧信号（抑圧信号の一例）を生成する。すなわち、畳み込み信号生成部２３は、更新量計算部２６で更新される最新のフィルタ係数を用いて、遅延時間分ずれた参照信号に対し畳み込み処理を行い、遅延時間分ずれた参照信号からクロストーク抑圧信号（上述参照）を生成する。 On the other hand, when the mixing ratio is equal to or less than the threshold value (S4, YES), the filter update unit 25 reads the corresponding filter coefficient stored in the memory (not shown) built in the filter update unit 25 and generates a convolution signal. It is set in the unit 23 (S7). The convolution signal generation unit 23 generates a crosstalk suppression signal (an example of the suppression signal) corresponding to a pseudo crosstalk signal by using the reference signal picked up by the microphone mc2 and delayed by the delay 29. That is, the convolution signal generation unit 23 uses the latest filter coefficient updated by the update amount calculation unit 26 to perform convolution processing on the reference signal deviated by the delay time, and crosstalks from the reference signal deviated by the delay time. Generate a suppression signal (see above).

加算器２２は、マイクｍｃ１で収音された音声の音声信号から、畳み込み信号生成部２３により生成されたクロストーク抑圧信号を減算し、マイクｍｃ１で収音された音声に含まれるクロストーク成分を抑圧する（Ｓ８）。 The adder 22 subtracts the crosstalk suppression signal generated by the convolution signal generation unit 23 from the audio signal of the sound picked up by the microphone mc1, and obtains the crosstalk component included in the sound picked up by the microphone mc1. Suppress (S8).

ＤＳＰ１０は、フィルタ学習期間であるか否かを判別する（Ｓ９）。フィルタ学習期間は、メイン話者である店員ｈｍ１に対し、他の話者である顧客ｈｍ２が発話している期間である。また、フィルタ学習期間でない期間は、他の話者である顧客ｈｍ２が発話していない期間である。フィルタ学習期間である場合（Ｓ９、ＹＥＳ）、フィルタ更新部２５は、それぞれ更新量計算部２６で計算されるフィルタ係数で畳み込み信号生成部２３のフィルタ係数を更新し、フィルタ更新部２５に内蔵されるメモリ（図示略）に記憶する（Ｓ１０）。一方、フィルタ学習期間でない場合（Ｓ９、ＮＯ）、ＤＳＰ１０は、図２に示す本処理を終了する。 DSP10 determines whether or not it is the filter learning period (S9). The filter learning period is a period in which the customer hm2, who is another speaker, speaks to the clerk hm1 who is the main speaker. Further, the period other than the filter learning period is a period during which the customer hm2, who is another speaker, does not speak. In the case of the filter learning period (S9, YES), the filter update unit 25 updates the filter coefficient of the convolution signal generation unit 23 with the filter coefficient calculated by the update amount calculation unit 26, and is built in the filter update unit 25. It is stored in a memory (not shown) (S10). On the other hand, when it is not the filter learning period (S9, NO), DSP10 ends the present process shown in FIG.

以上により、実施の形態１に係る音響クロストーク抑圧装置５は、例えば、店員ｈｍ１と顧客ｈｍ２とが対話する店舗などの閉空間内に配置された２個のマイクｍｃ１，ｍｃ２と接続される。音響クロストーク抑圧装置５は、２個のマイクｍｃ１，ｍｃ２のそれぞれにより収音された音声信号に基づいて、店舗内に存在する店員ｈｍ１または顧客ｈｍ２（メイン話者を含む複数人のうちいずれか一人の一例）が発話しているシングルトーク状態をシングルトーク検出部４５で検出する。音響クロストーク抑圧装置５は、メイン話者である店員ｈｍ１のシングルトーク状態で２個のマイクｍｃ１，ｍｃ２のそれぞれにより収音された音声信号の音圧比率と、他の話者である顧客ｈｍ２（メイン話者以外の他人物の一例）のシングルトーク状態で２個のマイクｍｃ１，ｍｃ２のそれぞれにより収音された音声信号の音圧比率とに基づいて、他の話者（メイン話者以外の他人物の一例）の音声信号に対してメイン話者の音声信号が含まれる割合を示す混合率を妨害音混合率推定部４１で推定する。音響クロストーク抑圧装置５は、混合率の推定結果に基づいて、メイン話者の音声信号に含まれる他の話者の発話によるクロストーク成分の抑圧の要否を信号処理選択部４２で判別する。 As described above, the acoustic crosstalk suppression device 5 according to the first embodiment is connected to, for example, two microphones mc1 and mc2 arranged in a closed space such as a store where the clerk hm1 and the customer hm2 interact with each other. The acoustic crosstalk suppression device 5 is either a clerk hm1 or a customer hm2 (a plurality of people including the main speaker) existing in the store based on the audio signals picked up by the two microphones mc1 and mc2, respectively. The single talk detection unit 45 detects the single talk state spoken by one person). The acoustic cross-talk suppression device 5 includes the sound pressure ratio of the audio signal picked up by each of the two microphones mc1 and mc2 in the single talk state of the clerk hm1 who is the main speaker, and the customer hm2 who is the other speaker. Based on the sound pressure ratio of the audio signal picked up by each of the two microphones mc1 and mc2 in the single talk state of (an example of another person other than the main speaker), the other speaker (other than the main speaker) The interfering sound mixing ratio estimation unit 41 estimates the mixing ratio indicating the ratio of the main speaker's voice signal to the voice signal of another person). Based on the estimation result of the mixing ratio, the acoustic crosstalk suppression device 5 determines the necessity of suppressing the crosstalk component due to the utterance of another speaker included in the voice signal of the main speaker by the signal processing selection unit 42. ..

これにより、音響クロストーク抑圧装置５は、店舗などの閉空間に存在する複数の話者（例えば店員ｈｍ１および顧客ｈｍ２）の状況に応じて、メイン話者（例えば店員ｈｍ１）の発話音声に含まれ得る他の話者（例えば顧客ｈｍ２）の発話音声による音響的なクロストーク成分を適応的に抑圧できる。したがって、音響クロストーク抑圧装置５は、メイン話者の発話音声の音質を改善できる。 As a result, the acoustic crosstalk suppression device 5 is included in the utterance voice of the main speaker (for example, clerk hm1) according to the situation of a plurality of speakers (for example, clerk hm1 and customer hm2) existing in a closed space such as a store. The acoustic cross-talk component due to the spoken voice of another possible speaker (for example, customer hm2) can be adaptively suppressed. Therefore, the acoustic crosstalk suppression device 5 can improve the sound quality of the uttered voice of the main speaker.

また、信号処理選択部４２は、混合率の推定結果が所定の閾値以下であると判定した場合に、メイン話者の音声信号に含まれる他の話者の発話によるクロストーク成分の抑圧を行うと決定する。これにより、音響クロストーク抑圧装置５は、他の話者が発話する声の音声信号を参照信号として使用する場合に、クロストーク成分を効果的に抑圧できる。 Further, when the signal processing selection unit 42 determines that the estimation result of the mixing ratio is equal to or less than a predetermined threshold value, the signal processing selection unit 42 suppresses the crosstalk component due to the utterance of another speaker included in the audio signal of the main speaker. To decide. As a result, the acoustic crosstalk suppression device 5 can effectively suppress the crosstalk component when the voice signal of the voice uttered by another speaker is used as the reference signal.

また、信号処理選択部４２は、混合率の推定結果が所定の閾値より大きいと判定した場合に、メイン話者の音声信号に含まれる他の話者の発話によるクロストーク成分の抑圧を行わないと決定する。これにより、音響クロストーク抑圧装置５は、クロストーク成分を抑圧することで、却ってメイン話者の音声に混ざる他の話者の音声が増加し、メイン話者の音声がクリアでなくなることを抑制できる。また、クロストーク抑圧処理を省くことで、ＤＳＰ１０による処理の負荷を軽減できる。 Further, when the signal processing selection unit 42 determines that the estimation result of the mixing ratio is larger than a predetermined threshold value, the signal processing selection unit 42 does not suppress the crosstalk component due to the utterance of another speaker included in the audio signal of the main speaker. To decide. As a result, the acoustic crosstalk suppression device 5 suppresses the crosstalk component, so that the voices of other speakers mixed with the voices of the main speaker increase, and the voices of the main speaker are suppressed from becoming unclear. can. Further, by omitting the crosstalk suppression processing, the processing load by the DSP 10 can be reduced.

また、音響クロストーク抑圧装置５は、メイン話者の音声信号に含まれる他の話者の発話によるクロストーク成分の抑圧信号を生成する畳み込み信号生成部２３を有し、クロストーク成分を抑圧するための畳み込み信号生成部２３のパラメータを更新し、その更新結果をメモリに保持するフィルタ更新部２５と、畳み込み信号生成部２３により生成されたクロストーク成分の抑圧信号を用いて、メイン話者の音声信号に含まれるクロストーク成分を抑圧する加算器２２と、をさらに備える。これにより、音響クロストーク抑圧装置５は、店舗内の店員ｈｍ１と顧客ｈｍ２の話者状況に応じて、メイン話者（例えば店員ｈｍ１）の発話音声に含まれ得る、顧客ｈｍ２による音響的なクロストーク成分を適応的に抑圧でき、店員ｈｍ１の発話音声の音質を改善できる。したがって、店舗内の音場が変わっても、例えば店員ｈｍ１あるいは顧客ｈｍ２が席を外して立ち上がっても、音場の変化に合わせてクロストーク成分の抑圧性能を徐々に高めることができる。 Further, the acoustic crosstalk suppression device 5 has a convolution signal generation unit 23 that generates a suppression signal of the crosstalk component due to the speech of another speaker included in the voice signal of the main speaker, and suppresses the crosstalk component. Using the filter update unit 25 that updates the parameters of the convolution signal generation unit 23 for the purpose and holds the update result in the memory, and the suppression signal of the crosstalk component generated by the convolution signal generation unit 23, the main speaker It further includes an adder 22 that suppresses a crosstalk component contained in an audio signal. As a result, the acoustic cross talk suppression device 5 can be included in the utterance voice of the main speaker (for example, the clerk hm1) according to the speaker situation of the clerk hm1 and the customer hm2 in the store, and the acoustic cross talk by the customer hm2. The talk component can be suppressed adaptively, and the sound quality of the utterance voice of the clerk hm1 can be improved. Therefore, even if the sound field in the store changes, for example, even if the clerk hm1 or the customer hm2 stands up after leaving their seats, the suppression performance of the crosstalk component can be gradually improved in accordance with the change in the sound field.

また、畳み込み信号生成部２３は、メモリに保持されている最新の畳み込み信号生成部２３のパラメータの更新結果を用いて、クロストーク成分の抑圧信号を生成する。これにより、音響クロストーク抑圧装置５は、同様の話者状況が継続する場合には、その話者状況に応じて既に算出された適応的なクロストーク成分を継続して求めることができるので、メイン話者の発話音声に含まれるクロストーク成分を効果的に抑圧できる。 Further, the convolution signal generation unit 23 generates a crosstalk component suppression signal by using the latest parameter update result of the convolution signal generation unit 23 held in the memory. As a result, when the same speaker situation continues, the acoustic crosstalk suppression device 5 can continuously obtain the adaptive crosstalk component already calculated according to the speaker situation. The crosstalk component contained in the voice of the main speaker can be effectively suppressed.

また、音響クロストーク抑圧装置５は、入力されたメイン話者の音声信号を、加算器２２を介さずに音響クロストーク抑圧装置５の出力段に伝達する第１端子４３ａと、入力されたメイン話者の音声信号を、加算器２２を介して音響クロストーク抑圧装置５の出力段に伝達する第２端子４３ｂとを有し、信号処理選択部４２によって判別されたクロストーク成分の抑圧の要否の判別結果にしたがい、メイン話者の音声信号の入力を第１端子４３ａまたは第２端子４３ｂに切り替える切替部４３を備える。これにより、音響クロストーク抑圧装置５は、機械的、電気的あるいは磁気的な切替スイッチを使用して、クロストーク抑圧を行った音声信号とクロストーク抑圧を行わない音声信号とを簡単に切り替えて出力できる。 Further, the acoustic crosstalk suppression device 5 has a first terminal 43a that transmits the input voice signal of the main speaker to the output stage of the acoustic crosstalk suppression device 5 without going through the adder 22, and the input main. It has a second terminal 43b that transmits the speaker's voice signal to the output stage of the acoustic crosstalk suppression device 5 via the adder 22, and is the key to suppressing the crosstalk component determined by the signal processing selection unit 42. A switching unit 43 for switching the input of the audio signal of the main speaker to the first terminal 43a or the second terminal 43b is provided according to the determination result of whether or not. As a result, the acoustic crosstalk suppression device 5 can easily switch between the voice signal with crosstalk suppression and the voice signal without crosstalk suppression by using a mechanical, electrical, or magnetic changeover switch. Can be output.

（実施の形態２）
実施の形態２に係る音響クロストーク抑圧装置５Ａでは、任意の方向に指向性を形成可能なマイクアレイを用いる場合を示す。図３は、実施の形態２に係る音響クロストーク抑圧装置５Ａの機能的構成例を示すブロック図である。実施の形態２に係る音響クロストーク抑圧装置５Ａにおいて、実施の形態１と同一の構成要素については同一の符号を用いることで、その説明を省略し、ここでは相違する部分だけを説明する。音響クロストーク抑圧装置５Ａは、実施の形態１と比べ、マイクｍｃ１，ｍｃ２の代わりに、マイクアレイｍＡを含む構成である。 (Embodiment 2)
In the acoustic crosstalk suppression device 5A according to the second embodiment, a case where a microphone array capable of forming directivity in an arbitrary direction is used is shown. FIG. 3 is a block diagram showing a functional configuration example of the acoustic crosstalk suppression device 5A according to the second embodiment. In the acoustic crosstalk suppression device 5A according to the second embodiment, the same components as those in the first embodiment are used with the same reference numerals, the description thereof will be omitted, and only the different parts will be described here. The acoustic crosstalk suppression device 5A has a configuration including a microphone array mA instead of the microphones mc1 and mc2, as compared with the first embodiment.

収音装置の一例としてのマイクアレイｍＡは、複数個（例えば１６個）の無指向性のマイク素子ｍ１１，ｍ１２，…ｍ１ｎおよびマイクアレイ処理部ｍｄを有し、実施の形態１で説明した２人の話者（例えば店員ｈｍ１および顧客ｈｍ２）の方向にそれぞれ指向性を形成（ビームフォーミングの処理）が可能な指向性マイクである。指向性処理部の一例としてのマイクアレイｍＡは、複数個の無指向性のマイク素子を用いて所定の方向に指向性をマイクアレイ処理部ｍｄにおいて形成することができる。なお、この指向性の形成に関する技術は、例えば特開２０１５−２９２４１号公報に示されるように、公知の技術である。なお、マイクアレイ処理部ｍｄは、ＤＳＰ１０に含まれるように構成されてもよい。 The microphone array mA as an example of the sound collecting device has a plurality of (for example, 16) omnidirectional microphone elements m11, m12, ... M1n and a microphone array processing unit md, and is described in the first embodiment. It is a directional microphone capable of forming directivity (beamforming processing) in each direction of a person's speaker (for example, clerk hm1 and customer hm2). The microphone array mA as an example of the directional processing unit can form the directivity in the microphone array processing unit md in a predetermined direction by using a plurality of omnidirectional microphone elements. The technique for forming this directivity is a known technique, for example, as shown in Japanese Patent Application Laid-Open No. 2015-292241. The microphone array processing unit md may be configured to be included in the DSP 10.

メモリ５０は、マイクアレイｍＡが店員ｈｍ１がいる方向ｄ１に指向性を形成して音声する収音する際、過去に顧客ｈｍ２が発話した声の音声信号を記憶する。同様に、メモリ５０は、マイクアレイｍＡが顧客ｈｍ２がいる方向ｄ２に指向性を形成して音声する収音する際、過去に店員ｈｍ１が発話した声の音声信号を記憶する。これらの信号は、参照信号として音響クロストークの再現（つまり、上述した疑似クロストーク信号の生成）に用いられる。 The memory 50 stores the voice signal of the voice spoken by the customer hm2 in the past when the microphone array mA forms a directivity in the direction d1 in which the clerk hm1 is present and collects the voice. Similarly, the memory 50 stores the voice signal of the voice spoken by the clerk hm1 in the past when the microphone array mA forms a directivity in the direction d2 where the customer hm2 is present and picks up the voice. These signals are used as reference signals for reproducing acoustic crosstalk (that is, generating the pseudo-crosstalk signal described above).

また、音響クロストーク抑圧装置５Ａは、実施の形態１に係るシングルトーク検出部４５、音圧比較部４６および妨害音混合率推定部４１とは異なる、シングルトーク検出部４５Ａ、音圧比較部４６Ａおよび妨害音混合率推定部４１Ａを含む構成である。 Further, the acoustic crosstalk suppression device 5A is different from the single talk detection unit 45, the sound pressure comparison unit 46 and the disturbing sound mixing ratio estimation unit 41 according to the first embodiment, and is different from the single talk detection unit 45A and the sound pressure comparison unit 46A. The configuration includes the disturbing sound mixing ratio estimation unit 41A.

シングルトーク検出部４５Ａは、マイクアレイｍＡが店員ｈｍ１の方向ｄ１に第１指向性を形成した音声と、マイクアレイｍＡが顧客ｈｍ２の方向ｄ１に第２指向性を形成した音声とに基づいて、実施の形態１に係るシングルトーク検出部４５と同様、店員ｈｍ１および顧客ｈｍ２のいずれか一方が発話しているシングルトーク状態を検出する。 The single talk detection unit 45A is based on the voice in which the microphone array mA forms the first directivity in the direction d1 of the clerk hm1 and the voice in which the microphone array mA forms the second directivity in the direction d1 of the customer hm2. Similar to the single talk detection unit 45 according to the first embodiment, the single talk state in which either the clerk hm1 or the customer hm2 is speaking is detected.

音圧比較部４６Ａは、店員ｈｍ１のシングルトーク状態でマイクアレイｍＡから店員ｈｍ１（メイン話者）の方向ｄ１に第１指向性（上述参照）を形成した後の店員ｈｍ１の音声信号の音圧を取得する。なお、音圧比較部４６Ａは、店員ｈｍ１のシングルトーク状態でマイクアレイｍＡから店員ｈｍ１の方向ｄ１に第１指向性（上述参照）を形成する前後での店員ｈｍ１の音声信号の音圧差を取得してもよい。 The sound pressure comparison unit 46A forms the first directivity (see above) in the direction d1 from the microphone array mA to the clerk hm1 (main speaker) in the single talk state of the clerk hm1, and then the sound pressure of the audio signal of the clerk hm1. To get. The sound pressure comparison unit 46A acquires the sound pressure difference of the audio signal of the clerk hm1 before and after forming the first directivity (see above) in the direction d1 of the clerk hm1 from the microphone array mA in the single talk state of the clerk hm1. You may.

また、音圧比較部４６Ａは、顧客ｈｍ２のシングルトーク状態でマイクアレイｍＡから顧客ｈｍ２の方向ｄ２に第２指向性（上述参照）を形成した後の顧客ｈｍ２の音声信号の音圧を取得する。なお、音圧比較部４６Ａは、顧客ｈｍ２のシングルトーク状態でマイクアレイｍＡから顧客ｈｍ２の方向ｄ２に第２指向性（上述参照）を形成する前後での顧客ｈｍ２の音声信号の音圧差を取得してもよい。 Further, the sound pressure comparison unit 46A acquires the sound pressure of the voice signal of the customer hm2 after forming the second directivity (see above) from the microphone array mA in the direction d2 of the customer hm2 in the single talk state of the customer hm2. .. The sound pressure comparison unit 46A acquires the sound pressure difference of the voice signal of the customer hm2 before and after forming the second directivity (see above) from the microphone array mA to the direction d2 of the customer hm2 in the single talk state of the customer hm2. You may.

混合率推定部の一例としての妨害音混合率推定部４１Ａは、音圧比較部４６Ａによって得られたシングルトーク時の音圧あるいは音圧差を基に、マイクｍｃ１またはマイクｍｃ２で収音される音声の音声信号（参照信号）に含まれる妨害音の混合率（上述参照）を推定する。 Interfering sound as an example of the mixing rate estimation unit The mixing rate estimation unit 41A is a sound picked up by the microphone mc1 or the microphone mc2 based on the sound pressure or the sound pressure difference at the time of single talk obtained by the sound pressure comparison unit 46A. Estimate the mixing ratio (see above) of the disturbing sounds contained in the voice signal (reference signal) of.

決定部の一例としての信号処理選択部４２は、妨害音混合率推定部４１Ａによって推定された混合率を基に、切替部４３に切り替えを指示する。 The signal processing selection unit 42 as an example of the determination unit instructs the switching unit 43 to switch based on the mixing ratio estimated by the disturbing sound mixing ratio estimation unit 41A.

一例として、マイクアレイｍＡが店員ｈｍ１側に片寄った位置に置かれた場合、マイクアレイｍＡが店員ｈｍ１がいる方向ｄ１に指向性を形成して音声を収音する際、店員ｈｍ１の声に混ざる顧客ｈｍ２の声の割合は小さい。したがって、マイクアレイｍＡがメイン話者である顧客ｈｍ２がいる方向ｄ２に指向性を形成し、抑圧ユニット２０がクロストーク成分の抑圧後の音声を取得する際、マイクアレイｍＡが収音する他の話者である店員ｈｍ１がいる方向ｄ１に指向性を形成して収音する音声は、音響クロストーク成分の抑圧に用いられる参照信号に適する。したがって、信号処理選択部４２は、クロストーク成分の抑圧を行うように、切替部４３に指示する。 As an example, when the microphone array mA is placed at a position offset to the clerk hm1 side, when the microphone array mA forms a directivity in the direction d1 where the clerk hm1 is and picks up the sound, it is mixed with the voice of the clerk hm1. The ratio of voices of customer hm2 is small. Therefore, when the microphone array mA forms directivity in the direction d2 where the customer hm2 who is the main speaker is present and the suppression unit 20 acquires the suppressed sound of the crosstalk component, the microphone array mA picks up the sound. The sound that forms directivity and collects sound in the direction d1 in which the clerk hm1 who is the speaker is present is suitable for the reference signal used for suppressing the acoustic crosstalk component. Therefore, the signal processing selection unit 42 instructs the switching unit 43 to suppress the crosstalk component.

一方、マイクアレイｍＡが顧客ｈｍ２がいる方向ｄ２に指向性を形成して音声を収音する際、顧客ｈｍ２の声に混ざる店員ｈｍ１の声の割合は大きい。したがって、マイクアレイｍＡがメイン話者である店員ｈｍ１がいる方向ｄ１に指向性を形成し、抑圧ユニット２０がクロストーク成分の抑圧後の音声を取得する際、マイクアレイｍＡが収音する他の話者である顧客ｈｍ２がいる方向ｄ２に指向性を形成して収音する音声は、音響クロストーク成分の抑圧に用いられる参照信号に適さない。したがって、信号処理選択部４２は、クロストーク成分の抑圧を行わないように、切替部４３に指示する。 On the other hand, when the microphone array mA forms a directivity in the direction d2 where the customer hm2 is present and collects the sound, the ratio of the voice of the clerk hm1 mixed with the voice of the customer hm2 is large. Therefore, when the microphone array mA forms a directivity in the direction d1 in which the clerk hm1 who is the main speaker is present and the suppression unit 20 acquires the suppressed sound of the cross talk component, the microphone array mA picks up the sound. The sound that forms directivity and collects sound in the direction d2 in which the customer hm2 who is the speaker is present is not suitable for the reference signal used for suppressing the acoustic crosstalk component. Therefore, the signal processing selection unit 42 instructs the switching unit 43 not to suppress the crosstalk component.

切替部４３は、メイン話者である店員ｈｍ１の方向に指向性を形成して収音した音声に対し、音響クロストーク抑圧を行わないとして、マイクアレイｍＡから入力した音声信号を第１端子４３ａに切り替える。一方、切替部４３は、メイン話者である顧客ｈｍ２の方向に指向性を形成して収音した音声に対し、音響クロストーク抑圧を行う場合、マイクアレイｍＡから入力した音声信号を第２端子４３ｂに切り替える。 The switching unit 43 receives the audio signal input from the microphone array mA on the first terminal 43a, assuming that the acoustic crosstalk is not suppressed for the sound collected by forming directivity in the direction of the clerk hm1 who is the main speaker. Switch to. On the other hand, when acoustic crosstalk suppression is performed on the sound collected by forming directivity in the direction of the customer hm2 who is the main speaker, the switching unit 43 inputs the sound signal input from the microphone array mA to the second terminal. Switch to 43b.

図４は、実施の形態２に係る音響クロストーク抑圧動作手順例を示すフローチャートである。図４の説明において、実施の形態１と同一のステップ処理については同一の付すことで、その説明を省略する。 FIG. 4 is a flowchart showing an example of an acoustic crosstalk suppression operation procedure according to the second embodiment. In the description of FIG. 4, the same step processing as that of the first embodiment is attached, and the description thereof will be omitted.

図４において、マイクアレイｍＡは、店員ｈｍ１および顧客ｈｍ２がいる店舗内で発話された音声を収音する（Ｓ０１）。マイクアレイｍＡは、収音した音声の音声信号に対し、店員ｈｍ１がいる方向ｄ１に第１指向性を形成し、メイン話者である店員ｈｍ１の音声信号（主信号）を取得する（Ｓ１Ａ）。同様に、マイクアレイｍＡは、収音した音声の音声信号に対し、顧客ｈｍ２がいる方向ｄ２に第２指向性を形成し、他の話者である顧客ｈｍ２の音声信号（参照信号）を取得する（Ｓ２Ａ）。ステップＳ３以降の処理は、実施の形態１と同様である。 In FIG. 4, the microphone array mA picks up the voice spoken in the store where the clerk hm1 and the customer hm2 are present (S01). The microphone array mA forms the first directivity in the direction d1 where the clerk hm1 is present with respect to the audio signal of the picked-up sound, and acquires the audio signal (main signal) of the clerk hm1 who is the main speaker (S1A). .. Similarly, the microphone array mA forms a second directivity in the direction d2 where the customer hm2 is present with respect to the audio signal of the picked-up voice, and acquires the audio signal (reference signal) of the customer hm2 who is another speaker. (S2A). The processing after step S3 is the same as that of the first embodiment.

以上により、音響クロストーク抑圧装置５Ａは、マイクアレイｍＡが有する複数の無指向性のマイク素子ｍ１１〜ｍ１ｎのそれぞれにより収音された音声信号に基づいて、マイクアレイｍＡからメイン話者、他の話者のそれぞれへの方向に異なる指向性を形成するマイクアレイ処理部ｍｄ、を備える。妨害音混合率推定部４１Ａは、店員ｈｍ１のシングルトーク状態でマイクアレイｍＡからメイン話者である店員ｈｍ１の方向ｄ１に第１指向性を形成した後の店員ｈｍ１の音声信号の音圧と、顧客ｈｍ２のシングルトーク状態でマイクアレイｍＡから他の話者である顧客ｈｍ２の方向ｄ２に第２指向性を形成した後の顧客ｈｍ２の音声信号の音圧とに基づいて、混合率を推定する。 As described above, the acoustic crosstalk suppression device 5A is based on the voice signals picked up by each of the plurality of omnidirectional microphone elements m11 to m1n of the microphone array mA, from the microphone array mA to the main speaker and others. A microphone array processing unit md, which forms different directivities in each direction of the speaker, is provided. The disturbing sound mixing ratio estimation unit 41A determines the sound pressure of the voice signal of the clerk hm1 after forming the first directivity in the direction d1 of the clerk hm1 who is the main speaker from the microphone array mA in the single talk state of the clerk hm1. The mixing ratio is estimated based on the sound pressure of the voice signal of the customer hm2 after forming the second directivity in the direction d2 of the customer hm2 who is another speaker from the microphone array mA in the single talk state of the customer hm2. ..

これにより、音響クロストーク抑圧装置５Ａは、マイクアレイｍＡの指向性性能を加味して、音響クロストーク抑圧処理を行うか否かを決定できる。また、顧客ｈｍ２の方向ｄ２に指向性が形成された音声を収音することで、参照信号として用いられる顧客ｈｍ２の音声に混ざる店員ｈｍ１の音声（妨害音）の割合（混合率）を下げることができる。したがって、店員ｈｍ１が発話する声の音声に対し、クロストーク成分の抑圧が行われる確率を高めることができる。 Thereby, the acoustic crosstalk suppression device 5A can determine whether or not to perform the acoustic crosstalk suppression processing in consideration of the directivity performance of the microphone array mA. Further, by collecting the voice having the directivity formed in the direction d2 of the customer hm2, the ratio (mixing ratio) of the voice (interfering sound) of the clerk hm1 mixed with the voice of the customer hm2 used as the reference signal is reduced. Can be done. Therefore, it is possible to increase the probability that the crosstalk component is suppressed with respect to the voice of the voice spoken by the clerk hm1.

（実施の形態２の変形例）
実施の形態１の変形例では、音響クロストーク抑圧装置５Ｂは、音源方向情報を基に、シングルトーク状態を検出する場合を示す。図５は、実施の形態２の変形例に係る音響クロストーク抑圧装置５Ｂの機能的構成例を示すブロック図である。音響クロストーク抑圧装置５Ｂにおいて、実施の形態２に係る音響クロストーク抑圧装置５Ａと同一の構成要素については、同一の符号を付すことでその説明を省略し、ここでは異なる構成要素について説明する。 (Modified Example of Embodiment 2)
In the modified example of the first embodiment, the acoustic crosstalk suppression device 5B shows a case where the single talk state is detected based on the sound source direction information. FIG. 5 is a block diagram showing a functional configuration example of the acoustic crosstalk suppression device 5B according to the modified example of the second embodiment. In the acoustic crosstalk suppression device 5B, the same components as those of the acoustic crosstalk suppression device 5A according to the second embodiment are designated by the same reference numerals, and the description thereof will be omitted. Here, different components will be described.

音響クロストーク抑圧装置５Ｂは、実施の形態２と異なる、シングルトーク検出部４５Ｂ、音圧比較部４６Ｂ、妨害音混合率推定部４１Ｂの他、メモリ５３を有する。シングルトーク検出部４５Ｂは、メモリ５３に記憶された音源方向情報を入力し、シングルトーク状態を検出する。音源方向情報は、例えば全方位カメラ（図示略）により撮影された３６０度の方位を有する魚眼画像を構成する各画素の位置に、その位置に対応するように算出された音圧値が画素と対応付けて割り当てられて作成された音圧ヒートマップである。この音圧ヒートマップは、音響クロストーク抑圧装置５Ｂとは異なる外部装置（図示略）によって作成されてメモリ５３に予め記憶されている。外部装置は、例えば音圧ヒートマップを生成するため、全方位カメラ付きマイクアレイを有する。全方位カメラ付きマイクアレイは、リング状に配置された複数個（例えば１６個）のマイク素子を有し、複数個のマイク素子を含むマイクアレイが全方位カメラを囲むように全方位カメラと同軸に設けられた構成である。音源方向の分析は、例えば特開２０２０−１２７０４号公報に開示されるように、公知の技術である。全方位カメラ付きマイクアレイは、例えば室内の天井あるいは天井近くの壁面に設置された場合、全方位カメラで撮像された画像に対し、各方向に指向性を形成して音声を収音し、各方向の音圧を音圧ヒートマップとして取得する。 The acoustic crosstalk suppression device 5B has a single talk detection unit 45B, a sound pressure comparison unit 46B, a disturbing sound mixing ratio estimation unit 41B, and a memory 53, which are different from those in the second embodiment. The single talk detection unit 45B inputs the sound source direction information stored in the memory 53 and detects the single talk state. The sound source direction information is, for example, a sound pressure value calculated to correspond to the position of each pixel constituting a fisheye image having a 360-degree orientation taken by an omnidirectional camera (not shown). It is a sound pressure heat map created by being assigned in association with. This sound pressure heat map is created by an external device (not shown) different from the acoustic crosstalk suppression device 5B, and is stored in the memory 53 in advance. The external device has a microphone array with an omnidirectional camera, for example to generate a sound pressure heatmap. A microphone array with an omnidirectional camera has a plurality of (for example, 16) microphone elements arranged in a ring shape, and is coaxial with the omnidirectional camera so that the microphone array including the plurality of microphone elements surrounds the omnidirectional camera. It is a configuration provided in. The analysis of the sound source direction is a known technique, for example, as disclosed in Japanese Patent Application Laid-Open No. 2020-12704. When the microphone array with an omnidirectional camera is installed on the ceiling or a wall surface near the ceiling, for example, it forms directivity in each direction with respect to the image captured by the omnidirectional camera and collects sound. The sound pressure in the direction is acquired as a sound pressure heat map.

図６は、音圧ヒートマップが重畳された全方位カメラによる撮像画像ＧＺ１を示す図である。全方位カメラで撮像される画像中の人物が特定されると、マイクアレイは、その方向に指向性を形成し、その人物が発話する声を収音可能である。図６では、全方位カメラ付きマイクアレイは、撮像画像中、店員ｈｍ１，顧客ｈｍ２を含む範囲でビームフォーミングを行い、音圧ヒートマップを生成する。 FIG. 6 is a diagram showing an image GZ1 captured by an omnidirectional camera on which a sound pressure heat map is superimposed. When a person in an image captured by an omnidirectional camera is identified, the microphone array forms directivity in that direction and is capable of picking up the voice spoken by that person. In FIG. 6, the microphone array with an omnidirectional camera performs beamforming in a range including the store clerk hm1 and the customer hm2 in the captured image to generate a sound pressure heat map.

シングルトーク検出部４５Ｂは、音圧ヒートマップ上で話者が発話する音声の音圧が所定値以上である箇所が１箇所である場合、シングルトーク状態を検出する。つまり、音圧ヒートマップ上で所定値以上の音圧が現れる箇所（図６では濃いドット表示）が１箇所であると、シングルトーク状態が検出されたと判断される。 The single talk detection unit 45B detects the single talk state when there is one place on the sound pressure heat map where the sound pressure of the voice spoken by the speaker is equal to or higher than a predetermined value. That is, if there is one place (dark dot display in FIG. 6) where a sound pressure equal to or higher than a predetermined value appears on the sound pressure heat map, it is determined that the single talk state has been detected.

音圧比較部４６Ｂは、店員ｈｍ１のシングルトーク状態時にマイクアレイ処理部ｍｄにより店員ｈｍ１の指向性が形成された対応する音声信号の音圧比率を取得する。また、音圧比較部４６Ｂは、顧客ｈｍ２のシングルトーク状態時にマイクアレイ処理部ｍｄにより顧客ｈｍ２の指向性が形成された対応する音声信号との音圧比率を取得する。 The sound pressure comparison unit 46B acquires the sound pressure ratio of the corresponding audio signal in which the directivity of the clerk hm1 is formed by the microphone array processing unit md in the single talk state of the clerk hm1. Further, the sound pressure comparison unit 46B acquires the sound pressure ratio with the corresponding voice signal in which the directivity of the customer hm2 is formed by the microphone array processing unit md in the single talk state of the customer hm2.

妨害音混合率推定部４１Ｂは、音源方向情報と、店員ｈｍ１のシングルトーク状態時にマイクアレイ処理部ｍｄにより店員ｈｍ１の指向性が形成された対応する音声信号の音圧比率と、顧客ｈｍ２のシングルトーク状態時にマイクアレイ処理部ｍｄにより顧客ｈｍ２の指向性が形成された対応する音声信号との音圧比率とに基づいて、混合率（上述参照）を推定する。 The disturbing sound mixing ratio estimation unit 41B includes sound source direction information, the sound pressure ratio of the corresponding audio signal in which the directivity of the clerk hm1 is formed by the microphone array processing unit md in the single talk state of the clerk hm1, and the single of the customer hm2. The mixing ratio (see above) is estimated based on the sound pressure ratio with the corresponding audio signal in which the directivity of the customer hm2 is formed by the microphone array processing unit md in the talk state.

なお、シングルトーク状態の検出が音源方向情報を用いて行われる場合、音源方向情報として、カメラ映像が用いられてもよい。また、カメラ映像を用いる場合、例えば全方位カメラで撮像された映像の中に口を動かしている人物が１人だけであると、シングルトーク状態が検出されたと判断される。 When the detection of the single talk state is performed using the sound source direction information, the camera image may be used as the sound source direction information. Further, when the camera image is used, for example, if there is only one person moving the mouth in the image captured by the omnidirectional camera, it is determined that the single talk state is detected.

以上により、音響クロストーク抑圧装置５Ｂは、マイクアレイｍＡが有する複数の無指向性のマイク素子ｍ１１〜ｍ１ｎのそれぞれにより収音された音声信号に基づいて、マイクアレイｍＡからメイン話者、他の話者のそれぞれへの方向に異なる指向性を形成するマイクアレイ処理部ｍｄ、を備える。シングルトーク検出部４５Ｂは、店舗内のメイン話者である店員ｈｍ１および他の話者である顧客ｈｍ２のそれぞれへの方向を示す音源方向情報を取得し、この音源方向情報に基づいてシングルトーク状態を検出する。妨害音混合率推定部４１Ｂは、音源方向情報と、店員ｈｍ１のシングルトーク状態時にマイクアレイ処理部ｍｄにより店員ｈｍ１の指向性が形成された対応する音声信号の音圧比率と、顧客ｈｍ２のシングルトーク状態時にマイクアレイ処理部ｍｄにより顧客ｈｍ２の指向性が形成された対応する音声信号との音圧比率とに基づいて、混合率を推定する。 As described above, the acoustic crosstalk suppression device 5B is based on the voice signals picked up by each of the plurality of omnidirectional microphone elements m11 to m1n of the microphone array mA, from the microphone array mA to the main speaker and others. A microphone array processing unit md, which forms different directivities in each direction of the speaker, is provided. The single talk detection unit 45B acquires sound source direction information indicating directions to each of the clerk hm1 who is the main speaker in the store and the customer hm2 who is another speaker, and the single talk state is based on the sound source direction information. Is detected. The disturbing sound mixing ratio estimation unit 41B includes sound source direction information, the sound pressure ratio of the corresponding audio signal in which the directivity of the clerk hm1 is formed by the microphone array processing unit md in the single talk state of the clerk hm1, and the single of the customer hm2. The mixing ratio is estimated based on the sound pressure ratio with the corresponding audio signal in which the directivity of the customer hm2 is formed by the microphone array processing unit md in the talk state.

これにより、音響クロストーク抑圧装置５Ｂは、シングルトーク検出部４５Ｂが音源方向情報を取得するので、シングルトーク状態を速やかに検出して混合率を取得できる。また、シングルトーク状態の検出処理を軽減することができる。 As a result, in the acoustic crosstalk suppression device 5B, since the single talk detection unit 45B acquires the sound source direction information, the single talk state can be quickly detected and the mixing ratio can be acquired. In addition, the single talk state detection process can be reduced.

以上、図面を参照しながら各種の実施の形態について説明したが、本開示はかかる例に限定されないことは言うまでもない。当業者であれば、特許請求の範囲に記載された範疇内において、各種の変更例、修正例、置換例、付加例、削除例、均等例に想到し得ることは明らかであり、それらについても当然に本開示の技術的範囲に属するものと了解される。また、発明の趣旨を逸脱しない範囲において、上述した各種の実施の形態における各構成要素を任意に組み合わせてもよい。 Although various embodiments have been described above with reference to the drawings, it goes without saying that the present disclosure is not limited to such examples. It is clear that a person skilled in the art can come up with various modifications, modifications, substitutions, additions, deletions, and equality within the scope of the claims. It is understood that it naturally belongs to the technical scope of the present disclosure. Further, each component in the various embodiments described above may be arbitrarily combined as long as the gist of the invention is not deviated.

例えば、上述した実施の形態１では、２個のマイク、店員ｈｍ１向けのマイクｍｃ１と顧客ｈｍ向けのマイクｍｃ２が設けられたが、これらのマイクの少なくとも一方は、ヘッドセットに内蔵されてもよい。これにより、参照信号に用いられる音声信号に含まれる妨害音の音圧が下がり、音響クロストークの抑圧が実行され易くなる。 For example, in the first embodiment described above, two microphones, a microphone mc1 for the clerk hm1 and a microphone mc2 for the customer hm1, are provided, but at least one of these microphones may be built in the headset. .. As a result, the sound pressure of the disturbing sound included in the audio signal used as the reference signal is lowered, and the suppression of acoustic crosstalk is easily executed.

また、上述した実施の形態１，２のいずれにおいても、妨害音混合率推定部４１によって推定された混合率が閾値以下である場合、更新量計算部２６は、混合率の値によってアルゴリズム（ＮＬＭＳアルゴリズム、ＩＣＡアルゴリズム等）を変更して適応フィルタのパラメータを計算してもよく、パラメータをより適した値に設定可能である。 Further, in any of the above-described first and second embodiments, when the mixing ratio estimated by the parameter mixing rate estimation unit 41 is equal to or less than the threshold value, the update amount calculation unit 26 determines the algorithm (N) based on the mixing ratio value. The parameters of the adaptive filter may be calculated by changing the LMS algorithm, the ICA algorithm, etc.), and the parameters can be set to more suitable values.

また、音響クロストーク抑圧装置は、ハウリングキャンセラに用いられてもよい。ハウリングキャンセラは、例えばカラオケボックス等において、自身が発する声がスピーカで再生されてマイクで収音される音を妨害音として抑圧する。また、音響クロストーク抑圧装置は、エコーキャンセラに用いられてもよい。エコーキャンセラは、車室内等において、他の話者が発話する声がスピーカから出力されてメイン話者のマイクで収音される音を妨害音として抑圧する。 Further, the acoustic crosstalk suppression device may be used for a howling canceller. In a karaoke box or the like, the howling canceller suppresses a sound in which a voice emitted by itself is reproduced by a speaker and picked up by a microphone as an interfering sound. Further, the acoustic crosstalk suppression device may be used for an echo canceller. The echo canceller suppresses the sound that the voice spoken by another speaker is output from the speaker and picked up by the microphone of the main speaker as a disturbing sound in the vehicle interior or the like.

本開示は、閉空間に存在する複数の話者の状況に応じて、メイン話者の発話音声に含まれ得る他の話者の発話音声による音響的なクロストーク成分を適応的に抑圧し、メイン話者の発話音声の音質を改善する音声処理装置および音声処理方法として有用である。 The present disclosure adaptively suppresses the acoustic cross-talk component of the speech speech of another speaker that may be included in the speech speech of the main speaker, depending on the situation of a plurality of speakers existing in the closed space. It is useful as a voice processing device and a voice processing method for improving the sound quality of the uttered voice of the main speaker.

５，５Ａ，５Ｂ音響クロストーク抑圧装置
２２加算器
２３畳み込み信号生成部
２５フィルタ更新部
２６更新量計算部
２７非線形変換部
２８ノルム算出部
２９ディレイ
４１，４１Ａ，４１Ｂ妨害音混合率推定部
４２信号処理選択部
４３切替部
４３ａ第１端子
４３ｂ第２端子
４５シングルトーク検出部
４６音圧比較部
ｍＡマイクアレイ
ｍｃ１、ｍｃ２マイク 5,5A, 5B Acoustic crosstalk suppression device 22 Adder 23 Folding signal generation unit 25 Filter update unit 26 Update amount calculation unit 27 Non-linear conversion unit 28 Norm calculation unit 29 Delay 41, 41A, 41B Interference sound mixing rate estimation unit 42 Signal Processing selection unit 43 Switching unit 43a 1st terminal 43b 2nd terminal 45 Single talk detection unit 46 Sound pressure comparison unit mA Microphone array mc1, mc2 Microphone

Claims

Connected to multiple microphones located in a closed space,
Single talk detection that detects a single talk state in which any one of a plurality of people including the main speaker existing in the closed space is speaking based on the audio signals picked up by each of the plurality of microphones. Department and
The sound pressure ratio of the audio signal picked up by each of the plurality of microphones in the single talk state of the main speaker and the sound picked up by each of the plurality of microphones in the single talk state of another person other than the main speaker. A mixing ratio estimation unit that estimates a mixing ratio indicating the ratio of the main speaker's voice signal to the voice signal of the other person based on the sound pressure ratio of the voice signal.
Based on the estimation result of the mixing ratio, a determination unit for determining whether or not the crosstalk component is suppressed by the utterance of the other person included in the audio signal of the main speaker is provided.
Voice processing device.

When the determination unit determines that the estimation result of the mixing ratio is equal to or less than a predetermined threshold value, it determines that the crosstalk component is suppressed by the utterance of the other person included in the audio signal of the main speaker. ,
The voice processing device according to claim 1.

When the determination unit determines that the estimation result of the mixing ratio is larger than a predetermined threshold value, the determination unit determines that the crosstalk component is not suppressed by the utterance of the other person included in the audio signal of the main speaker. ,
The voice processing device according to claim 1.

It has a filter that generates a suppression signal of a crosstalk component due to the utterance of the other person included in the audio signal of the main speaker, updates the parameters of the filter for suppressing the crosstalk component, and updates the result. And the filter updater that holds
Using the suppression signal generated by the filter, a crosstalk suppression unit that suppresses the crosstalk component included in the audio signal of the main speaker is further provided.
The voice processing device according to claim 1.

Directivity that forms different directivity in the direction from the sound collecting device to the main speaker and the other person based on the sound signal picked up by the sound collecting device accommodating each of the plurality of microphones. Further equipped with a processing unit,
The mixing ratio estimation unit includes the sound pressure of the voice signal of the main speaker after forming the first directivity from the sound collecting device in the direction of the main speaker in the single talk state of the main speaker, and the sound pressure of the main speaker. The mixing ratio is estimated based on the sound pressure of the voice signal of the other person after forming the second directivity in the direction of the other person from the sound collecting device in the single talk state of the other person.
The voice processing device according to claim 1.

Directivity that forms different directivity in the direction from the sound collecting device to the main speaker and the other person based on the sound signal picked up by the sound collecting device accommodating each of the plurality of microphones. Further equipped with a processing unit,
The single talk detection unit acquires sound source direction information indicating directions to each of the main speaker and the other person in the closed space, detects the single talk state based on the sound source direction information, and detects the single talk state.
The mixing ratio estimation unit uses the audio signal in which the directivity of the main speaker is formed by the directivity processing unit in the single talk state of the main speaker and the directivity processing unit in the single talk state of the other person. The mixing ratio is estimated based on the sound pressure ratio with the audio signal in which the directivity of the other person is formed.
The voice processing device according to claim 1.

The filter generates the suppression signal by using the latest update result of the parameter of the filter held in the memory.
The voice processing device according to claim 4.

The first terminal that transmits the input audio signal of the main speaker to the output stage of the audio processing device without passing through the cross talk suppression unit, and the input audio signal of the main speaker are crossed. It has a second terminal that transmits to the output stage of the voice processing device via the talk suppression unit, and inputs the voice signal of the main speaker based on the determination result of the necessity of suppressing the cross talk component. A switching unit for switching to the first terminal or the second terminal is further provided.
The voice processing device according to claim 4.

A voice processing method executed by a voice processing device connected to a plurality of microphones arranged in a closed space.
Based on the audio signals picked up by each of the plurality of microphones, a single talk state in which any one of a plurality of persons including the main speaker existing in the closed space is speaking is detected.
The sound pressure ratio of the audio signal picked up by each of the plurality of microphones in the single talk state of the main speaker and the sound picked up by each of the plurality of microphones in the single talk state of another person other than the main speaker. Based on the sound pressure ratio of the voice signal, the mixing ratio indicating the ratio of the voice signal of the main speaker to the voice signal of the other person is estimated.
Based on the estimation result of the mixing ratio, it is determined whether or not it is necessary to suppress the crosstalk component due to the utterance of the other person included in the audio signal of the main speaker.
Voice processing method.