JP2010026361A

JP2010026361A - Speech collection method, system and program

Info

Publication number: JP2010026361A
Application number: JP2008189504A
Authority: JP
Inventors: Takashi Fukuda; 隆福田; Osamu Ichikawa; 治市川; Masafumi Nishimura; 雅史西村
Original assignee: International Business Machines Corp
Current assignee: International Business Machines Corp
Priority date: 2008-07-23
Filing date: 2008-07-23
Publication date: 2010-02-04
Anticipated expiration: 2028-07-23
Also published as: JP5339501B2

Abstract

<P>PROBLEM TO BE SOLVED: To accurately collect speech of only a specified speaker such as a sales person in counter selling or the like. <P>SOLUTION: A speech collection system 10 extracts and collects target speech which is a target in a plurality of pieces of speech in which coming directions are different from each other. The system includes a microphone array 11 including at least first and second microphones 11a and 11b, in which the first and second microphones are arranged by separating them with a predetermined distance. Discrete Fourier transform is performed on each signal of speech received by the first and second microphones, and a plurality of cross spectrum power (CSP) coefficients related to the coming direction of speech are calculated, and a plurality of speech signals are detected from the plurality of CSP coefficients. Then, a speech direction index defined according to an angle between a line for connecting the first and second microphones and the coming direction, is detected from the plurality of calculated CSP coefficients, and the signal of the target speech is extracted from the plurality of speech signals, which are detected from the detected speech direction index. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、特定の音声を収集するための音声収集方法、システム及びプログラムに関する。特に、対面販売において、販売員の音声のみを収集するための音声収集方法、システム及びプログラムに関する。 The present invention relates to a sound collection method, system, and program for collecting specific sound. In particular, the present invention relates to a voice collection method, system, and program for collecting only a salesperson's voice in face-to-face sales.

近年、企業等において、違法な行為又は反社会的な行為等によって、消費者又は取引先の信頼（信用）を失ってしまうことがあり、一旦失った信用を回復するためには多大な企業努力を要するばかりでなく、事業存続に大きな影響を与えてしまうこともある。このため、企業においては所謂コンプライアンス体制の確立が緊急課題となっている。例えば、金融サービス業界においては、コンプライアンス強化の取り組みの一環として、販売員の営業活動をモニタリングすることが行われており、一例として、電話による販売活動においては、販売員の電話対応（通話内容）をサーバ等に蓄積して、無作為にチェックする仕組みを取り入れている。また，音声認識技術と自然言語処理技術の併用によって、販売員の不適切な対応を自動で検出しようという試みもある。 In recent years, corporations may lose the trust (credit) of consumers or business partners due to illegal or anti-social acts. In addition, it may have a significant impact on business continuity. For this reason, establishment of a so-called compliance system has become an urgent issue for companies. For example, in the financial services industry, sales activities of sales staff are monitored as part of efforts to strengthen compliance. For example, sales activities by telephone are handled by telephone sales (contents of calls). Is stored on a server, etc., and a system for checking at random is incorporated. There is also an attempt to automatically detect inappropriate responses of salespersons by using both speech recognition technology and natural language processing technology.

一方，窓口で商品販売を行う所謂対面販売においては、電話における販売のように、販売員の顧客対応記録を収集する仕組みが存在しないため、電話における販売に比べてモニタリング体制の整備が遅れている。現状では、販売員が行った営業活動を書面（レポート）等で報告するという手法が採られているものの、レポートの作成に時間が掛かるばかりでなく、適切な報告が行われないこともある。 On the other hand, in the so-called face-to-face sales where products are sold at the counter, there is no mechanism for collecting salespersons' customer response records, as in the case of sales by telephone, so the development of the monitoring system is delayed compared to sales by telephone. . At present, a method of reporting sales activities carried out by salespersons in writing (reports) or the like is employed, but not only it takes time to create a report, but also proper reporting may not be performed.

従来技術では、対面販売における対策として、接話マイクを装着した販売員が顧客との会話を録音する手法が検討されているが、販売員の声のみの録音を目的としているものの実用上は顧客の音声も録音されるため、会話の録音に抵抗感を示す顧客が多く、必ずしも適切な手法とはいえない。このため、顧客から見えない場所に、（単一）指向性マイクを設置して、販売員の音声を収集することも考えられるが、標準的なマイクでは指向性が低く、顧客の声も録音してしまうことになる。指向性を向上させるため、超指向性を有するガンマイク等を用いた場合には、当該ガンマイクが一般に高価であり、そのサイズも大きいことを考慮すると、対面販売にはガンマイクを用いることは適していない。 In the conventional technology, as a measure for face-to-face sales, a method in which a salesperson wearing a close-up microphone records a conversation with a customer is being studied. Are also recorded, so many customers are reluctant to record conversations, which is not always an appropriate technique. For this reason, it is possible to install a (single) directional microphone in a location that is not visible to the customer and collect the salesperson's voice, but the standard microphone has low directivity and the customer's voice is also recorded. Will end up. In order to improve the directivity, when using a super-directive gun microphone or the like, it is not suitable to use a gun microphone for face-to-face sales, considering that the gun microphone is generally expensive and its size is large. .

そこで、従来技術では、音声信号処理技術を併用する試みとしては、送話者方向に向けて一直線上に２つの無指向性マイクロホンを配置し、一方のマイクロホンへの音圧レベルに依存して出力信号を切り替える手段を有し、これにより強い指向性を発揮するマイクロホン装置が知られている（特許文献１参照）。また、従来技術では、複数個のマイクロホン素子を有するマイクロホンアレイを用い、発話区間を検出して発話信号を取り出す技術が知られている（特許文献２参照）。
特開平９−１４９４９０号公報特開２００７−８６５５４号公報 Therefore, in the prior art, as an attempt to use the audio signal processing technology in combination, two omnidirectional microphones are arranged in a straight line toward the direction of the speaker, and output depends on the sound pressure level to one of the microphones. There is known a microphone device that has means for switching signals and thereby exhibits strong directivity (see Patent Document 1). Further, in the prior art, a technique is known in which a microphone array having a plurality of microphone elements is used to detect a speech section and extract a speech signal (see Patent Document 2).
JP-A-9-149490 JP 2007-86554 A

しかし、特許文献１に記載の音圧レベルの判定結果に応じて出力を切り替える手法や、特許文献２に記載の音声と雑音の成分がそれぞれ相違することを利用する技法を含む、マイクロホンアレイ等を用いてソフトウェア的に指向性を形成する従来技法は、マイクの配置において収録時には顧客の音声も収集し、対面販売において顧客の音声を除いて販売員の音声のみを収集することは困難であった。 However, a microphone array including a method of switching output according to the sound pressure level determination result described in Patent Document 1 and a technique using the difference between the sound and noise components described in Patent Document 2 is used. The conventional technique of using software to create directionality collects the customer's voice when recording in the microphone arrangement, and it is difficult to collect only the salesperson's voice except for the customer's voice in face-to-face sales. .

本発明は、対面販売において販売員と顧客の音声を分離するマイクロホンアレイの設置方法、及び分離音声に対する音声認識性能向上のための音声強調方法、及びこれを用いる対話音声の話者方向インデキシングにより、対面販売において販売員のみの音声を的確に収集する音声収集方法、システム及びプログラムを提供する。さらに、本発明は、了解を得ていない顧客の発話記録を残さず、記録が必要な販売員の声だけを確実に残す方法、システム及びプログラムを提供する。 The present invention provides a method for installing a microphone array that separates voices of a salesperson and a customer in face-to-face sales, a voice enhancement method for improving voice recognition performance for separated voices, and speaker direction indexing of dialog voices using the same. Provided are a voice collection method, system, and program for accurately collecting voices of only salespersons in face-to-face sales. Furthermore, the present invention provides a method, a system, and a program that do not leave an utterance record of an unacknowledged customer, but reliably leave only a salesperson's voice that needs to be recorded.

本発明は、上記課題に鑑み、以下のような解決手段を含む。 In view of the above problems, the present invention includes the following solutions.

（音声の到達時間差の利用）
本発明は、所定の距離を隔てて配置された２つのマイクロホン素子を有するマイクロホンアレイを用い、特定の音源からこれらのマイクロホン素子に音声が到達する時間の差、すなわち時間遅れを利用する。さらに、本発明においては、マイクロホンアレイが含む２つのマイクロホン素子を結ぶ線分が、顧客と販売員を結ぶ線分と略平行となるように配置する。例えば、上方から見て、本発明により、マイクロホンアレイは顧客と販売員とを結ぶ直線上に配置される。このような配置により、顧客又は販売員が発する音声の、２つのマイクロホン素子のそれぞれへの到達時間の差は最大に近づき得る。従って、本発明においては、複数の対面販売ブースが並ぶ状況等において、マイクロホン素子への到達時間差が必ずしも最大ではない隣接ブースからの音声等を効果的にカットし、並びに到達時間差を利用し得る配置の範囲内において販売員や顧客の姿勢や位置の変化を許容し得る。さらに、一般に、マイクロホンアレイにおいては、同位相（同じ時間遅れ）で到達する方向からの音声を区別できないという問題（鏡像位置の問題）があるが、本発明においてはマイクロホン素子の配置によりこの問題を避けることが可能である。 (Use of audio arrival time difference)
The present invention uses a microphone array having two microphone elements arranged at a predetermined distance, and uses a time difference, that is, a time delay, in which sound reaches these microphone elements from a specific sound source. Furthermore, in the present invention, the line segment connecting the two microphone elements included in the microphone array is arranged so as to be substantially parallel to the line segment connecting the customer and the salesperson. For example, as viewed from above, according to the present invention, the microphone array is arranged on a straight line connecting a customer and a salesperson. With such an arrangement, the difference in the arrival time of the voice uttered by the customer or salesperson to each of the two microphone elements can approach the maximum. Therefore, in the present invention, in a situation where a plurality of face-to-face sales booths are lined up, etc., an arrangement that can effectively cut voices from adjacent booths where the difference in arrival time to the microphone element is not necessarily the maximum, and can use the arrival time difference Within this range, changes in attitude and position of salespersons and customers can be allowed. Furthermore, in general, in a microphone array, there is a problem that the sound from the direction that arrives in the same phase (same time delay) cannot be distinguished (mirror image position problem). In the present invention, this problem is caused by the arrangement of the microphone elements. It is possible to avoid it.

（ＣＳＰ係数の利用）
また、本発明は、ＣＳＰ(Ｃross power-Ｓpectrum Ｐhase、白色化相互相関)係数に基づく目的話者発話区間検出により、顧客と販売員の発話を区別し、個別に音声認識を行い得る。同時に、ＣＳＰ法による話者方向インデックスと音声認識結果のタイムスタンプを併用することにより、目的話者音声の録音を簡便化し、録音箇所を選択的に指定し得る。換言すれば、本発明は、方向インデックスと音声認識結果から、録音話者及び録音箇所を指定するインタフェースを有することを特徴としている。 (Use of CSP coefficient)
Further, according to the present invention, by detecting a target speaker utterance section based on a CSP (Cross power-Spectrum Phase, whitening cross-correlation) coefficient, it is possible to distinguish between a customer and a salesperson and perform voice recognition individually. At the same time, by using the speaker direction index by the CSP method and the time stamp of the speech recognition result, the recording of the target speaker voice can be simplified and the recording location can be selectively designated. In other words, the present invention is characterized by having an interface for designating a recording speaker and a recording location from the direction index and the speech recognition result.

（音声強調処理）
さらに、本発明は前記ＣＳＰ係数に基づいて利得調整、すなわち音声強調を行うことによって高い音声認識性能を実現する。本発明では、ＣＳＰ係数に基づく利得調整処理を、代表的な雑音除去手法であるスペクトル減算（ＳｐｅｃｔｒｕｍＳｕｂｔｒａｃｔｉｏｎ、ＳＳと略称）処理及びフロアリング（Ｆｌｏｏｒｉｎｇ）処理とを組み合わせた処理手順に結び付けている。具体的には、ＳＳ処理とＦｌｏｏｒｉｎｇ処理との間で利得調整を行う。この一連の処理によって、音声分離と同時に音声強調を行い、ソフトウェア処理として実用的な音声認識性能を低コストに実現する。 (Speech enhancement)
Furthermore, the present invention realizes high speech recognition performance by performing gain adjustment, that is, speech enhancement based on the CSP coefficient. In the present invention, gain adjustment processing based on the CSP coefficient is linked to a processing procedure that combines spectral subtraction (abbreviated as SS) processing and flooring processing, which are typical noise removal techniques. Specifically, gain adjustment is performed between the SS process and the flooring process. Through this series of processing, speech enhancement is performed simultaneously with speech separation, and practical speech recognition performance as software processing is realized at low cost.

本発明に係る、音声収集方法の実施手段には、音声信号処理の機能を有するコンピュータ装置、デジタル信号処理装置、デジタル録音装置等を用い得る。当該コンピュータ装置等は、販売員及び顧客の声に基づく音声信号の収録、収録された音声信号に対するＣＳＰ係数の算出等、本発明に係る音声収集方法のための諸段階を実施可能なものを任意に用い得る。 A computer device, a digital signal processing device, a digital recording device, or the like having a function of voice signal processing can be used as the means for implementing the voice collecting method according to the present invention. The computer apparatus or the like is arbitrarily capable of performing various steps for the voice collecting method according to the present invention, such as recording a voice signal based on the voices of salespeople and customers, and calculating a CSP coefficient for the recorded voice signal. Can be used.

本発明は、有音声区間のみを収集する音声収集技術、音声の明瞭度や聞きやすさを向上するために信号処理の周波数特性又は利得を調節する音声信号処理技術等の、既存の技術と組み合わせることができ、そのように組み合わせた技術もまた、本発明の技術範囲に含まれる。同様に、本発明の技法を含む音声収集機器、本発明の技法を含み可搬型コンピュータ装置等に組み込まれる音声収集機能、本発明の技法を含む複数の機器を協動させる音声収集システム等も、本発明の技術範囲に含まれる。さらに、本発明の技法は、音声収集のための諸段階を、ＦＰＧＡ（現場でプログラム可能なゲートアレイ）、ＡＳＩＣ（特定用途向け集積回路）、これらと同等のハードウェアロジック素子、プログラム可能な集積回路、又はこれらの組み合わせが記憶し得るプログラムの形態、すなわちプログラム製品として提供し得る。具体的には、データ入出力、データバス、メモリバス、システムバス等を備えるカスタムＬＳＩ（大規模集積回路）の形態として、本発明に係る販売員音声収集装置等を提供でき、そのように集積回路に記憶されたプログラム製品の形態も、本発明の技術範囲に含まれる。 The present invention is combined with existing technologies such as a speech collection technology that collects only voiced sections, and a speech signal processing technology that adjusts frequency characteristics or gain of signal processing to improve speech intelligibility and ease of hearing. Such combined techniques are also within the scope of the present invention. Similarly, a voice collecting device including the technique of the present invention, a voice collecting function incorporated in a portable computer device including the technique of the present invention, a voice collecting system for cooperating a plurality of devices including the technique of the present invention, etc. It is included in the technical scope of the present invention. In addition, the technique of the present invention provides the steps for voice collection, FPGA (field programmable gate array), ASIC (application specific integrated circuit), equivalent hardware logic elements, programmable integration. It may be provided as a program form that can be stored in the circuit or a combination thereof, that is, as a program product. Specifically, the salesperson voice collection device according to the present invention can be provided as a form of a custom LSI (large scale integrated circuit) having a data input / output, a data bus, a memory bus, a system bus, and the like. The form of the program product stored in the circuit is also included in the technical scope of the present invention.

本発明によれば、少なくとも第１及び第２のマイクロホンを備え第１及び第２のマイクロホンを所定の距離離して配置したマイクロホンアレイを用いて、第１及び第２のマイクロホンで受けた音声の信号をそれぞれ離散フーリエ変換して、音声の到来方向に関連する複数のＣＳＰ係数を求め、この複数のＣＳＰ係数より複数の音声の信号を検出した後、求めた複数の音声の信号から第１及び第２のマイクロホンを結ぶ線分と到来方向のなす角度に応じて規定された音声方向インデックスを検出して、検出した音声方向インデックスにより、検出した複数の音声の信号から目的音声の信号を抽出するようにしたので、目的音声のみを確実に抽出して収集することができるという効果がある。さらに、本発明は、了解を得ていない顧客の発話記録を残さず、記録が必要な販売員の音声だけを確実に残すことができるという効果がある。また音声分離と同時に、ＳＳ処理、ＣＳＰ係数による利得調整処理、Ｆｌｏｏｒｉｎｇ処理という一連のステップからなる音声強調処理を行うことによって、後続の音声認識性能を高めている。 According to the present invention, an audio signal received by the first and second microphones using the microphone array including at least the first and second microphones and the first and second microphones arranged at a predetermined distance. Are respectively subjected to discrete Fourier transform to obtain a plurality of CSP coefficients related to the direction of arrival of the voice, and after detecting a plurality of voice signals from the plurality of CSP coefficients, the first and second signals are obtained from the obtained plurality of voice signals. A speech direction index defined according to an angle formed by a line connecting the two microphones and the direction of arrival is detected, and a target speech signal is extracted from a plurality of detected speech signals based on the detected speech direction index. Therefore, there is an effect that only the target voice can be reliably extracted and collected. Further, the present invention has an effect that it is possible to reliably leave only the voice of the salesperson who needs to be recorded without leaving the utterance record of the customer who does not obtain the consent. Simultaneously with speech separation, subsequent speech recognition performance is enhanced by performing speech enhancement processing including a series of steps of SS processing, gain adjustment processing using CSP coefficients, and flooring processing.

以下、本発明の実施形態について図を参照しながら説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

［音声収集システム］
図１は、本発明の一実施形態に係る音声収集システムの一例を概略的に示す図である。図１において、音声収集システム１０は、マイクロホンアレイ１１、目的音声抽出装置１２、及び顧客対話記録サーバ１３を有しており、マイクロホンアレイ１１は２つのマイクロホン１１ａ及び１１ｂを備え、これらは例えば市販入手可能な一体型又は一組のステレオマイク等でもよい。目的音声抽出装置１２の詳細は、図７を用いて後述する。 [Audio collection system]
FIG. 1 is a diagram schematically illustrating an example of a voice collection system according to an embodiment of the present invention. In FIG. 1, a voice collection system 10 includes a microphone array 11, a target voice extraction device 12, and a customer dialogue recording server 13. The microphone array 11 includes two microphones 11a and 11b, which are commercially available, for example. A possible integral type or a set of stereo microphones may be used. Details of the target speech extraction device 12 will be described later with reference to FIG.

図１の例では、顧客２１、販売員２２及びテーブル１４等を上方から眺めて示す。マイクロホンアレイ１１は、上方から見て顧客２１と販売員２２とを結ぶ直線上にほぼ位置するように配置される。すなわち、マイクロホン１１ａ及び１１ｂを結ぶ線分と、顧客２１と販売員２２とを結ぶ線分とがほぼ並行となるように、マイクロホンアレイ１１を配置する。これにより、顧客又は販売員が発する音声の、２つのマイクロホン素子のそれぞれへの到達時間の差は最大になり得る。このように配置することにより、本発明においては、複数の対面販売ブースが横並びする状況等において、マイクロホン素子への到達時間差が必ずしも最大ではない隣接ブースからの音声等を効果的にカットし得る。 In the example of FIG. 1, the customer 21, the salesperson 22, the table 14 and the like are viewed from above. The microphone array 11 is arranged so as to be substantially located on a straight line connecting the customer 21 and the salesperson 22 when viewed from above. That is, the microphone array 11 is arranged so that the line segment connecting the microphones 11a and 11b and the line segment connecting the customer 21 and the salesperson 22 are substantially parallel. Thereby, the difference in the arrival time of the voices uttered by the customer or the salesperson to each of the two microphone elements can be maximized. By arranging in this way, in the present invention, in a situation where a plurality of face-to-face sales booths are arranged side by side, it is possible to effectively cut sound from adjacent booths where the difference in arrival time to the microphone element is not necessarily the maximum.

また、図示の例では、ＣＳＰ係数に基づいて目的話者発話区間検出を行って顧客と販売員の発話を区別する。具体的には、２つのマイクロホンで受けた音声信号についてＣＳＰ係数を計算し、ＣＳＰ係数が大きくなる区間を目的話者の発話区間と見なして目的音声の信号を抽出する。 In the illustrated example, the target speaker utterance section is detected based on the CSP coefficient to distinguish the utterances of the customer and the salesperson. Specifically, CSP coefficients are calculated for the speech signals received by the two microphones, and the target speech signal is extracted by regarding the section where the CSP coefficient is large as the speech section of the target speaker.

さらに、抽出された音声信号は、ＳＳ処理とＦｌｏｏｒｉｎｇ処理の間で、ＣＳＰ係数による利得調整を行うことによって音声強調処理を実施する。この音声強調処理は音声認識性能を高めるための処理であり、ＣＳＰ係数による目的話者音声抽出と、音声強調処理を合わせてＡＦＥ（ＡＳＲＦｒｏｎｔ−ｅｎｄｆｏｒｓｐｅｅｃｈＥｎｈａｎｃｅｍｅｎｔ、ＡＳＲは自動音声認識を意味するＡｕｔｏｍａｔｉｃＳｐｅｅｃｈＲｅｃｏｇｎｉｔｉｏｎの略称）と称する。本実施の形態では、ＡＦＥを用いて分離・強調した後の音声信号について個別に音声認識を行い、後述するように、ＣＳＰ手法による話者方向インデックスと音声認識結果のタイムスタンプを用いて、目的話者の音声信号の録音を簡便化して、録音箇所を選択的に指定する。 Further, the extracted speech signal is subjected to speech enhancement processing by performing gain adjustment using a CSP coefficient between the SS processing and the flooring processing. This voice enhancement process is a process for improving the voice recognition performance. AFE (ASR Front-end for speech Enhancement, ASR) means automatic voice recognition by combining the target speaker voice extraction by the CSP coefficient and the voice enhancement process. It is referred to as “Automatic Speech Recognition”. In the present embodiment, speech recognition is performed individually for speech signals that have been separated and emphasized using AFE, and, as will be described later, by using the speaker direction index by the CSP method and the time stamp of the speech recognition result, Simplify recording of the speaker's voice signal and selectively specify the recording location.

図１に示すように、マイクロホンアレイ１１は、上方から見てマイクロホン１１ａ及び１１ｂが顧客２１と販売員２２とを結ぶ直線上にほぼ位置するように配置されればよい。マイクロホンアレイ１１は、テーブル１４の略中央に置かれてもよく、テーブル１４の略中央に埋め込まれてもよい。 As shown in FIG. 1, the microphone array 11 may be arranged so that the microphones 11 a and 11 b are substantially located on a straight line connecting the customer 21 and the salesperson 22 when viewed from above. The microphone array 11 may be placed in the approximate center of the table 14 or may be embedded in the approximate center of the table 14.

図２は、マイクロホンに対する音声到来方向を示す図である。図２において、マイクロホン１１ａ及び１１ｂは距離ｄだけ離れて配置されているものとすると、マイクロホン１１ａ及び１１ｂを結ぶ直線と音声到来方向とのなす角度θは、数１で示される。 FIG. 2 is a diagram showing the voice arrival direction with respect to the microphone. In FIG. 2, assuming that the microphones 11a and 11b are spaced apart by a distance d, the angle θ formed by the straight line connecting the microphones 11a and 11b and the voice arrival direction is expressed by the following equation (1).

ここで、ｃは音速であり、τはマイクロホン１１ａ及び１１ｂに音声が到来する時間差（到来時間差）を表す。好適には、マイクロホン１１ａ及び１１ｂを結ぶ直線は、マイクロホン１１ａから１１ｂへの方向ベクトルであり、上式においてθ＝０°及びθ＝１８０°は、音声の到達方向と当該方向ベクトルとがそれぞれ平行及び逆平行の状態にあるものとして区別され得る。

Here, c is the speed of sound, and τ represents the time difference (arrival time difference) at which sound arrives at the

microphones

11a and 11b. Preferably, the straight line connecting the

microphones

11a and 11b is a direction vector from the microphones 11a to 11b. In the above equation, θ = 0 ° and θ = 180 ° indicate that the direction of speech arrival and the direction vector are parallel. And can be distinguished as being in an antiparallel state.

［音声強調処理（ＡＦＥ）］
次いで、本発明に係る音声収集システム等においては、ＣＳＰ係数を算出し、これを用いて音声強調処理を実施し得る。具体的には、音声強調処理は、ＳＳ処理とＦｌｏｏｒｉｎｇ処理においてＣＳＰ係数を用いて利得調整を実施し、これらにより販売員の音声を特定する性能や、音声認識の性能を向上し得る。以下、具体的な音声処理手段の構成要素及びその関係について例示する。 [Speech enhancement processing (AFE)]
Next, in the voice collection system or the like according to the present invention, the CSP coefficient can be calculated and the voice enhancement process can be performed using the CSP coefficient. Specifically, in the speech enhancement process, gain adjustment is performed using the CSP coefficient in the SS process and the flooring process, so that the performance of identifying the salesperson's voice and the performance of speech recognition can be improved. Hereinafter, specific components of the audio processing means and their relation will be exemplified.

図３は、本発明の一実施形態に係る、目的音声抽出装置１２の構成を示す図である。目的音声抽出装置１２は、マイクロホンアレイ１１に含まれるマイクロホン１１ａ及び１１ｂで受けた音声信号を入力とし、離散フーリエ変換処理部１０５及び１０６、ＣＳＰ係数算出部１１０、群遅延アレイ処理部１２０、雑音推定部１３０、ＳＳ処理部１４０、利得調整処理部１５０、フロアリング処理部１６０等を適宜含む。離散フーリエ変換処理部１０５及び１０６の処理は、２つのマイクロホン１１ａ及び１１ｂからの信号を適宜増幅し、所定の時間幅を有するフレームに分割し、適宜周波数帯域を制限する等、デジタル音声信号処理における公知の技法を含み、入力された信号から複素離散スペクトルを出力し得る。 FIG. 3 is a diagram showing a configuration of the target speech extraction device 12 according to an embodiment of the present invention. The target speech extraction device 12 receives speech signals received by the microphones 11a and 11b included in the microphone array 11, and receives discrete Fourier transform processing units 105 and 106, a CSP coefficient calculation unit 110, a group delay array processing unit 120, and noise estimation. Unit 130, SS processing unit 140, gain adjustment processing unit 150, flooring processing unit 160, and the like. The processing of the discrete Fourier transform processing units 105 and 106 is performed in digital audio signal processing, such as appropriately amplifying signals from the two microphones 11a and 11b, dividing the frames into frames having a predetermined time width, and appropriately limiting the frequency band. A complex discrete spectrum can be output from the input signal, including known techniques.

図３に示すＣＳＰ係数算出部１１０においては、前記複素離散スペクトルからＣＳＰ係数を算出する。ここで、ＣＳＰ係数とは、周波数領域で計算される２チャネル信号間の相互相関係数であって、次の数２により算出される。 In the CSP coefficient calculation unit 110 shown in FIG. 3, CSP coefficients are calculated from the complex discrete spectrum. Here, the CSP coefficient is a cross-correlation coefficient between two channel signals calculated in the frequency domain, and is calculated by the following equation (2).

式中、φ（ｉ，Ｔ）は、１番目と２番目のマイクロホン１１ａ及び１１ｂで受けた音声信号から求まるＣＳＰ係数、ｉは音声到来方向（話者方向インデックス）、Ｔはフレーム番号、ｓ_１（ｔ）とｓ_２（ｔ）はそれぞれ時刻ｔに受音した１番目と２番目のマイクロホン１１ａ及び１１ｂの信号である。また，ＤＦＴは離散フーリエ変換を表し、ＩＤＦＴは逆離散フーリエ変換を表している。また、＊は共役複素数を表す。

In the equation, φ (i, T) is a CSP coefficient obtained from the voice signals received by the first and

second microphones

11a and 11b, i is the voice arrival direction (speaker direction index), T is the frame number, and s ₁ (T) and s ₂ (t) are the signals of the first and

second microphones

11a and 11b received at time t, respectively. DFT represents discrete Fourier transform, and IDFT represents inverse discrete Fourier transform. * Represents a conjugate complex number.

次いで、群遅延アレイ処理部１２０において、θ方向から到来する信号を少なくとも２つのマイクロホンで受音し、それぞれを同相化して加算することにより、θ方向から到来する信号を強調するものである。よって、θ方向以外から到来する信号は、同相化されないために強調されない。よって、θ方向に感度が高く、それ以外の方向に感度が低いという指向性を形成することができる。 Next, in the group delay array processing unit 120, signals arriving from the θ direction are received by at least two microphones, in-phased and added, thereby enhancing the signal arriving from the θ direction. Therefore, signals coming from other than the θ direction are not emphasized because they are not in-phase. Therefore, it is possible to form directivity that sensitivity is high in the θ direction and sensitivity is low in other directions.

群遅延アレイ処理部１２０の代わりにも、適応型アレイ処理で雑音や残響の方向に対して死角を形成することもできる。さらには、その他のアレイ処理によって代替してもかまわない。また、これらのアレイ処理を省略して、すなわち素通りさせて、２つのマイクロホンで受けた音声信号のうち、どちらか片方の信号そのままを利用することもできる。 Instead of the group delay array processing unit 120, a blind spot can be formed in the direction of noise or reverberation by adaptive array processing. Furthermore, other array processing may be substituted. Further, these array processes can be omitted, that is, passed through, and one of the audio signals received by the two microphones can be used as it is.

次いで、上述のように算出されたＣＳＰ係数を用い、音声強調処理が実施される。具体的には、音声強調処理は、ＳＳ処理とＦｌｏｏｒｉｎｇ処理においてＣＳＰ係数を用いて利得調整を実施する。典型的には、ＳＳ処理は次式で表される減算処理である。

ここで、Ｘω（Ｔ）はＳＳ処理前のパワースペクトル，Ｙω（Ｔ）はＳＳ処理後のパワースペクトルすなわち減算後パワースペクトル，Ｕωは雑音のパワースペクトルである。このＵωについては、雑音区間すなわち目的話者の非発話区間で推定されるものであって、事前に推定して固定的に使ってもよく、又は入力された音声信号と同時に逐次推定（更新）してもよく、あるいは一定時間間隔で推定（更新）してもよい。 Next, speech enhancement processing is performed using the CSP coefficient calculated as described above. Specifically, in the speech enhancement process, gain adjustment is performed using the CSP coefficient in the SS process and the flooring process. Typically, SS processing is subtraction processing expressed by the following equation.

Here, Xω (T) is a power spectrum before SS processing, Yω (T) is a power spectrum after SS processing, that is, a power spectrum after subtraction, and Uω is a noise power spectrum. This Uω is estimated in the noise interval, that is, in the non-speech interval of the target speaker, and may be estimated in advance and used in a fixed manner, or sequentially estimated (updated) simultaneously with the input speech signal. Alternatively, it may be estimated (updated) at regular time intervals.

すなわち、例えばマイクロホン１１ａ及び１１ｂで受けた２つの入力信号の両方についてアレイ処理で統合された信号、又は当該２つの入力信号のいずれか一方であるＸω（Ｔ）は、雑音推定部１３０に入力され、雑音のパワースペクトルＵωが適宜推定される。αは減算定数であり、例えば１に近い値（例えば、０．９０）等の任意の値をとることができる。 That is, for example, a signal integrated by array processing for both of the two input signals received by the microphones 11a and 11b, or Xω (T) that is one of the two input signals is input to the noise estimation unit 130. The noise power spectrum Uω is estimated as appropriate. α is a subtraction constant, and can take an arbitrary value such as a value close to 1 (for example, 0.90).

次いで、次式のように適宜利得調整を実施し得る。すなわち、利得調整は、上述のＳＳ処理後の減算スペクトルＹω（Ｔ）にＣＳＰ係数を掛けることで行う。

式中、Ｄω（Ｔ）は利得調整後のパワースペクトルである。目的話者が発話していないときはＣＳＰ係数が小さくなるので、到来方向以外からの音声信号のパワースペクトルはこの処理により抑圧されることになる。この式が示すように「利得調整」を行うことができれば、本発明の技術的思想は、何もＣＳＰ係数を利用したものだけに限定されるものではないことが理解できる。 Next, gain adjustment can be appropriately performed as in the following equation. That is, the gain adjustment is performed by multiplying the subtracted spectrum Yω (T) after the SS process by the CSP coefficient.

In the equation, Dω (T) is a power spectrum after gain adjustment. Since the CSP coefficient is small when the target speaker is not speaking, the power spectrum of the voice signal from other than the direction of arrival is suppressed by this processing. If the “gain adjustment” can be performed as shown by this equation, it can be understood that the technical idea of the present invention is not limited to the one using the CSP coefficient.

さらに、次式のようにフロアリング（Ｆｌｏｏｒｉｎｇ）処理を実施する。すなわち、フロアリング処理とは実データに含まれる小さな値をそのまま用いずに適当な数値に置き換えることを指す。

式中、Ｚω（Ｔ）はＦｌｏｏｒｉｎｇ処理後のパワースペクトル、Ｕωは雑音のパワースペクトルであって、Ｕωとしては、数式３で用いるものと同様のもの、又は雑音推定部１３０の出力等を適宜利用できるが、他の方法で推定した異なったものを利用してもよい。数式５が示すように、Ｕωは条件判断のためだけに用いられることもある。フロアリング係数（Ｆｌｏｏｒｉｎｇ係数）βは定数であり、例えば０（ゼロ）に近い値（例えば、０．１０）等の、当技術分野において好都合な任意の値をとることができる。 Further, a flooring process is performed as in the following equation. That is, the flooring process refers to replacing a small value included in actual data with an appropriate numerical value without using it as it is.

In the equation, Zω (T) is the power spectrum after the flooring process, Uω is the noise power spectrum, and Uω is the same as that used in Equation 3, or the output of the noise estimation unit 130 is used as appropriate. It is possible to use a different one estimated by other methods. As Equation 5 shows, Uω may be used only for condition determination. The flooring coefficient (flooring coefficient) β is a constant, and can take any value convenient in the art, such as a value close to 0 (eg, 0.10).

通常、ＳＳ処理とフロアリング処理はこの手順を守って用いられるが、両処理の間にＣＳＰ係数による利得調整を導入したことが本発明の１つのポイントである。以上のようにして得られる出力Ｚω（Ｔ）は、サーバ装置等に記憶するための販売員の音声信号、又は音声認識手段への入力等に用い得る。図３においては、２つのマイクロホン１１ａ及び１１ｂを用いて観測し得る音声信号の一方を出力に用いる例を示したが、これに限らず、本発明に係る音声収集方法は、図８を用いて後述するように、マイクロホンアレイ１１に到達する方向の異なる２つの音声に対して、それぞれ受けた音声信号ごとに、記録又は音声認識等のための出力を得ることが可能である。記録又は音声認識等のための出力は、図７を用いて後述するように、音声認識等に用いることが可能である。 Normally, the SS process and the flooring process are used in compliance with this procedure, but it is one point of the present invention that a gain adjustment by a CSP coefficient is introduced between the two processes. The output Zω (T) obtained as described above can be used as a salesperson's voice signal to be stored in a server device or the like, or input to voice recognition means. FIG. 3 shows an example in which one of the sound signals that can be observed using the two microphones 11a and 11b is used for output. However, the present invention is not limited to this, and the sound collection method according to the present invention uses FIG. As will be described later, it is possible to obtain an output for recording or voice recognition or the like for each voice signal received with respect to two voices having different directions to reach the microphone array 11. The output for recording or voice recognition or the like can be used for voice recognition or the like, as will be described later with reference to FIG.

［話者方向インデックス］
図４はマイクロホンの位置に対する話者方向インデックスの一例を示す図である。マイクロホンアレイ１１に含まれるマイクロホン１１ａ及び１１ｂを結ぶ方向ベクトルを仮定すると、話者からの音声が到達する方向は、マイクロホンアレイ１１を中心とする当該方向ベクトルに対する方位角の範囲として区別し得る。例えば、マイクロホン１１ａからマイクロホン１１ｂの方向に沿って到達する音声は、当該方向ベクトルと略平行であり、方位角の余弦の値は＋１に近い（図４に示す話者方向インデックスが＋７の領域）。また例えば、マイクロホン１１ｂからマイクロホン１１ａの方向に沿って到達する音声は、当該方向ベクトルと逆平行に近く、方位角の余弦の値は−１に近い（図４に示す話者方向インデックスが−７の領域）。数１に示したように、マイクロホン間隔ｄ及び音速ｃが与えられると、到達時間差τは角度θに依存するので、図４に示す話者方向インデックスは、到達時間差τの情報を含む。 [Speaker Direction Index]
FIG. 4 is a diagram showing an example of the speaker direction index with respect to the position of the microphone. Assuming a direction vector connecting the microphones 11 a and 11 b included in the microphone array 11, the direction in which the voice from the speaker arrives can be distinguished as a range of azimuth angles with respect to the direction vector centering on the microphone array 11. For example, the voice arriving along the direction from the microphone 11a to the microphone 11b is substantially parallel to the direction vector, and the value of the cosine of the azimuth is close to +1 (the region where the speaker direction index shown in FIG. 4 is +7). . Further, for example, the voice arriving along the direction of the microphone 11a from the microphone 11b is nearly antiparallel to the direction vector, and the cosine value of the azimuth is close to −1 (the speaker direction index shown in FIG. 4 is −7). Area). As shown in Equation 1, when the microphone interval d and the sound speed c are given, the arrival time difference τ depends on the angle θ, so the speaker direction index shown in FIG. 4 includes information on the arrival time difference τ.

マイクロホンアレイ１１に対して直角の方向からマイクロホン１１ａ及び１１ｂに到来する音声には到来時間差はなく、ここでは、この方向の話者方向インデックスは０と表される。つまり、前述のように、角度θは数１で表され、到来サンプル数をｘ、サンプリング周波数をｆとすると、τ＝ｘ／ｆで表されるから、いまサンプリング周波数を２２０５０Ｈｚ、マイクロホン間の距離ｄ＝１２．５ｃｍとすると、ｘ＝０、つまり、話者方向インデックス＝０であると、音速を３４０ｍ／ｓとすれば、角度θ＝９０°となる。 There is no difference in arrival time between voices arriving at the microphones 11a and 11b from a direction perpendicular to the microphone array 11, and the speaker direction index in this direction is represented as 0 here. That is, as described above, the angle θ is expressed by Equation 1, and when the number of incoming samples is x and the sampling frequency is f, it is expressed by τ = x / f, so that the sampling frequency is 22050 Hz and the distance between the microphones. If d = 12.5 cm, then x = 0, that is, if the speaker direction index = 0, then if the sound speed is 340 m / s, the angle θ = 90 °.

また、図４において，話者方向インデックス＋１（又は−１）は、マイクロホン１１ａ及び１１ｂに到達する音声が１サンプルだけずれている範囲を表しており（つまり、Ｘ＝１であり）、この場合には、角度θ＝８２．９°となる。 In FIG. 4, the speaker direction index +1 (or -1) represents a range in which the sound reaching the microphones 11a and 11b is shifted by one sample (that is, X = 1). Is an angle θ = 82.9 °.

同様にして、話者方向インデックス＋２〜＋７（又は−２〜−７）は、それぞれマイクロホン１１ａ及び１１ｂに到達する音声が１〜７サンプルだけずれている範囲を表している。そして、ＡＦＥにおいては、マイクロホン１１ａ及び１１ｂに入力される音声の到来時間差を考慮したＣＳＰ係数を用いて目的音を抽出する。ここで、ｘ＝＋７においては角度θ＝３０．３°となり、ｘ＝−７においては角度θ＝１４９．７°となる。従って、マイクロホン１１ａ及び１１ｂを結ぶ直線方向には約３０°の範囲を同一の音声到達方向として許容し得る。このように、本発明においては、到達時間差を利用し得る配置の範囲内において販売員や顧客の姿勢や位置の変化を許容し得るという特徴がある。 Similarly, the speaker direction indexes +2 to +7 (or −2 to −7) represent ranges in which sounds reaching the microphones 11a and 11b are shifted by 1 to 7 samples, respectively. In the AFE, the target sound is extracted using a CSP coefficient that takes into account the difference in arrival times of the sounds input to the microphones 11a and 11b. Here, the angle θ = 30.3 ° at x = + 7, and the angle θ = 149.7 ° at x = −7. Therefore, a range of about 30 ° can be allowed as the same voice arrival direction in the linear direction connecting the microphones 11a and 11b. As described above, the present invention is characterized in that changes in attitudes and positions of salespersons and customers can be allowed within the range of arrangements in which the arrival time difference can be used.

いま、話者方向インデックス＝０（例えば、右側）の方向に目的話者がいるとすると、話者方向インデックス＝０にいる目的話者が発話した場合に、前述のように、マイクロホン１１ａ及び１１ｂで受けた音声信号には時間遅れがなく、両音声信号の相関が高くなる。このため、ＣＳＰ係数φ（０，Ｔ）は大きくなる。 Assuming that there is a target speaker in the direction of the speaker direction index = 0 (for example, the right side), when the target speaker in the speaker direction index = 0 speaks, as described above, the microphones 11a and 11b. There is no time delay in the audio signal received at, and the correlation between both audio signals is high. For this reason, the CSP coefficient φ (0, T) increases.

一方、例えば、話者方向インデックス＝＋４（例えば、図中右側）の方向から音声が到来する場合、マイクロホン１１ａから４サンプル分遅れてマイクロホン１１ｂに音声が到達することになる。このため、φ（０，Ｔ）は小さくなる（この際、φ（４，Ｔ）が大きくなる）。 On the other hand, for example, when the voice comes from the direction of the speaker direction index = + 4 (for example, the right side in the figure), the voice arrives at the microphone 11b with a delay of 4 samples from the microphone 11a. For this reason, φ (0, T) becomes small (in this case, φ (4, T) becomes large).

従って、話者方向インデックス＝０の方向から到来する音声のみを抽出したい場合には、φ（０，Ｔ）の値をトラッキングして、φ（０，Ｔ）が大きくなる区間を抽出すればよいことになる。但し、ＡＦＥでは、マイクロホン１１ａ及び１１ｂに同一の時間差で到来する方向、つまり、マイクロホン１１ａ及び１１ｂを結ぶ軸に対して対象の方向から到来する音声も受信することになる。 Therefore, when it is desired to extract only the voice coming from the direction of the speaker direction index = 0, the value of φ (0, T) is tracked and the section where φ (0, T) increases is extracted. It will be. However, in AFE, the direction of arrival in the microphones 11a and 11b with the same time difference, that is, the voice coming from the target direction with respect to the axis connecting the microphones 11a and 11b is also received.

例えば、話者方向インデックス＝＋４に着目すると、図中右側の話者方向インデックス＝＋４から到来する音声と図中左側の話者方向インデックス＝＋４から到来する音声を区別することができないことになる。よって、鏡像位置の問題を受けないようにマイクロホン１１ａ及び１１ｂを配置することが必要となる。 For example, when attention is paid to the speaker direction index = + 4, it is not possible to distinguish between voices coming from the right speaker direction index = + 4 and voices coming from the left speaker direction index = + 4 in the figure. . Therefore, it is necessary to arrange the microphones 11a and 11b so as not to be affected by the mirror image position problem.

ところで、話者（つまり、ここでは顧客２１と販売員２２）は、テーブル１４を挟んで向かい合って着座した際、横方向にずれて（つまり、横方向において広い範囲に）座る可能性があり、さらに対話中においても着座位置や姿勢が変化することが多い。このため、目的話者方向に対してある程度の範囲の音声を収音できる必要がある。 By the way, when a speaker (that is, customer 21 and salesperson 22 here) sits facing each other across the table 14, there is a possibility that the speaker will be displaced laterally (that is, in a wide range in the lateral direction), In addition, the sitting position and posture often change during the conversation. For this reason, it is necessary to be able to pick up a certain range of sounds with respect to the target speaker direction.

超指向性マイクロホンは、目的話者の音声信号のみを録音するという観点からは高い効果が得られるが、一般に高価格であり、さらに、話者位置の変動に対処することが難しく、着座位置によって収音性能が極端に変化してしまう。加えて、超指向性マイクロホンはそのサイズが大きく、目標方向とは逆方向にも鋭い指向性を有する。このため、ブースのレイアウトとマイクロホンとの配置関係が極めて難しくなってしまう。 Superdirective microphones are highly effective from the viewpoint of recording only the target speaker's voice signal, but are generally expensive and difficult to deal with variations in speaker position, depending on the seating position. Sound collection performance will change drastically. In addition, the super-directional microphone is large in size and has a sharp directivity in the direction opposite to the target direction. For this reason, the layout relationship between the booth layout and the microphone becomes extremely difficult.

一方、単一指向性マイクロホンを用いた場合には、指向性の精度がそれほど高くないため、周囲の環境音や隣のブースの会話をも録音してしまうことになる。なお、単一指向性マイクロホンも比較的高価格である。 On the other hand, when a unidirectional microphone is used, the directivity accuracy is not so high, so that ambient ambient sounds and conversations in the adjacent booth are also recorded. Unidirectional microphones are also relatively expensive.

図５はマイクロホンの指向性による分類を示す図であり、図５（ａ）に示す無指向性マイクロホンは３６０度全ての方向に対して同感度を有し、図５（ｂ）に示す双指向性マイクロホンは正面とその反対側に対して感度がよい。また、図５（ｃ）に示す単一指向性マイクロホンは正面方向のみの音声に対して感度がよい。図５（ｄ）に示す鋭指向性マイクロホン及び図５（ｅ）に示す超指向性マイクロホンはそれぞれ単一指向性よりも指向特性を鋭くしたものである。 FIG. 5 is a diagram showing classification by microphone directivity. The omnidirectional microphone shown in FIG. 5 (a) has the same sensitivity in all directions of 360 degrees, and the bidirectional design shown in FIG. 5 (b). The sensitive microphone is sensitive to the front and the opposite side. In addition, the unidirectional microphone shown in FIG. 5C is sensitive to sound only in the front direction. The sharp directional microphone shown in FIG. 5D and the super directional microphone shown in FIG. 5E each have sharper directional characteristics than unidirectional.

ＡＦＥを用いた場合には、図４に示すように、マイクロホンアレイ１１の軸方向（＋７，−７）に関して比較的広いローブが形成され、例えば、話者方向インデックス＝＋７に販売員２２、話者方向インデックス＝−７に顧客２１が位置すると、軸方向（＋７，−７）においてはそのローブが広いから、顧客２１及び販売員２２の姿勢や位置が多少ずれてもよく、そして、当該ローブの範囲以外から到達する音声を効果的にカットすることができる。 When AFE is used, as shown in FIG. 4, a relatively wide lobe is formed with respect to the axial direction (+7, -7) of the microphone array 11. For example, the salesperson 22 has a speaker direction index = + 7. When the customer 21 is positioned at the customer direction index = −7, the lobes are wide in the axial direction (+7, −7), so that the postures and positions of the customer 21 and the salesperson 22 may be slightly shifted. It is possible to effectively cut the voice that reaches from outside the range.

そして、ＡＦＥを用いれば、マイクロホンの指向性／無指向性が関係なくなり、どのような指向性のマイクフォンも用いることができる結果、マイクロホンに要するコストも低く抑えることができる。 When AFE is used, the directivity / omnidirectionality of the microphone is irrelevant, and any directivity microphone can be used. As a result, the cost required for the microphone can be kept low.

［マイクロホンアレイの配置］
図６に、本発明の一実施形態に係る、マイクロホンアレイの配置の例を示す。前述のように、ＡＦＥを用いた際には鏡像位置の問題があるので、マイクロホンの位置に配慮する必要があり、例えば、図６に符号Ａで示す位置（隣のブース１６との敷居１７等）にマイクロホンアレイ１１を配置した場合には、隣のブース１６の音声まで同じように抽出してしまうことがある。 [Arrangement of microphone array]
FIG. 6 shows an example of the arrangement of microphone arrays according to an embodiment of the present invention. As described above, since there is a problem of the mirror image position when using AFE, it is necessary to consider the position of the microphone. For example, the position indicated by symbol A in FIG. 6 (the threshold 17 with the adjacent booth 16 or the like). When the microphone array 11 is arranged at the same time, the sound of the adjacent booth 16 may be extracted in the same way.

このため、本実施の形態では、図６において符号Ｂで示す位置（例えば、テーブル１５上）にマイクロホンアレイ１１を設置して、上述の問題を回避する。本実施の形態におけるマイクロホンアレイ１１の設置については、発声者の方向を細かい単位で正確に検出しづらくなるけれども、販売員２２の音声のみを収集するという点からは、何ら問題はない。もちろん、隣接ブースからの到来音声がない環境においては、例えば、図６に符号Ａで示す位置にマイクを配置し、本発明のＡＦＥによる音声強調に関わる部分のみを適用する実施形態も想定し得る。 For this reason, in this embodiment, the microphone array 11 is installed at a position (for example, on the table 15) indicated by symbol B in FIG. Regarding the installation of the microphone array 11 in the present embodiment, it is difficult to accurately detect the direction of the speaker, but there is no problem in that only the voice of the salesperson 22 is collected. Of course, in an environment where there is no incoming voice from an adjacent booth, for example, an embodiment in which a microphone is arranged at a position indicated by reference symbol A in FIG. 6 and only a portion related to voice enhancement by AFE of the present invention is applied can be assumed. .

［目的音声抽出装置］
図７は、図１に示す目的音声抽出装置１２を詳細に示すブロック図である。図７において、いま販売員２２と顧客２１が１対１で対話しているものとする。目的音声抽出装置１２は、発話区間インデックス検出処理部３１、第１の音声認識部３２、第２の音声認識部３３、統合選択部３４、及び録音範囲抽出部３５を有しており、発話区間インデックス検出処理部３１にはマイクロホン１１ａ及び１１ｂから受けたそれぞれの音声信号が入力される。 [Target voice extraction device]
FIG. 7 is a block diagram showing in detail the target speech extraction device 12 shown in FIG. In FIG. 7, it is assumed that the salesperson 22 and the customer 21 are having a one-to-one dialogue. The target speech extraction device 12 includes a speech segment index detection processing unit 31, a first speech recognition unit 32, a second speech recognition unit 33, an integration selection unit 34, and a recording range extraction unit 35. Each of the audio signals received from the microphones 11a and 11b is input to the index detection processing unit 31.

図７においては、マイクロホン１１ａは販売員２２側に位置し、マイクロホン１１ｂは顧客２１側に位置しているものとし、マイクロホン１１ａ（Ｌ−ｃｈ）で受けた音声信号Ｓ_１（ｔ）、及びマイクロホン１１ｂ（Ｒ−ｃｈ）で受けた音声信号Ｓ_２（ｔ）が入力されるものとする。なお、ここでは、いずれのマイクロホンからの入力も、図示しないＡ／Ｄ変換部によって所定のサンプリング周波数でサンプリングされて、デジタル信号として発話区間インデックス検出処理部３１に与えられる。発話区間インデックス検出処理部３１の動作の詳細は、図８を用いて後述する。 In FIG. 7, it is assumed that the microphone 11a is located on the salesperson 22 side and the microphone 11b is located on the customer 21 side, and the audio signal S ₁ (t) received by the microphone 11a (L-ch) and the microphone Assume that the audio signal S ₂ (t) received at 11b (R-ch) is input. Here, the input from any of the microphones is sampled at a predetermined sampling frequency by an A / D conversion unit (not shown) and provided as a digital signal to the speech segment index detection processing unit 31. Details of the operation of the speech section index detection processing unit 31 will be described later with reference to FIG.

次いで、本発明に係る目的音声抽出装置１２は、音声認識部３２、３３を用い、発話区間インデックス検出処理部３１から出力される、分離された音声信号である販売員の音声信号及び顧客の音声信号のそれぞれに対して、適宜音声認識の動作を実施し、認識結果及びタイムスタンプを得る。ここで、タイムスタンプとは音声認識部３２、３３が出力する時間情報等である。タイムスタンプは後続の段階において認識結果を統合する際の時系列情報となり得る。 Next, the target speech extraction device 12 according to the present invention uses the speech recognition units 32 and 33, and outputs the speech segment index detection processing unit 31 as the separated speech signal and the customer speech and customer speech. A speech recognition operation is appropriately performed on each of the signals, and a recognition result and a time stamp are obtained. Here, the time stamp is time information output by the voice recognition units 32 and 33. The time stamp can be time-series information when integrating the recognition results in a subsequent stage.

次いで、本発明に係る目的音声抽出装置１２は、統合選択部３４を用い、音声認識の結果を統合し得る。具体的には、話者の区別、音声認識の結果、タイムスタンプ等が相互に関連付けられたデータが生成され得る。 Next, the target speech extraction apparatus 12 according to the present invention can integrate the results of speech recognition using the integration selection unit 34. Specifically, data in which speaker discrimination, speech recognition results, time stamps, and the like are associated with each other can be generated.

次いで、本発明に係る目的音声抽出装置１２は、録音範囲抽出部３５により、話者方向インデックス、音声認識結果、タイムスタンプ等の情報を元に、所定の又は指定の時間領域に含まれる音声信号を切り出して適宜サーバ装置等に保存し得る。本発明においては、販売員又は顧客のそれぞれについて個別に音声認識を実施することにより、録音部分を指定する際には、両者の対話内容を確認し得る。また、不必要な部分の録音を避けることも可能であり、サーバ装置等の資源を効率的に利用し得る。 Next, the target speech extraction device 12 according to the present invention uses the recording range extraction unit 35 to generate a speech signal included in a predetermined or specified time region based on information such as a speaker direction index, a speech recognition result, and a time stamp. Can be cut out and stored in a server device or the like as appropriate. In the present invention, by performing voice recognition individually for each salesperson or customer, when the recording portion is designated, the contents of the dialogue between the two can be confirmed. It is also possible to avoid recording unnecessary parts, and resources such as server devices can be used efficiently.

［発話区間インデックス検出処理部３１の処理］
図８は発話区間インデックス検出処理部３１における処理を説明するためのフロー図である。発話区間インデックス検出処理部３１では、音声信号を取得して（ステップＳ１）、当該音声信号がマイクロホン１１ａからの入力であるか否かを判定する（ステップＳ２）。マイクロホン１１ａ（第１のマイクロホン）からの入力であれば、販売員デジタル音声入力信号ついて、例えば、ハニング窓又はハミング窓による窓掛け処理が行われ、販売員窓掛け処理済信号とされる（ステップＳ３）。続いて、販売員窓掛け処理済信号は、離散フーリエ変換処理によって周波数領域に変換されて販売員周波数領域信号とされ（ステップＳ４）、図中破線の囲みで示す処理に移行する。同様に、ステップＳ２において、マイクロホン１１ｂ（第２のマイクロホン）からの入力であると判定されと、顧客デジタル音声入力信号について、同様にして、窓掛け処理（ステップＳ５）、離散フーリエ変換処理（ステップＳ６）が行われて、顧客周波数領域信号とされる。 [Processing of Speech Section Index Detection Processing Unit 31]
FIG. 8 is a flowchart for explaining processing in the utterance section index detection processing unit 31. The speech section index detection processing unit 31 acquires a voice signal (step S1) and determines whether the voice signal is an input from the microphone 11a (step S2). If it is an input from the microphone 11a (first microphone), the salesperson digital voice input signal is subjected to, for example, a windowing process using a Hanning window or a Hamming window to obtain a salesperson windowed signal (step). S3). Subsequently, the salesperson windowed signal is converted into a frequency domain signal by a discrete Fourier transform process to be a salesperson frequency domain signal (step S4), and the process proceeds to a process indicated by a broken line in the figure. Similarly, when it is determined in step S2 that the input is from the microphone 11b (second microphone), the windowing process (step S5) and the discrete Fourier transform process (step S5) are similarly performed for the customer digital voice input signal. S6) is performed to obtain a customer frequency domain signal.

発話区間インデックス検出処理部３１では、前述したように、話者方向インデックスを検出し、販売員周波数領域信号、顧客周波数領域信号、及び話者方向インデックスに基づいて、つまり、数１に基づいてＣＳＰ係数を算出する（ステップＳ７）。 As described above, the speech section index detection processing unit 31 detects the speaker direction index, and based on the salesperson frequency domain signal, the customer frequency domain signal, and the speaker direction index, that is, based on the CSP, A coefficient is calculated (step S7).

続いて、販売員周波数領域信号と顧客周波数領域信号について、販売員側遅延和アレイ処理を行って（ステップＳ８）、販売員の音声信号を強調して、販売員強調信号とする。同様にして、販売員周波数領域信号と顧客周波数領域信号について、顧客側遅延和アレイ処理を行って（ステップＳ９）、顧客の音声信号を強調して、顧客強調信号とする。 Subsequently, the salesperson side delay sum array processing is performed on the salesperson frequency domain signal and the customer frequency domain signal (step S8), and the salesperson's voice signal is emphasized to obtain a salesperson enhancement signal. Similarly, the customer side delay sum array processing is performed on the salesperson frequency domain signal and the customer frequency domain signal (step S9), and the customer's voice signal is emphasized to obtain a customer emphasized signal.

次に、販売員強調信号は、スペクトルサブトラクション処理（ステップＳ１０）において雑音が取り除かれて、さらに、ＣＳＰ係数を用いて利得調整処理（ステップＳ１１）を行った後、適宜フロアリング処理（ステップＳ１２）を実施し、販売員側の音声信号を得る。 Next, the salesperson emphasis signal is subjected to spectrum floor subtraction processing (step S10), noise is removed, and after performing gain adjustment processing (step S11) using the CSP coefficient, flooring processing (step S12) is performed as appropriate. To obtain a salesperson's voice signal.

同様にして、顧客強調信号は、スペクトルサブトラクション処理（ステップＳ１３）において雑音が取り除かれて、さらに、ＣＳＰ係数を用いて利得調整処理（ステップＳ１４）を行った後、適宜フロアリング処理（ステップＳ１５）を実施し、顧客側の音声信号を得る。 Similarly, the customer-enhanced signal is subjected to a flooring process (step S15) after the noise is removed in the spectral subtraction process (step S13) and the gain adjustment process (step S14) is further performed using the CSP coefficient. To obtain the customer's voice signal.

さらに、発話区間インデックス検出処理部３１では、前述の数１に示すＣＳＰ係数に基づいた発話区間検出処理を行って、前述のようにして得られた販売員側の音声信号と顧客側の音声信号をそれぞれ独立のチャネルとして一時保存する（発話区間検出処理に当たっては、前述の目的音抽出手法によるアルゴリズムが用いられることになる）。ここでは、前述したように、目的音の分離とともに話者方向インデックスも検出し、分離した音声信号と話者方向インデックスとを関連付けておく。 Further, the utterance section index detection processing unit 31 performs the utterance section detection processing based on the CSP coefficient expressed by the above-described formula 1, and the salesperson side audio signal and the customer side audio signal obtained as described above are used. Are temporarily stored as independent channels (in the speech section detection process, the algorithm based on the target sound extraction method described above is used). Here, as described above, the speaker direction index is detected together with the separation of the target sound, and the separated speech signal and the speaker direction index are associated with each other.

発話区間インデックス検出処理部３１は、販売員側の音声信号及び当該音声信号の話者方向インデックスを第１の音声認識部３２に与えるとともに、録音範囲抽出部３５に与える。また、発話区間インデックス検出処理部３１は、顧客側の音声信号及び当該音声信号の話者方向インデックスを第２の音声認識部３３に与えるとともに、録音範囲抽出部３５に与える。 The utterance section index detection processing unit 31 provides the sales person side speech signal and the speaker direction index of the speech signal to the first speech recognition unit 32 and also to the recording range extraction unit 35. Further, the utterance section index detection processing unit 31 gives the customer-side voice signal and the speaker direction index of the voice signal to the second voice recognition unit 33 and to the recording range extraction unit 35.

第１の音声認識部３２では、販売員側の音声信号について音声認識を行って、認識結果とタイムスタンプを得る（販売員音声認識結果及び販売員タイムスタンプを得る）。また、第２の音声認識部３３では、顧客側の音声信号について音声認識を行って、認識結果とタイムスタンプを得る（顧客音声認識結果及び顧客タイムスタンプを得る）。ここで、タイムスタンプとは、第１の音声認識部３２及び第２の音声認識部３３において出力される時間情報であり、認識結果を統合する際の時系列情報として用いられる。 The first voice recognition unit 32 performs voice recognition on the salesperson side voice signal to obtain a recognition result and a time stamp (a salesperson voice recognition result and a salesperson time stamp are obtained). The second voice recognition unit 33 performs voice recognition on the customer-side voice signal to obtain a recognition result and a time stamp (a customer voice recognition result and a customer time stamp are obtained). Here, the time stamp is time information output from the first speech recognition unit 32 and the second speech recognition unit 33, and is used as time-series information when integrating the recognition results.

前述の販売員音声認識結果及び販売員タイムスタンプと顧客音声認識結果及び顧客タイムスタンプとは、統合選択部３４に与えられ、ここで、これら音声認識結果を統合して、表１に示す対話表を得る（なお、この対話表は、例えば、ＨＴＭＬ形式でユーザに提示するようにしてもよい）。 The salesperson voice recognition result, salesperson time stamp, customer voice recognition result, and customer time stamp described above are provided to the integration selection unit 34, where these voice recognition results are integrated into the dialogue table shown in Table 1. (Note that this dialog table may be presented to the user in HTML format, for example).

この対話表から所望の音声信号の部分を録音部として選択すると、統合選択部３４は目的話者録音範囲（つまり、タイムスタンプで区切られた範囲）を生成し、録音範囲抽出部３５に送る。録音範囲抽出部３５では、話者方向インデックスと目的話者録音範囲に基づいて該当する区間（範囲）の音声信号を抽出し、顧客対話記録サーバ１３に販売員音声として保存する。 When a desired voice signal portion is selected as a recording unit from the dialogue table, the integrated selection unit 34 generates a target speaker recording range (that is, a range delimited by a time stamp) and sends it to the recording range extraction unit 35. The recording range extraction unit 35 extracts a voice signal of a corresponding section (range) based on the speaker direction index and the target speaker recording range, and stores it in the customer dialogue recording server 13 as salesperson voice.

本実施の形態では、上述のようにして、話者方向インデックス、音声認識結果、及びタイムスタンプを用いて、録音区間を決定するようにしており、各話者について個別に音声認識を行うことによって録音部分を指定する際には、両者の対話内容を確認しながら録音部分の指定を行うことができる。 In the present embodiment, as described above, the recording interval is determined using the speaker direction index, the speech recognition result, and the time stamp, and by performing speech recognition for each speaker individually, When specifying the recording part, the recording part can be specified while confirming the content of the dialogue between the two.

また、本実施の形態においては、不必要な部分の録音を避けることができる結果、顧客対話記録サーバ１３におけるディスク容量を低減することができ、効率的である。 Further, in the present embodiment, recording of unnecessary portions can be avoided. As a result, the disk capacity in the customer interaction recording server 13 can be reduced, which is efficient.

ここで、マイクロホンの種類とＡＦＥについて、顧客の音声信号の削減という観点から比較を行った（評価試験を行った）。評価実験には、模擬対面販売形式で収集した音声信号を用いた。評価試験では、縦（販売員と顧客間の方向）１００ｃｍのテーブルの両側に、販売員役と顧客役の話者がそれぞれ１名ずつ着席して、投資信託に関する内容を話しているものとする。 Here, the types of microphones and AFE were compared from the viewpoint of reducing the customer's voice signal (an evaluation test was performed). In the evaluation experiment, voice signals collected in a simulated face-to-face sales format were used. In the evaluation test, it is assumed that one person who speaks as a salesperson and one customer role sits on both sides of a vertical (salesperson-customer) 100cm table and talks about the contents of the investment trust. .

対話は、販売員、顧客、そして、販売員の順番で発話した内容を１セットとし、予め定めた標準位置、標準位置から左右に少しずれた位置、テーブルに極端に接近した位置の３ケースで各３セットずつ音声を収録した。マイクロホンはＳｏｎｙ（登録商標）の無指向性マイクロホン（ＳｏｎｙＥＣＭ−５５Ｂ）を２つ用いてマイクロホンアレイを構成し、販売員役と顧客役の中央に配置した。 The dialogue consists of three cases: a standard position, a position slightly deviated from the standard position to the left and right, and a position that is extremely close to the table. Three sets of each were recorded. A microphone array was configured using two Sony (registered trademark) omnidirectional microphones (Sony ECM-55B), and the microphone array was placed in the center of the salesperson role and the customer role.

比較のため，単一指向性マイク（ＡＫＧ４００）をそれぞれの話者の方向に向けて設置して、両話者の音声を収集した。マイクロホン間の距離は、指向性及び無指向性ともに共に１２．５ｃｍとした。この評価試験では、無指向性マイクロホンで受けた音声信号でＡＦＥを行った。 For comparison, a unidirectional microphone (AKG400) was installed in the direction of each speaker, and the voices of both speakers were collected. The distance between the microphones was 12.5 cm for both directivity and non-directivity. In this evaluation test, AFE was performed with an audio signal received by an omnidirectional microphone.

ここでは、販売員の音声信号のみを抽出して、顧客の音声信号を記録として残さないようにするため、顧客の音声信号を雑音とみなして、雑音削減率（ＮＲＲ：ＮｏｉｓｅＲｅｄｕｃｔｉｏｎＲａｔｅ）によって評価を行った。この際、販売員側に近い無指向性マイクフォンで収音された顧客の発声音圧レベルを基準として、当該基準からの顧客の音声信号の削減度合いにより効果を比較した。 Here, in order to extract only the sales person's voice signal and not leave the customer's voice signal as a record, the customer's voice signal is regarded as noise and is evaluated by a noise reduction rate (NRR). Went. At this time, the effect was compared based on the degree of reduction of the voice signal of the customer from the reference, based on the voice pressure level of the customer collected by the omnidirectional microphone near the salesperson.

ただし、収録デバイスの相違に起因する録音レベルの差を吸収するため、販売員の音声信号のパワーが各ケースで同程度になるようにコンピュータ上で正規化を行った。本評価実験で用いるＮＲＲの定義は以下の通りである。 However, in order to absorb the difference in recording level due to the difference in recording devices, normalization was performed on the computer so that the power of the salesperson's audio signal was the same in each case. The definition of NRR used in this evaluation experiment is as follows.

ＮｏｉｓｅＲｅｄｕｃｔｉｏｎＲａｔｅ（ＮＲＲ：％）＝無指向性マイクロホン（基準マイクロホン）による顧客発声音圧レベル［ｄＢ］−指向性マイクロホン（又はＡＦＥ後）の顧客発声音圧レベル［ｄＢ］ Noise Reduction Rate (NRR:%) = Customer utterance sound pressure level [dB] by omnidirectional microphone (reference microphone) −Customer utterance sound pressure level [dB] of directional microphone (or after AFE)

通常、ＮＲＲは入出力のＳＮＲに基づいて算出されるが、本評価実験においては音声信号のパワーは正規化しているので、上記の定義のように雑音のみの差として定式化している。表２に実験結果を示す。 Normally, the NRR is calculated based on the input / output SNR, but in this evaluation experiment, the power of the audio signal is normalized, so it is formulated as a noise-only difference as defined above. Table 2 shows the experimental results.

実験結果において、無指向性マイクロホンでは、音声到来方向に関係なく全ての音声を収音するため、顧客の音声についても高い音圧レベルを示すことが分かる。また、単一指向性マイクロホンでは、正面方向に対して指向性を有しているけれども、指向特性が鈍いので、顧客の音声をあまり遮断できていないことが分かる。これは、販売員の音声のみをサーバに録音するという目的においては、まったく役に立たないことを意味する。 From the experimental results, it can be seen that the omnidirectional microphone picks up all of the voice regardless of the voice arrival direction, and therefore shows a high sound pressure level for the voice of the customer. In addition, it can be seen that the unidirectional microphone has directivity with respect to the front direction, but the directivity is dull so that the customer's voice cannot be cut off so much. This means that it is completely useless for the purpose of recording only the salesperson's voice on the server.

一方、本実施の形態による音声収集システム（無指向性マイクロホンの使用）では、顧客の音声が顕著に削減されており、顧客音声が効果的に抑圧されていることが分かる。なお、本実施の形態による音声収集システムでは１９．６ｄＢの音圧レベルを示しているが、これはＡＦＥが音声認識のために数５に示すフロアリング処理を行うことによって微量なノイズを加えているためであって、この音声が音韻（何をしゃべっているか）を識別できる情報を持っていないことに注意されたい。なお、本実施の形態による音声収集システムでは販売員の音声がもれなく検出されている。 On the other hand, in the voice collection system according to the present embodiment (use of an omnidirectional microphone), it can be seen that the customer voice is remarkably reduced and the customer voice is effectively suppressed. Note that the sound collection system according to the present embodiment shows a sound pressure level of 19.6 dB. This is because the AFE performs a flooring process shown in Formula 5 for voice recognition, and adds a small amount of noise. Note that this voice has no information that can identify the phoneme (what it is talking about). In the voice collecting system according to the present embodiment, the salesperson's voice is completely detected.

上述の実施の形態では、マイクロホンから音声を収集して、マイクロホンアレイ目的音声抽出装置によって販売員の音声のみを顧客対話記録サーバに保存しているが、必要に応じて顧客の音声をサーバに保存することも可能である。また、必要に応じて、図４に示す話者方向インデックスに応じて３つ以上のマイクロホンを配置して、所望の話者のみの音声を抽出するようにしてもよい。 In the above-described embodiment, the voice is collected from the microphone, and only the salesperson's voice is stored in the customer interaction recording server by the microphone array target voice extraction device, but the customer's voice is stored in the server as necessary. It is also possible to do. In addition, if necessary, three or more microphones may be arranged according to the speaker direction index shown in FIG. 4 to extract the voice of only a desired speaker.

また、上述の実施の形態では、相互相関係数を用いたが、相関係数を求める他の方法を用いるようにしてもよい。そして、上述の音声収集システムの動作を実現するプログラムをコンピュータ上で動作させても同様に所望の話者のみの音声を抽出することができる。 In the above-described embodiment, the cross-correlation coefficient is used, but other methods for obtaining the correlation coefficient may be used. Then, even if a program that realizes the operation of the above-described voice collection system is operated on a computer, the voice of only a desired speaker can be similarly extracted.

［音声処理の諸段階の順序による音声強調の性能の例］
本発明に係る音声収集においては、前述の図８を用いて音声処理の諸段階及びそれらの順序を示したように、ＳＳ処理→ＣＳＰによる利得調整→Ｆｌｏｏｒｉｎｇ処理の順で、目的音声を収集するための音声強調処理を行う。この順序は、本発明に係る音声収集方法のための音声強調において重要なポイントであり、以下に処理順番の違いによる音声強調の性能の差を例示する。 [Example of speech enhancement performance based on the sequence of speech processing steps]
In the voice collection according to the present invention, the target voices are collected in the order of SS processing → gain adjustment by CSP → flooring processing, as shown in the steps of voice processing and their order using FIG. 8 described above. Voice enhancement processing is performed. This order is an important point in speech enhancement for the speech collection method according to the present invention, and the difference in speech enhancement performance due to the difference in processing order will be exemplified below.

音声強調の性能の差を試験するための音声は、マイクロホンアレイ１１を介して収集し、サンプリング周波数２２ｋＨｚ、フレームサイズ２３ｍｓ、フレームシフト１５ｍｓ、ＦＦＴサイズ５１２点の条件で処理した後、音声強調に用い、目的音声強調信号とした。得られた目的音声強調信号に対して、さらに適宜音声認識処理を実施した。 Speech for testing the difference in performance of speech enhancement is collected through the microphone array 11 and processed under the conditions of a sampling frequency of 22 kHz, a frame size of 23 ms, a frame shift of 15 ms, and an FFT size of 512 points, and then used for speech enhancement. The target speech enhancement signal was used. A speech recognition process was further appropriately performed on the obtained target speech enhancement signal.

まず、本発明に係る音声強調を用いることにより、音声認識率が向上する例を示す。表３に、４名の話者による５０種類の音声コマンドの発話収録における、音声強調を従来技術に係るＳＳ処理のみとして音声認識処理を実施した場合のコマンド認識率と、本発明に係る所定の順序に基づく音声強調、すなわち、ＳＳ処理→ＣＳＰによる利得調整→Ｆｌｏｏｒｉｎｇ処理を実施した場合のコマンド認識率の比較を示す。コマンド認識率は音声認識率として扱い得る。従って、表３に示すように、本発明に係る音声強調により、音声認識率を高めることが可能である。

First, an example in which the speech recognition rate is improved by using speech enhancement according to the present invention will be described. Table 3 shows the command recognition rate when the speech recognition processing is performed only with the SS processing according to the prior art in the speech recording of 50 types of speech commands by four speakers, and the predetermined number according to the present invention. A comparison of command recognition rates when performing speech enhancement based on order, that is, SS processing → gain adjustment by CSP → flooring processing is shown. The command recognition rate can be treated as a speech recognition rate. Therefore, as shown in Table 3, the speech recognition rate can be increased by the speech enhancement according to the present invention.

次いで、本発明に係る音声強調の諸段階の順序が、音声認識率の結果に影響する例を示す。表４に、音声強調の処理手順を入れ替えた場合のコマンド認識率を比較した結果を、表３に追記した表として示す。話者及び音声収集条件等は、前述の表３に示した例と同様であり、「処理手順入れ替え１」としてＳＳ処理→Ｆｌｏｏｒｉｎｇ処理→ＣＳＰによる利得調整の手順で音声強調を実施し、及び「処理手順入れ替え２」としてＣＳＰによる利得調整→ＳＳ処理→Ｆｌｏｏｒｉｎｇ処理とした音声強調を実施した。表４にコマンド認識率として示す音声認識率を比較すると、本発明に係る音声強調の手順として、ＳＳ処理→ＣＳＰによる利得調整→Ｆｌｏｏｒｉｎｇ処理の順で処理したときに顕著に高い性能が得られた。従って、この順番に処理するという手順が重要であることがわかる。

Next, an example is shown in which the order of the steps of speech enhancement according to the present invention affects the result of speech recognition rate. Table 4 shows the result of comparing the command recognition rates when the speech enhancement processing procedure is changed as a table added to Table 3. The speaker, voice collection conditions, and the like are the same as those in the example shown in Table 3 above. As “Processing Procedure Replacement 1”, voice enhancement is performed in the procedure of SS processing → flooring processing → gain adjustment by CSP, and “ As “Processing Procedure Change 2”, speech enhancement was performed by gain adjustment by CSP → SS processing → Flooring processing. Comparing the speech recognition rates shown as the command recognition rates in Table 4, a significantly high performance was obtained when processing in the order of SS processing → gain adjustment by CSP → flooring processing as a speech enhancement procedure according to the present invention. . Therefore, it is understood that the procedure of processing in this order is important.

図９に、本発明に係る音声強調処置の諸段階における雑音区間の音声信号の例を示す。本発明に係る音声強調の処理手順が飛びぬけて高い性能を示す理由として、図９の（ａ）（ｂ）（ｃ）（ｄ）で示すような模式図による説明が考えられる。雑音区間（目的話者の非発話区間）の例（２００）は、いずれも振幅の周波数特性として表す。図９（ａ）は、スペクトルサブトラクション（ＳＳ）処理を行う前のパワースペクトルＸω（Ｔ）を示す模式図である。図９（ｂ）はＳＳ処理を実施した減算後パワースペクトルＹω（Ｔ）を示す模式図であり、ＳＳ処理によって雑音が減少している。図９（ｃ）はＣＳＰ係数による利得調整後のパワースペクトルＤω（Ｔ）を示す模式図であり、ＣＳＰ係数による利得調整によって、さらに雑音が減少している。図９（ｄ）は、Ｆｌｏｏｒｉｎｇ処理を行った後の認識用パワースペクトルＺω（Ｔ）を示す模式図であり、でこぼこしていた雑音のスペクトルが、なだらかなものになる。 FIG. 9 shows an example of a speech signal in a noise section at various stages of speech enhancement processing according to the present invention. As a reason why the speech enhancement processing procedure according to the present invention skips and exhibits high performance, explanations using schematic diagrams as shown in FIGS. 9A, 9B, 9C, and 9D can be considered. All examples (200) of the noise section (non-speech section of the target speaker) are expressed as amplitude frequency characteristics. FIG. 9A is a schematic diagram showing the power spectrum Xω (T) before performing the spectrum subtraction (SS) process. FIG. 9B is a schematic diagram showing the subtracted power spectrum Yω (T) subjected to the SS process, and noise is reduced by the SS process. FIG. 9C is a schematic diagram showing the power spectrum Dω (T) after gain adjustment by the CSP coefficient, and noise is further reduced by the gain adjustment by the CSP coefficient. FIG. 9D is a schematic diagram showing a recognition power spectrum Zω (T) after performing the flooring process, and the spectrum of the lumpy noise becomes gentle.

ＣＳＰとＦｌｏｏｒｉｎｇの効果は、雑音区間（目的話者の非発話区間）に現れる。雑音区間のスペクトルが、ＳＳ処理により平らになり、ところどころ飛び出ている山が、ＣＳＰ係数をかけることによってさらにつぶされ、さらに、Ｆｌｏｏｒｉｎｇをかけることによって谷が埋められ、平滑化された（比喩としては、雪をかぶったような）なだらかなスペクトル包絡になる。結果として、雑音を目的話者の音声として間違うことがなくなる。従来技術に係る音声認識の方式では、目的話者が発話していないのに、周囲の雑音を目的話者の音声と間違えて誤った認識を起こしてしまうことが問題となっているが、ＳＳ処理→（ＣＳＰ係数による）利得調整→Ｆｌｏｏｒｉｎｇ処理という処理手順で処理すると、その誤りが軽減されると考えられる。 The effects of CSP and Flooring appear in the noise section (non-speaking section of the target speaker). The spectrum of the noise section is flattened by SS processing, and the peaks that pop out in some places are further crushed by applying the CSP coefficient, and further, valleys are filled and smoothed by applying Flooring (as a metaphor A gentle spectral envelope (like snow). As a result, noise is not mistaken as the target speaker's voice. In the speech recognition method according to the prior art, although the target speaker is not speaking, there is a problem that the surrounding noise is mistaken for the target speaker's voice and erroneous recognition is caused. It is considered that the error can be reduced by processing according to the processing procedure of processing → gain adjustment (by CSP coefficient) → flooring processing.

［可搬型販売員音声収集装置の動作状況の例］
図１０に、本発明の一実施形態に係る、可搬型販売員音声収集装置６０の動作状況を例示する。可搬型販売員音声収集装置６０は、マイクロホン６０ａ及び６０ｂを備え、これらは図１〜３及び図６を用いて前述の、本発明に係る音声収集方法の実施装置におけるマイクロホンアレイを構成する。さらに、可搬型販売員音声収集装置６０は、本発明に係る音声収集方法の諸段階を実施可能なデジタル信号処理手段を備え、記憶手段、音声再生手段等を適宜含む。 [Example of operation status of portable salesperson voice collection device]
FIG. 10 illustrates an operation state of the portable salesperson voice collection device 60 according to an embodiment of the present invention. The portable salesperson voice collecting apparatus 60 includes microphones 60a and 60b, which constitute a microphone array in the above-described voice collecting method implementing apparatus according to the present invention with reference to FIGS. Furthermore, the portable salesperson voice collection device 60 includes digital signal processing means capable of performing the steps of the voice collection method according to the present invention, and appropriately includes storage means, voice reproduction means, and the like.

典型的には、可搬型販売員音声収集装置６０は販売員２２の胸元等に固定され、販売員２２が顧客２１と対面するときに、販売員２２の口元から可搬型販売員音声収集装置６０に向かう音声到来方向１（７０）及び顧客２１の口元から可搬型販売員音声収集装置６０にむかう音声到来方向２（７２）のそれぞれが、マイクロホン６０ａ及びマイクロホン６０ｂを結ぶ方向ベクトルに対して異なる角度を有するように配置される。例えば、当該方向ベクトルは、販売員２２の頭頂から足元に向かい、体軸と略平行な向きを向いており（顧客２１から見て２つのマイクロホン６０ａ及び６０ｂは上下に配置しているように見える）、音声到来方向１（７０）は当該方向ベクトルと略平行な方向であり、音声到来方向２（７１）は当該方向ベクトルに対して略垂直な方向であり得る。これに限らず、可搬型販売員音声収集装置６０は、マイクロホン６０ａ及びマイクロホン６０ｂを結ぶ方向ベクトルが音声到来方向１（７０）及び音声到来方向２（７１）のそれぞれに対して異なる角度をなすように配置されればよく、可搬型販売員音声収集装置６０の大きさ、形状等は適宜設計し得る。 Typically, the portable salesperson voice collection device 60 is fixed to the chest of the salesperson 22 or the like, and when the salesperson 22 faces the customer 21, the portable salesperson voice collection device 60 from the mouth of the salesperson 22. The direction of the voice arrival direction 1 (70) toward the mobile phone 2 and the direction of voice arrival 2 (72) from the mouth of the customer 21 to the portable salesperson voice collecting device 60 are different from the direction vector connecting the microphone 60a and the microphone 60b. Are arranged to have For example, the direction vector is directed from the top of the salesperson 22 toward the feet and is directed in a direction substantially parallel to the body axis (the two microphones 60a and 60b appear to be arranged one above the other as viewed from the customer 21. ), Voice arrival direction 1 (70) may be a direction substantially parallel to the direction vector, and voice arrival direction 2 (71) may be a direction substantially perpendicular to the direction vector. The portable salesperson voice collecting device 60 is not limited to this, and the direction vector connecting the microphone 60a and the microphone 60b makes different angles with respect to the voice arrival direction 1 (70) and the voice arrival direction 2 (71). The size, shape, etc. of the portable salesperson voice collecting device 60 can be designed as appropriate.

このように可搬型販売員音声収集装置６０を配置し、マイクロホン６０ａ及びマイクロホン６０ｂを本発明に係る音声収集方法におけるマイクロホンアレイとして用い、前述の目的音声抽出のための方法を実施して、特定の時間差を有して当該マイクロホンアレイに到達する音声を抽出することにより、販売員２２の声を選択的に収集することが可能になる。本発明においては、市販入手可能なボイスレコーダ等と類似した形態を有する可搬型販売員音声収集装置６０を用いて、販売員の声を選択的に収集する実施手段を実現し得る。 In this way, the portable salesperson voice collecting device 60 is arranged, the microphone 60a and the microphone 60b are used as the microphone array in the voice collecting method according to the present invention, and the above-described method for extracting the target voice is performed. By extracting the voice that reaches the microphone array with a time difference, the voice of the salesperson 22 can be selectively collected. In the present invention, it is possible to realize an implementation means for selectively collecting a salesperson's voice using a portable salesperson voice collection device 60 having a form similar to a commercially available voice recorder or the like.

［販売員音声収集装置のハードウェア構成］
図１１は、本発明の一実施形態に係る、販売員音声収集装置のハードウェア構成を示す図である。図１１においては、販売員音声収集装置を情報処理装置１０００とし、そのハードウェア構成を例示する。以下は、コンピュータを典型とする情報処理装置として全般的な構成を説明するが、その環境に応じて必要最小限な構成を選択できることはいうまでもない。 [Hardware configuration of salesperson voice collection device]
FIG. 11 is a diagram showing a hardware configuration of a salesperson voice collection device according to an embodiment of the present invention. In FIG. 11, the salesperson voice collection device is the information processing device 1000, and the hardware configuration thereof is illustrated. In the following, an overall configuration of an information processing apparatus typified by a computer will be described, but it goes without saying that the minimum required configuration can be selected according to the environment.

情報処理装置１０００は、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）１０１０、バスライン１００５、通信Ｉ／Ｆ１０４０、メインメモリ１０５０、ＢＩＯＳ（ＢａｓｉｃＩｎｐｕｔＯｕｔｐｕｔＳｙｓｔｅｍ）１０６０、パラレルポート１０８０、ＵＳＢポート１０９０、グラフィック・コントローラ１０２０、ＶＲＡＭ１０２４、音声プロセッサ１０３０、Ｉ／Ｏコントローラ１０７０、並びにキーボード及びマウス・アダプタ１１００等の入力手段を備える。Ｉ／Ｏコントローラ１０７０には、フレキシブル・ディスク（ＦＤ）ドライブ１０７２、ハードディスク１０７４、光ディスク・ドライブ１０７６、半導体メモリ１０７８等の記憶手段を接続することができる。 The information processing apparatus 1000 includes a CPU (Central Processing Unit) 1010, a bus line 1005, a communication I / F 1040, a main memory 1050, a BIOS (Basic Input Output System) 1060, a parallel port 1080, a USB port 1090, a graphic controller 1020, and a VRAM 1024. , An audio processor 1030, an I / O controller 1070, and input means such as a keyboard and mouse adapter 1100. Storage means such as a flexible disk (FD) drive 1072, a hard disk 1074, an optical disk drive 1076, and a semiconductor memory 1078 can be connected to the I / O controller 1070.

音声プロセッサ１０３０には、マイクロホン１０３６及び１０３７、増幅回路１０３２、及びスピーカ１０３４が接続される。また、グラフィック・コントローラ１０２０には、表示装置１０２２が接続されている。 Microphones 1036 and 1037, an amplifier circuit 1032, and a speaker 1034 are connected to the audio processor 1030. A display device 1022 is connected to the graphic controller 1020.

ＢＩＯＳ１０６０は、情報処理装置１０００の起動時にＣＰＵ１０１０が実行するブートプログラムや、情報処理装置１０００のハードウェアに依存するプログラム等を格納する。ＦＤ（フレキシブル・ディスク）ドライブ１０７２は、フレキシブル・ディスク１０７１からプログラム又はデータを読み取り、Ｉ／Ｏコントローラ１０７０を介してメインメモリ１０５０又はハードディスク１０７４に提供する。
図５には、情報処理装置１０００の内部にハードディスク１０７４が含まれる例を示したが、バスライン１００５又はＩ／Ｏコントローラ１０７０に外部機器接続用インタフェース（図示せず）を接続し、情報処理装置１０００の外部にハードディスクを接続又は増設してもよい。 The BIOS 1060 stores a boot program executed by the CPU 1010 when the information processing apparatus 1000 is activated, a program depending on the hardware of the information processing apparatus 1000, and the like. An FD (flexible disk) drive 1072 reads a program or data from the flexible disk 1071 and provides it to the main memory 1050 or the hard disk 1074 via the I / O controller 1070.
FIG. 5 shows an example in which the information processing apparatus 1000 includes a hard disk 1074. However, an external device connection interface (not shown) is connected to the bus line 1005 or the I / O controller 1070, and the information processing apparatus A hard disk may be connected or added to the outside of 1000.

光ディスク・ドライブ１０７６としては、例えば、ＤＶＤ−ＲＯＭドライブ、ＣＤ−ＲＯＭドライブ、ＤＶＤ−ＲＡＭドライブ、ＣＤ−ＲＡＭドライブを使用することができる。この際は各ドライブに対応した光ディスク１０７７を使用する必要がある。光ディスク・ドライブ１０７６は光ディスク１０７７からプログラム又はデータを読み取り、Ｉ／Ｏコントローラ１０７０を介してメインメモリ１０５０又はハードディスク１０７４に提供することもできる。 As the optical disk drive 1076, for example, a DVD-ROM drive, a CD-ROM drive, a DVD-RAM drive, or a CD-RAM drive can be used. In this case, it is necessary to use the optical disk 1077 corresponding to each drive. The optical disk drive 1076 can also read a program or data from the optical disk 1077 and provide it to the main memory 1050 or the hard disk 1074 via the I / O controller 1070.

情報処理装置１０００に提供されるコンピュータプログラムは、フレキシブル・ディスク１０７１、光ディスク１０７７、又はメモリーカード等の記録媒体に格納されて利用者によって提供される。このコンピュータプログラムは、Ｉ／Ｏコントローラ１０７０を介して、記録媒体から読み出され、又は通信Ｉ／Ｆ１０４０を介してダウンロードされることによって、情報処理装置１０００にインストールされ実行される。コンピュータプログラムが情報処理装置に働きかけて行わせる動作は、既に説明した装置における動作と同一であるので省略する。 The computer program provided to the information processing apparatus 1000 is stored in a recording medium such as the flexible disk 1071, the optical disk 1077, or a memory card and provided by the user. The computer program is read from the recording medium via the I / O controller 1070 or downloaded via the communication I / F 1040 to be installed and executed in the information processing apparatus 1000. The operation that the computer program causes the information processing apparatus to perform is the same as the operation in the apparatus that has already been described.

前述のコンピュータプログラムは、外部の記憶媒体に格納されてもよい。記憶媒体としてはフレキシブル・ディスク１０７１、光ディスク１０７７、又はメモリーカードの他に、ＭＤ等の光磁気記録媒体、テープ媒体を用いることができる。また、専用通信回線やインターネットに接続されたサーバシステムに設けたハードディスク又は光ディスク・ライブラリ等の記憶装置を記録媒体として使用し、通信回線を介してコンピュータプログラムを情報処理装置１０００に提供してもよい。 The aforementioned computer program may be stored in an external storage medium. As the storage medium, in addition to the flexible disk 1071, the optical disk 1077, or the memory card, a magneto-optical recording medium such as an MD or a tape medium can be used. Further, a storage device such as a hard disk or an optical disk library provided in a server system connected to a dedicated communication line or the Internet may be used as a recording medium, and a computer program may be provided to the information processing apparatus 1000 via the communication line. .

以上の例は、情報処理装置１０００について主に説明したが、コンピュータに、情報処理装置で説明した機能を有するプログラムをインストールして、そのコンピュータを情報処理装置として動作させることにより上記で説明した情報処理装置と同様な機能を実現することができる。 In the above example, the information processing apparatus 1000 has been mainly described. However, the information described above is obtained by installing a program having the function described in the information processing apparatus in a computer and causing the computer to operate as the information processing apparatus. Functions similar to those of the processing device can be realized.

本装置は、ハードウェア、ソフトウェア、又はハードウェア及びソフトウェアの組み合わせとして実現可能である。ハードウェアとソフトウェアの組み合わせによる実施では、所定のプログラムを有するコンピュータシステムでの実施が典型的な例として挙げられる。かかる場合、該所定のプログラムが該コンピュータシステムにロードされ実行されることにより、該プログラムは、コンピュータシステムに本発明にかかる処理を実行させる。このプログラムは、任意の言語、コード、又は表記によって表現可能な命令群から構成される。そのような命令群は、システムが特定の機能を直接実行すること、又は（１）他の言語、コード、もしくは表記への変換、（２）他の媒体への複製、のいずれか一方もしくは双方が行われた後に、実行することを可能にするものである。もちろん、本発明は、そのようなプログラム自体のみならず、プログラムを記録した媒体を含むプログラム製品もその範囲に含むものである。本発明の機能を実行するためのプログラムは、フレキシブル・ディスク、ＭＯ、ＣＤ−ＲＯＭ、ＤＶＤ、ハードディスク装置、ＲＯＭ、ＭＲＡＭ、ＲＡＭ等の任意のコンピュータ可読媒体に格納することができる。かかるプログラムは、コンピュータ可読媒体への格納のために、通信回線で接続する他のコンピュータシステムからダウンロードしたり、他の媒体から複製したりすることができる。また、かかるプログラムは、圧縮し、又は複数に分割して、単一又は複数の記録媒体に格納することもできる。 This apparatus can be realized as hardware, software, or a combination of hardware and software. A typical example of implementation using a combination of hardware and software is implementation on a computer system having a predetermined program. In such a case, the predetermined program is loaded into the computer system and executed, whereby the program causes the computer system to execute the processing according to the present invention. This program is composed of a group of instructions that can be expressed in any language, code, or notation. Such instructions can be either or both of the following: (1) conversion to another language, code, or notation; (2) replication to other media; Can be executed after the Of course, the present invention includes not only such a program itself but also a program product including a medium on which the program is recorded. The program for executing the functions of the present invention can be stored in any computer-readable medium such as a flexible disk, MO, CD-ROM, DVD, hard disk device, ROM, MRAM, and RAM. Such a program can be downloaded from another computer system connected via a communication line or copied from another medium for storage in a computer-readable medium. Further, such a program can be compressed or divided into a plurality of parts and stored in a single or a plurality of recording media.

本発明の一実施形態にかかる音声収集システムの一例を概略的に示すブロック図である。1 is a block diagram schematically showing an example of a sound collection system according to an embodiment of the present invention. マイクロホンに対する音声到来方向を示す図である。It is a figure which shows the audio | voice arrival direction with respect to a microphone. 本発明の一実施形態に係る、目的音声抽出装置１２の構成を示す図である。It is a figure which shows the structure of the target audio | voice extraction apparatus 12 based on one Embodiment of this invention. マイクロホンの位置に対する話者方向インデックスの一例を示す図である。It is a figure which shows an example of the speaker direction index with respect to the position of a microphone. マイクロホンの指向性による分類を示す図である。It is a figure which shows the classification | category by the directivity of a microphone. 本発明の実施の形態によるマイクロホンアレイを配置する場所の一例を示す図である。It is a figure which shows an example of the place which arrange | positions the microphone array by embodiment of this invention. 図１に示す目的音声抽出装置１２を詳細に示すブロック図である。It is a block diagram which shows the target audio | voice extraction apparatus 12 shown in FIG. 1 in detail. 図７に示す発話区間インデックス検出処理部３１における処理を説明するためのフロー図である。It is a flowchart for demonstrating the process in the utterance area index detection process part 31 shown in FIG. 本発明に係る音声強調処置の諸段階における雑音区間の音声信号の例を示す図である。It is a figure which shows the example of the audio | voice signal of the noise area in the various steps of the audio | voice emphasis treatment based on this invention. 本発明の一実施形態に係る、可搬型販売員音声収集装置６０の動作状況を例示する図である。It is a figure which illustrates the operation condition of portable salesperson voice collection device 60 concerning one embodiment of the present invention. 本発明の一実施形態に係る、販売員音声収集装置のハードウェア構成を示す図である。It is a figure which shows the hardware constitutions of the salesperson audio | voice collection apparatus based on one Embodiment of this invention.

Explanation of symbols

１０音声収集システム
１１マイクロホンアレイ
１２目的音声抽出装置
１３顧客対話記録サーバ
３１発話区間インデックス検出処理部
３２、３３音声認識部
３４統合選択部
３５録音範囲抽出部
６０可搬型販売員音声収集装置
１０５、１０６離散フーリエ変換処理部
１１０ＣＳＰ係数算出部
１２０群遅延アレイ処理部、
１３０雑音推定部
１４０ＳＳ処理部
１５０利得調整処理部
１６０フロアリング処理部 DESCRIPTION OF SYMBOLS 10 Voice collection system 11 Microphone array 12 Target voice extraction device 13 Customer dialogue recording server 31 Speech section index detection processing unit 32, 33 Speech recognition unit 34 Integrated selection unit 35 Recording range extraction unit 60 Portable salesperson voice collection device 105, 106 Discrete Fourier transform processing unit 110 CSP coefficient calculation unit 120 group delay array processing unit,
130 Noise Estimation Unit 140 SS Processing Unit 150 Gain Adjustment Processing Unit 160 Flooring Processing Unit

Claims

A voice collection method using a microphone array in which at least a first microphone and a second microphone are arranged at a predetermined distance in order to extract and collect a target voice of interest from among a plurality of voices having different directions of arrival. There,
The speech signals received by the first microphone and the second microphone are discrete Fourier transformed to obtain a plurality of CSP coefficients related to the direction of arrival of the speech, and the plurality of speech signals are obtained from the plurality of CSP coefficients. Detecting a signal;
Detecting a speech direction index defined according to an angle formed by a line segment connecting the first microphone and the second microphone and the arrival direction from the obtained CSP coefficients;
Extracting the target speech signal from the detected plurality of speech signals according to the detected speech direction index;
The voice collecting method including:

The plurality of sounds are a first sound and a second sound, and each of the first sound generation source and the second sound generation source includes the first microphone and the second microphone. The speech collection method according to claim 1, wherein the speech collection method is located within a predetermined angle range with a connecting line segment as a central axis.

The plurality of sounds are a first sound and a second sound, a line connecting the first sound generation source and the second sound generation source, and the first sound included in the microphone array. The voice collecting method according to claim 1, wherein the microphone and the line connecting the second microphone are arranged substantially parallel to each other.

The step of detecting the voice direction index compares the magnitude relation of the plurality of CSP coefficients, and the voice direction index is derived from a difference in time when one voice reaches the first microphone and the second microphone. The voice collection method according to claim 1, wherein:

The speech collection method according to claim 1, further comprising a step of performing an array process to emphasize the target speech based on the result of the discrete Fourier transform.

And performing SS (spectral subtraction) processing using the estimated noise power spectrum (Uω) and the subtraction constant (α) based on the respective discrete Fourier transform results;
Performing gain adjustment from the output of the SS processing step and the CSP coefficient;
Performing a flooring process using a flooring coefficient (β) w for the output of the gain adjustment step;
The voice collecting method according to claim 1, comprising:

The plurality of sounds are a first sound and a second sound, and the step of detecting the plurality of sound signals further includes at least one of the first sound signal and the second sound signal based on the CSP coefficient. The speech collection method according to claim 1, wherein an utterance section is detected for one side.

8. The voice collection method according to claim 7, wherein the step of detecting the plurality of voice signals further separates at least one of the first voice signal and the second voice signal from the detected speech period. .

The step of detecting the plurality of sound signals further includes a sound direction index corresponding to each of the first sound signal and the second sound signal as a first sound direction index and a second sound direction index. The voice collecting method according to claim 7, which is associated.

The step of extracting the target speech signal further comprises the step of extracting the first speech from the first speech signal, the second speech signal, the first speech direction index, and the second speech direction index. And the second voice signal are subjected to voice recognition processing to obtain the first voice recognition result and the second voice recognition result, and the time when the first voice and the second voice are spoken. A voice recognition step for obtaining first time information and second time information indicating:
An integration step of integrating the first speech recognition result and the second speech recognition result together with the first time information and the second time information;
When a location to be extracted is selected as a result of the integration, a cutout step of cutting out a speech signal of an utterance section according to the location;
The voice collecting method according to claim 9, comprising:

The integration step further includes the first voice recognition result and the second voice recognition result, the first time information and the second time information, the first voice direction index, and the second voice. The method of claim 10, comprising associating with a direction index.

The voice collection method according to claim 10, wherein the cutout step includes a step of cutting out a voice signal of an utterance section according to a voice direction index and time information corresponding to the selected location from the integrated information.

The voice collecting method according to claim 10, comprising a step of recording the cut out voice signal as a voice to be recorded.

The computer program for performing each step of the method of any one of Claim 1 to 13 using a computer.

In a sound collection system using a microphone array in which at least a first microphone and a second microphone are arranged at a predetermined distance in order to extract and collect a target target sound from a plurality of sounds having different directions of arrival. There,
The speech signals received by the first microphone and the second microphone are discrete Fourier transformed to obtain a plurality of CSP coefficients related to the direction of arrival of the speech, and the plurality of speech signals are obtained from the plurality of CSP coefficients. Voice detection means for detecting a signal;
A voice direction index detecting means for detecting a voice direction index defined according to an angle formed by a line segment connecting the first microphone and the second microphone and the arrival direction from the plurality of CSP coefficients obtained;
Target speech extraction means for extracting the target speech signal from the plurality of speech signals detected by the detected speech direction index;
Including voice collection system.

A microphone array in which at least a first microphone and a second microphone are arranged at a predetermined distance in order to extract and collect the first voice out of the first voice and the second voice having different directions of arrival. A voice collection system using
Each of the first sound generation source and the second sound generation source is located within a predetermined angle range with a line segment connecting the first microphone and the second microphone as a central axis. And
Means for performing array processing to emphasize the target sound based on the result of discrete Fourier transform of the sound signals received by the first microphone and the second microphone;
Means for performing SS (spectral subtraction) processing using a power spectrum (Uω) of an estimated noise and a subtraction constant (α) based on the respective discrete Fourier transform results;
Means for obtaining a CSP coefficient from the result of each discrete Fourier transform, and performing gain adjustment from the output of the means for performing the SS processing and the CSP coefficient;
Means for performing a flooring process using a flooring coefficient (β) for the output of the means for performing the gain adjustment;
Voice detection means for detecting the first voice signal and the second voice signal from the voice signal subjected to the flooring process;
The time required for one sound to reach the first microphone and the second microphone by comparing the magnitude relationship of the obtained CSP coefficients independently for each of the first sound and the second sound. A voice direction index detecting means for determining a voice direction index derived from the difference between
Target speech extraction means for extracting the first speech signal from the speech direction index;
Utterance period detecting means for detecting an utterance period of the signal of the first voice from the CSP coefficient;
Target speech separation means for separating the first speech signal from the detected speech section;
Including voice collection system.