JP2005266797A

JP2005266797A - Method and apparatus for separating sound-source signal and method and device for detecting pitch

Info

Publication number: JP2005266797A
Application number: JP2005041169A
Authority: JP
Inventors: Tetsujiro Kondo; 哲二郎近藤; Tetsuhiko Arimitsu; 哲彦有光; Hiroshi Ichiki; 洋一木; Junichi Shima; 淳一嶋
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 2004-02-20
Filing date: 2005-02-17
Publication date: 2005-09-29

Abstract

<P>PROBLEM TO BE SOLVED: To pick up signals from a plurality of sound sources by stereo microphones and to separate a desired sound source signal. <P>SOLUTION: A stereo speech from a terminal 11 is inputted to a pitch detection part 12 to performs pitch detection based upon a 2-wavelength part of a pitch as a detection unit, thereby detecting a pitch and a stationary part where the same pitches are successive. A delay correcting addition part 13 delays and corrects the stereo speech so that the speech is in phase with a speech from a desired sound source, and adds them together to output a signal whose desired sound source signal is emphasized. A separation coefficient generation part 14 in a sound source signal separation part 19 generates a filter coefficient of a filter arithmetic circuit 15 in the sound source signal separation part 19 according to the pitch detected by the pitch detection part 12. The filter coefficient generated by the separation coefficient generation part 14 is sent to the filter arithmetic part 15, which filters the output signal from the delay correcting addition part 13 with a filter coefficient updated by detected pitches to obtain a waveform output after the desired sound source signal is separated. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

本発明は、音源信号分離装置及び方法、並びにピッチ検出装置及び方法に関し、例えば、複数の音源からの音声信号をステレオマイクロホンにより良好に分離するための音源信号分離装置及び方法、並びに音源信号分離に適したピッチ検出を行うためのピッチ検出装置及び方法に関する。 The present invention relates to a sound source signal separation device and method, and a pitch detection device and method, for example, a sound source signal separation device and method for separating sound signals from a plurality of sound sources with a stereo microphone, and sound source signal separation. The present invention relates to a pitch detection apparatus and method for performing suitable pitch detection.

複数種類の音源信号が混在した音響信号から所望の音源信号を分離することが知られている。これは、例えば図２６に示すように、複数人、例えば３人の人物ＳＰＡ，ＳＰＢ，ＳＰＣから発生された音声を、音響−電気変換手段、例えば左右のステレオマイクロホンＭＣＬ，ＭＣＲで集音して、得られた音響信号から所望の一人の人物からの音声信号を分離するような技術である。 It is known to separate a desired sound source signal from an acoustic signal in which a plurality of types of sound source signals are mixed. For example, as shown in FIG. 26, sound generated from a plurality of persons, for example, three persons SPA, SPB, SPC is collected by an acoustic-electric conversion means, for example, left and right stereo microphones MCL, MCR. The technique is to separate the audio signal from a desired single person from the obtained acoustic signal.

このような音源信号分離の従来技術として、特許文献１に開示される音響信号分離回路及びそれを用いたマイクロホン装置がある。これら音響信号分離回路及びそれを用いたマイクロホン装置においては、互いに線形独立な複数の音源信号が線形加算された複数の混合信号をフレーム分割し、フレーム毎に、分離回路によって分離された複数の信号相互間のラグタイムゼロの相関を最小にする混合行列の逆行列を乗算することにより、混合信号から元の音声信号をそれぞれ分離するようにしている。 As a conventional technique of such sound source signal separation, there is an acoustic signal separation circuit disclosed in Patent Document 1 and a microphone device using the same. In these acoustic signal separation circuits and microphone devices using the acoustic signal separation circuits, a plurality of mixed signals obtained by linearly adding a plurality of linearly independent sound source signals are divided into frames, and a plurality of signals separated by a separation circuit for each frame. The original speech signal is separated from the mixed signal by multiplying the inverse matrix of the mixing matrix that minimizes the correlation with zero lag time between them.

また、特許文献２には、周囲に雑音が多い環境下において、所望の音声信号を抽出する場合に用いられる、所望の音源を推定する音源信号推定装置が開示されている。 Patent Document 2 discloses a sound source signal estimation apparatus for estimating a desired sound source, which is used when a desired sound signal is extracted in an environment where there is a lot of noise in the surroundings.

さらに、音源信号の分離のために、ターゲット音声のピッチを求めることが考えられており、このピッチ検出の技術として、特許文献３に開示される音響信号分析方法及び装置並びに音声信号処理方法及び装置がある。これらの装置及び方法においては、入力信号を所定の時間長を持つフレーム毎に切り出して、各フレーム毎に周波数分析を行い、各フレームの周波数分析結果から各フレーム内での調波性評価を行うと共に各フレームの周波数分析結果の振幅のフレーム間差分に対して調波性評価を行い、これらの調波性評価の結果を使用して入力信号のピッチを検出するようにしている。 Further, it is considered that the pitch of the target sound is obtained for the separation of the sound source signal. As a technique for detecting the pitch, an acoustic signal analysis method and apparatus and an audio signal processing method and apparatus disclosed in Patent Document 3 are proposed. There is. In these devices and methods, an input signal is cut out for each frame having a predetermined time length, frequency analysis is performed for each frame, and harmonic characteristics are evaluated in each frame from the frequency analysis result of each frame. In addition, harmonic evaluation is performed on the difference between the amplitudes of the frequency analysis results of each frame, and the pitch of the input signal is detected using the results of the harmonic evaluation.

特開２００１−２２２２８９号公報JP 2001-222289 A 特開平７−２８４９２号公報Japanese Patent Laid-Open No. 7-28492 特開２０００−１８１４９９号公報JP 2000-181499 A

一般に、複数音源を分離するには、音源の数以上のマイクロホンが必要とされ、そのような複数のマイクロホンを用いた検討が行われている。例えば、上述の特許文献１においては、２本のマイクロホンに対しては、２音源までしか分離が不可能であることが開示されている。また、上記特許文献２には、複数本のマイクロホン（マイクアレイ）を用いて目標とする音源からの音声信号を抽出する技術が開示されている。これらの技術においては、複数の音源信号が混合された混合信号から所望の音源信号を分離するために、音源の個数以上の本数のマイクロホン（マルチマイク）を用いることが必要とされる。 Generally, in order to separate a plurality of sound sources, more microphones than the number of sound sources are required, and studies using such a plurality of microphones are being conducted. For example, Patent Document 1 described above discloses that only two sound sources can be separated from two microphones. Patent Document 2 discloses a technique for extracting an audio signal from a target sound source using a plurality of microphones (microphone arrays). In these techniques, in order to separate a desired sound source signal from a mixed signal obtained by mixing a plurality of sound source signals, it is necessary to use as many microphones (multi-microphones) as the number of sound sources.

従って、このような従来技術によっては、例えばカメラ一体型ＶＴＲ（いわゆるビデオカメラ）のような携帯型ＡＶ機器等に用いられるステレオマイクロホンの場合に、３音源以上の音源信号を分離することが困難である。 Therefore, according to such a conventional technique, it is difficult to separate sound source signals of three or more sound sources in the case of a stereo microphone used in a portable AV device such as a camera-integrated VTR (so-called video camera). is there.

また、音源信号を分離するに先立ってターゲット音声のピッチを求める場合に、音源信号の分離に適したピッチ検出が望まれる。 In addition, when obtaining the pitch of the target sound prior to separating the sound source signal, it is desired to detect the pitch suitable for separating the sound source signal.

本発明は、このような従来の実情に鑑みて提案されたものであり、ステレオマイクロホンのような少数個の集音手段を用いて、複数個の音源からの音声信号（一般的には音響信号）を集音し、目的とする所望の音源からの音声信号を有効に分離可能とするような音源信号分離装置及び方法、並びにピッチ検出装置及び方法を提供することを目的とする。 The present invention has been proposed in view of such conventional circumstances, and uses a small number of sound collecting means such as stereo microphones to generate audio signals (generally acoustic signals) from a plurality of sound sources. ), And a sound source signal separation device and method, and a pitch detection device and method that can effectively separate a sound signal from a desired desired sound source.

上述の課題を解決するために、本発明に係る音源信号分離装置は、複数の音源からの音響信号が混合されて複数の集音手段により集音された入力音響信号の内の所望の音源信号を強調する音源信号強調手段と、上記入力音響信号の内の上記所望の音源信号のピッチを検出するピッチ検出手段と、検出された上記ピッチと上記音源信号強調手段により強調された音源信号とに基づいて、上記入力音響信号から上記所望の音源信号を分離する音源信号分離手段とを有することを特徴とする。 In order to solve the above-described problem, a sound source signal separation device according to the present invention is a desired sound source signal among input sound signals obtained by mixing sound signals from a plurality of sound sources and collected by a plurality of sound collecting means. Sound source signal emphasizing means for emphasizing the sound, pitch detecting means for detecting the pitch of the desired sound source signal in the input sound signal, and the detected sound source signal and the sound source signal emphasized by the sound source signal emphasizing means. And a sound source signal separating means for separating the desired sound source signal from the input sound signal.

そして、上記音源信号分離手段の一例として、上記音源信号強調手段からの出力信号から上記所望の音源信号を分離するフィルタ手段と、上記ピッチ検出手段からの検出情報に基づき、上記フィルタ手段のフィルタ係数を出力するフィルタ係数出力手段とを有することを特徴とする。 As an example of the sound source signal separating means, a filter means for separating the desired sound source signal from an output signal from the sound source signal enhancing means, and a filter coefficient of the filter means based on detection information from the pitch detecting means. And a filter coefficient output means for outputting.

ここで、上記フィルタ係数出力手段は、上記フィルタ手段の周波数特性を、上記ピッチ検出手段により検出されたピッチの周波数の整数倍の周波数成分を通過させる特性とするフィルタ係数を出力することが好ましい。また、上記フィルタ係数出力手段は、予め何種類かのピッチに応じたフィルタ係数が蓄積された記憶手段を備え、上記ピッチ検出手段により検出されたピッチに応じて上記記憶手段から該ピッチに対応するフィルタ係数を読み出して出力することが好ましい。 Here, it is preferable that the filter coefficient output means outputs a filter coefficient having a characteristic that allows the frequency characteristic of the filter means to pass a frequency component that is an integral multiple of the frequency of the pitch detected by the pitch detection means. The filter coefficient output means includes storage means in which filter coefficients corresponding to several types of pitches are stored in advance, and corresponds to the pitch from the storage means according to the pitch detected by the pitch detection means. It is preferable to read out and output the filter coefficients.

また、上記音源信号強調手段からの出力信号の子音帯域を処理する高域処理手段と、上記音源信号強調手段からの出力信号の子音帯域を取り出して上記高域処理手段に送り、上記音源信号強調手段からの出力信号の子音以外の帯域を取り出して上記フィルタ手段に送り、上記音源信号強調手段からの出力信号の母音帯域を取り出して上記ピッチ検出手段に送るフィルタバンク手段とをさらに有することが好ましい。 Further, the high frequency processing means for processing the consonant band of the output signal from the sound source signal enhancing means, and the consonant band of the output signal from the sound source signal enhancing means are extracted and sent to the high frequency processing means, and the sound source signal enhancement is performed. It is preferable to further include filter bank means for taking out a band other than the consonant of the output signal from the means and sending it to the filter means and taking out the vowel band of the output signal from the sound source signal emphasizing means and sending it to the pitch detection means. .

また、上記複数の集音手段は、左右のステレオマイクロホンであることが挙げられる。また、上記音源信号強調手段は、上記複数の集音手段からの音響信号に対して、上記所望の音源から上記複数の集音手段までの音の伝搬の遅延時間差を補正して加算することにより、上記所望の音源からの音響信号のみを強調することが好ましい。さらに、上記ピッチ検出手段は、上記所望の音源信号のピッチの２波長分を検出単位としてピッチ検出を行うことが好ましい。 The plurality of sound collecting means may be left and right stereo microphones. Further, the sound source signal emphasizing unit corrects and adds a delay time difference of sound propagation from the desired sound source to the plurality of sound collecting units to the sound signals from the plurality of sound collecting units. It is preferable to emphasize only the acoustic signal from the desired sound source. Furthermore, it is preferable that the pitch detection means performs pitch detection using two wavelengths of the pitch of the desired sound source signal as detection units.

また、上記音源信号分離手段のその他の一例として、上記音源信号強調手段からの出力信号中の同じ若しくは略同じピッチが連続する定常性部分を用い、上記ピッチ検出手段からの検出情報に基づき、基本波形を作成する基本波形作成手段と、上記入力音響信号に基づく信号の少なくとも一部を、上記基本波形作成手段により作成された基本波形の繰り返し波形で置き換えて出力する基本波形置き換え手段とを有することを特徴とする。 Further, as another example of the sound source signal separating means, a continuity portion having the same or substantially the same pitch in the output signal from the sound source signal emphasizing means is used, and based on detection information from the pitch detecting means, Basic waveform creation means for creating a waveform, and basic waveform replacement means for replacing at least a part of the signal based on the input acoustic signal with a repeated waveform of the basic waveform created by the basic waveform creation means and outputting the waveform It is characterized by.

ここで、上記ピッチ検出手段は、上記所望の音源信号のピッチの２波長分を検出単位としてピッチ検出を行うことが好ましい。また、上記複数の集音手段は、左右のステレオマイクロホンであることが挙げられる。また、上記音源信号強調手段は、上記複数の集音手段からの音響信号に対して、上記所望の音源から上記複数の集音手段までの音の伝搬の遅延時間差を補正して加算することにより、上記所望の音源からの音響信号のみを強調することが好ましい。さらに、上記基本波形作成手段は、上記所望の音源信号のピッチが連続する定常性部分について、ピッチの２波長分を単位として加算し平均化することにより基本波形を作成することが好ましい。 Here, it is preferable that the pitch detection means performs pitch detection using two wavelengths of the pitch of the desired sound source signal as detection units. The plurality of sound collecting means may be left and right stereo microphones. Further, the sound source signal emphasizing unit corrects and adds a delay time difference of sound propagation from the desired sound source to the plurality of sound collecting units to the sound signals from the plurality of sound collecting units. It is preferable to emphasize only the acoustic signal from the desired sound source. Further, it is preferable that the basic waveform creating means creates a basic waveform by adding and averaging the two portions of the pitch for the stationary portion where the pitch of the desired sound source signal is continuous.

次に、本発明に係る音声信号分離方法は、上記目的を達成するため、複数の音源からの音響信号が混合されて複数の集音手段により集音された入力音響信号の内の所望の音源信号を強調する工程と、上記入力音響信号の内の上記所望の音源信号のピッチを検出する工程と、検出された上記ピッチと上記強調する工程で強調された音源信号とに基づいて、上記入力音響信号から上記所望の音源信号を分離する工程とを有することを特徴とする。 Next, in order to achieve the above object, the audio signal separation method according to the present invention mixes acoustic signals from a plurality of sound sources and collects a desired sound source among the input sound signals collected by a plurality of sound collecting means. The input based on the step of enhancing the signal, the step of detecting the pitch of the desired sound source signal in the input acoustic signal, and the sound source signal emphasized in the step of enhancing and the detected pitch. Separating the desired sound source signal from the acoustic signal.

次に、本発明に係るピッチ検出装置は、上記目的を達成するため、複数の音源からの音響信号が混合されて複数の集音手段により集音された入力音響信号の所望の音源信号を強調する音源信号強調手段と、上記音源強調手段からの出力信号中のピッチの２波長分を検出単位として２波長周期を検出する周期検出手段と、上記周期検出手段により検出された２波長周期の変化に基づき同じ若しくは略同じピッチが連続しているか否かを判定し、判定結果に応じてピッチ情報を出力する連続判定手段とを有することを特徴とする。 Next, in order to achieve the above object, the pitch detection apparatus according to the present invention emphasizes a desired sound source signal of an input sound signal that is obtained by mixing sound signals from a plurality of sound sources and collected by a plurality of sound collecting means. Sound source signal emphasizing means, period detecting means for detecting a two-wavelength period with two wavelengths of a pitch in the output signal from the sound source emphasizing means as a detection unit, and a change in the two-wavelength period detected by the period detecting means And a continuous determination means for determining whether or not the same or substantially the same pitch is continuous, and outputting pitch information in accordance with the determination result.

ここで、上記複数の集音手段は、左右のステレオマイクロホンであることが挙げられる。また、上記音源信号強調手段は、上記複数の集音手段からの音響信号に対して、上記所望の音源から上記複数の集音手段までの音の伝搬の遅延時間差を補正して加算することにより、上記所望の音源からの音響信号のみを強調することが好ましい。 Here, the plurality of sound collecting means may be left and right stereo microphones. Further, the sound source signal emphasizing unit corrects and adds a delay time difference of sound propagation from the desired sound source to the plurality of sound collecting units to the sound signals from the plurality of sound collecting units. It is preferable to emphasize only the acoustic signal from the desired sound source.

また、本発明に係るピッチ検出方法は、上記目的を達成するため、複数の音源からの音響信号が混合されて複数の集音手段により集音された入力音響信号の所望の音源信号を強調する音源信号強調工程と、上記音源強調工程により得られる出力信号中のピッチの２波長分を検出単位として２波長周期を検出する周期検出工程と、上記周期検出工程により検出された２波長周期の変化に基づき同じ若しくは略同じピッチが連続しているか否かを判定し、判定結果に応じてピッチ情報を出力する連続判定工程とを有することを特徴とする。 In addition, in order to achieve the above object, the pitch detection method according to the present invention emphasizes a desired sound source signal of an input sound signal obtained by mixing sound signals from a plurality of sound sources and collected by a plurality of sound collecting means. A sound source signal emphasizing step, a period detecting step for detecting a two-wavelength period using two wavelengths of pitch in the output signal obtained by the sound source emphasizing step as a detection unit, and a change in two wavelength periods detected by the period detecting step And a continuous determination step of determining whether or not the same or substantially the same pitch is continuous, and outputting pitch information according to the determination result.

次に、本発明に係る音源信号分離装置は、上記目的を達成するため、複数の音源からの音響信号が混合されてなる入力音響信号の所望の音源信号のピッチの２の倍数の波長分を検出単位としてピッチ検出を行うピッチ検出手段と、検出された上記ピッチに基づいて所望の音源信号を分離する音源信号分離手段とを有することを特徴とする。 Next, in order to achieve the above object, the sound source signal separation device according to the present invention has a wavelength component that is a multiple of two times the pitch of a desired sound source signal of an input acoustic signal obtained by mixing acoustic signals from a plurality of sound sources. Pitch detection means for performing pitch detection as a detection unit and sound source signal separation means for separating a desired sound source signal based on the detected pitch are provided.

さらに、本発明に係る音源信号分離方法は、上記目的を達成するため、複数の音源からの音響信号が混合されてなる入力音響信号の所望の音源信号のピッチの２の倍数の波長分を検出単位としてピッチ検出を行う工程と、検出された上記ピッチに基づいて所望の音源信号を分離する工程とを有することを特徴とする。 Furthermore, the sound source signal separation method according to the present invention detects the wavelength component of a multiple of 2 of the pitch of the desired sound source signal of the input sound signal obtained by mixing the sound signals from a plurality of sound sources in order to achieve the above object. The method includes a step of performing pitch detection as a unit and a step of separating a desired sound source signal based on the detected pitch.

本発明によれば、複数の音源からの音響信号が混合されて複数の集音手段により集音された入力音響信号の内の所望の音源信号を分離するためのフィルタについて、入力音響信号のピッチを検出し検出されたピッチに応じてフィルタ係数を更新しているため、目的とする音源からの音を良好に分離できる。 According to the present invention, with respect to a filter for separating a desired sound source signal from among input sound signals collected by a plurality of sound collecting means by mixing sound signals from a plurality of sound sources, the pitch of the input sound signal Since the filter coefficient is updated according to the detected pitch, the sound from the target sound source can be separated satisfactorily.

また、本発明によれば、入力信号のピッチの定常性部分に基づく基本波形を作成し、置き換えているため、目的とする音源からの音に近似した良好な音源信号を分離できる。 Further, according to the present invention, since a basic waveform based on the stationary part of the pitch of the input signal is created and replaced, a good sound source signal approximating the sound from the target sound source can be separated.

さらに、ピッチの２波長分を検出単位としてピッチ検出を行うことにより、信頼性が高く安定したピッチ検出が行える。 Furthermore, by performing pitch detection using the two wavelengths of the pitch as a detection unit, reliable and stable pitch detection can be performed.

以下、本発明を適用した具体的な実施の形態について、図面を参照しながら詳細に説明する。 Hereinafter, specific embodiments to which the present invention is applied will be described in detail with reference to the drawings.

本発明の実施の形態に用いられる音源信号分離装置の具体例の概略構成を図１に示す。 A schematic configuration of a specific example of the sound source signal separation device used in the embodiment of the present invention is shown in FIG.

この図１において、入力端子１１には、マイクロホン等により集音された音響信号、具体的には例えばステレオマイクロホンにより集音されたステレオ音声信号が入力され、ピッチ検出部１２及び所望の音源信号を強調する音源信号強調手段としての遅延補正加算部１３に送られる。ピッチ検出部１２からの出力は、音源信号分離部１９内の分離係数作成部１４に送られ、遅延補正加算部１３からの出力は、必要に応じて中域以下の周波数帯域を出力するフィルタ（ローパスフィルタ）２０Ａを介し、音源信号分離部１９内のフィルタ演算回路１５に送られる。フィルタ演算回路１５は、所望のターゲット音声を分離するフィルタであり、ピッチ検出部１２で検出されたピッチが更新される度に、分離係数出力手段である分離係数作成部１４が、検出されたピッチに応じたフィルタ係数を作成し、フィルタ演算回路１５に送っている。また、遅延補正加算部１３からの出力は、必要に応じて高域の周波数帯域を通すフィルタ（ハイパスフィルタ）２０Ｂを介して高域処理部１７に送られ、子音等の非定常波形に対して処理が施される。フィルタ演算回路１５からの出力と、高域処理部１７からの出力とは、加算器１６で加算され、出力端子１８より分離波形出力信号として取り出される。 In FIG. 1, an acoustic signal collected by a microphone or the like, specifically, for example, a stereo audio signal collected by a stereo microphone, is input to an input terminal 11, and a pitch detector 12 and a desired sound source signal are input to the input terminal 11. The signal is sent to a delay correction adding unit 13 as a sound source signal emphasizing means to be emphasized. The output from the pitch detection unit 12 is sent to the separation coefficient creation unit 14 in the sound source signal separation unit 19, and the output from the delay correction addition unit 13 is a filter that outputs a frequency band equal to or lower than the middle range as necessary ( The signal is sent to the filter arithmetic circuit 15 in the sound source signal separation unit 19 via the low-pass filter) 20A. The filter calculation circuit 15 is a filter that separates a desired target voice. Every time the pitch detected by the pitch detection unit 12 is updated, the separation coefficient creating unit 14 that is a separation coefficient output unit detects the detected pitch. The filter coefficient corresponding to is generated and sent to the filter arithmetic circuit 15. The output from the delay correction addition unit 13 is sent to the high frequency processing unit 17 via a filter (high pass filter) 20B that passes a high frequency band as necessary, and is output to an unsteady waveform such as a consonant. Processing is performed. The output from the filter arithmetic circuit 15 and the output from the high frequency processing unit 17 are added by the adder 16 and are taken out from the output terminal 18 as a separated waveform output signal.

このような構成を有する音源信号分離装置の具体例において、ピッチ検出部１２は、音声信号における母音等のような同じ若しくは略同じピッチが連続する部分である定常性部分のピッチ（音の高さ）を検出するものであり、このピッチ検出部１２からは、検出されたピッチが出力され、また必要に応じて上記定常性部分を示す情報（例えば連続する区間を示す時間軸上の座標情報）が出力される。遅延補正加算部１３は、所望の音源信号を強調する音源信号強調手段の一例として用いられるものであり、複数（ステレオの場合は２本）のマイクロホンへの音源からの距離に応じた伝搬遅延時間の差に応じて、各マイクロホンからの信号に時間遅延を持たせて加算することにより、所望の音源からの信号を強め、他の信号を弱めるようなものであり、詳細は後述する。分離係数作成部１４は、ピッチ検出部１２で検出された定常性部分のピッチに応じて、所望の音源からの信号を分離するためのフィルタ係数を作成するものであり、詳細は後述する。フィルタ演算回路１５は、分離係数作成部１４からのフィルタ係数を用いて、遅延補正加算部１３からの出力（必要に応じてフィルタ（ローパスフィルタ）２０Ａを介した出力）にフィルタ処理を施し、所望の音源からの信号を分離するものである。高域域処理部１７は、遅延補正加算部１３からの出力に、必要に応じて高域の周波数を通すフィルタ（ハイパスフィルタ）２０Ｂを介した信号の、例えば子音等の非定常波形に対して所定の処理を施し、加算器１６へ出力する。加算器１６では、フィルタ演算回路１５からの出力と高域処理部１７からの出力を加算し、ターゲット音声の分離波形出力信号として出力端子１８に送る。 In a specific example of the sound source signal separation device having such a configuration, the pitch detection unit 12 is configured such that the pitch (sound pitch) of a stationary part, which is a part where the same or substantially the same pitch is continued, such as a vowel in an audio signal. ) Is detected, and the detected pitch is output from the pitch detector 12 and, if necessary, information indicating the stationary part (for example, coordinate information on a time axis indicating a continuous section). Is output. The delay correction adding unit 13 is used as an example of a sound source signal emphasizing unit that emphasizes a desired sound source signal, and a propagation delay time corresponding to the distance from the sound source to a plurality of (two in the case of stereo) microphones. Depending on the difference, the signals from the microphones are added with a time delay to strengthen the signal from the desired sound source and weaken the other signals. Details will be described later. The separation coefficient creation unit 14 creates a filter coefficient for separating a signal from a desired sound source in accordance with the pitch of the stationary part detected by the pitch detection unit 12, and details will be described later. The filter arithmetic circuit 15 performs a filtering process on the output from the delay correction addition unit 13 (the output via the filter (low-pass filter) 20A as necessary) using the filter coefficient from the separation coefficient creation unit 14 to obtain a desired value. The signal from the sound source is separated. The high-frequency processing unit 17 applies an unsteady waveform such as a consonant of a signal via a filter (high-pass filter) 20B that passes a high-frequency to the output from the delay correction adding unit 13 as necessary. Predetermined processing is performed and output to the adder 16. The adder 16 adds the output from the filter arithmetic circuit 15 and the output from the high frequency processing unit 17 and sends the result to the output terminal 18 as a separated waveform output signal of the target sound.

次に、ピッチ検出部１２の具体例の概略構成を図２に示す。この図２の入力端子２１は、上記図１の入力端子１１に相当し、例えばステレオマイクロホンにより集音されたステレオ音響信号が入力される。、ピッチが定常的に現れる例えば母音帯域を通過させるためのローパスフィルタ（ＬＰＦ）２２を介して、遅延補正加算部２３に送られ、後述するように、所望の音源からの信号を強調するような指向性制御処理が施される。遅延補正加算部２３からの出力は、極大値検出部２４を介し、極大値のゼロクロス間最大値検出部２５を介して、最大値間ピッチ検出部２６に送られる。最大値間ピッチ検出部２６からの出力は、連続判定部２７に送られて、代表ピッチ出力が端子２８から、上記定常性部分の区間を示す座標（時刻）出力が端子２９からそれぞれ取り出される。 Next, a schematic configuration of a specific example of the pitch detector 12 is shown in FIG. The input terminal 21 in FIG. 2 corresponds to the input terminal 11 in FIG. 1 and receives, for example, a stereo sound signal collected by a stereo microphone. For example, a signal from a desired sound source is emphasized as will be described later, which is sent to a delay correction adding unit 23 via a low pass filter (LPF) 22 for passing a vowel band, for example, in which the pitch constantly appears. Directivity control processing is performed. The output from the delay correction addition unit 23 is sent to the maximum value pitch detection unit 26 via the maximum value detection unit 24 and via the maximum value between zero crosses detection unit 25 of the maximum value. The output from the maximum value pitch detection unit 26 is sent to the continuation determination unit 27, and the representative pitch output is taken out from the terminal 28, and the coordinate (time) output indicating the section of the stationary part is taken out from the terminal 29.

ここで、上記図１の遅延補正加算部１３、あるいは図２の遅延補正加算部２３の原理的な構成例について、図３を参照しながら説明する。この図３において、左右のステレオマイクロホンＭＣＬ、ＭＣＲからの信号が、左右のステレオ信号をそれぞれ遅延するバッファメモリ等を用いた遅延回路３２Ｌ、３２Ｒに送られている。上記図２の遅延補正加算部２３の場合には、ピッチ検出の品質を高めるために、左右のステレオ信号を、音声信号における母音等の帯域を通過させるためのローパスフィルタ（ＬＰＦ）２２を介した後に、遅延補正加算部の遅延回路３２Ｌ、３２Ｒに送るようにすればよい。これらの遅延回路３２Ｌ、３２Ｒからの遅延信号は、加算器３４で加算され、遅延補正加算信号として出力端子３５より取り出される。また、必要に応じて、遅延回路３２Ｌ、３２Ｒからの遅延信号を減算器３６で減算して、遅延補正減算信号として出力端子３７より取り出すようにしてもよい。 Here, a principle configuration example of the delay correction addition unit 13 of FIG. 1 or the delay correction addition unit 23 of FIG. 2 will be described with reference to FIG. In FIG. 3, signals from left and right stereo microphones MCL and MCR are sent to delay circuits 32L and 32R using a buffer memory or the like for delaying the left and right stereo signals, respectively. In the case of the delay correction adding unit 23 of FIG. 2 above, in order to improve the quality of pitch detection, the left and right stereo signals are passed through a low pass filter (LPF) 22 for passing a band such as a vowel in the audio signal. It may be sent to the delay circuits 32L and 32R of the delay correction adder later. The delay signals from these delay circuits 32L and 32R are added by an adder 34 and taken out from an output terminal 35 as a delay correction addition signal. If necessary, the delay signals from the delay circuits 32L and 32R may be subtracted by the subtracter 36 and taken out from the output terminal 37 as a delay correction subtraction signal.

この図３に示すような原理的構成を有する遅延補正加算部は、所望の分離しようとするターゲット音源からの音声信号のみを増強し、他の信号成分を減衰させるような指向性制御処理を施すものである。図３の例において、ステレオマイクロホンＭＣＬ、ＭＣＲに対して、左側に音源ＳＬ、中央に音源ＳＣ、右側に音源ＳＲが配置されている場合に、例えば、右側の音源ＳＲをターゲット音源とするとき、音源ＳＲから発せられた音は、空気中を伝搬するのに要する時間遅延のため、音源に近い側のマイクロホンＭＣＲに比べて、音源に遠い側のマイクロホンＭＣＬには時間（物理的遅延量）τだけ遅れて集音される。このとき、バッファメモリ等を用いた遅延回路３２Ｌ、３２Ｒに対して、遅延回路３２Ｌの遅延量を遅延回路３２Ｒよりも時間τだけ長く設定することにより、遅延回路３２Ｌ、３２Ｒからの遅延量が補正された出力信号は、図４に示すように、ターゲット音源ＳＲからのターゲット音声については左右の信号の相関係数が高くなり（位相が、より一致し）、その他の音声については相関係数が低くなる（位相が、より不一致となる）。また、中央の音源ＳＣをターゲット音源とする場合には、音源ＳＣから発せられた音はステレオマイクロホンＭＣＬ、ＭＣＲに同時に（遅延時間差なく）集音されるから、遅延回路３２Ｌ、３２Ｒの各遅延量を等しくすることにより、音源ＳＣからのターゲット音声の相関性を高くし、他の音声の相関性を低くすることができる。このように、遅延回路３２Ｌ、３２Ｒの各遅延量を調整して、ターゲット音源からの音声のみについて相関性を高めることができる。 The delay correction adder having the principle configuration as shown in FIG. 3 performs directivity control processing that enhances only the audio signal from the target sound source to be separated and attenuates other signal components. Is. In the example of FIG. 3, when the sound source SL on the left side, the sound source SC on the center, and the sound source SR on the right side are arranged with respect to the stereo microphones MCL and MCR, for example, when the right sound source SR is used as the target sound source, The sound emitted from the sound source SR has a time delay required for propagating in the air, so that the time (physical delay amount) τ is greater in the microphone MCL far from the sound source than in the microphone MCR closer to the sound source. Only after a delay is collected. At this time, the delay amount from the delay circuits 32L and 32R is corrected by setting the delay amount of the delay circuit 32L longer by the time τ than the delay circuit 32R with respect to the delay circuits 32L and 32R using the buffer memory or the like. As shown in FIG. 4, the output signal thus obtained has a high correlation coefficient between the left and right signals for the target sound from the target sound source SR (phases are more consistent), and the correlation coefficient for the other sounds. Lower (phases are more inconsistent). When the central sound source SC is the target sound source, the sound emitted from the sound source SC is simultaneously collected (without a delay time difference) by the stereo microphones MCL and MCR, and therefore the delay amounts of the delay circuits 32L and 32R are collected. By making these equal, the correlation of the target sound from the sound source SC can be increased, and the correlation of other sounds can be decreased. In this way, by adjusting the delay amounts of the delay circuits 32L and 32R, it is possible to increase the correlation only for the sound from the target sound source.

従って、遅延回路３２Ｌ、３２Ｒからの遅延出力信号を加算器３４で加算することにより、相関性の高い音声のみが増強されることになる。特に、母音部分のような繰り返し波形部分では、位相が揃った波形を足し込むことで位相が揃った部分が強調され、位相の揃っていない部分は減衰されることになる。出力端子３５からは、ターゲット音声のみが増強あるいは強調された信号が取り出される。また、遅延回路３２Ｌ、３２Ｒからの遅延出力信号を減算器３６で減算する場合には、位相が揃った部分が引き算されることから、ターゲット音源からの音声のみが減衰されることになり、出力端子３７からはターゲット音声のみ減衰された信号が取り出される。 Therefore, by adding the delayed output signals from the delay circuits 32L and 32R by the adder 34, only highly correlated speech is enhanced. In particular, in a repetitive waveform portion such as a vowel portion, a portion having a uniform phase is emphasized by adding waveforms having a uniform phase, and a portion having a non-uniform phase is attenuated. From the output terminal 35, a signal in which only the target sound is enhanced or emphasized is taken out. In addition, when the delayed output signals from the delay circuits 32L and 32R are subtracted by the subtractor 36, only the sound from the target sound source is attenuated because the portion having the same phase is subtracted. A signal in which only the target sound is attenuated is taken out from the terminal 37.

上記相関係数について説明すると、２本のマイクロホンに入力された音声に対して上述のように遅延量補正された波形は、波形の一致度が高く、逆にその他の音声にのように、位相のずれた波形は一致度が低くなる。この一致度を表す相関係数ｃｏｒは、次の（１）式により求めることができる。この（１）式において、ｍ１，ｍ２は、ステレオマイクロホンＭＣＬ、ＭＣＲのそれぞれの時間サンプルを示し、ｎ対のサンプル値（ｍ１１，ｍ２１），（ｍ１２，ｍ２２），・・・，（ｍ１ｎ，ｍ２ｎ）についての相関係数ｃｏｒを求めている。なお、Ｓ１，Ｓ２は標準偏差である。 Explaining the correlation coefficient, the waveform whose delay amount is corrected as described above with respect to the sound input to the two microphones has a high degree of coincidence of the waveform, and conversely, as with other sounds, the phase The shifted waveform has a low degree of coincidence. The correlation coefficient cor representing the degree of coincidence can be obtained by the following equation (1). In the equation (1), m1 and m2 indicate time samples of the stereo microphones MCL and MCR, and n pairs of sample values (m11, m21), (m12, m22),..., (M1n, m2n) ) For the correlation coefficient cor. S1 and S2 are standard deviations.

次に、上記ピッチ検出部１２におけるピッチ検出動作について説明する。ピッチ検出部１２の具体的な構成例は、上記図２に示した通りである。先ず、マイクロホンからの信号は、例えば図５のように、ターゲット音声とその他の音声とが混在したものとなる。この図５において、実線が実際に得られた信号波形を示し、破線がターゲット音声の信号波形を示している。これは、上述のような遅延補正加算による指向性制御処理を行ってターゲット音声を強調したとしても、その他の音声が残存しており、これらが混在した信号波形となる。ここで、図５におけるターゲット音声の破線に示す信号波形は、振幅方向（レベル方向）の変動が少なく規則的であるのに対して、実線に示す混在信号波形は、レベル方向にも変動が生じていることが分かる。しかしながら、混在信号波形は、ターゲット音声の波形と比較してみると、レベル方向には相関性はないが、時間方向ではピークの間隔が保存されていることが確認できる。 Next, the pitch detection operation in the pitch detection unit 12 will be described. A specific configuration example of the pitch detector 12 is as shown in FIG. First, as shown in FIG. 5, for example, the signal from the microphone is a mixture of target sound and other sounds. In FIG. 5, the solid line indicates the actually obtained signal waveform, and the broken line indicates the signal waveform of the target speech. Even if the target speech is emphasized by performing the directivity control processing by delay correction and addition as described above, other speech remains and a signal waveform in which these speeches are mixed is obtained. Here, the signal waveform indicated by the broken line of the target voice in FIG. 5 is regular with little fluctuation in the amplitude direction (level direction), whereas the mixed signal waveform indicated by the solid line also varies in the level direction. I understand that However, when the mixed signal waveform is compared with the waveform of the target speech, it can be confirmed that the peak interval is preserved in the time direction, although there is no correlation in the level direction.

この図５に示すような信号波形のスペクトルをとると、例えば図６のようになり、ある基本周波数Ｆｘの倍数構造を有していることが分かる。この基本周波数Ｆｘは、一般的に音の高さを表すピッチに相当しており、ピッチ周波数とも称され、図５の信号波形における隣り合うピーク間の期間を１周期Ｔｘ（１波長λｘ）とするときの周期（ピッチ周期）の逆数に相当する。すなわち、Ｆｘ＝１／Ｔｘである。図６の例では、例えばピッチ周波数Ｆｘの倍の周波数２Ｆｘの位置にもピークが現れており、一般的に周波数Ｆｘの整数倍の位置にピークが現れる。 If the spectrum of the signal waveform as shown in FIG. 5 is taken, it becomes as shown in FIG. 6, for example, and it can be seen that it has a multiple structure of a certain fundamental frequency Fx. This fundamental frequency Fx generally corresponds to a pitch representing the pitch of the sound, and is also referred to as a pitch frequency. A period between adjacent peaks in the signal waveform of FIG. 5 is defined as one period Tx (one wavelength λx). This corresponds to the reciprocal of the cycle (pitch cycle). That is, Fx = 1 / Tx. In the example of FIG. 6, for example, a peak appears at a position of frequency 2Fx that is twice the pitch frequency Fx, and generally a peak appears at a position that is an integral multiple of frequency Fx.

ところで、信号波形における隣り合うピーク間に相当するピッチ周期Ｔｘ（ピッチ波長λｘ）に対して、実際の波形信号にはこのピッチ周期よりも長い波長の成分も含まれており、特に２倍のピッチ周期Ｔｙ（＝２Ｔｘ）の成分、すなわち図６のスペクトルでは、ピッチ周波数Ｆｘの１／２の周波数Ｆｙ（＝Ｆｘ／２）の成分が比較的有力に現れていることが分かる。このように１／２ピッチ周波数Ｆｙ（＝Ｆｘ／２）の成分が比較的大きく現れることは、通常の音声信号の場合に一般的にいえることであり、例えば、図７、図８に示すピッチ周波数Ｆｘが約６５０Ｈｚの音声信号の例や、図９、図１０に示すピッチ周波数Ｆｘが約５８０Ｈｚの音声信号の例でも同様に、ピッチの１／２の周波数Ｆｙ（＝Ｆｘ／２）の成分が明瞭に確認できる。なお、図７、図９は時間軸上の音声信号波形を示し、図８、図１０は周波数軸上のスペクトルを示している。 By the way, with respect to the pitch period Tx (pitch wavelength λx) corresponding to the adjacent peaks in the signal waveform, the actual waveform signal includes a component having a wavelength longer than this pitch period, and in particular, a pitch twice as large. It can be seen that in the component of the cycle Ty (= 2Tx), that is, in the spectrum of FIG. 6, the component of the frequency Fy (= Fx / 2) that is ½ of the pitch frequency Fx appears relatively influential. The fact that the component of the ½ pitch frequency Fy (= Fx / 2) appears relatively large as described above is generally applicable to an ordinary audio signal. For example, the pitches shown in FIGS. Similarly, in the example of the audio signal having the frequency Fx of about 650 Hz and the example of the audio signal having the pitch frequency Fx of about 580 Hz shown in FIGS. 9 and 10, the component of the frequency Fy (= Fx / 2) that is ½ the pitch. Can be clearly seen. 7 and 9 show audio signal waveforms on the time axis, and FIGS. 8 and 10 show spectra on the frequency axis.

図１１は、上述のようなピッチ周波数Ｆｘの成分と、その１／２の周波数Ｆｙの成分とを合成する場合の例を示す説明図である。この図１１の（ａ）は、ピッチ周波数Ｆｘの基本波形（例えば正弦波）を示し、（ｂ）はピッチ波長の倍の波長、すなわち１／２の周波数Ｆｙ（＝Ｆｘ／２）の基本波形を示している。これらの成分を図１１の（ｃ）のように合成すると、１波長おきに交互に同じ変動が生じ、例えば図１１の（ｄ）に示すように、１波長おきに交互に形状が似てくる場合が多くなる。このため、隣り合うピーク間の周期をとると、ばらつきが交互に現れるため、安定したピッチ検出が行えない。 FIG. 11 is an explanatory diagram showing an example in the case of synthesizing the component of the pitch frequency Fx as described above and the component of the frequency Fy that is a half thereof. FIG. 11A shows a basic waveform (for example, a sine wave) of the pitch frequency Fx, and FIG. 11B shows a basic waveform of a wavelength twice the pitch wavelength, that is, a frequency Fy of 1/2 (= Fx / 2). Is shown. When these components are synthesized as shown in FIG. 11 (c), the same fluctuation occurs alternately every other wavelength, and for example, as shown in FIG. 11 (d), the shape becomes similar every other wavelength. More cases. For this reason, if the period between adjacent peaks is taken, variations appear alternately, so that stable pitch detection cannot be performed.

そこで、本発明の実施の形態においては、ピーク間の周期Ｔｘ（ピッチ波長λｘ）の倍の周期Ｔｙ（＝２Ｔｘ）を単位としてピッチ検出を行うようにしている。このように、２波長毎にピークを検出すると、信号波形の形状が似た時のピーク毎に検出できるため、誤差がより少なくなる傾向がある。またこの時、検出の開始のタイミングとしては、位相が１波長ずれていても統計的に同様の結果を得ることができる。なお、ピーク検出の間隔としては、２波長以外に、原理的には４波長、６波長、８波長、・・・のように偶数倍の波長とすることも可能である。ただし、例えば４波長毎にピークを検出する場合には、より誤差が少なくなるが、サンプル数を必要とするというデメリットがある。 Therefore, in the embodiment of the present invention, pitch detection is performed in units of a cycle Ty (= 2Tx) which is twice the cycle Tx (pitch wavelength λx) between peaks. Thus, if a peak is detected for every two wavelengths, it can be detected for each peak when the shape of the signal waveform is similar, so there is a tendency for errors to be smaller. At this time, the same result can be statistically obtained as the detection start timing even if the phase is shifted by one wavelength. In addition to the two wavelengths, the peak detection interval can in principle be an even multiple of 4 wavelengths, 6 wavelengths, 8 wavelengths,... However, for example, when a peak is detected every four wavelengths, the error is reduced, but there is a demerit that the number of samples is required.

次に、図１２を参照しながら、ピッチ検出動作の具体例を説明する。この図１２において、最初のステップＳ４１でステレオ音声信号を入力し、ステップＳ４２でローパスフィルタ処理し、ステップＳ４３で上述した遅延補正加算処理による指向性処理を施す。これらは、上記図２の入力端子２１（１１）からの入力、ＬＰＦ（ローパスフィルタ）２２での処理、遅延補正加算部２３での処理にそれぞれ対応する。 Next, a specific example of the pitch detection operation will be described with reference to FIG. In FIG. 12, a stereo audio signal is input in the first step S41, low-pass filter processing is performed in step S42, and directivity processing by the delay correction addition processing described above is performed in step S43. These correspond to the input from the input terminal 21 (11) of FIG. 2, the processing in the LPF (low-pass filter) 22, and the processing in the delay correction adding unit 23, respectively.

次のステップＳ４４で、上記図２の極大値検出部２４による極大値計算処理を行う。これは、図１３の波形におけるｘマークに示すような局所的なピークを求めるものであり、正側のピーク（極大点）と負側のピーク（極小点）とがあるが、この実施の形態では正側の局所的なピーク（極大点）を採用しており、時間軸方向の信号波形のサンプル値が増加から減少に変化した点を検出することで求めることができる。具体的には、信号波形の各サンプル点の時間軸上の座標（位置）をサンプル番号で表わす場合、位置ｎ（すなわちサンプル番号ｎ）のサンプル点のサンプル値をｄ(ｎ)とし、前後のサンプル値間の差の閾値をｔｈとするとき、
ｄ(ｎ)−ｄ(ｎ−１)＞ｔｈ、かつ、ｄ(ｎ＋１)−ｄ(ｎ)＜−ｔｈ・・・（２）
のときの点ｎを極大点、そのときのサンプル値を極大値とする。 In the next step S44, a local maximum calculation process is performed by the local maximum detector 24 shown in FIG. This is to obtain a local peak as shown by the x mark in the waveform of FIG. 13, and has a positive peak (maximum point) and a negative peak (minimum point). Uses a local peak (maximum point) on the positive side, and can be obtained by detecting the point where the sample value of the signal waveform in the time axis direction changes from increasing to decreasing. Specifically, when the coordinate (position) on the time axis of each sample point of the signal waveform is represented by a sample number, the sample value of the sample point at the position n (ie, sample number n) is d (n), When the threshold value of the difference between sample values is th,
d (n) -d (n-1)> th and d (n + 1) -d (n) <-th (2)
In this case, the point n is the maximum point, and the sample value at that time is the maximum value.

次のステップＳ４５では、上記図２の極大値のゼロクロス間最大値検出部２５にて、上記ステップＳ４４で求められた極大値の内、値が正となる範囲のゼロクロス間で最大となる極大値を検出する。すなわち、サンプル値が負から正になるゼロクロス点から始まり、次の正から負になるゼロクロス点までの間に存在する極大値の内で最大値をとるものを検出する。このゼロクロス間の極大値の最大値の点の時間軸上の座標（サンプル点の位置、サンプル番号）が記録される。 In the next step S45, the maximum value between the zero crosses in the range in which the value is positive among the maximum values obtained in step S44 by the maximum value between zero crosses detection unit 25 in FIG. 2 above. Is detected. That is, the maximum value among the local maximum values existing from the zero cross point where the sample value starts from negative to positive and from the next positive to negative zero cross point is detected. The coordinates (sample point position, sample number) on the time axis of the point of the maximum value between the zero crosses are recorded.

次のステップＳ４６では、上記図２の最大値間ピッチ検出部２６にて、上記ステップＳ４５で求めた極大値の最大値の１つ目と３つ目との間隔、すなわち、１つおきの最大値間（２波長分）からピッチを検出する。すなわち、２波長分を検出単位としてピッチ検出を行っている。この場合のピッチ検出とは、２波長分の周期Ｔｙ（＝２Ｔｘ）を検出することに相当し、この検出された周期Ｔｙ（あるいは周波数Ｆｙ＝１／Ｔｙ）を、本来のピッチ周期Ｔｘ（あるいはピッチ周波数Ｆｘ）の代わりに用いている。ここで、信号波形の各サンプル点の時間軸上の座標をサンプル番号で表わすとき、上記ピッチ検出により求められる周期Ｔｙはサンプル数（サンプル番号の差）で表すことができ、１つ目の極大値の最大値の時間軸上の座標（サンプル番号）をmax１、３つ目の極大値の最大値の時間軸上の座標をmax３とするとき、
Ｔｙ＝ max３ − max１・・・（３）
となる。 In the next step S46, the interval between the first and third maximum values of the maximum value obtained in step S45 in the maximum value pitch detection unit 26 in FIG. 2, that is, every other maximum value is obtained. The pitch is detected between values (for two wavelengths). That is, pitch detection is performed using two wavelengths as a detection unit. The pitch detection in this case corresponds to detection of a period Ty (= 2Tx) for two wavelengths, and this detected period Ty (or frequency Fy = 1 / Ty) is used as the original pitch period Tx (or It is used instead of the pitch frequency Fx). Here, when the coordinates on the time axis of each sample point of the signal waveform are represented by sample numbers, the period Ty obtained by the pitch detection can be represented by the number of samples (difference in sample numbers), and the first maximum. When the coordinate on the time axis of the maximum value (sample number) is max1, and the coordinate on the time axis of the maximum value of the third maximum value is max3,
Ty = max3−max1 (3)
It becomes.

次のステップＳ４７以降は、上記図２の連続性判定部２７での処理に相当するものであり、先ずステップＳ４７では、上記ピッチ検出の単位区間の前後のピッチを比較する。この場合のピッチとしては、上記ピッチ周期ＴｘをＴｙ／２から求めて用いるようにしてもよいが、上記ピッチ検出の際に検出された２波長分の周期Ｔｙをそのまま用いるようにしてもよい。このとき、隣り合うピッチ検出単位毎のピッチ（あるいは周期Ｔｙ）の比率ｒを求めており、例えば上記２波長分の周期Ｔｙを用いる場合に、現在のピッチ検出単位ｎの２波長分の周期をＴｙ(ｎ)とするとき、ピッチ比率（この実施の形態では周期Ｔｙの比率）ｒは、
ｒ(ｎ)＝Ｔｙ(ｎ)／Ｔｙ(ｎ−１) ・・・（４）
となる。 The subsequent step S47 and subsequent steps correspond to the processing in the continuity determination unit 27 in FIG. 2, and in step S47, the pitches before and after the pitch detection unit section are compared. As the pitch in this case, the pitch period Tx may be obtained from Ty / 2 and used, but the period Ty for two wavelengths detected during the pitch detection may be used as it is. At this time, the ratio r of the pitch (or period Ty) for each adjacent pitch detection unit is obtained. For example, when the period Ty for the two wavelengths is used, the period for the two wavelengths of the current pitch detection unit n is calculated. When Ty (n) is given, the pitch ratio (ratio of the period Ty in this embodiment) r is
r (n) = Ty (n) / Ty (n-1) (4)
It becomes.

ここで、上記図５に示した信号波形の場合のピッチ検出結果の具体的な数値の例を図１４に示す。この図１４において、１番目のピッチ検出単位から順次２波長分の周期を検出しており、これらをＴｙ(１)、Ｔｙ(２)、Ｔｙ(３)、・・・のように示し、各ピッチ検出単位において検出された２波長分の周期Ｔｙをサンプル数で示した値、比率ｒ、及び後述する連続性判定フラグを例示している。 Here, FIG. 14 shows an example of specific numerical values of the pitch detection result in the case of the signal waveform shown in FIG. In FIG. 14, periods for two wavelengths are sequentially detected from the first pitch detection unit, and these are indicated as Ty (1), Ty (2), Ty (3),. A value indicating the period Ty for two wavelengths detected in the pitch detection unit, the ratio r, and a continuity determination flag described later are illustrated.

次のステップＳ４８では、上記ステップＳ４７で求められたピッチ比率（周期Ｔｙの比率）ｒがほぼ安定している区間（上記定常性部分）を検出するために、上記比率ｒの変化分Δｒ（＝１−ｒ）の絶対値｜Δｒ｜（＝｜１−ｒ｜）が、所定の閾値th_rより小さいか否かを判別しており、閾値th_rより小さい（ＹＥＳ）と判別されたとき、ステップＳ４９に進んで、連続性判定フラグをセット（フラグを１に）し、あるいはピッチが連続する区間（定常性部分）を計測するためのカウンタをカウントアップする。ステップＳ４８で、比率変化分の絶対値｜Δｒ｜が所定の閾値th_r以上である（ＮＯ）と判別されたときには、ステップＳ５０に進んで、連続性判定フラグをリセット（フラグを０に）する。上記所定の閾値th_rとしては、例えば０．０５等の値があげられ、図１４の例では、Ｔｙ(２)が検出された単位区間ではｒが１．００で｜Δｒ｜は０であるからフラグは１、Ｔｙ(３)が検出された単位区間ではｒが０．９７で｜Δｒ｜は０．０３であるからフラグは１となり、・・・と進み、Ｔｙ(ｎ)が検出された単位区間ではｒが０．７で、｜Δｒ｜は０．３であるからフラグは０となっている。 In the next step S48, in order to detect a section where the pitch ratio (ratio of period Ty) r obtained in step S47 is substantially stable (the stationary part), a change Δr (= It is determined whether or not the absolute value | Δr | (= | 1-r |) of 1-r) is smaller than a predetermined threshold th_r, and when it is determined that the absolute value | Δr | Then, the continuity determination flag is set (flag is set to 1), or a counter for measuring a section where the pitch is continuous (stationary part) is counted up. If it is determined in step S48 that the absolute value | Δr | of the ratio change is equal to or greater than the predetermined threshold th_r (NO), the process proceeds to step S50, and the continuity determination flag is reset (flag is set to 0). The predetermined threshold th_r is a value such as 0.05, for example. In the example of FIG. 14, r is 1.00 and | Δr | is 0 in the unit interval in which Ty (2) is detected. In the unit interval where the flag is 1 and Ty (3) is detected, r is 0.97 and | Δr | is 0.03, so the flag becomes 1, and so on, and Ty (n) is detected. In the unit interval, r is 0.7 and | Δr | is 0.3, so the flag is 0.

次のステップＳ５１では、上記検出されたピッチ（あるいは周期Ｔｙ）について、連続性があるか否かを判別している。ここで、例えば、ステップＳ４９でセットされた連続性判定フラグが５回以上連続してカウントされた場合には、連続性ありと判別し、検出されたピッチ（あるいは周期Ｔｙ）は有効であると判断する。例えば、図１４の例のように、周期Ｔｙ(２)から連続してＴｙ(６)までフラグが１で連続している場合は有効であり、代表ピッチ、例えばＴｙ(２)〜Ｔｙ(６)の平均値を出力する。 In the next step S51, it is determined whether or not the detected pitch (or cycle Ty) has continuity. Here, for example, when the continuity determination flag set in step S49 is continuously counted five times or more, it is determined that there is continuity, and the detected pitch (or cycle Ty) is valid. to decide. For example, as in the example of FIG. 14, it is effective when the flag is continuous with 1 from the cycle Ty (2) to Ty (6), and the representative pitch, for example, Ty (2) to Ty (6) is effective. ) Average value is output.

すなわち、ステップＳ５１で連続性あり（ＹＥＳ）と判別されたときは、ステップＳ５２に進んで、略々同じピッチが連続する区間（定常性部分）の時間軸上の座標（時刻）をサンプル番号で表したものを出力し、次のステップＳ５３で代表ピッチ（例えば連続する区間の周期Ｔｙの平均値）を出力した後、終了する。また、ステップＳ５１で連続性なし（ＮＯ）と判別されたときは、そのまま終了する。この図１２のような処理を繰り返し実行することにより、入力される信号波形に対するピッチ検出が継続して行われる。 That is, if it is determined in step S51 that there is continuity (YES), the process proceeds to step S52, and the coordinates (time) on the time axis of a section (stationary part) where substantially the same pitch continues are represented by the sample number. After the output, the representative pitch (for example, the average value of the period Ty of the continuous section) is output in the next step S53, and the process ends. If it is determined in step S51 that there is no continuity (NO), the process ends. By repeatedly executing the process shown in FIG. 12, the pitch detection for the input signal waveform is continuously performed.

以上の実施の形態におけるピッチ検出の動作をまとめると、ステレオマイクに対する２音源以上の音源を対象とし、ターゲット人物の音声の分離を行うため、混在波形の母音のような定常性部分のピッチを検出している。この時、声の高低や男性女性は問わない。その際、純粋な波形であれば、混じりけがないためレベル方向が保存されるので、自己相関などで周期がわかるが、混在波形の場合はレベル方向は保存されないため同様な手法が使いにくい。しかしながら、時間方向のピッチは保存されているのが確認できる。そこで、本発明の実施の形態においては、音声波形の特徴から、ピークツーピークを見て隣り合うピッチを求めるのではなく、２波長分でピッチ検出を行っており、これによって、信頼性が高く正確なピッチ検出が行え、その後の音声分離処理がしやすくなるような効果を得ることができる。 To summarize the pitch detection operation in the above embodiment, the sound of two or more sound sources for a stereo microphone is targeted, and the target person's voice is separated, so that the pitch of a stationary part such as a mixed waveform vowel is detected. doing. At this time, it doesn't matter whether the voice is loud or male. At that time, if the waveform is pure, the level direction is preserved because there is no mixing, so the period can be determined by autocorrelation or the like. However, in the case of a mixed waveform, the level direction is not preserved, so the same method is difficult to use. However, it can be confirmed that the pitch in the time direction is preserved. Therefore, in the embodiment of the present invention, instead of obtaining the adjacent pitch by looking at the peak-to-peak from the characteristics of the voice waveform, the pitch detection is performed for two wavelengths, thereby improving the reliability. Accurate pitch detection can be performed, and an effect that facilitates subsequent voice separation processing can be obtained.

次に、上記図１の音源信号分離装置の動作の具体例について説明する。 Next, a specific example of the operation of the sound source signal separation device in FIG. 1 will be described.

この図１のピッチ検出部１２としては、上述した実施の形態のような２波長分の周期からピッチ検出を行うものを用いることができるが、これに限定されず、１波長分の周期を検出するものや、４波長以上の偶数波長分の周期を検出するものを用いてもよい。 As the pitch detection unit 12 in FIG. 1, a unit that performs pitch detection from a period of two wavelengths as in the above-described embodiment can be used, but is not limited thereto, and a period of one wavelength is detected. Or a device that detects a period corresponding to an even number of wavelengths of four or more wavelengths may be used.

このピッチ検出部１２では、ピッチ検出単位毎にピッチを求め、そのピッチが連続する連続区間あるいは定常性部分の座標（サンプル番号）を求めており、図１のステレオマイクロホンを用いた音声信号分離装置は、これらの情報から、２音源以上の信号波形を分離するようにしたものである。 The pitch detection unit 12 obtains a pitch for each pitch detection unit, obtains a coordinate (sample number) of a continuous section or a stationary part where the pitch continues, and an audio signal separation device using the stereo microphone of FIG. Is to separate signal waveforms of two or more sound sources from these pieces of information.

ピッチ検出部１２で求められたピッチは、分離係数作成部１４に送られ、所望のターゲット音声を分離するための分離フィルタ（フィルタ演算回路１５）のフィルタ係数（分離係数）が作成される。この分離係数作成部１４において、ピッチ検出部１２で得られた代表するピッチを基本周波数とすると、以下の（５）式に示すようなバンドパスフィルタ係数作成式により、分離フィルタのフィルタ係数（分離係数）を作成する。この（５）式において、タップ位置ｉのフィルタ係数をh[i]としており、フィルタタップ数はFIRLEN、HLFLENは(FIRLEN−1)/２、Piは円周率π、ｍは倍音個数、サンプリング周波数FS、例えば４８ＫHz ならば48000である。Lo[n]、Hi[n]は各倍音次数の周波数におけるバンド幅を意味する。Lo[n]は低い方の周波数、Hi[n]は高い方の周波数である。バンド幅については任意であり分離性能にあわせる。ｍは倍音個数であるが、この倍音の個数はただ単に一定の個数でもよいが、例えば、最大周波数をmax_freqとし基本周波数をf[1]とすると、整数値ｍ＝max_freq/f[1]としてもよい。ただし、ｍ＝０の場合はf[0]＝f[1]/2を適用する。また、基本周波数をf[0]としてもよい。 The pitch obtained by the pitch detection unit 12 is sent to the separation coefficient creation unit 14, and a filter coefficient (separation coefficient) of a separation filter (filter operation circuit 15) for separating desired target speech is created. In the separation coefficient creation unit 14, when the representative pitch obtained by the pitch detection unit 12 is a fundamental frequency, the filter coefficient (separation) of the separation filter is obtained by a bandpass filter coefficient creation formula as shown in the following formula (5). Coefficient). In this equation (5), the filter coefficient at the tap position i is h [i], the number of filter taps is FIRLEN, HLFLEN is (FIRLEN-1) / 2, Pi is the circular ratio π, m is the number of harmonics, sampling The frequency FS, for example 48KHz, is 48000. Lo [n] and Hi [n] mean the bandwidth at the frequency of each harmonic order. Lo [n] is the lower frequency, and Hi [n] is the higher frequency. The bandwidth is arbitrary and matches the separation performance. Although m is the number of overtones, the number of overtones may be merely a fixed number. For example, when the maximum frequency is max_freq and the fundamental frequency is f [1], the integer value m = max_freq / f [1] Also good. However, when m = 0, f [0] = f [1] / 2 is applied. Further, the fundamental frequency may be f [0].

図１５は、分離係数作成部１４にて作成したフィルタ係数を用いた分離フィルタ（フィルタ演算回路１５）の周波数特性の具体例を示している。この図１５に示す周波数特性を有するフィルタは、いわゆる櫛形のバンドパスフィルタであり、このバンドパスフィルタは、タップ数が多いほど山と谷が急峻であり、またバンド幅が小さいほど谷の領域が増えるので、分離の確率は高くなる。また、上記（５）式において作成したバンドパスフィルタ係数は、実際にはタップ軸上のタップ位置により図１６のように表される。またこの時、より分離力を高めるために窓関数を選ぶ必要がある。 FIG. 15 shows a specific example of the frequency characteristic of the separation filter (filter operation circuit 15) using the filter coefficient created by the separation coefficient creation unit 14. The filter having the frequency characteristics shown in FIG. 15 is a so-called comb-shaped bandpass filter. In this bandpass filter, the peaks and valleys become steeper as the number of taps increases, and the valley region decreases as the band width decreases. Since it increases, the probability of separation increases. Further, the bandpass filter coefficient created in the above equation (5) is actually expressed as shown in FIG. 16 by the tap position on the tap axis. At this time, it is necessary to select a window function in order to further increase the separation power.

フィルタ演算回路１５では中域以下を対象とし、分離係数作成部１４により作成されたフィルタ係数を用い、積和演算を代表するようなＦＩＲフィルタによりフィルタがかけられることにより、上記検出されたピッチ及びその倍音成分を含むターゲット音声の分離がなされる。 In the filter operation circuit 15, the above-described detected pitch and filter are filtered by an FIR filter that represents the product-sum operation, using the filter coefficient created by the separation coefficient creating unit 14 for the middle range and below. The target speech including the harmonic component is separated.

また、高域処理部１７には、例えば子音のような非定常波形が入力される。高域と中域以下に分ける理由は、下記の通り音声の発生原理が異なるため、中域以下に集中する母音部分と高域に集中する子音部分というように帯域で処理を変えた方が、より定常性を判定しやすくなるからである。 Further, an unsteady waveform such as a consonant is input to the high frequency processing unit 17. The reason for dividing the high range and the mid range and below is because the principle of sound generation is different as follows, so it is better to change the processing in the band so that the vowel part concentrates below the mid range and the consonant part concentrates on the high range, This is because it is easier to determine continuity.

音声の発生原理では、母音部分は声帯の周期運動を振動源として生成されるため、定常的な信号となる。しかし子音部分には、例えば摩擦音や破裂音などの声帯の振動を伴わないものもあり、子音の波形がランダムになる傾向にある。そのため、母音部分にランダムな波形が混在すると、ランダムな波形はノイズ成分となり、ピッチ検出に悪影響が出る。また、同じサンプル数でサンプリングした場合には、高周波は低周波に比べて信号の再現性に乏しいため、波形の崩れを招き、そのためにピッチの検出を誤る場合がある。 In the sound generation principle, the vowel part is a stationary signal because it is generated using the periodic movement of the vocal cords as a vibration source. However, some consonant parts do not accompany vocal cord vibrations such as friction sounds and plosive sounds, and the consonant waveforms tend to be random. Therefore, if random waveforms are mixed in the vowel part, the random waveforms become noise components, which adversely affects pitch detection. Further, when sampling is performed with the same number of samples, the high frequency signal is less reproducible than the low frequency signal, which may cause the waveform to be corrupted, and thus the pitch may be detected incorrectly.

したがって、高域と中域以下に分けて、中域以下で定常性を判定する処理を行うことで、判定の精度を上げることができる。 Therefore, the accuracy of the determination can be improved by performing the process of determining the continuity in the middle range or less by dividing the high range and the middle range or less.

高域処理部１７では、例えばターゲット音声の定常性部分すなわち母音部分において、摩擦音や破裂音などの通常現れない子音によるランダムな高周波波形を取り除く処理が行われる。 In the high frequency processing unit 17, for example, in the stationary part of the target speech, that is, the vowel part, a process of removing a random high-frequency waveform due to consonants that do not normally appear such as a frictional sound or a plosive sound is performed.

音声では通常、母音部分にレベルの大きな子音が存在することはない。したがって、たとえ複数音源からなる音声信号の母音部分から、ターゲットの音声を分離できたとしても、その母音部分にランダムな高周波波形が加わると、実際のターゲット音声とは異なるものに聞こえる場合がある。そこで高域処理部１７において、母音部分である定常性部分における高周波波形のゲインを下げる処理を行い、加算器１６で出来るだけ加算されないようにすることで、よりターゲット音声に近い出力を得ることができる。 In speech, consonants with a high level usually do not exist in the vowel part. Therefore, even if the target voice can be separated from the vowel part of the voice signal composed of a plurality of sound sources, if a random high-frequency waveform is added to the vowel part, it may sound different from the actual target voice. Therefore, the high frequency processing unit 17 performs a process of reducing the gain of the high-frequency waveform in the stationary part which is a vowel part, and the adder 16 prevents the addition as much as possible, thereby obtaining an output closer to the target voice. it can.

フィルタ演算回路１５からの出力と、高域処理部１７からの出力とは、加算器１６で加算され、ターゲット音声の分離波形出力信号として出力端子１８より取り出される。 The output from the filter arithmetic circuit 15 and the output from the high frequency processing unit 17 are added by the adder 16 and are taken out from the output terminal 18 as a separated waveform output signal of the target sound.

ここで、ステレオマイクロホンと音源（人物等）との関係について説明する。ステレオマイクロホンの間隔は特に指定していないが、一般的に持ち運べる機器の場合には、数cm〜数十cm内である。例えば、カメラ一体型ＶＴＲ（いわゆるビデオカメラ）などの携帯型機器に取り付けたステレオマイクロホンを用いて集音する場合、音源である人物を３つの区分（中央、左、右）に分けることにするとき、数十度ずつの区分であれば、どの位置に人物が配置されようともターゲット音源の分離の実現が可能である。マイクの間隔に関して、２本のマイクの到達間隔を考慮すると、間隔が広ければより多くの領域に分割することが可能であり、分離区分が多くなるが、持ち運びに不便であるという欠点がある。逆に、マイク間隔が狭くなると、区分は３つのように少なくなるが、持ち運びには便利になるという利点がある。 Here, the relationship between a stereo microphone and a sound source (such as a person) will be described. The distance between the stereo microphones is not particularly specified, but is generally within a few centimeters to several tens of centimeters for a portable device. For example, when collecting sound using a stereo microphone attached to a portable device such as a camera-integrated VTR (so-called video camera), the person who is a sound source is divided into three sections (center, left, right). If the division is several tens of degrees, the target sound source can be separated regardless of the position of the person. In consideration of the distance between the microphones, if the distance between the two microphones is taken into consideration, it is possible to divide into more regions if the distance is wide, and there is a disadvantage that the number of separation sections increases, but it is inconvenient to carry. Conversely, when the microphone interval is narrowed, the number of sections is reduced to three, but there is an advantage that it is convenient to carry.

以上説明したような本発明の実施の形態において、ピッチ検出部１２の図１のローパスフィルタ（ＬＰＦ）２２、図１のフィルタ２０Ａ、２０Ｂは、１つのフィルタバンクにまとめるようにしてもよい。この場合、図２の遅延補正加算部２３は、図１の遅延補正加算部１３と共通化され、遅延補正加算部１３からの出力をフィルタバンクに送って、ピッチ検出用の低域と、分離フィルタのための中域以下と、高域処理のための高域とに分離するようにすればよい。 In the embodiment of the present invention as described above, the low-pass filter (LPF) 22 of FIG. 1 and the filters 20A and 20B of FIG. 1 of the pitch detector 12 may be combined into one filter bank. In this case, the delay correction addition unit 23 in FIG. 2 is shared with the delay correction addition unit 13 in FIG. 1 and sends the output from the delay correction addition unit 13 to the filter bank to separate the low frequency for pitch detection and separation. What is necessary is just to make it isolate | separate into the below-mid range for a filter, and the high region for a high region process.

図１７は、上述したようなフィルタバンク部７３を用いた音源信号分離装置の具体例を示すブロック図である。 FIG. 17 is a block diagram showing a specific example of a sound source signal separation device using the filter bank unit 73 as described above.

この図１７において、入力端子７１には、ステレオマイクロホンにより集音されたステレオ音声信号が入力され、所望のターゲット音源信号を強調する音源信号強調手段としての遅延補正加算部７２に送られる。この遅延補正加算部７２としては、上記図３と共に説明した構成を用いることができる。遅延補正加算部７２からの出力は、フィルタバンク部７３に送られる。フィルタバンク部７３は、帯域分割を行う部分であり、高域を出力するハイパスフィルタと、中域を出力するローパスフィルタと、低域を出力するローパスフィルタを用意する。例えば、高域とは子音帯域を通すような帯域であり、また中域以下は子音帯域以外の帯域であり、また低域とは中域よりも低い周波数帯域を示す。フィルタバンク部７３で分割された各帯域の信号内、低域信号は定常性判定部７４を介しピッチ検出器７５に送られ、中域以下の信号はフィルタ演算回路７７に送られ、高域信号は高域処理部７９に送られる。 In FIG. 17, a stereo sound signal collected by a stereo microphone is input to an input terminal 71 and sent to a delay correction adding unit 72 as sound source signal enhancing means for enhancing a desired target sound source signal. As the delay correction addition unit 72, the configuration described with reference to FIG. 3 can be used. The output from the delay correction adding unit 72 is sent to the filter bank unit 73. The filter bank unit 73 is a part that performs band division, and prepares a high-pass filter that outputs a high frequency, a low-pass filter that outputs a mid-frequency, and a low-pass filter that outputs a low frequency. For example, the high band is a band that passes the consonant band, the middle band or lower is a band other than the consonant band, and the low band is a frequency band lower than the middle band. Among the signals of each band divided by the filter bank unit 73, the low frequency signal is sent to the pitch detector 75 via the continuity determination unit 74, and the signal in the middle frequency or lower is sent to the filter arithmetic circuit 77, and the high frequency signal Is sent to the high frequency processing section 79.

ここで、上記図２と共に説明したピッチ検出部は、この図１７のフィルタバンク部７３内の低域を出力するローパスフィルタと、定常性判定部７４と、ピッチ検出器７５とを含むものであり、また図２の遅延補正加算部２３はローパスフィルタ（ＬＰＦ）２２の前段側に移されて、図１７の遅延補正加算部７２に相当している。すなわち、図１７の定常性判定部７４では、上述したように、連続する各ピッチが例えば誤差数％以内で連続する部分（定常性部分）を判定しており、この定常性部分が所定時間以上連続する（例えば２波長分の検出単位での連続性判定フラグが５回以上連続する）場合に、ピッチが有効であると判断し、そのときの代表ピッチをピッチ検出器７５から出力する。 Here, the pitch detection unit described with FIG. 2 includes a low-pass filter that outputs a low frequency in the filter bank unit 73 of FIG. 17, a continuity determination unit 74, and a pitch detector 75. Further, the delay correction adding unit 23 in FIG. 2 is moved to the preceding stage of the low pass filter (LPF) 22 and corresponds to the delay correction adding unit 72 in FIG. That is, as described above, the continuity determination unit 74 in FIG. 17 determines a portion (stationary portion) in which each successive pitch continues within, for example, a few percent of an error, and this continuity portion exceeds a predetermined time. When continuous (for example, the continuity determination flag in the detection unit for two wavelengths continues five times or more), it is determined that the pitch is valid, and the representative pitch at that time is output from the pitch detector 75.

音源信号分離部１９１内の分離係数作成部７６は、所望のターゲット音声を分離するための分離フィルタ（フィルタ演算回路７７）のフィルタ係数（分離係数）を、例えば上記（５）式に従って作成するものであり、上述した図１の分離係数作成部１４と同様である。この作成されたフィルタ係数が音源信号分離部１９１内のフィルタ演算回路７７に送られ、フィルタ演算回路７７では、フィルタバンク部７３からの中域以下の成分を入力し、上記図１のフィルタ演算回路１５と同様に、所望のターゲット音源からの音声信号を分離する。また、高域処理部７９は、子音等の非定常波形に対して処理を行うものであり、上述した図１の高域処理部１７と同様である。これらのフィルタ演算回路７７からの出力と、高域処理部７９からの出力とが加算器７８で加算され、分離波形出力として出力端子８０から取り出される。 The separation coefficient creation unit 76 in the sound source signal separation unit 191 creates a filter coefficient (separation coefficient) of a separation filter (filter operation circuit 77) for separating a desired target sound, for example, according to the above equation (5). This is the same as the separation coefficient creation unit 14 of FIG. 1 described above. The created filter coefficients are sent to the filter arithmetic circuit 77 in the sound source signal separation unit 191. The filter arithmetic circuit 77 inputs components in the middle range or lower from the filter bank unit 73, and the filter arithmetic circuit of FIG. As in 15, the audio signal from the desired target sound source is separated. The high frequency processing unit 79 performs processing on unsteady waveforms such as consonants, and is the same as the high frequency processing unit 17 of FIG. 1 described above. The output from the filter arithmetic circuit 77 and the output from the high-frequency processing unit 79 are added by an adder 78 and taken out from the output terminal 80 as a separated waveform output.

このような実施の形態においては、定常性部分においてピッチを検出したが、実際の一人で話すような音声の特性上、混在波形にて定常性判定された部分を越えて時間軸に領域をもつ。上述の実施の形態においては、ピッチが検出される度に分離フィルタ係数を作成するものとしたが、実際に定常性判定部分のみにフィルタを適用するのでは、処理として不十分である。そこで、定常性判定の周辺にも係数を使い回すことで、より時間方向の分離力を高めるようにすることが好ましい。 In such an embodiment, the pitch is detected in the stationary part, but due to the characteristics of speech that is actually spoken by one person, there is a region on the time axis beyond the part determined to be stationary in the mixed waveform. . In the above-described embodiment, the separation filter coefficient is created every time the pitch is detected. However, if the filter is actually applied only to the continuity determination portion, the processing is insufficient. Therefore, it is preferable to further increase the separation force in the time direction by using a coefficient around the continuity determination.

例えば、図１８には、横軸を時間とし、母音部分にて検出された２つ定常性部分を示しており、一番目の定常性判定部分をＲＡ、二番目の定常性判定部分をＲＢとすると、その時に求められたフィルタ係数は各々異なる。このとき、定常性部分ＲＡのフィルタ係数を該定常性部分ＲＡの時間軸前後に適用し、定常性部分ＲＢの係数を該定常性部分ＲＢの時間軸前後に適用する。この時、前後に適用する領域に関しては、統計的データを用い、事前に決めることができる。例えば、高い周波数がピッチとして検出されれば、時間を長くもしくは短くし、低い周波数がピッチとして検出されれば、時間を短くもしくは長くといった具合いである。 For example, FIG. 18 shows two continuity parts detected in the vowel part with time on the horizontal axis, RA being the first stationarity determination part and RB being the second stationarity determination part. Then, the filter coefficients obtained at that time are different. At this time, the filter coefficient of the stationary part RA is applied before and after the time axis of the stationary part RA, and the coefficient of the stationary part RB is applied before and after the time axis of the stationary part RB. At this time, the areas to be applied before and after can be determined in advance using statistical data. For example, if a high frequency is detected as a pitch, the time is lengthened or shortened. If a low frequency is detected as a pitch, the time is shortened or lengthened.

図１９は実際の時間軸上の信号波形の具体例を示しす。図１９の（Ａ）はフィルタをかける前の波形を示し、矢印の範囲Ｒｐで定常性判定部分ならびに代表的なピッチが検出すなわち基本周波数が検出される。図１９の（Ｂ）には、そのピッチを基準に作成したバンドパスフィルタを通した波形を示し、矢印の部分Ｒｑにて同一係数を使用し領域をより拡大している。 FIG. 19 shows a specific example of a signal waveform on the actual time axis. FIG. 19A shows a waveform before the filter is applied. In the range Rp of the arrow, the stationarity determination portion and the representative pitch are detected, that is, the fundamental frequency is detected. FIG. 19B shows a waveform that has passed through a band-pass filter created based on the pitch, and the region is further expanded by using the same coefficient at the arrow portion Rq.

更にターゲット音声の分離特性を向上させるために、ピッチ周波数の全ての倍音成分の帯域を通すと、ターゲット以外の音声が減衰しない場合がでてくるが、予め統計データを用いることで、ある倍音次数の帯域を足し込まないこともできる。 Furthermore, in order to improve the separation characteristics of the target sound, if all the harmonic component bands of the pitch frequency are passed, the sound other than the target may not be attenuated, but by using statistical data in advance, a certain harmonic order It is possible not to add the bandwidth.

次に、本発明の実施の形態のさらに他の具体例について、図２０を参照しながら説明する。この図２０に示す音源信号分離装置は、上記図１７と共に説明した音源信号分離装置の構成に、話者判定及び領域指定に関する構成を付加したものであり、また、分離係数出力手段として、図１７の音源信号分離部１９１内の分離係数作成部７６の代わりに、音源信号分離部１９２内に係数メモリ・係数選択部８６を用いている。 Next, still another specific example of the embodiment of the present invention will be described with reference to FIG. The sound source signal separation device shown in FIG. 20 is obtained by adding a configuration relating to speaker determination and area designation to the configuration of the sound source signal separation device described with reference to FIG. 17. A coefficient memory / coefficient selection unit 86 is used in the sound source signal separation unit 192 instead of the separation coefficient creation unit 76 in the sound source signal separation unit 191.

この図２０の分離係数出力手段としての係数メモリ・係数選択部８６は、予め何種類かのピッチに応じて作成したおいた分離フィルタ係数をメモリに蓄積しておき、検出されたピッチに応じて対応する分離フィルタ係数を読み出すようにしたものである。これは、例えば、ピッチの値を複数の区分に分け、その区分内の代表ピッチに対して分離フィルタ係数を予め作成しておき、各区分毎の分離フィルタ係数をメモリに蓄積しておき、ピッチ検出によって求められたピッチが上記複数の区分のいずれの範囲内に入るかに応じて、対応する区分の分離フィルタ係数をメモリから読み出すようにすればよい。これによって、音源信号分離装置においては、検出されたピッチ毎に分離フィルタ係数を演算により作成する必要がなくなり、メモリアクセスによって高速に分離フィルタ係数を得ることができ、処理の高速化が図れる。 The coefficient memory / coefficient selection unit 86 as the separation coefficient output means in FIG. 20 accumulates separation filter coefficients created in advance according to several types of pitches in the memory, and according to the detected pitches. Corresponding separation filter coefficients are read out. This is because, for example, the pitch value is divided into a plurality of sections, separation filter coefficients are created in advance for the representative pitch in the section, and the separation filter coefficients for each section are stored in the memory. The separation filter coefficient of the corresponding section may be read from the memory in accordance with which of the plurality of sections the pitch obtained by detection falls within. Thereby, in the sound source signal separation device, it is not necessary to create a separation filter coefficient for each detected pitch by calculation, and the separation filter coefficient can be obtained at high speed by memory access, and the processing speed can be increased.

話者判定とは、複数の音源（複数の人）の内のターゲットとなる人からの音声（ターゲット音声）であるか否かを判別することであり、この実施の形態における話者判定部８２においては、基本的にＬＰＦ（ローパスフィルタ）８１を介した信号波形を用いている。このＬＰＦ８１を介した低域信号は、上記フィルタバンク部７３からピッチ検出するために取り出される低域と同様の帯域の信号とすればよい。本実施の形態の話者判定では、上述した図１、図３等の遅延補正加算の出力を用いて、上記（１）式と共に説明したような相関係数ｃｏｒの値を利用して一致度を見ることにより、ターゲットとなる人が話しているか否かを判定することができる。判定法の具体例としては、図２１の（ａ）に示すように、上述した定常性部分となる定常性判定領域の区間全体の相関値そのものの閾値で判定する方法や、図２１の（ｂ）に示すように、定常性判定領域を細かく区分し所定の閾値以上の出現確率で判定する方法や、図２１の（ｃ）に示すように、定常性判定領域に対して重複を許して複数の区間に区切り、その相関値の閾値以上の出現確率で判定する方法等が挙げられ、この他、波形の特徴化したデータの相関性も含めて判定するようにしてもよい。なお、遅延補正加算における遅延量を調整することで、複数の音源（複数の人）の各方向に適用することができ、誰が話しているかを判別することも可能である。 The speaker determination is to determine whether or not the voice (target voice) is from a target person among a plurality of sound sources (a plurality of persons), and the speaker determination unit 82 in this embodiment. In FIG. 4, a signal waveform that basically passes through an LPF (low-pass filter) 81 is used. The low-frequency signal via the LPF 81 may be a signal in the same band as the low-frequency signal extracted for pitch detection from the filter bank unit 73. In the speaker determination according to the present embodiment, the degree of coincidence is obtained using the value of the correlation coefficient cor as described together with the above equation (1) using the output of the delay correction addition in FIGS. By seeing, it can be determined whether or not the target person is speaking. As a specific example of the determination method, as shown in FIG. 21 (a), a determination method using the threshold value of the correlation value itself of the entire section of the continuity determination region as the continuity portion described above, or (b) of FIG. As shown in FIG. 21), the continuity determination area is divided into fine parts and determined with an appearance probability equal to or higher than a predetermined threshold, or as shown in FIG. And a method of determining with an appearance probability equal to or higher than a threshold value of the correlation value, and the like. In addition to this, determination may be made including the correlation of the characteristic data of the waveform. In addition, by adjusting the delay amount in the delay correction addition, it can be applied to each direction of a plurality of sound sources (a plurality of people), and it is also possible to determine who is speaking.

話者判定部８２からの出力は、定常性判定部７４及び領域指定部８３に送られる。定常性判定部７４では、定常性である部分が判定されると、時間軸座標データが得られ、その座標データが領域指定部８３に送られる。領域指定部８３では、話者が判定されると、その定常性判定部の領域よりも一定間隔だけ広めにとるような処理を加え、バッファ８４、８５にそのタイミングを知らせることで、領域の調整をする。バッファ８４はフィルタバンク部７３と音源信号分離部１９２内のフィルタ演算回路７７との間に挿入され、バッファ８５はフィルタバンク部７３と高域処理部７９との間に挿入されている。領域指定部８３により領域外と判定された時間（区間）に関しては、単にゲインを下げるだけで良い。ゲインの調整のしかたについては、例えば、フィルタ演算回路７７と同様のタップを用意し、中心以外のタップをゼロにし、中心のタップのみ１以外の係数にすればよい。また、１０分の１にするときは、中心のタップのみ０．１の係数にすればよい。 The output from the speaker determination unit 82 is sent to the continuity determination unit 74 and the region designation unit 83. When the stationarity determination unit 74 determines a portion that is stationarity, time-axis coordinate data is obtained and the coordinate data is sent to the region designation unit 83. When the speaker is determined, the region designating unit 83 performs a process of taking a certain interval wider than the region of the continuity determining unit, and notifies the buffers 84 and 85 of the timing, thereby adjusting the region. do. The buffer 84 is inserted between the filter bank unit 73 and the filter arithmetic circuit 77 in the sound source signal separation unit 192, and the buffer 85 is inserted between the filter bank unit 73 and the high frequency processing unit 79. For the time (section) determined to be out of the area by the area designating unit 83, it is only necessary to lower the gain. For adjusting the gain, for example, a tap similar to the filter arithmetic circuit 77 is prepared, taps other than the center are set to zero, and only the center tap is set to a coefficient other than 1. Moreover, when it is set to 1/10, it is sufficient to use a coefficient of 0.1 for only the center tap.

図２０の他の構成は、上述した図１７の構成と同様であるため、対応する部分に同じ指示符号を付して説明を省略する。 The other configuration in FIG. 20 is the same as the configuration in FIG. 17 described above.

以上説明した本発明の音源信号分離装置の実施の形態の動作をまとめると、ステレオマイクに対する２音源以上の音源を対象とし、ターゲット人物の音声の分離を行うため、混在波形の母音のような定常性部分のピッチを検出している。この時、声の高低や男性女性は問わない。このピッチを基準としたターゲット音声の通過特性を得るためのバンドパス係数（分離フィルタ係数）を求めることで、ターゲット音声に関係する周波数軸上で山となる部分以外の帯域でターゲット音声以外の音が減衰される。また、演算速度を高めるために予め係数メモリを用意することで、係数の演算の手間が省ける。 Summarizing the operation of the embodiment of the sound source signal separation device of the present invention described above, since the target person's voice is separated for two or more sound sources for a stereo microphone, a steady state such as a mixed waveform vowel is obtained. The pitch of the sex part is detected. At this time, it doesn't matter whether the voice is loud or male. By obtaining a bandpass coefficient (separation filter coefficient) for obtaining the pass characteristics of the target sound with reference to this pitch, the sound other than the target sound in a band other than the peak portion on the frequency axis related to the target sound. Is attenuated. Also, by preparing a coefficient memory in advance in order to increase the calculation speed, it is possible to save time and effort for calculating the coefficient.

次に、本発明の他の実施の形態に用いられる音源信号分離装置の具体例の概略構成を図２２に示す。 Next, FIG. 22 shows a schematic configuration of a specific example of a sound source signal separation device used in another embodiment of the present invention.

この図２２において、入力端子１１０には、マイクロホン等により集音された音響信号、具体的には例えばステレオマイクロホンにより集音されたステレオ音声信号が入力され、ピッチ検出部１２及び所望の音源信号を強調する音源信号強調手段としての遅延補正加算部１３に送られる。遅延補正加算部１３からの出力は、音源信号分離部１９０内の基本波形作成部１４０及び基本波形置き換え部１５０に送られ、基本波形作成部１４では、ピッチ検出部１２で検出されたピッチに基づいて基本波形が作成される。基本波形作成部１４０からの基本波形は、基本波形置き換え部１５０に送られ、遅延補正加算部１３からの音声信号の少なくとも一部（例えば後述する定常性部分）が基本波形に置き換えられて、出力端子１６０より分離波形出力信号として取り出される。 In FIG. 22, an acoustic signal collected by a microphone or the like, specifically, for example, a stereo audio signal collected by a stereo microphone is input to the input terminal 110, and the pitch detector 12 and a desired sound source signal are input to the input terminal 110. The signal is sent to a delay correction adding unit 13 as a sound source signal emphasizing means to be emphasized. The output from the delay correction addition unit 13 is sent to the basic waveform creation unit 140 and the basic waveform replacement unit 150 in the sound source signal separation unit 190, and the basic waveform creation unit 14 is based on the pitch detected by the pitch detection unit 12. To create a basic waveform. The basic waveform from the basic waveform creation unit 140 is sent to the basic waveform replacement unit 150, and at least a part (for example, a stationary part described later) of the audio signal from the delay correction addition unit 13 is replaced with the basic waveform and output. It is taken out from the terminal 160 as a separated waveform output signal.

このような構成を有する音源信号分離装置の具体例において、ピッチ検出部１２および遅延補正加算部１３は上述した図１の構成と同様であるため、対応する部分に同じ指示符号を付して説明を省略する。 In the specific example of the sound source signal separation device having such a configuration, the pitch detection unit 12 and the delay correction addition unit 13 are the same as those in the above-described configuration of FIG. Is omitted.

この図２２のピッチ検出部１２としては、上述した実施の形態のような２波長分の周期からピッチ検出を行うものを用いることができるが、これに限定されず、１波長分の周期を検出するものや、４波長以上の偶数波長分の周期を検出するものを用いてもよい。ピッチ検出の波長の数を多くとれば処理すべきサンプル数が増えるが、誤差が少なくなる利点がある。また、このようなピッチ検出部は、上記図２２に示したような音源信号分離装置のみならず、ピッチを検出することで音源信号分離をするような種々の音源信号分離装置に広く用いることができる。 As the pitch detection unit 12 in FIG. 22, a unit that detects pitch from the period of two wavelengths as in the above-described embodiment can be used, but is not limited to this, and the period of one wavelength is detected. Or a device that detects a period corresponding to an even number of wavelengths of four or more wavelengths may be used. Increasing the number of wavelengths for pitch detection increases the number of samples to be processed, but has the advantage of reducing errors. Further, such a pitch detector is widely used not only for the sound source signal separation device as shown in FIG. 22 but also for various sound source signal separation devices that perform sound source signal separation by detecting the pitch. it can.

基本波形作成部１４０では、ピッチ検出部１２で検出された定常性部分のピッチに基づいて基本波形が作成される。この基本波形としては、一般にピッチ波長の整数倍の波形が用いられるが、本実施の形態においては、後述するように、ピッチ波長の倍の波長の波形を用いている。次に、基本波形置き換え部１５０では、遅延補正加算部１３（あるいは入力端子１１０）からの音声信号の例えば上記定常性部分を、基本波形作成部１４０で作成された基本波形の繰り返し波形に置き換えることにより、所望の音源からの音声信号のみが強調されたような分離波形出力信号として、出力端子１６０に送っている。 The basic waveform creation unit 140 creates a basic waveform based on the pitch of the stationary part detected by the pitch detection unit 12. As this basic waveform, a waveform that is an integral multiple of the pitch wavelength is generally used, but in the present embodiment, a waveform having a wavelength that is twice the pitch wavelength is used, as will be described later. Next, the basic waveform replacement unit 150 replaces, for example, the above-mentioned stationary portion of the audio signal from the delay correction addition unit 13 (or the input terminal 110) with a repetitive waveform of the basic waveform generated by the basic waveform generation unit 140. Thus, the separated waveform output signal in which only the audio signal from the desired sound source is emphasized is sent to the output terminal 160.

次に、上記図２２の音源信号分離装置の動作の具体例について説明する。 Next, a specific example of the operation of the sound source signal separation device in FIG. 22 will be described.

このピッチ検出部１２では、ピッチ検出単位毎にピッチを求め、そのピッチが連続する連続区間あるいは定常性部分の座標（サンプル番号）を求めており、図２２のステレオマイクロホンを用いた音声信号分離装置は、これらの情報から、２音源以上の信号波形を分離するようにしたものである。 The pitch detection unit 12 obtains a pitch for each pitch detection unit, obtains a coordinate (sample number) of a continuous section or a stationary part where the pitch continues, and an audio signal separation device using the stereo microphone of FIG. Is to separate signal waveforms of two or more sound sources from these pieces of information.

ここで、前述したように、マイクロホン毎にターゲット音声に対して遅延量補正を行って位相を合わせ、これらを足し込むことで、ターゲット音声を強調し、その他の音声は相対的に減衰される。この点を踏まえて、上記定常性部分の信号波形を上記ピッチ検出単位を周期として足し込むことで、この定常性部分の基本波形を作ることができる。 Here, as described above, the delay amount is corrected for the target sound for each microphone, the phases are matched, and these are added to enhance the target sound, and other sounds are relatively attenuated. In consideration of this point, the basic waveform of the stationary part can be created by adding the signal waveform of the stationary part as the period of the pitch detection unit.

すなわち、図２２の遅延補正加算部１３では、上記図３と共に説明したように、ターゲット音源から各マイクロホンへの音の伝搬遅延時間の差をなくすように遅延量補正を行い、これらを加算して出力している。基本波形作成部１４０では、遅延補正加算部１３からの出力信号波形を、ピッチ検出部１２からの情報に基づいて処理することで基本波形作成を行っており、具体的には、上記ピッチ連続区間あるいは定常性部分の信号波形を、上記ピッチ検出単位を周期として足し込むことで、基本波形を作成している。図２３の実線の波形ａは、このようにして作成された基本波形の一例を示しており、上記図５に示したような２波長分の波形の６個分（例えば周期Ｔｙ(１)〜Ｔｙ(６)に相当）を足し込んで平均化した波形を示している。また、図２３の破線の波形ｂは、参考として本来のターゲット音声の波形を示している。この図２３から明らかなように、上記ピッチ連続区間あるいは定常性部分の信号波形をピッチ検出単位である２波長を周期として足し込むことにより作成された基本波形ａは、本来のターゲット音声の波形ｂに極めて近似したものが得られていることが分かる。この基本波形は、ターゲット音声に関しては位相がずれずに足し込まれるので、保存または強調されるが、他の音に関しては、位相がずれた音声を足し込むことになるので、減衰効果を示す。この時、ピッチ検出を２波長単位で行い、基本波形作成も２波長単位で行うことが好ましい理由としては、作成された基本波形には、ピッチ周期Ｔｘよりも周期の長いＴｙの成分も保存されるからである。 That is, the delay correction adding unit 13 in FIG. 22 performs delay amount correction so as to eliminate the difference in the propagation delay time of the sound from the target sound source to each microphone as described with reference to FIG. Output. The basic waveform creation unit 140 creates the basic waveform by processing the output signal waveform from the delay correction addition unit 13 based on the information from the pitch detection unit 12, and specifically, the pitch continuous section. Alternatively, the basic waveform is created by adding the signal waveform of the stationary part with the pitch detection unit as a period. The solid line waveform a in FIG. 23 shows an example of the basic waveform created in this way, and six waveforms for two wavelengths as shown in FIG. 5 (for example, period Ty (1) ˜ The waveform is averaged by adding (corresponding to Ty (6)). Moreover, the broken line waveform b in FIG. 23 shows the waveform of the original target speech for reference. As is apparent from FIG. 23, the basic waveform a created by adding the signal waveform of the pitch continuous section or the stationary part with two wavelengths as pitch detection units as a period is the waveform b of the original target speech. It can be seen that a very close approximation is obtained. This basic waveform is preserved or emphasized because it is added without phase shift with respect to the target sound, but the sound with phase shift is added with respect to other sounds, and thus shows a damping effect. At this time, it is preferable that the pitch detection is performed in units of two wavelengths and the basic waveform is also generated in units of two wavelengths. The generated basic waveform also stores a Ty component having a period longer than the pitch period Tx. This is because that.

次の基本波形置き換え部１５０では、遅延補正加算部１３からの出力信号波形の内の上記ピッチ連続区間あるいは定常性部分を、上記基本波形作成部１４０で作成された基本波形の繰り返し波形で置き換えている。図２４の実線の波形ａは、基本波形置き換え部１５０にて置き換える基本波形の繰り返し波形の例を示しており、図２４の破線の波形ｂは、参考として本来のターゲット音声の波形を示している。 In the next basic waveform replacement unit 150, the pitch continuous section or the stationary part of the output signal waveform from the delay correction addition unit 13 is replaced with a repetitive waveform of the basic waveform generated by the basic waveform generation unit 140. Yes. A solid line waveform a in FIG. 24 shows an example of a repetitive waveform of the basic waveform replaced by the basic waveform replacement unit 150, and a broken line waveform b in FIG. 24 shows the waveform of the original target speech for reference. .

このように、ピッチ連続区間あるいは定常性部分が基本波形で置き換えられた基本波形置き換え部１５０からの出力波形信号は、ターゲット音声の分離出力波形信号として、出力端子１６０より取り出される。 In this way, the output waveform signal from the basic waveform replacement unit 150 in which the pitch continuous section or the stationary part is replaced with the basic waveform is taken out from the output terminal 160 as the separated output waveform signal of the target voice.

図２５は、このような音声信号分離装置の動作を概略的に示すフローチャートである。この図２５において、最初のステップＳ６１で、例えば上述したような２波長分を検出単位とするピッチ検出を行い、次のステップＳ６２で連続性ありか否かの判別を行い、ＮＯのときはピッチ検出のステップＳ６１に戻り、ＹＥＳのときはステップＳ６３以降に進む。ステップＳ６３では、上記ピッチ検出により得られた各ピッチ検出単位の始点と終点の座標を入力し、ステップＳ６４で、これらの各ピッチ検出単位の信号波形を足し込んで平均化することにより基本波形を作成し、次のステップＳ６５で、上述したような基本波形の置き換え処理を行っている。 FIG. 25 is a flowchart schematically showing the operation of such an audio signal separation device. In FIG. 25, in the first step S61, for example, pitch detection using the two wavelengths as described above as a detection unit is performed, and in the next step S62, it is determined whether or not there is continuity. Returning to step S61 of detection, if YES, the process proceeds to step S63 and subsequent steps. In step S63, the coordinates of the start point and end point of each pitch detection unit obtained by the pitch detection are input, and in step S64, the signal waveform of each pitch detection unit is added and averaged to obtain a basic waveform. In step S65, the basic waveform replacement process as described above is performed.

なお、ステレオマイクロホンと音源（人物等）との関係についは、前述と同様であるので説明を省略する。 Note that the relationship between the stereo microphone and the sound source (such as a person) is the same as described above, and a description thereof will be omitted.

以上説明した本発明の音源信号分離装置の実施の形態の動作をまとめると、ステレオマイクに対する２音源以上の音源を対象とし、ターゲット人物の音声の分離を行うため、混在波形の母音のような定常性部分のピッチを検出している。この時、声の高低や男性女性は問わない。この前ピッチとの誤差が少ない場合は連続性と判断し、その連続部分を足し込み平均をし、出来上がった波形を基本波形とし、もとの波形と置き換える。置き換え波形は足し込むほど混在波形は減衰し、ターゲットの音のみが強調され分離を実現することができる。 Summarizing the operation of the embodiment of the sound source signal separation device of the present invention described above, since the target person's voice is separated for two or more sound sources for a stereo microphone, a steady state such as a mixed waveform vowel is obtained. The pitch of the sex part is detected. At this time, it doesn't matter whether the voice is loud or male. When there is little error from the previous pitch, it is determined as continuity, and the continuous portions are added and averaged, and the completed waveform is used as a basic waveform to replace the original waveform. As the replacement waveform is added, the mixed waveform is attenuated, and only the target sound is emphasized to achieve separation.

なお、本発明は上述した実施の形態のみに限定されるものではなく、例えば、上述したピッチ検出は、２波長周期のみならず、４波長等の２の倍数波長を周期として行うようにしてもよく、この場合、４波長以上とすると、より誤差が少なくなるが、処理すべきサンプル数が増えることを考慮して、適宜ピッチ検出周期を設定すればよい。また、このようなピッチ検出の構成は、上記実施の形態の音源信号分離装置だけでなく、ピッチを検出することで音源信号を分離する種々の装置に広く用いることが可能である。この他、本発明の要旨を逸脱しない範囲において種々の変更が可能であることは勿論である。 The present invention is not limited to the above-described embodiment. For example, the above-described pitch detection may be performed using not only two-wavelength periods but also multiple wavelengths of 2 such as four wavelengths as a period. In this case, if the number of wavelengths is four or more, the error is further reduced, but the pitch detection period may be set as appropriate in consideration of an increase in the number of samples to be processed. Further, such a pitch detection configuration can be widely used not only for the sound source signal separation device of the above-described embodiment, but also for various devices for separating sound source signals by detecting the pitch. Of course, various modifications can be made without departing from the scope of the present invention.

本発明の実施の形態となる音源信号分離装置の概略構成を示すブロック図である。It is a block diagram which shows schematic structure of the sound source signal separation apparatus used as embodiment of this invention. 本発明の実施の形態に用いられるピッチ検出装置の構成例を示すブロック図である。It is a block diagram which shows the structural example of the pitch detection apparatus used for embodiment of this invention. 本発明の実施の形態に用いられる遅延補正加算部の構成例を示すブロック図である。It is a block diagram which shows the structural example of the delay correction | amendment addition part used for embodiment of this invention. 本発明の実施の形態に用いられる遅延補正加算部の動作を説明するための音声信号波形を示す図である。It is a figure which shows the audio | voice signal waveform for demonstrating operation | movement of the delay correction addition part used for embodiment of this invention. 本発明の実施の形態に用いられる音声信号の時間軸上の波形を示す波形図である。It is a wave form diagram which shows the waveform on the time-axis of the audio | voice signal used for embodiment of this invention. 図５に示す音声信号の周波数軸上のスペクトルを示す図である。It is a figure which shows the spectrum on the frequency axis of the audio | voice signal shown in FIG. ピッチ周波数が約６５０Ｈｚの音声信号の時間軸上の波形を示す波形図である。It is a wave form diagram which shows the waveform on the time-axis of the audio | voice signal whose pitch frequency is about 650 Hz. 図７に示す音声信号の周波数軸上のスペクトルを示す図である。It is a figure which shows the spectrum on the frequency axis of the audio | voice signal shown in FIG. ピッチ周波数が約５８０Ｈｚの音声信号の時間軸上の波形を示す波形図である。It is a wave form diagram which shows the waveform on the time-axis of the audio | voice signal whose pitch frequency is about 580 Hz. 図９に示す音声信号の周波数軸上のスペクトルを示す図である。It is a figure which shows the spectrum on the frequency axis of the audio | voice signal shown in FIG. 本発明の実施の形態において２波長を検出単位としてピッチ検出を行う理由を説明するための音声信号波形を示す図である。It is a figure which shows the audio | voice signal waveform for demonstrating the reason for performing pitch detection by making 2 wavelengths into a detection unit in embodiment of this invention. 本発明の実施の形態におけるピッチ検出処理の動作の一例を説明するためのフローチャートである。It is a flowchart for demonstrating an example of the operation | movement of the pitch detection process in embodiment of this invention. 音声信号波形の極大値及び極小値を説明するための波形図である。It is a wave form diagram for demonstrating the maximum value and minimum value of an audio | voice signal waveform. ２波長分のピッチ検出単位毎に検出される情報の具体例を示す図である。It is a figure which shows the specific example of the information detected for every pitch detection unit for 2 wavelengths. 分離係数作成部にて作成したフィルタ係数を用いた分離フィルタの周波数特性の具体例を示す図である。It is a figure which shows the specific example of the frequency characteristic of the separation filter using the filter coefficient produced in the separation coefficient preparation part. 分離係数作成部にて作成したフィルタ係数の具体例を示す図である。It is a figure which shows the specific example of the filter coefficient produced in the separation coefficient production part. 本発明の実施の形態における音源信号分離装置の他の具体例を示すブロック図である。It is a block diagram which shows the other specific example of the sound source signal separation apparatus in embodiment of this invention. 定常性部分のフィルタ係数の時間軸上での拡張を説明するために図である。It is a figure for demonstrating the expansion on the time axis of the filter coefficient of a stationary part. 時間軸上の信号波形の具体例を示す波形図である。It is a wave form diagram which shows the specific example of the signal waveform on a time-axis. 本発明の実施の形態における音源信号分離装置のさらに他の具体例を示すブロック図である。It is a block diagram which shows the other specific example of the sound source signal separation apparatus in embodiment of this invention. 定常性判定領域と話者判定との関係を説明するための図である。It is a figure for demonstrating the relationship between a continuity determination area | region and a speaker determination. 本発明の実施の形態となる音源信号分離装置の概略構成を示すブロック図である。It is a block diagram which shows schematic structure of the sound source signal separation apparatus used as embodiment of this invention. 基本波形作成部により作成される基本波形の一例を示す波形図である。It is a wave form diagram which shows an example of the basic waveform produced by the basic waveform production part. 基本波形置き換え部により置き換えられる基本波形の繰り返し波形の一例を示す波形図である。It is a wave form diagram which shows an example of the repetition waveform of the basic waveform replaced by the basic waveform replacement part. 本発明の実施の形態における音源信号分離処理の一例を説明するためのフローチャートである。It is a flowchart for demonstrating an example of the sound source signal separation process in embodiment of this invention. ３人の人物を音源とするときのステレオマイクロホンによる集音の具体例を示す図である。It is a figure which shows the specific example of the sound collection with a stereo microphone when using three persons as a sound source.

Explanation of symbols

１２ピッチ検出部、１３，２３，７２遅延補正加算部、１４，７６分離係数作成部、１５，７７フィルタ演算回路、１７，７９高域処理部、１９，１９０，１９１，１９２音源信号分離部、２４極大値検出部、２５極大値のゼロクロス間最大値検出部、２６最大値間ピッチ検出部、２７連続判定部、７３フィルタバンク部、７４定常性判定部、８６係数メモリ・係数選択部、１４０基本波形作成部、１５０基本波形置き換え部 12 pitch detector, 13, 23, 72 delay correction adder, 14, 76 separation factor generator, 15, 77 filter operation circuit, 17, 79 high frequency processor, 19, 190, 191, 192 sound source signal separator, 24 maximum value detection unit, 25 maximum value detection unit between zero crosses of maximum value, 26 pitch detection unit between maximum values, 27 continuous determination unit, 73 filter bank unit, 74 stationarity determination unit, 86 coefficient memory / coefficient selection unit, 140 Basic waveform creation section, 150 Basic waveform replacement section

Claims

Sound source signal emphasizing means for emphasizing a desired sound source signal among input sound signals obtained by mixing acoustic signals from a plurality of sound sources and collected by a plurality of sound collecting means;
Pitch detecting means for detecting the pitch of the desired sound source signal in the input acoustic signal;
Sound source signal separation comprising: sound source signal separation means for separating the desired sound source signal from the input sound signal based on the detected pitch and the sound source signal enhanced by the sound source signal enhancement means apparatus.

The sound source signal separating means is
Filter means for separating the desired sound source signal from the output signal from the sound source signal enhancing means;
The sound source signal separation device according to claim 1, further comprising: filter coefficient output means for outputting a filter coefficient of the filter means based on detection information from the pitch detection means.

The filter coefficient output means outputs a filter coefficient having a characteristic that allows the frequency characteristic of the filter means to pass a frequency component that is an integral multiple of the frequency of the pitch detected by the pitch detection means. 2. The sound source signal separation device according to 2.

The filter coefficient output means includes storage means in which filter coefficients corresponding to several kinds of pitches are stored in advance, and the filter coefficient corresponding to the pitch is stored from the storage means according to the pitch detected by the pitch detection means. 4. A sound source signal separation device according to claim 3, wherein

High-frequency processing means for processing the consonant band of the output signal from the sound source signal enhancing means;
The consonant band of the output signal from the sound source signal enhancing means is extracted and sent to the high frequency processing means, the band other than the consonant of the output signal from the sound source signal enhancing means is extracted and sent to the filter means, and the sound source signal enhancing 3. The sound source signal separation device according to claim 2, further comprising: filter bank means for taking out a vowel band of an output signal from the means and sending it to the pitch detection means.

3. The sound source signal separation device according to claim 2, wherein the plurality of sound collecting means are left and right stereo microphones.

The sound source signal emphasizing unit corrects and adds a delay time difference of sound propagation from the desired sound source to the plurality of sound collecting units to the acoustic signals from the plurality of sound collecting units, 3. The sound source signal separation device according to claim 2, wherein only a sound signal from a desired sound source is emphasized.

3. The sound source signal separation apparatus according to claim 2, wherein the pitch detection means performs pitch detection using two wavelengths of the pitch of the desired sound source signal as detection units.

The sound source signal separating means is
A basic waveform creating means for creating a basic waveform based on detection information from the pitch detecting means, using a stationary part in which at least substantially the same pitch continues in the output signal from the sound source signal enhancing means;
2. A sound source according to claim 1, further comprising basic waveform replacement means for replacing at least a part of the signal based on the input acoustic signal with a repetitive waveform of the basic waveform generated by the basic waveform generation means. Signal separation device.

10. The sound source signal separation device according to claim 9, wherein the pitch detection means performs pitch detection using two wavelengths of the pitch of the desired sound source signal as detection units.

The sound source signal separation device according to claim 9, wherein the plurality of sound collecting means are left and right stereo microphones.

The sound source signal emphasizing unit corrects and adds a delay time difference of sound propagation from the desired sound source to the plurality of sound collecting units to the acoustic signals from the plurality of sound collecting units, The sound source signal separation device according to claim 9, wherein only a sound signal from a desired sound source is emphasized.

10. The basic waveform generating means generates a basic waveform by adding and averaging the two portions of the pitch in units of the stationary portion where the pitch of the desired sound source signal is continuous. The sound source signal separation device described.

A step of emphasizing a desired sound source signal among input sound signals obtained by mixing sound signals from a plurality of sound sources and collected by a plurality of sound collecting means;
Detecting a pitch of the desired sound source signal in the input acoustic signal;
And a step of separating the desired sound source signal from the input sound signal based on the detected pitch and the sound source signal emphasized in the emphasizing step.

Sound source signal emphasizing means for emphasizing a desired sound source signal of an input sound signal obtained by mixing sound signals from a plurality of sound sources and collected by a plurality of sound collecting means;
A period detecting means for detecting a two-wavelength period with a detection unit of two wavelengths of the pitch in the output signal from the sound source enhancing means;
Continuous determination means for determining whether or not at least substantially the same pitch is continuous based on a change in the two-wavelength period detected by the period detection means, and outputting pitch information in accordance with the determination result. Pitch detector.

16. The pitch detection apparatus according to claim 15, wherein the plurality of sound collecting means are left and right stereo microphones.

The sound source signal emphasizing unit corrects and adds a delay time difference of sound propagation from the desired sound source to the plurality of sound collecting units to the acoustic signals from the plurality of sound collecting units, The pitch detection apparatus according to claim 15, wherein only a sound signal from a desired sound source is emphasized.

A sound source signal emphasizing step for emphasizing a desired sound source signal of an input acoustic signal mixed with sound signals from a plurality of sound sources and collected by a plurality of sound collecting means;
A period detection step of detecting a two-wavelength period using two wavelengths of pitch in the output signal obtained by the sound source enhancement step as a detection unit;
A continuous determination step of determining whether or not at least substantially the same pitch is continuous based on a change in the two-wavelength cycle detected by the cycle detection step, and outputting pitch information in accordance with the determination result. Pitch detection method.

Pitch detection means for performing pitch detection with a wavelength unit that is a multiple of 2 of the pitch of a desired sound source signal of an input acoustic signal obtained by mixing acoustic signals from a plurality of sound sources;
And a sound source signal separation means for separating a desired sound source signal based on the detected pitch.

A step of performing pitch detection with a wavelength unit that is a multiple of 2 of the pitch of a desired sound source signal of an input acoustic signal obtained by mixing acoustic signals from a plurality of sound sources, as a detection unit;
And a step of separating a desired sound source signal based on the detected pitch.