JP4471780B2

JP4471780B2 - Audio signal processing apparatus and method

Info

Publication number: JP4471780B2
Application number: JP2004243882A
Authority: JP
Inventors: 孝之稗方; 哲也高橋; 陽平池田; 敏章下田
Original assignee: Kobe Steel Ltd
Current assignee: Kobe Steel Ltd
Priority date: 2004-08-24
Filing date: 2004-08-24
Publication date: 2010-06-02
Anticipated expiration: 2024-08-24
Also published as: JP2006064755A

Description

本発明は，入力音声信号からピッチ周期を検出し，そのピッチ周期に基づいて入力音声信号の時間軸の圧縮や伸張を行う音声信号処理装置及びその方法に関するものである。 The present invention relates to an audio signal processing apparatus and method for detecting a pitch period from an input audio signal and compressing or expanding the time axis of the input audio signal based on the pitch period.

カラオケのテンポ（速度）変更やビデオの再生速度変更等を行う際に，音程を変えずに音声信号（オーディオ信号）の再生速度を速くしたり遅くしたりする時間軸圧縮伸張処理（音声信号処理の一例）が要求される。
従来，非特許文献１及び非特許文献２には，音声信号の周期性の強い部分を見出し，その周期（ピッチ周期）の単位での音声信号の省略や繰り返し（挿入）によって（ピッチ周期に基づく）時間軸圧縮伸張処理を行う技術が示されている。この技術では，音声信号における省略するピッチ周期分の信号をその次のピッチ周期分の信号にクロスフェードの重み付けにより重複加算する，或いは挿入するピッチ周期分の信号をその前後のピッチ周期分の信号をクロスフェードの重み付けにより重複加算した信号とするＰＩＣＯＬＡ（Pointer Interval Control OverLap and Add，ポインター移動量制御による重複加算法）という手法が採用されている。 Time-axis compression / expansion processing (audio signal processing) that increases or decreases the playback speed of audio signals (audio signals) without changing the pitch when changing the tempo (speed) of karaoke or changing the playback speed of video Example) is required.
Conventionally, in Non-Patent Document 1 and Non-Patent Document 2, a portion having a strong periodicity of an audio signal is found, and the audio signal is omitted or repeated (inserted) in units of the cycle (pitch cycle) (based on the pitch cycle). ) A technique for performing time-axis compression / decompression processing is shown. In this technology, a signal corresponding to a pitch period that is omitted in an audio signal is added to a signal corresponding to the next pitch period by cross-fading weighting, or a signal corresponding to a pitch period that is inserted is added to a signal corresponding to the pitch period before and after the signal. A technique called PICOLA (Pointer Interval Control OverLap and Add, overlap addition method by pointer movement amount control) is employed in which the signal is obtained by overlapping and adding by cross-fading weighting.

図５は，ＰＩＣＯＬＡ方式により時間軸圧縮が行われる際の音声信号の波形を模式的に表したものである。
まず，図５（ａ）に示すように，時間軸圧縮（音声信号の省略）の対象となる音声信号の範囲の先頭位置Ｐｏ１にポインタが設定され，このポインタ位置Ｐｏ１からの音声信号について，そのピッチ周期Ｐ（強い周期性を有する周期）が検出される。ピッチ周期Ｐの検出方法の例については後述する。
次に，図５（ｂ）に示すように，前記ポインタ位置Ｐｏ１からピッチ周期Ｐ分の（ピッチ周期Ｐの長さの）２つの信号ａ，ｂをクロスフェードの重み付けにより重複加算した信号ａ’を生成する。即ち，２つの信号ａ，ｂを合成（加算）する際に，図５（ａ）に破線Ｗ１，Ｗ２で示すように，信号ａに対する重みは時間軸が進むに従ってフェードアウト（次第に低下）し，信号ｂに対する重みは時間軸が進むに従ってフェードイン（次第に増大）するようクロスフェードの重み付けがなされる。
次に，信号ａを削除（省略）するとともに，信号ｂを信号ａ’に置き換える。これにより，１ピッチ周期Ｐ分の時間軸圧縮が完了する。ここで，音声信号の省略部に設定された信号ａ’は，クロスフェードの重み付けにより重複加算した信号であるので，その前後の音声信号との繋がりがスムーズとなり，違和感の少ない時間軸圧縮が可能となる。
次に，目標圧縮比がＲｘ（０＜Ｒｘ＜１）であるとすると，ポインタが，前記Ｐｏ１の位置からＣ（＝Ｐ×Ｒｘ／（１−Ｒｘ））だけ進んだ位置Ｐｏ２に再設定され，前記Ｐｏ１の位置から位置Ｐｏ２までの圧縮処理後の音声信号が出力されるとともに，このポインタ位置Ｐｏ２から同様の時間軸圧縮処理が繰り返される。これにより，Ｐ＋Ｃの長さの元の音声信号から，Ｃの長さの圧縮音声信号が生成（出力）されることになり，目標圧縮比Ｒｘ（＝Ｃ／（Ｐ＋Ｃ））を達成する時間軸圧縮がなされる。 FIG. 5 schematically shows a waveform of an audio signal when time axis compression is performed by the PICOLA method.
First, as shown in FIG. 5 (a), a pointer is set at the start position Po1 of the range of the audio signal to be subjected to time axis compression (omission of the audio signal), and the audio signal from the pointer position Po1 is A pitch period P (a period having a strong periodicity) is detected. An example of a method for detecting the pitch period P will be described later.
Next, as shown in FIG. 5 (b), a signal a ′ obtained by overlapping and adding two signals a and b corresponding to the pitch period P (the length of the pitch period P) from the pointer position Po1 by cross-fading weighting. Is generated. That is, when the two signals a and b are combined (added), as indicated by broken lines W1 and W2 in FIG. 5A, the weight for the signal a fades out (gradually decreases) as the time axis advances. The weight for b is weighted so that it fades in (increases gradually) as the time axis advances.
Next, the signal a is deleted (omitted) and the signal b is replaced with the signal a ′. Thereby, the time axis compression for one pitch period P is completed. Here, since the signal a ′ set in the omitted portion of the audio signal is a signal that is overlapped and added by weighting of the cross fade, the connection with the audio signals before and after the smooth becomes smooth, and the time axis compression with less sense of incongruity is possible. It becomes.
Next, assuming that the target compression ratio is Rx (0 <Rx <1), the pointer is reset to a position Po2 advanced by C (= P × Rx / (1-Rx)) from the position of Po1. The audio signal after the compression processing from the position Po1 to the position Po2 is output, and the same time axis compression processing is repeated from the pointer position Po2. As a result, a compressed audio signal having a length of C is generated (output) from the original audio signal having a length of P + C, and the time axis for achieving the target compression ratio Rx (= C / (P + C)). Compression is done.

一方，図６は，ＰＩＣＯＬＡ方式により時間軸伸張が行われる際の音声信号の波形を模式的に表したものである。
まず，図６（ａ）に示すように，時間軸伸張（音声信号の挿入）の対象となる音声信号の範囲の先頭位置Ｐｏ３にポインタが設定され，このポインタ位置Ｐｏ３からの音声信号について，そのピッチ周期Ｐ（強い周期性を有する周期）が検出される。
次に，図６（ｂ）に示すように，前記ポインタ位置Ｐｏ３からピッチ周期Ｐ分の（ピッチ周期Ｐの長さの）２つの信号ａ，ｂをクロスフェードの重み付けにより重複加算した信号ａ’を生成する。時間軸伸張の場合のクロスフェードの重み付けは，図６（ａ）に破線Ｗ３，Ｗ４で示すように，信号ａに対する重みは時間軸が進むに従ってフェードイン（次第に増加）し，信号ｂに対する重みは時間軸が進むに従ってフェードアウト（次第に低下）するよう重み付けがなされる。
次に，信号ａ，ｂの間に信号ａ’を挿入する。これにより，１ピッチ周期Ｐ分の時間軸伸張が完了する。ここで，挿入された信号ａ’は，クロスフェードの重み付けにより重複加算した信号であるので，その前後の音声信号との繋がりがスムーズとなり，違和感の少ない時間軸伸張が可能となる。
次に，目標伸張比がＲｙ（０＜Ｒｙ＜１）であるとすると，ポインタが，前記Ｐｏ３の位置からＰ＋Ｓ（Ｓ＝Ｐ×１／（Ｒｙ−１））だけ進んだ位置Ｐｏ４に再設定され，前記Ｐｏ３の位置から位置Ｐｏ４までの伸張処理後の音声信号が出力されるとともに，このポインタ位置Ｐｏ４から同様の時間軸伸張処理が繰り返される。これにより，Ｓの長さの元の音声信号から，Ｐ＋Ｓの長さの圧縮音声信号が生成（出力）されることになり，目標伸張比Ｒｙ（＝（Ｐ＋Ｓ）／Ｓ）を達成する時間軸伸張がなされる。 On the other hand, FIG. 6 schematically shows the waveform of an audio signal when time axis expansion is performed by the PICOLA method.
First, as shown in FIG. 6A, a pointer is set at the start position Po3 of the range of the audio signal to be subjected to time axis expansion (audio signal insertion), and the audio signal from the pointer position Po3 is A pitch period P (a period having a strong periodicity) is detected.
Next, as shown in FIG. 6 (b), a signal a ′ obtained by overlapping and adding two signals a and b corresponding to the pitch period P (the length of the pitch period P) from the pointer position Po3 by cross-fading weighting. Is generated. As shown by broken lines W3 and W4 in FIG. 6 (a), the weight for the signal a fades in (increases gradually) as the time axis advances, and the weight for the signal b is Weighting is performed so that fade-out (gradual decrease) occurs as the time axis advances.
Next, the signal a ′ is inserted between the signals a and b. Thereby, the time base extension for one pitch period P is completed. Here, since the inserted signal a ′ is a signal that is overlapped and added by weighting the crossfade, the connection with the audio signals before and after that becomes smooth, and the time axis can be expanded with little discomfort.
Next, assuming that the target expansion ratio is Ry (0 <Ry <1), the pointer is reset to a position Po4 advanced by P + S (S = P × 1 / (Ry−1)) from the position Po3. Then, the audio signal after the expansion process from the position Po3 to the position Po4 is output, and the same time axis expansion process is repeated from the pointer position Po4. As a result, a compressed audio signal having a length of P + S is generated (output) from the original audio signal having a length of S, and a time axis for achieving the target expansion ratio Ry (= (P + S) / S). Stretching is done.

ところで，処理する音声信号が，ステレオオーディオ信号等のように複数チャンネルの音声信号である場合，各チャンネルについてＰＩＣＯＬＡを適用すると，ピッチ周期を求める高負荷の演算をチャンネルごとに実行する必要があるため演算負荷が非常に高くなることに加え，チャンネルごとにピッチ周期が異なりうるので，圧縮伸張処理後の音声信号にチャンネル間で元の音声信号とは異なる位相差が生じ，聞く人に違和感を与えてしまうという問題点がある。
この問題を解決するためには，音声信号の圧縮伸張に用いるピッチ周期を，全てのチャンネルで統一（共通化）することが有効である。
例えば，特許文献１には，ステレオ音声信号のＬチャンネルとＲチャンネルとを加算した信号（Ｌ＋Ｒ）についてピッチ周期を検出し，そのピッチ周期に基づいて両チャンネルの音声信号の圧縮伸張処理（ＰＩＣＯＬＡ）を行う技術が提案されている。
さらに，特許文献２には，複数のチャンネル信号を加算した信号或いは最大の振幅を有するチャンネル信号についてピッチ周期を検出し，そのピッチ周期に基づいて全てのチャンネル信号の圧縮伸張処理を行う技術が提案されている。
これらの技術により，ピッチ周期を求める高負荷の演算を１つの音声信号について求めるだけで済むので演算負荷の増大を防止できるとともに，圧縮伸張処理後の音声信号に，聞く人に違和感を与えるようなチャンネル間での信号の位相差が生じることを防止できる。
特開２００１−５５００号公報特開２００２−２９７２００号公報森田，板倉「自己相関関数を用いた音声の時間軸での伸縮」，日本音響学会講演論文集，昭和６１年３月，ｐ．１９９−２００森田，板倉「ポインター移動量制御による重複加算法（ＰＩＣＯＬＡ）を用いた音声の時間軸での伸張圧縮とその評価」，日本音響学会講演論文集，昭和６１年１０月，ｐ．１４９−１５０猿渡洋「アレー信号処理を用いたブラインド音源分離の基礎」，電子情報通信学会技術報告，２００１年４月，ｖｏｌ．ＥＡ２００１−７，ｐ．４９−５６高谷智哉他「SIMOモデルに基づくICAを用いた高忠実度なブラインド音源分離」，電子情報通信学会技術報告，２００３年１月，ｖｏｌ．ＵＳ２００２−８７，ＥＡ２００２−１０８ By the way, if the audio signal to be processed is a multi-channel audio signal such as a stereo audio signal, applying PICOLA to each channel requires a high-load operation for obtaining the pitch period to be executed for each channel. In addition to the extremely high computational load, the pitch period can be different for each channel, resulting in a phase difference that differs from the original audio signal between channels in the audio signal after compression / expansion processing, giving the listener a sense of incongruity. There is a problem that.
In order to solve this problem, it is effective to unify (commonize) the pitch period used for compression / expansion of audio signals in all channels.
For example, in Patent Document 1, a pitch period is detected for a signal (L + R) obtained by adding the L channel and the R channel of a stereo audio signal, and the compression / expansion processing (PICOLA) of the audio signals of both channels is performed based on the pitch period. A technique for performing the above has been proposed.
Further, Patent Document 2 proposes a technique for detecting a pitch period of a signal obtained by adding a plurality of channel signals or a channel signal having the maximum amplitude, and compressing / decompressing all the channel signals based on the pitch period. Has been.
With these technologies, it is only necessary to obtain a high-load operation for determining the pitch period for one audio signal, so that the increase in the operation load can be prevented and the audio signal after compression / decompression processing can be uncomfortable for the listener. It is possible to prevent a signal phase difference between channels from occurring.
JP 2001-5500 A JP 2002-297200 A Morita, Itakura, “Expansion and contraction of speech in time axis using autocorrelation function”, Proceedings of the Acoustical Society of Japan, March 1986, p. 199-200 Morita, Itakura, “Stretching and compressing speech on the time axis using the overlap addition method (PICOLA) with pointer movement control and its evaluation”, Proc. Of the Acoustical Society of Japan, October 1986, p. 149-150 Hiroshi Saruwatari “Basics of Blind Sound Source Separation Using Array Signal Processing”, IEICE Technical Report, April 2001, vol. EA2001-7, p. 49-56 Tomoya Takatani et al. “High fidelity blind source separation using ICA based on SIMO model”, IEICE Technical Report, January 2003, vol. US2002-87, EA2002-108

ここで，ピッチ周期の検出対象となる音声信号（１チャンネル（モノラル）の入力音声信号や，複数チャンネルの入力音声信号の合成音声信号）に，複数の異なる音源からの音声信号が混在している場合，特許文献１や特許文献２に示される技術では，最も周期性の強い代表的な音源の音声信号に対応するピッチ周期が検出されることになる。
このため，特許文献１や特許文献２に示される技術では，複数音源の信号が混在する場合における時間軸圧縮又は伸張後の音声信号において，代表的な一の音源の音声信号は明瞭となるが，その他の音源からの音声信号については明瞭感がなくなり，音声信号全体としての品質劣化につながるという問題点があった。
例えば，入力音声信号に，人の歌唱音と楽器の演奏音とが混在する場合，演奏音は明瞭であるが，歌唱音が不明瞭となる等の音質劣化が生じる。
従って，本発明は上記事情に鑑みてなされたものであり，その目的とするところは，入力音声信号からピッチ周期を検出し，そのピッチ周期に基づいて入力音声信号の時間軸の圧縮や伸張を行う場合に，入力音声信号に複数の音源の音声信号が混在する場合であっても，圧縮・伸張後の音声信号において，複数の音源からの音声信号各々の明瞭感をバランス良く保って音質劣化を防止できる音声信号処理装置及びその方法を提供することにある。 Here, audio signals from a plurality of different sound sources are mixed in an audio signal (one channel (monaural) input audio signal or a synthesized audio signal of input audio signals of a plurality of channels) whose pitch cycle is to be detected. In this case, with the techniques disclosed in Patent Document 1 and Patent Document 2, the pitch period corresponding to the sound signal of a representative sound source having the strongest periodicity is detected.
For this reason, in the techniques shown in Patent Document 1 and Patent Document 2, in the audio signal after time-axis compression or expansion when the signals of a plurality of sound sources coexist, the sound signal of one representative sound source becomes clear. However, there is a problem that the sound signal from other sound sources is unclear and leads to quality deterioration of the sound signal as a whole.
For example, when a human singing sound and a musical instrument performance sound are mixed in the input sound signal, the performance sound is clear, but the sound quality is deteriorated such that the singing sound is unclear.
Accordingly, the present invention has been made in view of the above circumstances, and its object is to detect a pitch period from an input audio signal and to compress or expand the time axis of the input audio signal based on the pitch period. When performing, even if audio signals from multiple sound sources are mixed in the input audio signal, the audio signal after compression / expansion maintains the clarity of each of the audio signals from multiple sound sources in a well-balanced sound quality. Is to provide an audio signal processing apparatus and method thereof.

上記目的を達成するために本発明は，複数の音源からの音声が混在する一の入力音声信号又は複数チャンネルの入力音声信号の合成音声信号から第１の所定の周波数帯の音声信号を分離した音声信号を生成する第１の信号処理手段と，複数の音源からの音声が混在する一の入力音声信号又は複数チャンネルの入力音声信号の合成音声信号から前記第１の所定の周波数帯とは異なる第２の所定の周波数帯の音声信号を分離した音声信号を生成する第２の信号処理手段と，前記第１の信号処理手段により分離された前記第１の所定の周波数帯に含まれる特定の音声信号を構成する周波数成分であるピッチ周期の候補を検出する手段であって，前記入力音声信号又は前記合成音声信号を所定の第１のピッチ周期用サンプリング周期でサンプリングし，サンプリングされた１つの周期の信号とサンプリングされた他の周期の信号との信号強度の差が小さい順に１又は複数のピッチ周期の候補として検出する第１のピッチ周期候補検出手段と，前記第２の信号処理手段により分離された前記第２の所定の周波数帯に含まれる特定の音声信号を前記第１のピッチ周期用サンプリング周期とは異なる所定の第２のピッチ周期用サンプリング周期でサンプリングし，サンプリングされた１つの周期の信号とサンプリングされた他の周期の信号との信号強度の差が小さい順に１又は複数の第２のピッチ周期の候補として検出する第２のピッチ周期候補検出手段と，前記１又は複数の第１のピッチ周期の候補と，前記１又は複数の第２のピッチ周期の候補とに共通する候補の中から，最も周期性の強い一のピッチ周期を選択するピッチ周期選択手段と，前記ピッチ周期選択手段により選択された前記一のピッチ周期に基づいて前記入力音声信号の時間軸の圧縮及び／又は伸張を行う時間軸調節手段と，を具備してなることを特徴とする音声信号処理装置として構成することが考えられる。
即ち，一の入力音声信号又は複数チャンネルの入力音声信号の合成音声信号（以下，ピッチ周期検出用信号という）から一のピッチ周期を検出するにあたり，まず第１段階として，そのピッチ周期検出用信号に所定の信号処理（第１の信号処理）を施した後の信号に基づいてその信号のピッチ周期の複数候補を検出する。
ここで，前記第１の信号処理は，前記入力音声信号に混在する複数の音源からの音声信号の一部を抽出若しくは除去したり，一部の音声信号の周期性（ピッチ周期）を強調若しくは減衰させる等の処理である。また，前記ピッチ周期の複数の候補は，例えば，ピッチ周期としての評価値が高いものから既定数分を候補とすること等が考えられる。
これにより，前記入力音声信号に複数の音源からの音声信号が混在する場合に，それらの中で必ずしも代表的（最も周期性が強い）とはいえない音源の音声信号（例えば，楽器演奏音が混在する場合の歌唱音声信号等）の抽出等を前記信号処理によって行い，その信号に対応したピッチ周期の複数候補を検出できる。
さらに，第２段階として，前記ピッチ周期検出信号又はこれに上記第１の信号処理と異なる他の信号処理（第２の信号処理）を施した信号に基づいて，前記ピッチ周期の複数候補の中から一のピッチ周期を選択する。
このようにして選択された前記一のピッチ周期は，前記ピッチ周期の複数候補検出に用いた信号，即ち，前記ピッチ周期検出用信号に混在する音声信号の中から前記信号処理によって抽出或いは強調等された音源の音声信号と，その他の音源の音声信号との両方に対応したピッチ周期となる。
従って，このようにして検出（選択）された前記一のピッチ周期に基づいて前記入力音声信号の時間軸の圧縮処理や伸張処理を行えば，その処理後の音声信号において，前記入力音声信号に混在する複数の音源からの音声信号各々の明瞭感をバランス良く保つことができる。 In order to achieve the above object, the present invention separates an audio signal of the first predetermined frequency band from one input audio signal in which audio from a plurality of sound sources is mixed or a synthesized audio signal of input audio signals of a plurality of channels. The first signal processing means for generating an audio signal differs from the first predetermined frequency band from one input audio signal mixed with audio from a plurality of sound sources or a synthesized audio signal of a plurality of channels of input audio signals. Second signal processing means for generating an audio signal obtained by separating an audio signal of the second predetermined frequency band; and a specific signal included in the first predetermined frequency band separated by the first signal processing means A means for detecting a pitch period candidate which is a frequency component constituting an audio signal, wherein the input audio signal or the synthesized audio signal is sampled at a predetermined first pitch period sampling period. First pitch period candidate detecting means for detecting one or a plurality of pitch period candidates in ascending order of signal intensity difference between a sampled signal of one period and a sampled signal of another period; Sampling a specific audio signal included in the second predetermined frequency band separated by the signal processing means at a predetermined second pitch period sampling period different from the first pitch period sampling period; Second pitch period candidate detecting means for detecting as one or a plurality of second pitch period candidates in ascending order of difference in signal intensity between the sampled signal of one period and the sampled signal of the other period; said one or more first candidate pitch period, the one or from the common candidate and the candidate of the plurality of second pitch period, strong most periodicity one Pitch period selecting means for selecting a pitch period; and time axis adjusting means for compressing and / or expanding the time axis of the input audio signal based on the one pitch period selected by the pitch period selecting means. It may be configured as an audio signal processing device characterized by comprising.
That is, in detecting one pitch period from one input voice signal or a synthesized voice signal of a plurality of channels of input voice signals (hereinafter referred to as a pitch period detection signal), first, as a first step, the pitch period detection signal A plurality of candidates for the pitch period of the signal is detected based on the signal after being subjected to predetermined signal processing (first signal processing).
Here, the first signal processing extracts or removes a part of audio signals from a plurality of sound sources mixed in the input audio signal, emphasizes the periodicity (pitch period) of some audio signals, or It is processing such as attenuation. The plurality of pitch cycle candidates may be, for example, a predetermined number of candidates having a high evaluation value as the pitch cycle.
As a result, when the audio signals from a plurality of sound sources are mixed in the input sound signal, the sound signal of the sound source that is not necessarily representative (the strongest periodicity) among them (for example, the musical instrument performance sound is The singing voice signal and the like in the case of mixing can be extracted by the signal processing, and a plurality of pitch cycle candidates corresponding to the signal can be detected.
Further, as a second stage, based on the pitch period detection signal or a signal obtained by subjecting the pitch period detection signal to another signal process (second signal process) different from the first signal process, a plurality of pitch period candidates are selected. To select one pitch period.
The one pitch period selected in this way is extracted or emphasized by the signal processing from the signals used for detecting a plurality of candidates of the pitch period, that is, audio signals mixed in the pitch period detection signal. The pitch period corresponds to both the sound signal of the generated sound source and the sound signals of the other sound sources.
Therefore, if the compression process or the expansion process of the time axis of the input audio signal is performed based on the one pitch period detected (selected) in this way, the input audio signal is converted into the input audio signal in the processed audio signal. The clarity of each of the audio signals from a plurality of mixed sound sources can be maintained in a well-balanced manner.

さらに，本発明に係る音声信号処理装置は，複数の音源からの一の入力音声信号又は複数チャンネルの入力音声信号の合成音声信号から前記第１の所定の周波数帯及び前記第２の所定の周波数帯とは異なる１又は複数の第３の所定の周波数帯の音声信号を分離する第３の信号処理手段と，第３の信号処理手段によって分離された第３の所定の周波数帯の音声信号を第４の所定のピッチ周期用サンプリング周期でサンプリングし，サンプリングされた１つの周期の信号とサンプリングされた他の周期の信号との信号強度の差が小さい順に１つの前記周期の信号と他の前記周期の信号との差を周期性が強い周期として，ピッチ周期検出信号の複数の第３の候補として検出し，前記複数の第１の候補と前記複数の第３の候補との双方に対応した周期性の強い複数の周期をピッチ周期の第３の候補として選択して，前記ピッチ周期の複数候補を絞り込むピッチ周期絞り込み手段とを具備し，前記ピッチ周期選択手段が，前記ピッチ周期絞り込み手段により絞り込まれた候補の中から一のピッチ周期を選択してなることが考えられる。
即ち，前記ピッチ周期検出用信号に前述の信号処理（第１の信号処理）とは異なる１又は複数の信号処理（第３の信号処理）を施し，その処理後の１又は複数の信号各々に基づいて，前述の第１段階の処理により検出された前記ピッチ周期の複数候補を絞り込み，その絞り込まれた候補の中から，前述の第２段階の処理によって前記一のピッチ周期を選択することも考えられる。
これにより，前述の第１段階と第２段階との間の中間段階において，前記第１段階及び第２段階におけるピッチ周期の検出若しくは選択の対象となる音源の音声信号とは異なる他の音源の音声信号の抽出，強調等（第３の信号処理）が可能となり，これらの音声信号にも対応したピッチ周期が選択（絞り込み）されることとなる。その結果，前述の第１段階及び第２段階と，それらの間の１以上の中間段階とで，３種以上の異なる音源の音声信号各々にバランス良く対応したピッチ周期を求めることが可能となる。
ここで，前記第１の信号処理が，帯域制限フィルタによって，前記第１の所定の周波数帯の音声信号を分離するものであることが考えられる。
また，前記第２の信号処理及び／又は前記第３の信号処理が，帯域制限フィルタによって，前記第１の所定の周波数帯とは異なる周波数帯の音声信号を分離するものであることが考えられる。
即ち，各段階（第１，第２，その中間）での前記信号処理としては，例えば，各々異なる周波数帯域についての帯域制限フィルタ処理等とすることが考えられる。
その他，イコライジングによる周波数強調によって特定の周波数帯の信号を増幅或いは減衰させる信号処理や，ブラインド音源分離方式（ＢＳＳ方式）によって前記入力音声信号に含まれる複数の音源の音声信号を分離する信号処理等が考えられる。なお，ＢＳＳ（Blind Source Separation）方式の詳細は，例えば非特許文献３や非特許文献４等に詳説されている。 Furthermore, the audio signal processing apparatus according to the present invention includes the first predetermined frequency band and the second predetermined frequency from one input audio signal from a plurality of sound sources or a synthesized audio signal of a plurality of channels of input audio signals. A third signal processing means for separating one or a plurality of third predetermined frequency band audio signals different from the band, and a third predetermined frequency band audio signal separated by the third signal processing means. Sampling is performed at a sampling period for the fourth predetermined pitch period, and the signal of one period and the other of the above-mentioned signals are sampled in ascending order of signal strength difference between the sampled signal of one period and the sampled signal of the other period. The difference from the period signal is detected as a plurality of third candidates of the pitch period detection signal as a period having a strong periodicity, and both the plurality of first candidates and the plurality of third candidates are supported. period Pitch period narrowing means for selecting a plurality of strong periods as a third candidate for the pitch period and narrowing down the plurality of pitch period candidates, and the pitch period selecting means is narrowed down by the pitch period narrowing means. It is conceivable that one pitch period is selected from the candidates.
That is, the pitch period detection signal is subjected to one or more signal processing (third signal processing) different from the signal processing described above (first signal processing), and each of the one or more signals after the processing is subjected to the processing. On the basis of this, it is also possible to narrow down a plurality of candidates for the pitch period detected by the above-described first stage processing, and to select the one pitch period from the narrowed candidates by the above-described second stage processing. Conceivable.
As a result, in an intermediate stage between the first stage and the second stage described above, another sound source different from the sound signal of the sound source to be detected or selected in the pitch period in the first stage and the second stage is selected. Extraction, enhancement, and the like (third signal processing) of audio signals are possible, and pitch periods corresponding to these audio signals are selected (narrowed down). As a result, the first and second steps described above, in the one or more intermediate stages between them, can asking you to balance well the corresponding pitch period in speech signals each of three or more different sound sources Become.
Here, it can be considered that the first signal processing is to separate the audio signal of the first predetermined frequency band by a band limiting filter.
Further, it is considered that the second signal processing and / or the third signal processing is to separate an audio signal having a frequency band different from the first predetermined frequency band by a band limiting filter. .
That is, the signal processing at each stage (first, second, and intermediate) may be, for example, band limiting filter processing for different frequency bands.
In addition, signal processing for amplifying or attenuating a signal in a specific frequency band by frequency enhancement by equalizing, signal processing for separating audio signals of a plurality of sound sources included in the input audio signal by a blind sound source separation method (BSS method), etc. Can be considered. Details of the BSS (Blind Source Separation) method are described in detail in Non-Patent Document 3, Non-Patent Document 4, and the like.

本発明によれば，第１段階で入力音声信号若しくは複数の入力音声信号の合成音声信号から信号処理により抽出，強調等を行った所望の音源の音声信号に対応したピッチ周期の複数候補を検出し，その複数候補の中から，その後の第２段階で他の音源の音声信号に対応した一のピッチ周期を選択し，さらにはそれらの中間段階でさらに他の音源の音声信号に対応したピッチ周期の候補を絞り込むことにより，複数の音源の音声信号各々にバランス良く対応したピッチ周期を選択することができる。そして，そのようにして選択されたピッチ周期に基づいて，入力音声信号の時間軸の圧縮処理や伸張処理を行うことにより，その処理後の音声信号において，前記入力音声信号に混在する複数の音源からの音声信号各々の明瞭感をバランス良く保って音質劣化を防止することができる。 According to the present invention, a plurality of pitch cycle candidates corresponding to the sound signal of a desired sound source extracted and enhanced by signal processing from an input sound signal or a synthesized sound signal of a plurality of input sound signals in the first stage is detected. Then, one pitch period corresponding to the sound signal of the other sound source is selected from the plurality of candidates in the second stage thereafter, and further, the pitch corresponding to the sound signal of the other sound source is selected in the intermediate stage. By narrowing down the period candidates, it is possible to select a pitch period corresponding to each of the sound signals of a plurality of sound sources in a balanced manner. Then, by performing compression processing and expansion processing on the time axis of the input audio signal based on the pitch period thus selected, a plurality of sound sources mixed in the input audio signal in the processed audio signal Therefore, it is possible to prevent the sound quality from deteriorating by keeping the clearness of each of the sound signals from the sound balance.

以下添付図面を参照しながら，本発明の実施の形態について説明し，本発明の理解に供する。尚，以下の実施の形態は，本発明を具体化した一例であって，本発明の技術的範囲を限定する性格のものではない。
ここに，図１は本発明の第１実施形態に係る音声信号処理装置Ｚ１の概略構成を表すブロック図，図２は本発明の第２実施形態に係る音声信号処理装置Ｚ２の概略構成を表すブロック図，図３は本発明の第３実施形態に係る音声信号処理装置Ｚ３の概略構成を表すブロック図，図４は本発明の第４実施形態に係る音声信号処理装置Ｚ４の概略構成を表すブロック図，図５はＰＩＣＯＬＡ方式により音声信号の時間軸圧縮が行われる際の音声信号の波形を模式的に表した図，図６はＰＩＣＯＬＡ方式により音声信号の時間軸伸張が行われる際の音声信号の波形を模式的に表した図，図７は時間軸圧縮・伸張処理に用いられる音声信号（原音）の波形の一例を表す図，図８及び図９は図７に示す音声信号（原音）に従来の手法で時間軸伸張を行った後の信号の波形の一例を表す図，図１０は図７に示す音声信号（原音）に本発明の手法で時間軸伸張を行った後の信号の波形の一例を表す図，図１１は楽曲音声信号の波形の一例及びその楽曲音声信号に対して従来の手法と本発明の手法とで検出されたピッチ周期の時間変化を表すグラフである。 Hereinafter, embodiments of the present invention will be described with reference to the accompanying drawings so that the present invention can be understood. The following embodiment is an example embodying the present invention, and does not limit the technical scope of the present invention.
FIG. 1 is a block diagram showing a schematic configuration of the audio signal processing device Z1 according to the first embodiment of the present invention, and FIG. 2 shows a schematic configuration of the audio signal processing device Z2 according to the second embodiment of the present invention. FIG. 3 is a block diagram showing a schematic configuration of an audio signal processing device Z3 according to the third embodiment of the present invention, and FIG. 4 shows a schematic configuration of an audio signal processing device Z4 according to the fourth embodiment of the present invention. FIG. 5 is a block diagram, FIG. 5 is a diagram schematically showing a waveform of an audio signal when the time base compression of the audio signal is performed by the PICOLA system, and FIG. 6 is an audio when the time axis expansion of the audio signal is performed by the PICOLA system. FIG. 7 is a diagram schematically showing the waveform of a signal, FIG. 7 is a diagram showing an example of a waveform of an audio signal (original sound) used for time axis compression / expansion processing, and FIGS. 8 and 9 are audio signals (original sound) ) To extend the time axis using the conventional method FIG. 10 is a diagram showing an example of the waveform of the subsequent signal, FIG. 10 is a diagram showing an example of the waveform of the signal after the time axis extension is performed on the audio signal (original sound) shown in FIG. It is a graph showing the time change of the pitch period detected with the example of the conventional method and the method of this invention with respect to the example of the waveform of an audio | voice signal, and the music audio | voice signal.

＜第１実施形態＞
以下，図１に示すブロック図を用いて，本発明の第１実施形態に係る音声信号処理装置Ｚ１について説明する。
図１に示すように，音声信号処理装置Ｚ１は，第１フィルタ（１）と，ピッチ周期検出部２と，ピッチ周期選択部３と，信号圧縮／伸張部４とを具備している。
前記第１フィルタ（１）は，外部から入力される入力モノラル信号Ｍ（一の入力音声信号及びピッチ周期検出用信号の一例）に，帯域制限フィルタ処理（第１の信号処理の一例）を施すものである。
前記ピッチ周期検出部２は，前記第１フィルタ（１）によるフィルタ処理後の信号を入力し，その信号のピッチ周期の複数候補を検出するものである（ピッチ周期候補検出手段の一例）。
前記ピッチ周期選択部３は，前記入力モノラル信号（ピッチ周期検出用信号の一例）を入力し，その信号に基づいて，前記ピッチ周期検出部２によって検出されたピッチ周期の複数候補の中から，信号圧縮又は伸張に用いる一のピッチ周期を選択するものである（ピッチ周期選択手段の一例）。
前記信号圧縮／伸張部４は，前記ピッチ周期選択部３（ピッチ周期選択手段）により選択された一のピッチ周期を入力し，これを用いて，例えば，前述したＰＩＣＯＬＡ方式（図５，図６参照）により，前記入力モノラル信号Ｍ（入力音声信号の一例）の時間軸の圧縮及び伸張を行うものである（時間軸調節手段の一例）。
図１に示す音声処理装置Ｚ１及び後述する他の実施形態に係る音声処理装置Ｚ２〜Ｚ４は，その各構成要素を，それぞれＣＰＵやメモリ等からなる処理回路やＤＳＰ（Digital Signal Processor）として構成することが考えられるが，その他にも，各構成要素が行う処理（工程）を実現する処理プログラムを所定のコンピュータによって実行するもの等であってもよい。
本音声信号処理装置Ｚ１の特徴は，前記第１フィルタ（１）及び前記ピッチ周期検出部２と，前記ピッチ周期選択部３とにより，ピッチ周期検出を２段階で行う点にある。以下，これについて詳述する。 <First Embodiment>
Hereinafter, the audio signal processing device Z1 according to the first embodiment of the present invention will be described with reference to the block diagram shown in FIG.
As shown in FIG. 1, the audio signal processing device Z1 includes a first filter (1), a pitch period detection unit 2, a pitch period selection unit 3, and a signal compression / expansion unit 4.
The first filter (1) performs band-limiting filter processing (an example of first signal processing) on an external input monaural signal M (an example of one input audio signal and a pitch period detection signal). Is.
The pitch period detection unit 2 receives a signal after filtering by the first filter (1) and detects a plurality of candidates for the pitch period of the signal (an example of pitch period candidate detection means).
The pitch cycle selection unit 3 inputs the input monaural signal (an example of a pitch cycle detection signal), and based on the signal, from among a plurality of pitch cycle candidates detected by the pitch cycle detection unit 2, One pitch period used for signal compression or expansion is selected (an example of pitch period selection means).
The signal compression / decompression unit 4 inputs one pitch cycle selected by the pitch cycle selection unit 3 (pitch cycle selection means), and uses this, for example, the PICOLA system (FIGS. 5 and 6). ) To compress and expand the time axis of the input monaural signal M (an example of an input audio signal) (an example of a time axis adjusting unit).
The speech processing device Z1 shown in FIG. 1 and speech processing devices Z2 to Z4 according to other embodiments to be described later are configured as processing circuits or DSPs (Digital Signal Processors) each composed of a CPU, a memory, and the like. In addition, a processing program that realizes processing (steps) performed by each component may be executed by a predetermined computer.
The audio signal processing device Z1 is characterized in that pitch period detection is performed in two stages by the first filter (1), the pitch period detection unit 2, and the pitch period selection unit 3. This will be described in detail below.

＜＜第１段階＞＞
まず，前記第１フィルタ（１）により，前記入力モノラル信号Ｍに対し，バンドバスフィルタ，ローパスフィルタ，ハイパスフィルタ等の帯域制限フィルタ処理を施す。
この第１フィルタ（１）では，前記入力モノラル信号Ｍに複数の音源からの音声信号が混在する場合に，それらの中で必ずしも代表的（最も周期性が強い）とはいえない音源の音声信号であって，圧縮／伸張後の明瞭感を確保したい音声信号（例えば，楽器演奏音が混在する場合の歌唱音声信号等）の帯域（人の音声の場合，例えば，２００Ｈｚ〜８ＫＨｚ）のみを通過させるようなフィルタ処理を施す。
そして，その信号処理後の信号に対応したピッチ周期の複数候補を，前記ピッチ周期検出部２により検出する。ここで，前記ピッチ周期検出部２によるピッチ周期の複数候補の検出（算出）方法の一例を以下に示す。
前記入力モノラル信号Ｍ（例えば，歌唱音声や楽器音等が混在したオーディオ信号）のピッチ周期として適正と考えられるピッチ周期Ｐの全候補ｊ（ｊはデジタル音声信号のサンプル数を表し，時間換算したピッチ周期は，「ｊ×サンプリング周期」となる。）として予めｊ＝Ｎ₀〜Ｎの所定範囲を設定し，前記第１フィルタ（１）による信号処理（フィルタ処理）後のデジタル音声信号をピッチ周期の評価対象信号Ｘ_iとし，その（２Ｎ＋１）点分のサンプル信号Ｘ_i（ｉ=０〜２Ｎ，ｉ≧１）について，前記ピッチ周期の全候補ｊ（Ｎ₀〜Ｎ）それぞれについての周期性の強さを評価する。そして，最も周期性の評価結果に基づいてピッチ周期の複数候補を求める。例えば，最も周期性が強いと評価されるものから順に，予め定められた個数（複数個）分，若しくは予め設定された評価値よりも周期性が強いと評価されたもの（複数），或いはそれらの組合せ等によってピッチ周期の複数候補を求める。
この場合，周期性の評価対象とする信号Ｘ_iの時間範囲ｉ（サンプル数）を０〜Ｎ（ここで，参照される評価対象信号の最大時間範囲は，０〜２Ｎ）としたときに，周期性の強さの評価関数を，次の（１）式や（２）式とすることが考えられる。

これらは，ｊサンプルだけ離れた信号値どうしの差（絶対値又は２乗値）を計算し，その差が小さいほど周期ｊにおける周期性が強い（即ち，周期ｊごとに似た波形が現れる）として評価するものである。従って，ｊ＝Ｎ₀〜Ｎそれぞれについて，（１）式又は（２）式による評価値を計算し，その評価値が最も小さいもの（最も周期性評価が高いもの）から所定の規則に従った複数個分のｊをピッチ周期の複数候補として検出（算出）する。
上記以外にも，例えば，ｊの範囲を複数区間に分割し，その分割区間毎に最も周期性評価の高いもの（前記評価値の最も小さいもの）を選択する方法も考えられる。即ち，ｊの区間をＮ₀〜Ｎ₁，Ｎ₁〜Ｎ₂，…，Ｎ_k〜Ｎ（但し，Ｎ₀＜Ｎ₁＜Ｎ₂＜…＜Ｎ_k＜Ｎ）というように分割し，分割区間各々において周期性評価が最大となる（例えば，（１）式や（２）式による評価値が最小となる）ｊをピッチ周期の複数候補とする。 << First Stage >>
First, the first filter (1) performs band-limiting filter processing such as a band-pass filter, a low-pass filter, and a high-pass filter on the input monaural signal M.
In the first filter (1), when audio signals from a plurality of sound sources are mixed in the input monaural signal M, the sound signal of a sound source that is not necessarily representative (having the strongest periodicity) among them. However, it passes only the band (for example, 200 Hz to 8 KHz in the case of human voice) of an audio signal (for example, a singing audio signal when musical instrument performance sounds are mixed) for which a clear sense after compression / decompression is desired. Filter processing is performed.
Then, a plurality of pitch cycle candidates corresponding to the signal after the signal processing are detected by the pitch cycle detection unit 2. Here, an example of a method for detecting (calculating) a plurality of pitch cycle candidates by the pitch cycle detection unit 2 will be described below.
All candidates j (j represents the number of samples of the digital audio signal) and converted to time, which are considered to be appropriate as the pitch period of the input monaural signal M (for example, an audio signal in which singing voice or instrument sound is mixed) The pitch period is “j × sampling period.”), A predetermined range of j = N _{0 to} N is set in advance, and the digital audio signal after the signal processing (filter processing) by the first filter (1) is pitched. A cycle evaluation target signal X _i, and (2N + 1) point sample signals X _i (i = 0 to 2N, i ≧ 1) for each of the pitch cycle candidates j (N _{0 to} N). Assess sexual strength. Then, a plurality of pitch cycle candidates are obtained based on the most periodic evaluation result. For example, in order from the one evaluated to have the strongest periodicity, a predetermined number (plural), or ones evaluated to have a periodicity stronger than a preset evaluation value, or those A plurality of pitch period candidates are obtained by combining the above.
In this case, when the time range i (number of samples) of the signal X _i to be evaluated for periodicity is 0 to N (where the maximum time range of the evaluation target signal to be referred to is 0 to 2N), It can be considered that the evaluation function of the strength of periodicity is the following formula (1) or (2).

These calculate the difference (absolute value or square value) between signal values separated by j samples, and the smaller the difference, the stronger the periodicity in period j (that is, a similar waveform appears for each period j). Is to be evaluated. Therefore, for each j = N _{0 to} N, the evaluation value according to the expression (1) or (2) is calculated, and the evaluation value is the smallest (the one with the highest periodicity evaluation) and the predetermined rule is followed. A plurality of j's are detected (calculated) as a plurality of pitch cycle candidates.
In addition to the above, for example, a method in which the range of j is divided into a plurality of sections and the one having the highest periodicity evaluation (the one having the smallest evaluation value) is selected for each divided section. That is, the section j is divided into N _{0 to} N ₁ , N _{1 to} N ₂ ,..., N _{k to} N (where N ₀ <N ₁ <N ₂ <... <N _k <N). Let j be the largest candidate for the pitch period in which the periodicity evaluation is maximum in each section (for example, the evaluation value according to the expressions (1) and (2) is minimum).

＜＜第２段階＞＞
次に，前記ピッチ周期選択部３により，前記入力モノラル信号Ｍに基づいて，前記第１段階で得られたピッチ周期の複数候補の中から，圧縮／伸張に用いる一のピッチ周期を選択する。
具体的には，前記入力モノラル信号Ｍ（デジタル音声信号）を前記ピッチ周期の評価対象信号Ｘ_iとし，その（２Ｎ＋１）点分のサンプル信号Ｘ_i（ｉ=０〜２Ｎ，ｉ≧１）について，前記第１段階で求めたピッチ周期の候補それぞれについての周期性の強さを評価した上で，最も周期性の強いピッチ周期を圧縮／伸張に用いる一のピッチ周期とする。ピッチ周期の評価方法は，前記第１段階と同様である。
このようにして選択された一のピッチ周期は，前記ピッチ周期の複数候補検出に用いた信号，即ち，前記入力モノラル信号Ｍ（ピッチ周期検出用信号）に混在する音声信号の中から前記第１フィルタ（１）によって抽出された音源の音声信号と，その他の音源の音声信号との両方に対応したピッチ周期となる。
そして，前記信号圧縮／伸張部４では，前記入力モノラル信号Ｍに基づいて前記２段階の処理により検出された前記一のピッチ周期を用いて，前記入力モノラル信号Ｍについて所望の圧縮率（伸張率）で時間軸圧縮（伸張）がなされ，圧縮（伸張）後の音声信号Ｍ’が出力される。ここで，圧縮・伸張の方式は，前述したＰＩＣＯＬＡ方式が採用される。
このようにして出力される圧縮・伸張処理後の音声信号Ｍ’においては，前記入力音声信号に混在する複数の音源からの音声信号各々の明瞭感をバランス良く保つことができ，音質が向上する。 << Second Stage >>
Next, based on the input monaural signal M, the pitch cycle selection unit 3 selects one pitch cycle used for compression / expansion from among a plurality of pitch cycle candidates obtained in the first stage.
Specifically, the input monaural signal M (digital audio signal) is set as the evaluation signal X _i for the pitch period, and the sample signal X _{i for} (2N + 1) points (i = 0 to 2N, i ≧ 1). After evaluating the strength of the periodicity for each of the pitch period candidates obtained in the first stage, the pitch period having the strongest periodicity is set as one pitch period used for compression / expansion. The pitch period evaluation method is the same as in the first stage.
One pitch period selected in this way is the signal used for detecting a plurality of candidates for the pitch period, that is, the audio signal mixed in the input monaural signal M (pitch period detection signal). The pitch period corresponds to both the sound signal of the sound source extracted by the filter (1) and the sound signals of other sound sources.
Then, the signal compression / decompression unit 4 uses the one pitch period detected by the two-stage processing based on the input monaural signal M to obtain a desired compression rate (expansion rate) for the input monaural signal M. ) Is time-axis compressed (expanded), and the compressed (expanded) audio signal M ′ is output. Here, the above-described PICOLA method is adopted as the compression / decompression method.
In the audio signal M ′ after compression / decompression processing output in this way, the clarity of each of the audio signals from a plurality of sound sources mixed in the input audio signal can be maintained in a well-balanced manner, and the sound quality is improved. .

＜第２実施形態＞
次に，図２のブロック図を用いて，本発明の第２実施形態に係る音声信号処理装置Ｚ２について説明する。
図２に示すように，音声信号処理装置Ｚ２は，前記音声信号処理装置Ｚ１に新たな構成要素として合成信号生成部５を加えたものである。
入力音声信号が，ステレオオーディオ信号等のように複数チャンネルの入力音声信号である場合，各チャンネル信号ごとにピッチ周期の検出及び圧縮／伸張を行った信号を合成すると，チャンネルごとにピッチ周期が異なり得るので，圧縮／伸張処理後の音声信号にチャンネル間で元の音声信号とは異なる位相差が生じ，聞く人に違和感を与えてしまう。
この問題を解決するためには，音声信号の圧縮／伸張に用いるピッチ周期を，全てのチャンネルで統一（共通化）することが有効である。
そこで，当該音声信号処理装置Ｚ２では，前記合成信号生成部５により，複数チャンネルの入力ステレオ信号（入力音声信号の一例）の合成音声信号（ピッチ周期検出用信号の一例）を生成し，その合成音声信号に基づいて前記信号処理装置Ｚ１と同様に２段階の処理を経て一のピッチ周期を求める。
前記合成信号生成部５としては，例えば，各チャンネル信号を加算（ステレオ２チャンネルの場合，Ｌ＋Ｒ）するものや，各チャンネル信号を加算した信号（Ｌ＋Ｒ）と減算した信号（Ｌ−Ｒ）とを生成し，そのうちのいずれかパワー（振幅）の大きい方を前記合成音声信号とするもの等が考えられる。
そして，前記信号圧縮／伸張部４では，前記合成音声信号に基づいて前記２段階の処理により検出された前記一のピッチ周期を用いて，前記ステレオ信号（Ｌ，Ｒ）の両チャンネル信号それぞれについて所望の圧縮率（伸張率）で時間軸圧縮（伸張）がなされ，圧縮（伸張）後の音声信号Ｌ’，Ｒ’が出力される。ここで，圧縮・伸張の方式は，前述したＰＩＣＯＬＡ方式が採用される。
このように，複数チャンネルの音声入力信号から得た１つのピッチ周期Ｐに基づいて，全てのチャンネル信号の圧縮・伸張処理がなされるので，演算負荷の増大や，聞く人に違和感を与えるような圧縮・伸張後のチャンネル間の位相差発生を防止できる。このような構成も，本発明の実施形態の一例である。 <Second Embodiment>
Next, an audio signal processing device Z2 according to a second embodiment of the present invention will be described using the block diagram of FIG.
As shown in FIG. 2, the audio signal processing device Z2 is obtained by adding a synthesized signal generation unit 5 as a new component to the audio signal processing device Z1.
When the input audio signal is a multi-channel input audio signal such as a stereo audio signal, the pitch period differs for each channel when the signals with the detected pitch period and compression / expansion are synthesized for each channel signal. Therefore, a phase difference different from that of the original audio signal occurs between channels in the audio signal after compression / expansion processing, which gives the listener a sense of incongruity.
In order to solve this problem, it is effective to unify (commonize) the pitch period used for compression / decompression of audio signals in all channels.
Therefore, in the audio signal processing device Z2, the synthesized signal generation unit 5 generates a synthesized audio signal (an example of a pitch period detection signal) of a multi-channel input stereo signal (an example of an input audio signal), and synthesizes the synthesized audio signal. Based on the audio signal, one pitch cycle is obtained through two stages of processing in the same manner as the signal processing device Z1.
As the composite signal generation unit 5, for example, an addition of each channel signal (L + R in the case of two stereo channels), a signal (L + R) obtained by adding each channel signal, and a signal (LR) obtained by subtraction. It is possible to generate the generated voice signal and use the higher one of the power (amplitude) as the synthesized voice signal.
Then, the signal compression / decompression unit 4 uses the one pitch period detected by the two-stage processing based on the synthesized speech signal, for both channel signals of the stereo signal (L, R). Time-axis compression (expansion) is performed at a desired compression rate (expansion rate), and audio signals L ′ and R ′ after compression (expansion) are output. Here, the above-described PICOLA method is adopted as the compression / decompression method.
In this way, all the channel signals are compressed / expanded based on one pitch period P obtained from a plurality of channels of audio input signals, so that the calculation load increases and the listener is uncomfortable. It is possible to prevent the phase difference between the channels after compression and expansion. Such a configuration is also an example of an embodiment of the present invention.

＜第３実施形態＞
次に，図３のブロック図を用いて，本発明の第３実施形態に係る音声信号処理装置Ｚ３について説明する。
図３に示すように，音声信号処理装置Ｚ３は，前記音声信号処理装置Ｚ１に新たな構成要素として，前記ピッチ周期選択部３への入力信号に対して帯域制限フィルタ処理（第２の信号処理の一例）を施す第２フィルタ（６）を加えたものである。この第２フィルタ（６）のフィルタ特性は，前記第１フィルタ（１）のフィルタ特性とは異なるものである。
このように，前記第２段階における前記ピッチ周期選択部３（ピッチ周期選択手段の一例）において，前記入力モノラル信号Ｍ（ピッチ周期検出用信号の一例）に前記第１フィルタ（１）の信号処理とは異なるフィルタ処理（第２の信号処理の一例）を施した信号に基づいて，前記ピッチ周期の複数候補の中から一のピッチ周期を選択する構成も考えられる。
この第２段階でのフィルタ処理（信号処理）により，例えば，最も周期性の強い音源からの音声信号を除去する，或いは，所望の音源からの音声信号のみを抽出する等により，ピッチ周期検出に用いる音源信号を任意に選択でき，圧縮／伸張後の信号（Ｍ’）について所望の音質調整を行うことが可能となる。
もちろん，このように第２段階において信号処理を行う構成を，前記音声信号処理装置Ｚ２（複数チャンネルの入力音声信号（ステレオオーディオ信号等）の処理装置）に適用することも考えられる。 <Third Embodiment>
Next, an audio signal processing device Z3 according to a third embodiment of the present invention will be described using the block diagram of FIG.
As shown in FIG. 3, the audio signal processing device Z3 is a new component of the audio signal processing device Z1, and performs band-limiting filter processing (second signal processing) on the input signal to the pitch period selection unit 3. The second filter (6) is applied. The filter characteristic of the second filter (6) is different from the filter characteristic of the first filter (1).
In this way, in the pitch cycle selection unit 3 (an example of pitch cycle selection means) in the second stage, the signal processing of the first filter (1) is applied to the input monaural signal M (an example of a pitch cycle detection signal). A configuration is also conceivable in which one pitch period is selected from a plurality of candidates for the pitch period based on a signal subjected to filter processing (an example of second signal processing) different from the above.
By this filtering process (signal processing) in the second stage, for example, the sound signal from the sound source with the strongest periodicity is removed, or only the sound signal from the desired sound source is extracted. The sound source signal to be used can be arbitrarily selected, and desired sound quality adjustment can be performed on the signal (M ′) after compression / decompression.
Of course, it is also conceivable to apply the configuration in which signal processing is performed in the second stage to the audio signal processing device Z2 (processing device for a plurality of channels of input audio signals (stereo audio signals, etc.)).

＜第４実施形態＞
次に，図４のブロック図を用いて，本発明の第４実施形態に係る音声信号処理装置Ｚ４について説明する。
図４に示すように，音声信号処理装置Ｚ４は，前記音声信号処理装置Ｚ１に新たな構成要素として，前記第１段階と前記第２段階とにおけるピッチ周期の複数候補検出と一のピッチ周期選択との間の中間段階で，前記第１段階で検出されたピッチ周期の複数候補をさらに絞り込むピッチ周期候補中間選択部２０を加えたものである。
前記ピッチ周期候補中間選択部２０は，前記入力モノラル信号Ｍ（ピッチ周期検出用信号の一例）に，前記第１フィルタ（１）の処理（第１の信号処理）とは各々異なるフィルタ処理（第３の信号処理の一例）を施す複数の第３フィルタ１１，１２，…，１Ｎと，それらによりフィルタ処理が施された後の複数の信号各々に基づいて，前記ピッチ周期検出部２（ピッチ周期候補検出手段）により検出された前記ピッチ周期の複数候補を順次絞り込む複数のピッチ周期中間選択部２１，２２，…，２Ｎとを具備している（ピッチ周期絞り込み手段の一例）。
前記第３フィルタ（１１〜１Ｎ）各々は，前記第１段階及び第２段階におけるピッチ周期検出（選択）の対象となる音源の音声信号とは異なる他の音源の音声信号を抽出するフィルタ特性とする。
ここで，前記ピッチ周期中間選択部（２１〜２Ｎ）は，相互に直列接続されており，前記入力モノラル信号Ｍ（ピッチ周期検出用信号の一例）に対して各々前記第１フィルタ（１）と異なるフィルタ処理が施された信号を前記ピッチ周期の評価対象信号Ｘ_iとし，各々前段の前記ピッチ周期中間選択部（２１〜２（Ｎ−１））から出力されるピッチ周期の複数の候補（第１段目の前記ピッチ周期中間選択部２１については，前記ピッチ周期検出部２によって検出された前記ピッチ周期の複数候補）それぞれについての周期性の強さを評価した結果に基づいて前記ピッチ周期の複数候補を順次少数（複数）の候補に絞り込む。その絞り込み（複数のピッチ周期の選択）の方法は，前記ピッチ周期検出部２において，ピッチ周期の全候補から前記ピッチ周期の複数候補を選択する方法と同様である。
このような構成により，前記第１段階と第２段階との間の中間段階において，前記第１段階及び第２段階とは異なる音源の音声信号にも対応したピッチ周期を選択することが可能となり，３種以上の異なる音源の音声信号各々にバランス良く対応した一のピッチ周期を求めることが可能となる。
そして，前記ピッチ周期選択部３（ピッチ周期選択手段）により，前記ピッチ周期中間選択部（２１〜２Ｎ，ピッチ周期絞り込み手段の一例）により絞り込まれた複数のピッチ周期の候補の中から一のピッチ周期を選択する。 <Fourth embodiment>
Next, an audio signal processing device Z4 according to a fourth embodiment of the present invention will be described using the block diagram of FIG.
As shown in FIG. 4, the audio signal processing device Z4 detects a plurality of pitch cycle candidates and selects one pitch cycle in the first stage and the second stage as new components to the audio signal processing apparatus Z1. A pitch cycle candidate intermediate selection unit 20 for further narrowing down a plurality of pitch cycle candidates detected in the first step is added at an intermediate stage between the two.
The pitch cycle candidate intermediate selection unit 20 applies a filter process (first signal process) different from the process of the first filter (1) (first signal process) to the input monaural signal M (an example of a pitch period detection signal). .., 1N, and the plurality of signals after being subjected to the filter processing based on each of the plurality of third filters 11, 12,... , 2N (one example of pitch cycle narrowing means). The pitch cycle intermediate selection units 21, 22,..., 2N sequentially narrow down the plurality of pitch cycle candidates detected by the candidate detection means).
Each of the third filters (11 to 1N) has a filter characteristic for extracting a sound signal of another sound source different from a sound signal of a sound source that is a target of pitch period detection (selection) in the first stage and the second stage. To do.
Here, the pitch cycle intermediate selectors (21 to 2N) are connected in series with each other, and each of the first filter (1) and the input monaural signal M (an example of a pitch cycle detection signal) is connected to each other. A signal subjected to different filter processing is set as the pitch cycle evaluation target signal X _i, and a plurality of pitch cycle candidates (21 to 2 (N−1)) output from the preceding pitch cycle intermediate selection unit (21 to 2 (N−1)) ( For the pitch cycle intermediate selection unit 21 in the first stage, the pitch cycle is based on the result of evaluating the strength of the periodicity for each of the plurality of pitch cycle candidates detected by the pitch cycle detection unit 2. The multiple candidates are sequentially narrowed down to a small number (multiple) candidates. The method of narrowing down (selecting a plurality of pitch periods) is the same as the method of selecting a plurality of pitch period candidates from all pitch period candidates in the pitch period detection unit 2.
With such a configuration, it becomes possible to select a pitch period corresponding to an audio signal of a sound source different from the first stage and the second stage in an intermediate stage between the first stage and the second stage. Thus, it is possible to obtain one pitch period corresponding to each of the audio signals of three or more different sound sources in a well-balanced manner.
Then, one pitch is selected from a plurality of pitch period candidates narrowed down by the pitch period selecting section 3 (pitch cycle selecting means) by the pitch cycle intermediate selecting section (21 to 2N, an example of pitch cycle narrowing means). Select a period.

次に，図７〜図１０に示す音声波形により，本発明の作用効果について説明する。なお，図７〜図１０に示す音声波形について，いずれも，その横軸（時間軸）の幅は０．２秒分，各音声信号のサンプリングレートは４４１００Ｈｚであり，ピッチ周期検出の際に周期性評価を行うピッチ周期の全範囲は３５０〜１４００サンプル（前述のピッチ周期の全候補Ｎ₀〜Ｎに相当）としている。
図７（ａ）は，ピッチ周期検出に用いる模擬信号（前記入力モノラル信号Ｍや前記合成音声信号に相当，ピッチ周期検出用信号の一例，以下，原音という）の波形の一例を表し，図７（ｂ），（ｃ）は，その原音に含まれる２つの異なる音声信号（以下，原音成分１，原音成分２という）各々の波形を表す。前記原音成分１（ｂ）は５０Ｈｚ正弦波であり，前記原音成分２（ｃ）は，５３３Ｈｚ正弦波である。
これに対し，図８（ａ）は，前記原音に対し，前記原音成分２から求まるピッチ周期（５３３Ｈｚ相当）を用いて前記ＰＩＣＯＬＡ方式により１．４１倍の時間軸伸張処理を施した信号の波形を表したものである。また，図８（ｂ），（ｃ）は，各々図８（ａ）に示す伸張後の信号に含まれる前記原音成分１，前記原音成分２の各々に相当する伸張後の信号である。
図８（ａ）に示す前記原音の伸長処理後の信号波形は，音質維持の観点からすれば，時間軸が伸張されたことを除いて前記原音の波形（図７（ａ））に近いことが好ましい。しかし，図８（ａ）に示すように，一方の前記原音成分２に最も適応したピッチ周期を用いて時間軸伸張を行うと，前記原音の波形とは大きく異なる波形となる。これは，図８（ｂ）に示すように，他方の前記原音成分１に相当する伸張処理後の波形，即ち，時間軸伸張に用いるピッチ周期の選択に全く考慮されなかった低周波数側の信号の波形が大きく歪むためである。 Next, the function and effect of the present invention will be described with reference to the speech waveforms shown in FIGS. 7 to 10, the horizontal axis (time axis) has a width of 0.2 seconds, the sampling rate of each audio signal is 44100 Hz, and the period when the pitch period is detected. The entire range of pitch periods for which the sex evaluation is performed is 350 to 1400 samples (corresponding to all the above-mentioned pitch cycle candidates N _{0 to} N).
FIG. 7A shows an example of a waveform of a simulation signal (corresponding to the input monaural signal M and the synthesized speech signal, an example of a pitch period detection signal, hereinafter referred to as an original sound) used for pitch period detection. (B) and (c) represent the waveforms of two different audio signals (hereinafter referred to as the original sound component 1 and the original sound component 2) included in the original sound. The original sound component 1 (b) is a 50 Hz sine wave, and the original sound component 2 (c) is a 533 Hz sine wave.
On the other hand, FIG. 8 (a) shows the waveform of a signal obtained by subjecting the original sound to a 1.41-times time base expansion process by the PICOLA method using a pitch period (equivalent to 533 Hz) obtained from the original sound component 2. It represents. FIGS. 8B and 8C show the expanded signals corresponding to the original sound component 1 and the original sound component 2 included in the expanded signal shown in FIG. 8A, respectively.
The signal waveform after the original sound extension process shown in FIG. 8A is close to the waveform of the original sound (FIG. 7A) except that the time axis is extended from the viewpoint of maintaining sound quality. Is preferred. However, as shown in FIG. 8A, when the time axis is expanded using the pitch cycle most suitable for one of the original sound components 2, the waveform of the original sound is greatly different. This is because, as shown in FIG. 8 (b), the waveform after expansion corresponding to the other original sound component 1, that is, the signal on the low frequency side which is not considered at all in the selection of the pitch period used for time axis expansion. This is because the waveform is greatly distorted.

一方，図９（ａ）は，前記原音に対し，前記原音成分１から求まるピッチ周期（５０Ｈｚ相当）を用いて前記ＰＩＣＯＬＡ方式により１．４１倍の時間軸伸張処理を施した信号の波形を表したものである。また，図９（ｂ），（ｃ）は，各々図９（ａ）に示す伸張後の信号に含まれる前記原音成分１，前記原音成分２の各々に相当する伸張後の信号の波形である。
この場合も，図８に示したのと同様に，図９（ａ）に示す前記原音の伸長処理後の信号波形は，前記原音の波形とは大きく異なる波形となる。これは，図９（ｃ）に示すように，時間軸伸張に用いるピッチ周期の選択に全く考慮されなかった高周波数側の前記原音成分２の伸張後の信号にパワー（振幅）の減衰が生じるためである。
このような波形の違い（図７（ａ）の波形に対する図８（ａ）及び図９（ａ）の波形の違い）は，聴覚上も大きな音質劣化として表れる。 On the other hand, FIG. 9A shows a waveform of a signal obtained by subjecting the original sound to 1.41 times time axis expansion processing by the PICOLA method using a pitch period (equivalent to 50 Hz) obtained from the original sound component 1. It is a thing. FIGS. 9B and 9C show the waveforms of the expanded signals corresponding to the original sound component 1 and the original sound component 2 included in the expanded signal shown in FIG. 9A, respectively. .
Also in this case, similarly to the case shown in FIG. 8, the signal waveform after the original sound expansion process shown in FIG. 9A is greatly different from the waveform of the original sound. This is because, as shown in FIG. 9C, power (amplitude) attenuation occurs in the signal after expansion of the original sound component 2 on the high frequency side, which is not considered at all in the selection of the pitch period used for time axis expansion. Because.
Such a waveform difference (difference between the waveforms in FIG. 8A and FIG. 9A with respect to the waveform in FIG. 7A) appears as a significant deterioration in sound quality.

次に，図１０を用いて，前記音声信号処理装置Ｚ３（図１）の構成によりピッチ周期検出及び時間軸伸張を行った例について説明する。
ここで，前記第１フィルタ（１）は，前記原音成分２のみを抽出するフィルタ，前記第２フィルタ（６）は，前記原音成分１のみを抽出するフィルタとしている。
また，前記第１段階での前記ピッチ周期検出部２によるピッチ周期の複数候補の検出処理には，ピッチ周期の全範囲（全候補，３５０〜１４００サンプルの範囲）を均等に４区間に分割し，各区間毎に最も周期性評価の高いもの（前述の（１）式による評価値の最も小さいもの）を複数候補（４候補）として検出する処理を適用した。
さらに，前記第２段階での前記ピッチ周期選択部３による一のピッチ周期の選択処理には，前記ピッチ周期の複数候補（４つ）の中から，最も周期性評価の高いもの（前述の（１）式による評価値の最も小さいもの）を前記一のピッチ周期として選択する処理を適用した。
図１０（ａ）は，前記信号処理装置Ｚ１により，信号処理（フィルタ処理）を伴う２段階でのピッチ周期検出（選択）を経て求めたピッチ周期を用いて，前記原音（図７（ａ））に対し１．４１倍の時間軸伸張を施した信号の波形である。
また，図１０（ｂ），（ｃ）は，各々図１０（ａ）に示す伸張後の信号に含まれる前記原音成分１，前記原音成分２の各々に相当する伸張後の信号の波形である。
図１０（ａ）〜（ｃ）に示すように，本発明の適用により，前記原音の波形に対する大きな劣化のない出力波形が得られることがわかる。これは，前記第１段階において，前記第１フィルタ（１）によって前記原音から前記原音成分２が抽出され，該原音成分２に対応したピッチ周期の複数候補が検出されるとともに，その複数候補の中から，最も前記原音成分１に対応した一のピッチ周期が選択されるため，前記原音成分１及び前記原音成分２の両方にバランス良く対応した一のピッチ周期が選択されることによる。さらに，前記信号処理装置Ｚ４（図４）のように前記第１段階と前記第２段階との間の中間段階の処理を設けることにより，より多くの音源の音声信号にバランス良く対応したピッチ周期を選択することが可能となる。
このようなピッチ周期を用いて，前記原音に対して圧縮／伸張処理を施した音声信号（例えば，図１０（ａ））は，前述の図８（ａ），図９（ａ）等に示すような，従来のピッチ周期検出処理の検出結果を用いた圧縮／伸張後の音声信号に比べて，聴覚上も音質劣化が少ない。 Next, an example in which pitch period detection and time axis expansion are performed by the configuration of the audio signal processing device Z3 (FIG. 1) will be described with reference to FIG.
Here, the first filter (1) is a filter that extracts only the original sound component 2, and the second filter (6) is a filter that extracts only the original sound component 1.
In addition, in the first stage, the pitch period detection unit 2 detects a plurality of pitch period candidates, and the entire range of pitch periods (all candidates, a range of 350 to 1400 samples) is equally divided into four sections. , A process of detecting the one having the highest periodicity evaluation for each section (the one having the smallest evaluation value according to the above formula (1)) as a plurality of candidates (four candidates).
Further, in the selection process of one pitch period by the pitch period selection unit 3 in the second stage, the one having the highest periodicity evaluation (the above-mentioned (4) The process of selecting the one having the smallest evaluation value according to 1) as the one pitch period was applied.
FIG. 10A shows the original sound (FIG. 7A) using the pitch period obtained by the signal processing device Z1 through the pitch period detection (selection) in two stages accompanied by signal processing (filter processing). ) Is a waveform of a signal that has been extended by 1.41 times the time axis.
FIGS. 10B and 10C show the waveforms of the expanded signals corresponding to the original sound component 1 and the original sound component 2 included in the expanded signal shown in FIG. 10A, respectively. .
As shown in FIGS. 10 (a) to 10 (c), it can be seen that by applying the present invention, an output waveform without significant deterioration with respect to the waveform of the original sound can be obtained. This is because, in the first stage, the original sound component 2 is extracted from the original sound by the first filter (1), and a plurality of candidates having a pitch period corresponding to the original sound component 2 are detected. This is because one pitch cycle corresponding to the original sound component 1 is selected from among them, and therefore, one pitch cycle corresponding to both the original sound component 1 and the original sound component 2 is selected in a balanced manner. Further, by providing an intermediate stage process between the first stage and the second stage as in the signal processing device Z4 (FIG. 4), a pitch period corresponding to the sound signals of more sound sources in a well-balanced manner. Can be selected.
An audio signal (for example, FIG. 10 (a)) obtained by compressing / decompressing the original sound using such a pitch period is shown in FIG. 8 (a), FIG. 9 (a), etc. As compared with the audio signal after compression / decompression using the detection result of the conventional pitch period detection process, the sound quality is less deteriorated in hearing.

また，図１１は，歌唱音声と楽器音とが混在した楽曲音声信号の波形の一例（ａ）と，その楽曲音声信号に対して従来の手法で検出されたピッチ周期の時間変化を表すグラフ（ｂ）と本発明の手法で検出されたピッチ周期の時間変化を表すグラフ（ｃ）とを表す。図１１（ｂ），（ｃ）のグラフの縦軸は検出されたピッチ周期（サンプル数）を表し，横軸は時間軸（横軸の数値は，１秒間に４４１００回のサンプリングが行われることを条件として時間をサンプリング回数（サンプル数）で換算したもの）を表す。
ここで，音声信号のサンプリングレートは４４１００Ｈｚであり，ピッチ周期検出の際に周期性評価を行うピッチ周期の全範囲は，従来の手法及び本発明の手法のいずれにおいても３５０〜１４００サンプル（前述のピッチ周期の全候補Ｎ₀〜Ｎに相当）としている。
また，本発明の手法（図１１（ｃ））では，前記音声信号処理装置Ｚ１（図１）を用い，前記第１フィルタ（１）として，そのカットオフ周波数が２００Ｈｚ及び８ＫＨｚ，そのスロープ特性が−１２ｄＢ／ｏｃｔであるＩＩＲ型フィルタを用いた場合のものである。
また，前記第１段階での前記ピッチ周期検出部２によるピッチ周期の複数候補の検出処理には，ピッチ周期の全範囲（全候補，３５０〜１４００サンプルの範囲）を均等に４区間に分割し，各区間毎に最も周期性評価の高いもの（前述の（１）式による評価値の最も小さいもの）を複数候補（４候補）として検出する処理を適用した。
さらに，前記第２段階での前記ピッチ周期選択部３による一のピッチ周期の選択処理には，前記ピッチ周期の複数候補（４つ）の中から，最も周期性評価の高いもの（前述の（１）式による評価値の最も小さいもの）を前記一のピッチ周期として選択する処理を適用した。
従来の手法図１１（ｂ）においては，検出されたピッチ周期がその上限及び下限付近に多く散らばってそのばらつきが大きいのに対し，本発明の手法図１１（ｃ）においては，そのばらつきが小さくなり，ピッチ周期＝約７００サンプルの前後で比較的滑らかに連続したピッチ周期の抽出が行われていることが表れている。
これはフィルタ（音声帯域制限）によって歌唱音声に対応したピッチ周期であるとともに，楽曲音声信号全体としても違和感のないピッチ周期が検出（選択）されていることを示しており，従来手法よりも本発明の手法の方が，聴覚上の音質が向上することを客観的に表している。 FIG. 11 is a graph (a) showing an example of a waveform of a music sound signal in which singing voice and instrument sound are mixed, and a graph showing a time change of the pitch period detected by the conventional method for the music sound signal. b) and a graph (c) showing a time change of the pitch period detected by the method of the present invention. 11B and 11C, the vertical axis represents the detected pitch period (number of samples), the horizontal axis represents the time axis (the numerical value on the horizontal axis represents that 44100 samplings are performed per second. Represents the time converted into the number of samplings (number of samples)).
Here, the sampling rate of the audio signal is 44100 Hz, and the entire range of the pitch period for which the periodicity is evaluated when detecting the pitch period is 350 to 1400 samples (described above) in both the conventional method and the method of the present invention. It corresponds to all pitch cycle candidates N _{0 to} N).
In the method of the present invention (FIG. 11C), the audio signal processing device Z1 (FIG. 1) is used, and the first filter (1) has a cutoff frequency of 200 Hz and 8 KHz, and a slope characteristic thereof. This is a case where an IIR filter of −12 dB / oct is used.
In addition, in the first stage, the pitch period detection unit 2 detects a plurality of pitch period candidates, and the entire range of pitch periods (all candidates, a range of 350 to 1400 samples) is equally divided into four sections. , A process of detecting the one having the highest periodicity evaluation for each section (the one having the smallest evaluation value according to the above formula (1)) as a plurality of candidates (four candidates).
Further, in the selection process of one pitch period by the pitch period selection unit 3 in the second stage, the one having the highest periodicity evaluation (the above-mentioned (4) The process of selecting the one having the smallest evaluation value according to 1) as the one pitch period was applied.
In the conventional technique FIG. 11 (b), the detected pitch periods are scattered in the vicinity of the upper and lower limits, and the variation is large, whereas in the technique FIG. 11 (c) of the present invention, the variation is small. Thus, it can be seen that the pitch period is extracted relatively smoothly before and after about 700 samples.
This indicates that the pitch period corresponding to the singing voice is detected (selected) by the filter (voice band limitation), and that the pitch period without any sense of incongruity is detected (selected) for the entire music voice signal. The method of the invention objectively indicates that the sound quality on hearing is improved.

以上示した実施形態では，前記第１段階における信号処理として，比較的軽い演算負荷で複数の音源からの音声信号を分離できる帯域制限フィルタ処理を適用した。
その他，前記第１段階における信号処理としては，イコライジングによる周波数強調によって特定の周波数帯の信号を増幅或いは減衰させる信号処理や，ブラインド音源分離方式（ＢＳＳ方式）によって前記入力音声信号に含まれる複数の音源の音声信号を分離する信号処理等も考えられる。
前記イコライジングによる周波数強調では，例えば，特定の音源の周波数帯域について増幅ゲインを設定し，その周波数帯域の音声信号の振幅をＦＩＲフィルタ処理等を施すことによって増幅（強調）する。これにより，特定の音源の音声信号に対応したピッチ周期を得ることができる。
また，前記ＢＳＳ方式に基づく音源分離によれば，予め複数音源各々の周波数帯域を指定しなくても自動的に音源分離され，各音源の音声信号が得られる点で有効である。但し，演算負荷は大きくなる。 In the embodiment described above, the band limiting filter process that can separate the audio signals from a plurality of sound sources with a relatively light calculation load is applied as the signal processing in the first stage.
In addition, as the signal processing in the first stage, signal processing for amplifying or attenuating a signal in a specific frequency band by frequency enhancement by equalizing, or a plurality of signals included in the input audio signal by a blind sound source separation method (BSS method) Signal processing for separating the sound signal of the sound source is also conceivable.
In the frequency emphasis by equalizing, for example, an amplification gain is set for the frequency band of a specific sound source, and the amplitude of the audio signal in that frequency band is amplified (emphasized) by performing FIR filter processing or the like. Thereby, the pitch period corresponding to the sound signal of a specific sound source can be obtained.
Further, the sound source separation based on the BSS method is effective in that sound sources are automatically separated and a sound signal of each sound source can be obtained without designating the frequency bands of each of a plurality of sound sources in advance. However, the calculation load increases.

本発明は，音声信号の時間軸圧縮・伸張を行う音声信号処理への利用が可能である。 The present invention can be used for audio signal processing that performs time-base compression / expansion of an audio signal.

本発明の第１実施形態に係る音声信号処理装置Ｚ１の概略構成を表すブロック図。The block diagram showing the schematic structure of the audio | voice signal processing apparatus Z1 which concerns on 1st Embodiment of this invention. 本発明の第２実施形態に係る音声信号処理装置Ｚ２の概略構成を表すブロック図。The block diagram showing schematic structure of the audio | voice signal processing apparatus Z2 which concerns on 2nd Embodiment of this invention. 本発明の第３実施形態に係る音声信号処理装置Ｚ３の概略構成を表すブロック図。The block diagram showing schematic structure of the audio | voice signal processing apparatus Z3 which concerns on 3rd Embodiment of this invention. 本発明の第４実施形態に係る音声信号処理装置Ｚ４の概略構成を表すブロック図。The block diagram showing the schematic structure of the audio | voice signal processing apparatus Z4 which concerns on 4th Embodiment of this invention. ＰＩＣＯＬＡ方式により音声信号の時間軸圧縮が行われる際の音声信号の波形を模式的に表した図。The figure which represented typically the waveform of the audio | voice signal at the time of time-axis compression of an audio | voice signal by a PICOLA system. ＰＩＣＯＬＡ方式により音声信号の時間軸伸張が行われる際の音声信号の波形を模式的に表した図。The figure which represented typically the waveform of the audio | voice signal when the time-axis expansion | extension of an audio | voice signal is performed by a PICOLA system. 時間軸圧縮・伸張処理に用いられる音声信号（原音）の波形の一例を表す図。The figure showing an example of the waveform of the audio | voice signal (original sound) used for a time-axis compression / expansion process. 図７に示す音声信号（原音）に従来の手法で時間軸伸張を行った後の信号の波形の一例を表す図。The figure showing an example of the waveform of the signal after performing time-axis expansion | extension by the conventional method to the audio | voice signal (original sound) shown in FIG. 図７に示す音声信号（原音）に従来の手法で時間軸伸張を行った後の信号の波形の一例を表す図。The figure showing an example of the waveform of the signal after performing time-axis expansion | extension by the conventional method to the audio | voice signal (original sound) shown in FIG. 図７に示す音声信号（原音）に本発明の手法で時間軸伸張を行った後の信号の波形の一例を表す図。The figure showing an example of the waveform of the signal after performing time-axis expansion | extension by the method of this invention to the audio | voice signal (original sound) shown in FIG. 楽曲音声信号の波形の一例及びその楽曲音声信号に対して従来の手法と本発明の手法とで検出されたピッチ周期の時間変化を表すグラフ。The graph showing the time change of the pitch period detected with the example of the conventional method and the method of this invention with respect to the example of the waveform of a music audio | voice signal, and the music audio | voice signal.

Explanation of symbols

Ｚ１〜Ｚ４…音声信号処理装置
１，１１〜１Ｎ，６…フィルタ
２…ピッチ周期検出部
３…ピッチ周期選択部
４…信号圧縮／伸張部
５…合成信号生成部
２０…ピッチ周期候補中間選択部
２１〜２Ｎ…ピッチ周期中間選択部 Z1 to Z4 ... audio signal processing devices 1, 11 to 1N, 6 ... filter 2 ... pitch cycle detection unit 3 ... pitch cycle selection unit 4 ... signal compression / expansion unit 5 ... synthesized signal generation unit 20 ... pitch cycle candidate intermediate selection unit 21 to 2N: Pitch period intermediate selection unit

Claims

First signal processing means for generating an audio signal obtained by separating an audio signal of a first predetermined frequency band from an input audio signal in which audio from a plurality of sound sources is mixed or a synthesized audio signal of input audio signals of a plurality of channels When,
A voice signal in a second predetermined frequency band different from the first predetermined frequency band is separated from one input voice signal in which voices from a plurality of sound sources are mixed or a synthesized voice signal of a plurality of channels of input voice signals. Second signal processing means for generating an audio signal;
Means for detecting a pitch period candidate which is a frequency component constituting a specific audio signal included in the first predetermined frequency band separated by the first signal processing means, the input audio signal or The synthesized speech signal is sampled at a predetermined first pitch period sampling period, and one or a plurality of pitches in ascending order of difference in signal intensity between the sampled signal of one period and the sampled signal of the other period First pitch period candidate detecting means for detecting as a period candidate;
A specific audio signal included in the second predetermined frequency band separated by the second signal processing means is transmitted at a predetermined second pitch period sampling period different from the first pitch period sampling period. Second pitch period candidate detection that is sampled and detected as one or a plurality of second pitch period candidates in order of increasing signal strength difference between the sampled signal of one period and the sampled signal of the other period Means,
Pitch cycle selection for selecting one pitch cycle having the strongest periodicity from candidates common to the one or more first pitch cycle candidates and the one or more second pitch cycle candidates Means,
Time axis adjusting means for compressing and / or expanding the time axis of the input audio signal based on the one pitch period selected by the pitch period selecting means;
An audio signal processing apparatus comprising:

One or more third predetermined frequencies different from the first predetermined frequency band and the second predetermined frequency band from one input audio signal from a plurality of sound sources or a synthesized audio signal of input audio signals of a plurality of channels Third signal processing means for separating audio signals in the frequency band of
The audio signal of the third predetermined frequency band separated by the third signal processing means is sampled at the sampling period for the fourth predetermined pitch period, the sampled signal of one period and the other period sampled Detecting a difference between one signal of the period and the signal of the other period as a period having a strong periodicity as a plurality of third candidates of the pitch period detection signal in order of increasing signal intensity difference from the signal of A pitch that selects a plurality of highly periodic periods corresponding to both the plurality of first candidates and the plurality of third candidates as a third candidate for the pitch period, and narrows down the plurality of pitch period candidates. Period narrowing means,
2. The audio signal processing apparatus according to claim 1, wherein the pitch cycle selecting unit selects one pitch cycle from the candidates narrowed down by the pitch cycle narrowing unit.

The audio signal processing apparatus according to claim 1 or 2, wherein the first signal processing is to separate the audio signal of the first predetermined frequency band by a band limiting filter.

The said 2nd signal processing and / or said 3rd signal processing isolate | separate the audio | voice signal of a frequency band different from a said 1st predetermined frequency band by a band-limiting filter. Audio signal processing device.

A first signal processing step of generating an audio signal obtained by separating an audio signal of a first predetermined frequency band from an input audio signal in which audio from a plurality of sound sources is mixed or a synthesized audio signal of input audio signals of a plurality of channels When,
A voice signal in a second predetermined frequency band different from the first predetermined frequency band is separated from one input voice signal in which voices from a plurality of sound sources are mixed or a synthesized voice signal of a plurality of channels of input voice signals. A second signal processing step for generating an audio signal;
A step of detecting a pitch period candidate which is a frequency component constituting a specific audio signal included in the first predetermined frequency band separated by the first signal processing step, the input audio signal or The synthesized speech signal is sampled at a predetermined first pitch period sampling period, and one or a plurality of pitches in ascending order of difference in signal intensity between the sampled signal of one period and the sampled signal of the other period A first pitch period candidate detecting step of detecting as a period candidate;
The specific audio signal included in the second predetermined frequency band separated by the second signal processing step is transmitted at a predetermined second pitch period sampling period different from the first pitch period sampling period. Second pitch period candidate detection that is sampled and detected as one or a plurality of second pitch period candidates in order of increasing signal strength difference between the sampled signal of one period and the sampled signal of the other period Process,
Pitch cycle selection for selecting one pitch cycle having the strongest periodicity from candidates common to the one or more first pitch cycle candidates and the one or more second pitch cycle candidates Process,
A time axis adjustment step of compressing and / or expanding the time axis of the input audio signal based on the one pitch cycle selected by the pitch cycle selection step;
An audio signal processing method comprising: