JP5874341B2

JP5874341B2 - Audio signal processing apparatus and program

Info

Publication number: JP5874341B2
Application number: JP2011252652A
Authority: JP
Inventors: 一浩片桐
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 2011-11-18
Filing date: 2011-11-18
Publication date: 2016-03-02
Anticipated expiration: 2031-11-18
Also published as: JP2013109078A

Description

この発明は、音声信号処理装置及びプログラムに関し、例えば、話者の声質を保ちながら話す速度を変化させる話速変換装置に適用し得る。 The present invention relates to an audio signal processing device and a program, and can be applied to, for example, a speech speed conversion device that changes the speaking speed while maintaining the voice quality of a speaker.

従来、話者の声質を保ちながら話す速度を変化させる音声信号処理（話速変換処理）において、従来、音声に含まれる周期性を利用し、時間軸上で伸長・圧縮するＰＩＣＯＬＡ（ＰｏｉｎｔｅｒＩｎｔｅｒｖａｌＣｏｎｔｒｏｌＯｖｅｒｌａｐａｎｄＡｄｄ）という手法がある（非特許文献１参照）。ＰＩＣＯＬＡとは、音声データより抽出したピッチ（基本周期の音声信号の波形）から、ピッチとその後に続く波形とを滑らかに繋ぐことのできる音声波形を生成し、さらに生成した音声波形を挿入した後に、所定のデータ長を出力することにより、もとの音声を圧縮または伸長する技術である。以下では、従来のＰＩＣＯＬＡのアルゴリズムについて、話速を遅くする（音声を伸張する）場合の説明を行う。 Conventionally, in speech signal processing (speech speed conversion processing) that changes the speaking speed while maintaining the voice quality of the speaker, conventionally, PICOLA (Pointer Interval Control) that expands and compresses on the time axis using the periodicity included in the speech. There is a technique called “Overlap and Add” (see Non-Patent Document 1). PICOLA generates a speech waveform that can smoothly connect the pitch and the subsequent waveform from the pitch (waveform of the speech signal of the basic period) extracted from the speech data, and further inserts the generated speech waveform This is a technique for compressing or expanding the original sound by outputting a predetermined data length. In the following, the conventional PICOLA algorithm will be described in the case where the speech speed is slowed down (the voice is expanded).

ＰＩＣＯＬＡでは、任意の長さに音声を伸張するために、音声中に繰り返し現れる周期性のある波形であるピッチを利用する。ピッチは音声データの所定のピッチを探索ための期間の音声データ（以後「ピッチ探索範囲ＴＰ」と呼ぶ）から自己相関関数などを用い抽出される。抽出されたピッチを用い、図１２に示すように、まずピッチＷの波形Ａをフェードイン、続く波形Ａと同じ長さの波形Ｂをフェードアウトした波形Ａ’、Ｂ’を生成する。この波形Ａ’とＢ’を足し合わせる（合成する）ことにより、波形Ａ，Ｂと滑らかに繋がる伸長用の波形Ｃが生成される。この波形Ｃを波形Ａと波形Ｂの間に挿入することにより音声波形の伸長が可能となる。挿入された波形Ｃの長さはピッチＷの長さ（以下、「Ｔｗ」とも呼ぶ）と同じであるから、もとの音声信号の長さをＬとすると伸長後の長さはＬ＋Ｔｗとなる。所望の伸長率をＲｓとしたとき、もとの長さに対して伸長後の長さがＲ倍になれば良いからＲｓ＝（Ｌ＋Ｔｗ）／Ｌとなる。これより伸長率Ｒｓを満たすために必要な原信号の長さはＬ＝Ｔｗ／（Ｒｓ−１）となる。例として伸長率Ｒｓが１．２５倍のとき、ピッチの長さＷが１００サンプルだったとすると、Ｌ＝４００サンプルとなる。原信号４００サンプルに生成した１００サンプルを加えて出力すれば、（出力サンプル数５００）／（原信号サンプル数４００）で伸長率１．２５倍を実現出来る。 In PICOLA, in order to expand the voice to an arbitrary length, a pitch that is a periodic waveform that repeatedly appears in the voice is used. The pitch is extracted from speech data (hereinafter referred to as “pitch search range TP”) during a period for searching for a predetermined pitch of the speech data using an autocorrelation function or the like. Using the extracted pitch, as shown in FIG. 12, first, waveforms A ′ and B ′ are generated by fading in the waveform A having the pitch W and fading out the waveform B having the same length as the subsequent waveform A. By adding (synthesizing) the waveforms A 'and B', an expansion waveform C that is smoothly connected to the waveforms A and B is generated. By inserting this waveform C between waveform A and waveform B, the speech waveform can be expanded. Since the length of the inserted waveform C is the same as the length of the pitch W (hereinafter also referred to as “Tw”), if the length of the original audio signal is L, the length after expansion is L + Tw. . When the desired elongation rate is Rs, Rs = (L + Tw) / L because the length after the elongation should be R times the original length. Accordingly, the length of the original signal necessary to satisfy the expansion rate Rs is L = Tw / (Rs−1). For example, if the elongation rate Rs is 1.25 times and the pitch length W is 100 samples, L = 400 samples. If the generated 100 samples are added to the original 400 samples and output, the expansion ratio of 1.25 times can be realized by (output sample number 500) / (original signal sample number 400).

ＰＩＣＯＬＡによる伸長処理のサイクルを図１３に示す。まずピッチ探索範囲Ｔｐ分のデータを取得してからピッチＷ_１を抽出し、ピッチから音声波形を生成し、所望の伸長率に必要な分の音声データを出力する。音声データを出力した最後のポイントＰ_１が次の処理の開始地点となる。Ｐ_１から数えてピッチ探索範囲Ｔｐ以上の音声データを取得したら再びピッチを抽出し、伸長処理を実行する。以上の処理を繰り返すことにより、話速を任意の早さに変換することができる。 FIG. 13 shows a cycle of extension processing by PICOLA. Extracting pitch W ₁ after obtaining the pitch search range Tp of data first generates a speech waveform from the pitch, outputs a frequency of the audio data required for a desired elongation. Last point P ₁ that has output the audio data becomes the start point of the next process. Again extracted pitch Once you have more audio data pitch search range Tp counted from P _1, executes the decompression process. By repeating the above processing, the speech speed can be converted to an arbitrary speed.

通常、予め設定したサンプル数ｎを１フレームとし、フレーム単位で処理及び出力を行う場合、話速変換をリアルタイムで実行するためには、１フレーム分の時間の経過ごとに１フレーム分のデータの処理が完了していなければならない。以下では、話側変換装置が、リアルタイムに音声データを変換処理し、出力するフレームを「出力フレーム」と呼ぶものとする。そして、フレーム単位でリアルタイムに処理を行う話側変換装置では、１出力フレーム分の時間経過ごとに、１出力フレーム分の処理済みの音声データをバッファ（以下、「出力バッファ」と呼ぶ）に保持していなければならない。しかし、話速変換装置では、出力フレームを出力すべきタイミングになっても、出力バッファに１出力フレーム分の処理済みの音声データが保持されていない場合には、出力フレームを出力することができず、出力フレームの被供給側での音声信号の途切れや、その後の処理に異常を引き起こす原因となる。 Normally, when processing and output are performed in units of frames with a preset number of samples n as one frame, in order to execute speech speed conversion in real time, data for one frame is obtained every time one frame has elapsed. Processing must be complete. In the following, it is assumed that the talk side conversion device converts voice data in real time and outputs the frame as an “output frame”. Then, in the talk-side conversion apparatus that performs processing in real time in units of frames, the processed audio data for one output frame is held in a buffer (hereinafter referred to as “output buffer”) every time an output frame has elapsed. Must be. However, the speech rate conversion device can output an output frame if the output buffer does not hold processed audio data for one output frame even when the output frame is to be output. In other words, the audio signal is interrupted on the supply side of the output frame and causes an abnormality in the subsequent processing.

話速変換（話速を遅く変換する場合）のアルゴリズムとしてＰＩＣＯＬＡを用いる話側変換装置では、出力フレームを出力すべきタイミングとなっても、出力バッファに１出力フレーム分の処理済みの音声データが保持できない状況としては、大きく分けて、以下の２つの種類の条件が挙げられる。 In a speech side conversion device that uses PICOLA as an algorithm for speech speed conversion (when the speech speed is converted slowly), even if it is time to output an output frame, processed speech data for one output frame is stored in the output buffer. Situations that cannot be maintained include the following two types of conditions.

第１の条件としては、従来の話速変換装置において、処理前の音声データからピッチ抽出後に伸長処理を行う場合がある。具体的には、従来の話速変換装置において、処理前の音声データが、ピッチ探索範囲Ｔｐ分バッファされておらず、さらに、出力バッファの音声データ（伸長処理後の音声データ）が、１出力フレーム分以下しか残っていない状態で、次の出力フレームを出力すべきタイミングが訪れた場合である。 As a first condition, there is a case where a conventional speech speed conversion apparatus performs an expansion process after pitch extraction from unprocessed audio data. Specifically, in the conventional speech speed conversion apparatus, the audio data before processing is not buffered for the pitch search range Tp, and further, the audio data in the output buffer (audio data after decompression processing) is output by one output. This is a case where the timing for outputting the next output frame has arrived with only the remaining frames or less remaining.

第２の条件としては、従来の話速変換装置において、処理前の音声データからピッチ候補を抽出したが、当該ピッチ判定基準に基づいて伸長処理が行われない場合がある。入力音声信号には、常に音声のピッチが含まれているとは限らず、自然に話速変換するためには、無声子音や非音声区間を避けて母音や有声子音を伸長しなくてはならない。そのため、従来の話速変換装置において、ピッチ探索の結果、得られたピッチ候補が、伸長処理に用いるピッチとして適当なものであるか否かを判定する必要がある。従来の話速変換装置では、例えば、先頭から所定範囲の音声データが、ピッチ候補として抽出され、さらに、当該ピッチ候補が伸長処理に用いるピッチとして適当でないと判定された場合、当該ピッチ候補の音声データは伸長処理されずにそのまま出力（出力バッファで保持）される。フレーム単位で、音声データを出力する話速変換装置では、ピッチ候補となる音声データのデータ長は、出力フレームのデータ長よりも短くなる可能性がある。その場合、従来の話速変換装置において、出力バッファに伸長処理に利用されなかったピッチ候補の音声データのみが保持された状態で、次のフレームを出力すべきタイミングが訪れると、出力フレームを正常に出力できないことになる。 As a second condition, in a conventional speech speed conversion device, pitch candidates are extracted from unprocessed audio data, but there is a case where the expansion processing is not performed based on the pitch determination criterion. The input voice signal does not always include the pitch of the voice, and in order to convert the speech speed naturally, the vowels and voiced consonants must be expanded to avoid unvoiced and non-voiced sections. . Therefore, in the conventional speech speed conversion device, it is necessary to determine whether or not the pitch candidates obtained as a result of the pitch search are appropriate as the pitch used for the decompression process. In a conventional speech speed conversion device, for example, when a predetermined range of voice data from the beginning is extracted as a pitch candidate, and it is further determined that the pitch candidate is not suitable as a pitch to be used for the expansion process, the voice of the pitch candidate The data is output as it is without being decompressed (held in the output buffer). In a speech speed converting apparatus that outputs voice data in units of frames, the data length of voice data that is a pitch candidate may be shorter than the data length of an output frame. In that case, in the conventional speech speed conversion device, when only the voice data of the pitch candidate that was not used for the decompression process is held in the output buffer, when the timing to output the next frame comes, the output frame is normal Can not be output to.

特許文献１では、上述のような、従来の話速変換装置の課題に対する解決案が記載されている。特許文献１に記載された装置では、ＰＩＣＯＬＡによる伸長処理終了後、バッファにピッチ探索範囲Ｔｐ分のデータがなくても、前回の処理のピッチ情報を再利用し、同一フレーム内でもう一度伸長処理を実行する。特許文献１では、これにより、出力データのないフレームの発生を回避できる旨が記載されている。 Patent Document 1 describes a solution to the problem of the conventional speech speed conversion device as described above. In the device described in Patent Document 1, after the expansion processing by PICOLA is completed, even if there is no data for the pitch search range Tp in the buffer, the pitch information of the previous processing is reused and the expansion processing is performed once again in the same frame. Run. Japanese Patent Application Laid-Open No. H10-228561 describes that this can avoid the generation of a frame without output data.

特開２００６−３５１７号公報JP 2006-3517 A

森田直孝，板倉文忠、「ポインター移動制御による重複加算法（PICOLA）を用いた音声の時間軸での伸長圧縮とその評価」、日本音響学会講演論文集、ｐ．１４９−１５０、昭和６１年１０月Naotaka Morita and Fumada Itakura, “Expansion and compression of speech over time using pointer movement control (PICOLA) and its evaluation”, Proceedings of the Acoustical Society of Japan, p. 149-150, October 1986

しかしながら、特許文献１の記載技術では、話速変換をあらゆる環境下で安定して動作させることができないという問題がある。 However, the technique described in Patent Document 1 has a problem that the speech speed conversion cannot be stably operated under any environment.

具体的には、特許文献１の記載技術の欠点は、大きく分けて以下の３つ挙げられる。 Specifically, the drawbacks of the technology described in Patent Document 1 are roughly divided into the following three.

まず第１に、特許文献１の記載技術では、少なくとも一回の伸長処理を必要としているため、ピッチ候補抽出後に伸長処理をしない場合には全く対応することが出来ないという問題がある。 First, since the technique described in Patent Document 1 requires at least one extension process, there is a problem that it cannot be handled at all when the extension process is not performed after the pitch candidate extraction.

第２に、特許文献１の記載技術では、同一フレーム内で再び伸長処理を行う際に、出力バッファに伸長処理に十分なデータが残っているとは限らないという問題がある。つまり、特許文献１の記載技術では、少なくとも１回目の伸長処理で出力したデータ以上のデータが出力バッファに残っていなければ、特許文献１の記載技術は適用できないことになる。 Secondly, the technique described in Patent Document 1 has a problem that when the decompression process is performed again in the same frame, there is not always enough data remaining in the output buffer for the decompression process. In other words, in the technique described in Patent Document 1, the technique described in Patent Document 1 cannot be applied unless data equal to or larger than the data output in the first decompression process remains in the output buffer.

第３に、特許文献１の記載技術では、同一フレーム内での２度目の伸長の際、そのデータとは直接関係ない部分のピッチ情報を使うため、ピッチでない部分を伸張し、音質が劣化してしまうという問題がある。 Thirdly, in the technique described in Patent Document 1, since the pitch information of the portion not directly related to the data is used at the second expansion within the same frame, the portion that is not directly related to the data is expanded, and the sound quality deteriorates. There is a problem that it ends up.

上述のような問題点に鑑みて、リアルタイムに音声信号の音声データについて話速変換処理を行う際に、安定的に話速変換処理後の音声データを出力することができる音声信号処理装置及びプログラムが望まれている。 In view of the above-described problems, an audio signal processing apparatus and program capable of stably outputting audio data after speech speed conversion processing when performing audio speed conversion processing on audio data of an audio signal in real time Is desired.

第１のホン発明の音声信号処理装置は、（１）入力音声信号の音声データを蓄積する入力バッファ手段と、（２）上記入力バッファ手段に蓄積されている音声データに基づく音声信号波形について、探索周期分の音声信号波形から、基本周期を抽出し、抽出した基本周期の音声信号波形を利用して、入力バッファ手段に蓄積される音声データについて話速変換処理を行う話速変換手段と、（３）上記話速変換手段が話速変換処理した後の音声データを蓄積する出力バッファ手段と、（４）出力間隔ごとに、上記出力バッファ手段に蓄積している音声データのうち出力間隔分の音声データを含む出力音声データフレームを出力する音声データ出力手段と、（５）上記入力バッファ手段に、上記探索周期と上記出力間隔とを加算し、さらに、出力バッファ手段の音声データの１サンプリング周期を減じた最低蓄積期間最低蓄積期間以上の音声データが蓄積されてから、上記話速変換手段による話速変換処理を開始させる変換処理制御手段とを有することを特徴とする。 The audio signal processing apparatus of the first phone invention includes (1) an input buffer means for storing audio data of an input audio signal, and (2) an audio signal waveform based on the audio data stored in the input buffer means. From the speech signal waveform for the search cycle, extract the fundamental cycle, and using the extracted speech signal waveform of the fundamental cycle, the speech rate conversion means for performing speech rate conversion processing on the speech data stored in the input buffer means; (3) output buffer means for storing voice data after the speech speed conversion means performs the speech speed conversion process; and (4) for the output interval of the voice data stored in the output buffer means for each output interval. and the audio data outputting means for outputting an output audio data frames containing audio data, (5) the input buffer means, by adding the above-mentioned search cycle and the output interval, further, the output From the lowest accumulation period Minimum accumulation period or more audio data obtained by subtracting one sampling period of the audio data is accumulated in Ffa it means, that it has a conversion processing control means for starting the speech speed conversion processing by the speech speed converting means Features.

第２の本発明の音声信号処理プログラムは、コンピュータを、（１）入力音声信号の音声データを蓄積する入力バッファ手段と、（２）上記入力バッファ手段に蓄積されている音声データに基づく音声信号波形について、探索周期分の音声信号波形から、基本周期を抽出し、抽出した基本周期の音声信号波形を利用して、入力バッファ手段に蓄積される音声データについて話速変換処理を行う話速変換手段と、（３）上記話速変換手段が話速変換処理した後の音声データを蓄積する出力バッファ手段と、（４）出力間隔ごとに、上記出力バッファ手段に蓄積している音声データのうち出力間隔分の音声データを含む出力音声データフレームを出力する音声データ出力手段と、（５）上記入力バッファ手段に、上記探索周期と上記出力間隔とを加算し、さらに、出力バッファ手段の音声データの１サンプリング周期を減じた最低蓄積期間最低蓄積期間以上の音声データが蓄積されてから、上記話速変換手段による話速変換処理を開始させる変換処理制御手段として機能させることを特徴とする。 The audio signal processing program according to the second aspect of the present invention comprises: (1) input buffer means for storing audio data of an input audio signal; and (2) an audio signal based on the audio data stored in the input buffer means. For the waveform, extract the fundamental period from the speech signal waveform for the search period, and use the extracted speech signal waveform of the fundamental period to perform speech speed conversion processing on the speech data stored in the input buffer means Means, (3) output buffer means for storing the speech data after the speech speed conversion means has been subjected to speech speed conversion processing, and (4) of the speech data stored in the output buffer means for each output interval and the audio data outputting means for outputting an output audio data frames containing audio data output interval fraction, and (5) to the input buffer means, the search cycle and the output interval Calculated and, further, the output from the minimum accumulation period Minimum accumulation period or more audio data obtained by subtracting one sampling period of audio data in the buffer means is accumulated, the conversion processing control for starting the speech speed conversion processing by the speech speed converting means It is made to function as a means.

本発明によれば、リアルタイムに音声信号の音声データについて話速変換処理を行う際に、安定的に話速変換処理後の音声データを出力することができる音声信号処理装置を提供することができる。 According to the present invention, it is possible to provide an audio signal processing device capable of stably outputting audio data after speech speed conversion processing when performing speech speed conversion processing on audio data of an audio signal in real time. .

第１の実施形態に係る話速変換装置の機能的構成を示すブロック図である。It is a block diagram which shows the functional structure of the speech-speed converter which concerns on 1st Embodiment. 第１の実施形態に係る話速変換装置で、ピッチ抽出処理の処理例について説明した説明図である。It is explanatory drawing explaining the process example of the pitch extraction process with the speech-speed converter which concerns on 1st Embodiment. 第１の実施形態に係る話速変換装置で、出力フレームを出力する処理について示したフローチャートである。It is the flowchart shown about the process which outputs an output frame with the speech-speed converter which concerns on 1st Embodiment. 第１の実施形態に係る話速変換装置で、入力フレームの取得から出力バッファに音声データを供給するまでの動作について示したフローチャート（その１）である。4 is a flowchart (part 1) illustrating an operation from acquisition of an input frame to supply of audio data to an output buffer in the speech speed conversion apparatus according to the first embodiment. 第１の実施形態に係る話速変換装置で、入力フレームの取得から出力バッファに音声データを供給するまでの動作について示したフローチャート（その２）である。4 is a flowchart (part 2) illustrating an operation from acquisition of an input frame to supply of audio data to an output buffer in the speech speed conversion apparatus according to the first embodiment. 第１の実施形態に係る話速変換装置で、最低遅延量をピッチ探索範囲と等しくした場合の動作の第１の例について示したタイミングチャートである。5 is a timing chart showing a first example of an operation when the minimum delay amount is made equal to the pitch search range in the speech speed converting apparatus according to the first embodiment. 第１の実施形態に係る話速変換装置で、最低遅延量をピッチ探索範囲と等しくした場合の動作の第２の例について示したタイミングチャートである。6 is a timing chart showing a second example of the operation when the minimum delay amount is made equal to the pitch search range in the speech speed conversion device according to the first embodiment. 第１の実施形態に係る話速変換装置で、最低遅延量をピッチ探索範囲よりも長く設定した場合の動作の第１の例について示したタイミングチャートである。6 is a timing chart showing a first example of an operation when the minimum delay amount is set longer than the pitch search range in the speech speed converting apparatus according to the first embodiment. 第１の実施形態に係る話速変換装置で、最低遅延量をピッチ探索範囲よりも長く設定した場合の動作の第２の例について示したタイミングチャートである。6 is a timing chart showing a second example of the operation when the minimum delay amount is set longer than the pitch search range in the speech speed converting apparatus according to the first embodiment. 第２の実施形態に係る話速変換装置の機能的構成を示すブロック図である。It is a block diagram which shows the functional structure of the speech-speed converter which concerns on 2nd Embodiment. 第２の実施形態に係る話速変換装置で、入力フレームの取得から出力バッファに音声データを供給するまでの動作について示したフローチャートである。6 is a flowchart illustrating an operation from acquisition of an input frame to supply of audio data to an output buffer in the speech speed conversion apparatus according to the second embodiment. 従来の話速変換装置で行われるＰＩＣＯＬＡによる波形伸長について示した説明図である。It is explanatory drawing shown about the waveform expansion | extension by PICOLA performed with the conventional speech speed converter. 従来の話速変換装置で行われるＰＩＣＯＬＡによる伸長処理のサイクルを説明するための図である。It is a figure for demonstrating the cycle of the expansion | extension process by PICOLA performed with the conventional speech speed converter.

（Ａ）第１の実施形態
以下、本発明による音声信号処理装置及びプログラムの第１の実施形態を、図面を参照しながら詳述する。なお、第１の実施形態の音声信号処理装置は、話速変換装置である。 (A) First Embodiment Hereinafter, a first embodiment of an audio signal processing device and a program according to the present invention will be described in detail with reference to the drawings. Note that the audio signal processing device according to the first embodiment is a speech rate conversion device.

（Ａ−１）第１の実施形態の構成
図１は、この実施形態の話速変換装置１００の機能的構成を示すブロック図である。 (A-1) Configuration of the First Embodiment FIG. 1 is a block diagram showing a functional configuration of the speech rate conversion apparatus 100 of this embodiment.

話速変換装置１００は、データ入力部１０１、話速変換処理バッファ１０２、遅延制御部１０３、話速制御部１０４、ピッチ抽出部１０５、ＰＩＣＯＬＡ処理部１０６、出力バッファ１０７、及びデータ出力部１０８を有している。 The speech speed conversion apparatus 100 includes a data input unit 101, a speech speed conversion processing buffer 102, a delay control unit 103, a speech speed control unit 104, a pitch extraction unit 105, a PICOLA processing unit 106, an output buffer 107, and a data output unit 108. Have.

話速変換装置１００は、例えば、ＣＰＵ、ＲＯＭ、ＲＡＭ、ＥＥＰＲＯＭ、ハードディスクなどのプログラムの実行構成を有する情報処理装置（１台に限定されず、複数台を分散処理し得るようにしたものであっても良い。）に、実施形態の音声信号処理プログラム等をインストールすることにより構築してもよく、その場合でも機能的には上述の図１のように示すことができる。
データ入力部１０１は、入力された入力音声信号（音響信号）から、所定のサンプル間隔Ｔｓごとにサンプリングされた音声データ（例えば、ＰＣＭ（ＰｕｌｓｅＣｏｄｅＭｏｄｕｌａｔｉｏｎ）形式で生成された音声データ）を生成し、サンプル数ｎを１フレームとするフレーム単位で、分割して話速変換処理バッファ１０２に供給する。以下では、データ入力部１０１により生成されるフレームを「入力フレーム」と呼ぶものとする。 The speech speed conversion apparatus 100 is an information processing apparatus having a program execution configuration such as a CPU, ROM, RAM, EEPROM, hard disk and the like (not limited to one, but capable of distributed processing of a plurality of units). May be constructed by installing the audio signal processing program or the like of the embodiment, and even in that case, it can be functionally shown as in FIG.
The data input unit 101 generates voice data (for example, voice data generated in a PCM (Pulse Code Modulation) format) sampled at a predetermined sample interval Ts from an input voice signal (acoustic signal) that has been input. Then, the sample number n is divided into frame units, and the divided number is supplied to the speech speed conversion processing buffer 102. Hereinafter, a frame generated by the data input unit 101 is referred to as an “input frame”.

この実施形態では、データ入力部１０１は、上述の通り、入力音声信号を符号化して入力フレームを生成して話速変換処理バッファ１０２に供給するものとして説明するが、データ入力部１０１が、話速変換処理バッファ１０２に供給する音声データを保持する方法や供給方法は限定されないものである。例えば、データ入力部１０１は、入力フレーム単位ではなく、１サンプル単位で話速変換処理バッファ１０２に音声データを供給するようにしても良い。また、データ入力部１０１は、例えば、既に符号化された音声データを外部装置から保持する（１度に保持するデータ量は限定されないものである）ようにしても良い。 In this embodiment, as described above, the data input unit 101 is described as encoding the input voice signal to generate an input frame and supplying the input frame to the speech speed conversion processing buffer 102. There is no limitation on the method and method for holding the audio data supplied to the speed conversion processing buffer 102. For example, the data input unit 101 may supply audio data to the speech rate conversion processing buffer 102 not in units of input frames but in units of samples. The data input unit 101 may hold, for example, already encoded audio data from an external device (the amount of data held at one time is not limited).

話速変換処理バッファ１０２は、データ入力部１０１から供給される入力フレームを保持し、遅延制御部１０３、話速制御部１０４、ピッチ抽出部１０５等の制御に従って、保持した入力フレームの音声データを、後段の処理構成（ＰＩＣＯＬＡ処理部１０６）に供給する等の処理を行うものである。 The speech speed conversion processing buffer 102 holds the input frame supplied from the data input unit 101, and converts the audio data of the held input frame according to the control of the delay control unit 103, the speech speed control unit 104, the pitch extraction unit 105, and the like. The processing such as supplying to the subsequent processing configuration (PICOLA processing unit 106) is performed.

遅延制御部１０３は、話速変換処理バッファ１０２で保持されている音声データのデータ量に応じて、その他の処理構成（話速変換処理バッファ１０２、ピッチ抽出部１０５等）の動作を制御するものである。遅延制御部１０３は、話速変換装置１００の処理開始直後等に、話速変換処理バッファ１０２で、最低遅延量Ｔｄ（最低蓄積期間）分以上の音声データが保持されているかどうかを確認する。そして、遅延制御部１０３は、話速変換処理バッファ１０２で、最低遅延量Ｔｄ以上の期間分の音声データが保持されていると確認した場合に、ＰＩＣＯＬＡ処理部１０６等による伸張処理を開始させるように、他の処理構成を制御する。 The delay control unit 103 controls the operation of other processing configurations (speech speed conversion processing buffer 102, pitch extraction unit 105, etc.) according to the amount of audio data held in the speech speed conversion processing buffer 102. It is. The delay control unit 103 checks whether or not the speech speed conversion processing buffer 102 holds audio data equal to or more than the minimum delay amount Td (minimum accumulation period) immediately after the processing of the speech speed conversion device 100 is started. The delay control unit 103 starts the expansion processing by the PICOLA processing unit 106 or the like when it is confirmed in the speech speed conversion processing buffer 102 that audio data for a period equal to or longer than the minimum delay amount Td is held. In addition, other processing configurations are controlled.

最低遅延量Ｔｄとしては、少なくともピッチ探索範囲Ｔｐ以上の期間とする必要があるが、この実施形態では、以下の（１）式のように算出されるものであるものとする。一方、詳細については口述するが、話速制御部１０４では、開始時に設定した伸長率Ｒｓから伸長率を達成するために必要なデータ長（音声データに対応する期間）が算出される。 The minimum delay amount Td needs to be at least a period equal to or greater than the pitch search range Tp. In this embodiment, it is assumed that the minimum delay amount Td is calculated as in the following equation (1). On the other hand, although the details are dictated, the speech speed control unit 104 calculates the data length (period corresponding to the voice data) necessary to achieve the expansion rate from the expansion rate Rs set at the start.

最低遅延量Ｔｄ＝ピッチ探索範囲Ｔｐ
＋１出力フレーム分の期間−１サンプル分の期間 …（１）
ピッチ抽出部１０５は、話速変換処理バッファ１０２に保持されている音声データに対して、ピッチの探索を行う。ピッチ抽出部１０５は、ピッチ探索範囲Ｔｐ分の音声データが話速変換処理バッファ１０２にある場合に、話速変換処理バッファ１０２に保持されている音声データに対して、ピッチ候補の探索(抽出)を行う。 Minimum delay amount Td = pitch search range Tp
+1 output frame period-1 sample period (1)
The pitch extraction unit 105 searches for the pitch of the audio data held in the speech speed conversion processing buffer 102. The pitch extraction unit 105 searches (extracts) pitch candidates for the speech data held in the speech speed conversion processing buffer 102 when speech data for the pitch search range Tp is in the speech speed conversion processing buffer 102. I do.

ピッチ抽出部１０５は、例えば、話速変換処理バッファ１０２で保持されている音声データのピッチ探索範囲Ｔｐの中から最も周期性の強いものをピッチ候補とし、当該ピッチ候補が、ＰＩＣＯＬＡ処理部１０６で話速変換処理に用いるものとして適当であるか否かも判定する。ここでは、説明を簡易にするため、ピッチ抽出部１０５が抽出するピッチ候補は、話速変換処理バッファ１０２で保持されている音声データの先頭（時系列上の先頭）から始まる音声データであるものとする。 For example, the pitch extraction unit 105 selects a pitch candidate having the strongest periodicity from the pitch search range Tp of the speech data held in the speech speed conversion processing buffer 102, and the pitch candidate is the PICOLA processing unit 106. It is also determined whether or not it is suitable for use in speech speed conversion processing. Here, for simplification of description, the pitch candidates extracted by the pitch extraction unit 105 are audio data starting from the beginning (the top in time series) of the audio data held in the speech speed conversion processing buffer 102. And

そして、ピッチ抽出部１０５は、抽出したピッチ候補の音声データ（話速変換処理バッファ１０２時系列上の先頭から所定範囲の音声データ）について、話速変換処理に用いるものとして適当であるか否かを判定し、適当と判定した場合には、話速変換処理バッファ１０２を制御して、抽出したピッチの音声データと、当該ピッチの音声データに続く話側変換処理（伸長処理）に必要な長さ（以下、「伸長処理用データ期間Ｔｇ」と呼ぶ）の音声データを、ＰＩＣＯＬＡ処理部１０６に供給させる。 Then, the pitch extraction unit 105 determines whether the extracted pitch candidate speech data (speech speed conversion processing buffer 102 speech data within a predetermined range from the top in the time series) is appropriate for use in the speech speed conversion processing. If the speech rate conversion processing buffer 102 is determined to be appropriate, the voice data of the extracted pitch and the length necessary for the speech side conversion processing (decompression processing) following the voice data of the pitch are controlled. The audio data (hereinafter referred to as “decompression processing data period Tg”) is supplied to the PICOLA processing unit 106.

なお、伸長処理用データ期間Ｔｇの長さは、話速制御部１０４により、伸長率Ｒｓに応じて決定されるものである。 The length of the expansion processing data period Tg is determined by the speech speed control unit 104 according to the expansion rate Rs.

一方、抽出したピッチ候補の音声データについて、話速変換処理に用いるものとして適当でないと判定された場合には、ピッチ抽出部１０５は、話速変換処理バッファ１０２を制御して、当該ピッチ候補の音声データについて、出力バッファ１０７に供給させる。 On the other hand, when it is determined that the extracted speech data of the pitch candidate is not suitable for use in the speech speed conversion process, the pitch extraction unit 105 controls the speech speed conversion processing buffer 102 to determine the pitch candidate. The audio data is supplied to the output buffer 107.

ピッチ抽出部１０５によるピッチ候補の抽出処理は、例えば、相違度や自己相関係数などを利用するようにしてもよい。以下に、ピッチ抽出部１０５が、上述の相違度を用いてピッチ候補を抽出する処理の例について説明する。 The pitch candidate extraction process by the pitch extraction unit 105 may use, for example, a degree of difference or an autocorrelation coefficient. Hereinafter, an example of processing in which the pitch extracting unit 105 extracts pitch candidates using the above-described difference will be described.

ピッチ抽出部１０５は、例えば、図２のように話速変換処理バッファ１０２の時系列上の先頭となる音声データ（図２の「ピッチ候補区間」）について、１サンプルずつずらしながら隣り合うデータ区間（図２の「比較対象区間」）との相違度を算出し、相違度が最も小さいものをピッチ候補とするものとする。ピッチ抽出部１０５では、例えば、以下の（２）式を用いて、仮のピッチ候補区間と比較対象区間との相違度を算出するようにしてもよい。以下の（２）式で、ｆは、図２に示す仮のピッチ候補区間を示しており、ｆ_ｉは、ｆ（ピッチ候補区間）の先頭からｉ番目のサンプル値を表している。また、以下の（２）式で、ｇは、図２に示す比較対象区間を示しており、ｇ_ｉは、ｇ（比較対象区間）の先頭からｉ番目のサンプル値を表している。

For example, as shown in FIG. 2, the pitch extraction unit 105 performs the adjacent data intervals while shifting one sample at a time for the top voice data (“pitch candidate interval” in FIG. 2) of the speech speed conversion processing buffer 102. The degree of difference from the “comparison target section” in FIG. 2 is calculated, and the one with the smallest degree of difference is set as a pitch candidate. For example, the pitch extraction unit 105 may calculate the degree of difference between the temporary pitch candidate section and the comparison target section using the following equation (2). In the following equation (2), f represents the temporary pitch candidate section shown in FIG. 2, and f _i represents the i-th sample value from the beginning of f (pitch candidate section). In the following equation (2), g represents the comparison target section shown in FIG. 2, and g _i represents the i-th sample value from the beginning of g (comparison target section).

ピッチ抽出部１０５が、ピッチ探索を行う回数（相違度を算出する回数）は限定されないものであるが、例えば、所定の最大回数を限度として相違度を計算し、最も相違度が少ないものをピッチ候補区間として抽出するようにしてもよい。そして、ピッチ抽出部１０５は、相違度が所定の閾値未満である場合に、当該ピッチ候補を伸長処理に用いるものとして適当であると判定するようにしてもよい。 The number of times the pitch extraction unit 105 performs a pitch search (the number of times of calculating the dissimilarity) is not limited. For example, the dissimilarity is calculated up to a predetermined maximum number of times, and the pitch having the smallest dissimilarity is pitched. You may make it extract as a candidate area. Then, when the degree of difference is less than a predetermined threshold, the pitch extraction unit 105 may determine that the pitch candidate is appropriate for use in the expansion process.

ピッチ探索範囲Ｔｐとしては、例えば、人間のピッチを探索するのに十分な範囲を設定することが望ましく、例えば、サンプリング周波数８ｋＨｚのとき２０サンプルから１２０サンプルとするようにしてもよい。 As the pitch search range Tp, for example, it is desirable to set a range sufficient for searching for a human pitch. For example, when the sampling frequency is 8 kHz, the range may be 20 to 120 samples.

ＰＩＣＯＬＡ処理部１０６は、ピッチ抽出部１０５で抽出したピッチＷの音声データ（音声波形）に基づいて、話速変換処理バッファ１０２に保持された音声データの伸長に用いる音声データを生成する。ＰＩＣＯＬＡ処理部１０６は、例えば、従来技術と同様に、クロスフェードを用いて、伸長処理に用いる音声データを生成するようにしても良い。 The PICOLA processing unit 106 generates audio data used to expand the audio data held in the speech speed conversion processing buffer 102 based on the audio data (speech waveform) of the pitch W extracted by the pitch extraction unit 105. For example, the PICOLA processing unit 106 may generate audio data to be used for the decompression process using a cross fade, as in the conventional technique.

そして、ＰＩＣＯＬＡ処理部１０６は、ピッチＷの波形の長さ（時間軸上の波形の長さであり、生成した音声データの波形の長さと同様）に応じた伸長処理用データ期間Ｔｇを、話速制御部１０４に問い合わせて取得する。 Then, the PICOLA processing unit 106 transmits the expansion processing data period Tg according to the length of the waveform of the pitch W (the length of the waveform on the time axis, which is the same as the length of the waveform of the generated audio data). The speed control unit 104 is inquired and acquired.

そして、ＰＩＣＯＬＡ処理部１０６は、伸長処理用データ期間Ｔｇを取得すると、話速変換処理バッファ１０２で保持された音声データのうち、ピッチＷに続く伸長処理用データ期間Ｔｇ分の音声データを取得する。そして、ＰＩＣＯＬＡ処理部１０６は、取得したピッチＷの音声データと、伸長処理用データ期間Ｔｇの音声データとの間に、生成した音声データを挿入した音声データを、出力バッファ１０７に供給する。 Then, when the PICOLA processing unit 106 acquires the expansion processing data period Tg, the PICOLA processing unit 106 acquires audio data for the expansion processing data period Tg following the pitch W among the audio data held in the speech speed conversion processing buffer 102. . Then, the PICOLA processing unit 106 supplies the output buffer 107 with audio data in which the generated audio data is inserted between the acquired audio data of the pitch W and the audio data of the decompression processing data period Tg.

なお、ＰＩＣＯＬＡ処理部１０６は、話速変換処理バッファ１０２に、伸長処理用データ期間Ｔｇ分の音声データが残っていない場合には、足りない分の長さ（以下、「未処理データ期間Ｔｕ」と呼ぶ）を、話速制御部１０４に報告する。 The PICOLA processing unit 106, when there is no audio data for the decompression processing data period Tg remaining in the speech rate conversion processing buffer 102, is the length of the shortage (hereinafter referred to as “unprocessed data period Tu”). To the speech speed control unit 104.

話速制御部１０４は、基準となる伸長率Ｒｓのパラメータを保持し、保持した伸長率Ｒｓにもとづいて、ＰＩＣＯＬＡ処理部１０６からの問い合わせに応じて、伸長率Ｒｓを満たすための伸長処理用データ期間Ｔｇの長さを算出して、返答する。話速制御部１０４が、伸長率Ｒｓを保持する方法は限定されないものであるが、例えば、予め設定しておくようにしても良いし、ユーザの操作に応じて変更可能な構成としても良い。 The speech speed control unit 104 holds the parameter of the reference expansion rate Rs, and the expansion processing data for satisfying the expansion rate Rs according to the inquiry from the PICOLA processing unit 106 based on the stored expansion rate Rs. The length of the period Tg is calculated and returned. The method by which the speech speed control unit 104 holds the expansion rate Rs is not limited. However, for example, it may be set in advance or may be configured to change according to the user's operation.

話速制御部１０４は、ＰＩＣＯＬＡ処理部１０６から、ピッチＷの長さＴｗが通知されると、伸長率Ｒｓを満たすための、当該ピッチＷの長さＴｗに対応する伸長処理用データ期間Ｔｇを算出する。そして、話速制御部１０４は、算出した伸長処理用データ期間Ｔｇを、ＰＩＣＯＬＡ処理部１０６に返答する。 When the speech speed control unit 104 is notified of the length Tw of the pitch W from the PICOLA processing unit 106, the speech speed control unit 104 sets an expansion processing data period Tg corresponding to the length Tw of the pitch W to satisfy the expansion rate Rs. calculate. Then, the speech speed control unit 104 returns the calculated decompression processing data period Tg to the PICOLA processing unit 106.

話速制御部１０４では、例えば、以下の（３）式を用いて、伸長処理用データ期間Ｔｇを算出するようにしてもよい。 The speech speed control unit 104 may calculate the decompression processing data period Tg using, for example, the following equation (3).

Ｔｇ＝（Ｔｗ／（Ｒｓ−１））−Ｔｗ …（３）
なお、話速制御部１０４は、ＰＩＣＯＬＡ処理部１０６から未処理データ期間Ｔｕが報告された場合には、話速変換処理バッファ１０２を制御して、未処理データ期間Ｔｕ分の音声データについてデータ出力部１０８に供給させる処理を優先させる。すなわち、未処理データ期間Ｔｕが発生した場合、話速制御部１０４は、次に話速変換処理バッファ１０２に未処理データ期間Ｔｕ分の音声データが溜まると、話速変換処理バッファ１０２を制御して、次の伸長処理よりも優先して、その音声データを出力バッファ１０７に供給させる。 Tg = (Tw / (Rs-1))-Tw (3)
When the unprocessed data period Tu is reported from the PICOLA processing unit 106, the speech speed control unit 104 controls the speech speed conversion processing buffer 102 and outputs data for the unprocessed data period Tu audio data. The processing to be supplied to the unit 108 is prioritized. That is, when the unprocessed data period Tu occurs, the speech speed control unit 104 controls the speech speed conversion processing buffer 102 when the speech data for the unprocessed data period Tu is accumulated in the speech speed conversion processing buffer 102 next time. Thus, the audio data is supplied to the output buffer 107 in preference to the next decompression process.

出力バッファ１０７は、出力フレームを送出すべきタイミングが到来するごとに、出力フレーム１つ分の音声データを、データ出力部１０８に供給する。出力フレームがｎサンプル分の音声データで構成されている場合には、出力フレームを送出すべきタイミングは、サンプル間隔Ｔｓ×ｎの期間ごとに到来することになる。 The output buffer 107 supplies audio data for one output frame to the data output unit 108 every time when an output frame should be transmitted. When the output frame is composed of audio data for n samples, the timing at which the output frame should be transmitted comes every sample interval Ts × n.

話速変換装置１００の動作開始直後等、出力バッファ１０７に保持されている音声データが、１出力フレーム分に満たない場合には、出力バッファ１０７は、例えば、無音データや、予め用意しておいたノイズ等のダミー用の音声データを含む出力フレームを、データ出力部１０８に供給するようにしてもよい。 When the voice data held in the output buffer 107 is less than one output frame, such as immediately after the operation of the speech speed converting apparatus 100 is started, the output buffer 107 is prepared with, for example, silence data or previously prepared. An output frame including dummy audio data such as noise may be supplied to the data output unit 108.

データ出力部１０８は、出力バッファ１０７から供給された出力フレームの音声データを所定の方法により出力するものである。データ出力部１０８が音声データを出力する方法については限定されないものであるが、例えば、スピーカ等の音声出力装置を備えて表音出力したり、所定のデータ記憶媒体（例えば、ハードディスクドライブ等）に記憶させたり、出力フレームの音声データをそのまま、又は符号化して、所定の形式のパケットに挿入し、送信先の通信装置に送出するようにしてもよい。 The data output unit 108 outputs the audio data of the output frame supplied from the output buffer 107 by a predetermined method. The method for outputting the audio data by the data output unit 108 is not limited. For example, the audio output device 108 is provided with an audio output device such as a speaker, or is output to a predetermined data storage medium (for example, a hard disk drive). The audio data of the output frame may be stored as it is or may be encoded and inserted into a packet of a predetermined format, and sent to the destination communication device.

（Ａ−２）第１の実施形態の動作
次に、以上のような構成を有する第１の実施形態の話速変換装置１００の動作を説明する。 (A-2) Operation of the First Embodiment Next, the operation of the speech rate conversion apparatus 100 of the first embodiment having the above configuration will be described.

まず、話速変換装置１００において、データ出力部１０８による、出力データフレームの出力処理について図３のフローチャートを用いて説明する。 First, an output data frame output process by the data output unit 108 in the speech speed conversion apparatus 100 will be described with reference to the flowchart of FIG.

話速変換装置１００で話速変換処理（話速を遅くする処理）が開始されると、まず、出力バッファ１０７は、次の出力フレーム送出タイミングが到来するまで待機し（Ｓ１０１）、出力フレーム送出タイミングが到来すると、１つの出力フレーム分の音声データをデータ出力部１０８に供給する（Ｓ１０２）。上述の通り、出力バッファ１０７で、出力フレームを出力すべきタイミングは、サンプル間隔Ｔｓ×ｎの期間ごとに到来する。 When speech speed conversion processing (processing for slowing down the speech speed) is started in the speech speed conversion apparatus 100, the output buffer 107 first waits until the next output frame transmission timing arrives (S101), and outputs the output frame. When the timing arrives, audio data for one output frame is supplied to the data output unit 108 (S102). As described above, the timing at which the output buffer 107 should output the output frame arrives every sample interval Ts × n.

出力バッファ１０７では、話速変換装置１００で話速変換処理が開始された直後は、出力フレームに用いる音声データを保持していないが、音声データが所定以上（例えば、１出力フレーム分以上）溜まるまでの間、例えば、出力フレームの出力を保留したり、ダミー用の音声データ（例えば、無音データや、予め用意しておいたノイズ等の音声データ）を含む出力フレームを、データ出力部１０８に供給するようにしてもよい。また、出力バッファ１０７は、音声データの供給が開始された後でも、出力すべき音声データが１出力フレーム分未満となった場合にも、ダミー用の音声データを含む出力フレーム（出力バッファ１０７に残っている音声データで足りない分をダミー用の音声データで保管した出力フレームとしてもよい）を、データ出力部１０８に供給するようにしてもよい。 In the output buffer 107, immediately after the speech speed conversion processing is started in the speech speed converting apparatus 100, the audio data used for the output frame is not held, but the audio data is accumulated more than a predetermined amount (for example, one output frame or more). For example, the output frame of the output frame is suspended, or the output frame including dummy audio data (for example, silence data or audio data such as noise prepared in advance) is sent to the data output unit 108. You may make it supply. The output buffer 107 also outputs an output frame including dummy audio data (to the output buffer 107) even when the supply of audio data is started and the audio data to be output is less than one output frame. An output frame in which the remaining audio data is stored as dummy audio data may be supplied to the data output unit 108.

そして、話速変換装置１００で、話速変換処理が継続している場合には、出力バッファ１０７は上述のステップＳ１０１に戻って、次の出力フレーム送出タイミングまで待機する。一方、話速変換装置１００で、話速変換処理が継続している場合には、出力バッファ１０７は、処理を終了する。 When the speech speed conversion process is continued in the speech speed conversion apparatus 100, the output buffer 107 returns to the above-described step S101 and waits until the next output frame transmission timing. On the other hand, when the speech speed conversion process is continued in the speech speed conversion apparatus 100, the output buffer 107 ends the process.

次に、話速変換装置１００において、入力された音声信号（音声データ）を伸長処理（話速を遅くする処理）し、出力バッファ１０７に供給するまでの処理について、図４、図５を用いて説明する。 Next, in the speech speed conversion apparatus 100, the process from decompressing the input speech signal (speech data) (processing to slow down the speech speed) and supplying it to the output buffer 107 will be described with reference to FIGS. I will explain.

話速変換装置１００で話速変換処理（話速を遅くする処理）が開始されると、データ入力部１０１では、入力される音声信号から音声データ（例えば、ＰＣＭ形式のデータ）が生成され、生成された音声データについて入力フレームごとに分割して取得される（Ｓ２０１）。 When the speech speed conversion processing (processing for slowing down the speech speed) is started in the speech speed conversion apparatus 100, the data input unit 101 generates speech data (for example, PCM format data) from the input speech signal. The generated audio data is divided and acquired for each input frame (S201).

そして、データ入力部１０１では、入力フレームが取得されると、その入力フレームが話速変換処理バッファ１０２に供給される（Ｓ２０２）。 In the data input unit 101, when an input frame is acquired, the input frame is supplied to the speech speed conversion processing buffer 102 (S202).

この実施形態では、説明を簡易にするため、入力フレーム及び出力フレームのサンプリング間隔及びサンプル数は同じであるものとして説明する。以下では、入力フレーム及び出力フレームのサンプリング間隔はＴｓであり、サンプル数はｎであるものとして説明する。 In this embodiment, in order to simplify the description, it is assumed that the sampling interval and the number of samples of the input frame and the output frame are the same. In the following description, it is assumed that the sampling interval between the input frame and the output frame is Ts, and the number of samples is n.

そして、遅延制御部１０３では、今回の話速変換処理が開始されてから、過去に一度でも話速変換処理バッファ１０２に最低遅延量Ｔｄ以上のデータを保持したことがあるかどうかが確認される（Ｓ２０３）。遅延制御部１０３で、話速変換処理バッファ１０２に最低遅延量Ｔｄ以上のデータを保持したことがないと確認された場合には、話速変換装置１００は、上述のステップＳ２０１に戻って動作する。一方、遅延制御部１０３で、話速変換処理バッファ１０２に最低遅延量Ｔｄ以上のデータを保持したことがあると確認された場合には、後述するステップＳ２０４の処理に進む。 Then, the delay control unit 103 checks whether or not data having a minimum delay amount Td or more has been held in the speech speed conversion processing buffer 102 even once in the past since the start of the current speech speed conversion process. (S203). When it is confirmed by the delay control unit 103 that no data of the minimum delay amount Td or more has been held in the speech speed conversion processing buffer 102, the speech speed conversion device 100 returns to the above-described step S201 to operate. . On the other hand, if it is confirmed by the delay control unit 103 that data of the minimum delay amount Td or more has been held in the speech speed conversion processing buffer 102, the process proceeds to step S204 described later.

そして、話速制御部１０４は、ユーザの操作等により、現在話速変換処理に適用している話速に関する設定変更が行われたか否かを確認し、設定変更が行われたと確認された場合には、話速変換処理に適用する伸長率Ｒｓを、変更後に対応する値に変更する処理を行う（Ｓ２０４、Ｓ２０５）。 Then, the speech speed control unit 104 confirms whether or not a setting change related to the speech speed currently applied to the speech speed conversion process has been performed by a user operation or the like, and when it is confirmed that the setting change has been performed. In step S204 and S205, the expansion rate Rs applied to the speech speed conversion process is changed to a corresponding value after the change.

そして、話速制御部１０４では、そして、話速制御部１０４で、過去にＰＩＣＯＬＡ処理部１０６から未処理データ期間Ｔｕが報告されており、伸長率Ｒｓを満たすための未処理データ期間Ｔｕ分の音声データの出力バッファ１０７への供給が完了しているか否か（すなわち、伸長率Ｒｓが満たされている状態であるか否か）について確認が行われる（Ｓ２０６）。 The speech speed control unit 104 and the speech speed control unit 104 have previously reported the unprocessed data period Tu from the PICOLA processing unit 106, and the unprocessed data period Tu for satisfying the expansion rate Rs. A check is made as to whether or not the supply of the audio data to the output buffer 107 has been completed (that is, whether or not the expansion rate Rs is satisfied) (S206).

そして、話速制御部１０４で、伸長率Ｒｓが満たされている状態でないと確認された場合には、話速制御部１０４は、話速変換処理バッファ１０２を制御して、未処理データ期間Ｔｕ分の音声データを出力バッファ１０７に供給（未処理データ期間Ｔｕ分に満たない場合は現時点で供給可能な音声データを出力バッファ１０７に供給）させる（Ｓ２０７）。そして、話速変換装置１００は後述するステップＳ２１４から動作する。 When the speech speed control unit 104 confirms that the expansion rate Rs is not satisfied, the speech speed control unit 104 controls the speech speed conversion processing buffer 102 to perform an unprocessed data period Tu. Audio data is supplied to the output buffer 107 (if the unprocessed data period Tu is not reached, audio data that can be supplied at this time is supplied to the output buffer 107) (S207). Then, the speech speed conversion apparatus 100 operates from step S214 described later.

一方、話速制御部１０４で、伸長率Ｒｓが満たされている状態であると確認された場合には、ピッチ抽出部１０５により、ピッチ探索範囲Ｔｐ分の音声データ（ピッチ探索に必要な量の音声データ）が、話速変換処理バッファ１０２に保持されているか否かが確認される（Ｓ２０８）。 On the other hand, when the speech speed control unit 104 confirms that the expansion rate Rs is satisfied, the pitch extraction unit 105 performs speech data corresponding to the pitch search range Tp (the amount required for pitch search). It is confirmed whether or not (voice data) is held in the speech speed conversion processing buffer 102 (S208).

そして、ピッチ抽出部１０５により、ピッチ探索範囲Ｔｐ分の音声データが、話速変換処理バッファ１０２に保持されていないと判断された場合には、話速変換装置１００は、上述のステップＳ２０１の処理に戻り、次の入力フレームを取得する処理から動作する。 When the pitch extraction unit 105 determines that the speech data for the pitch search range Tp is not held in the speech speed conversion processing buffer 102, the speech speed conversion device 100 performs the process in step S201 described above. Returning to FIG. 3, the process starts from the process of acquiring the next input frame.

一方、ピッチ抽出部１０５により、ピッチ探索範囲Ｔｐ分の音声データが、話速変換処理バッファ１０２に保持されていると判断された場合には、ピッチ抽出部１０５によりピッチ候補が抽出され（Ｓ２０９）、当該ピッチ候補が伸長処理に用いるものとして適当であるか否かが判定される（Ｓ２１０）。 On the other hand, if the pitch extraction unit 105 determines that the speech data for the pitch search range Tp is held in the speech speed conversion processing buffer 102, the pitch extraction unit 105 extracts pitch candidates (S209). Then, it is determined whether or not the pitch candidate is suitable for use in the extension process (S210).

ステップＳ２１０で、当該ピッチ候補が伸長処理に用いるものとして適当ででないと判定された場合には、ピッチ抽出部１０５は、話速変換処理バッファ１０２を制御して、当該ピッチ候補の音声データを、直接出力バッファ１０７に供給させる（Ｓ２１１）。そして、話速変換装置１００は、後述するステップＳ２１４の処理から動作する。 If it is determined in step S210 that the pitch candidate is not suitable for use in the decompression process, the pitch extraction unit 105 controls the speech speed conversion processing buffer 102 to obtain the speech data of the pitch candidate. The data is directly supplied to the output buffer 107 (S211). Then, the speech speed conversion apparatus 100 operates from the process of step S214 described later.

一方、ステップＳ２１０で、当該ピッチ候補が伸長処理に用いるものとして適当であると判定された場合には、ピッチ抽出部１０５は、話速変換処理バッファ１０２を制御して、当該ピッチ候補の音声データを、ピッチＷの音声データとして、ＰＩＣＯＬＡ処理部１０６に供給させる。そして、話速変換装置１００は、後述するステップＳ２１２の処理から動作する。 On the other hand, if it is determined in step S210 that the pitch candidate is suitable for use in the decompression process, the pitch extraction unit 105 controls the speech speed conversion processing buffer 102 to output the voice data of the pitch candidate. Are supplied to the PICOLA processing unit 106 as audio data of pitch W. Then, the speech speed conversion apparatus 100 operates from the process of step S212 described later.

ピッチＷの音声データが供給されるとＰＩＣＯＬＡ処理部１０６は、そのピッチＷの音声データ（音声波形）に基づいて、話速変換処理バッファ１０２に保持された音声データの伸長に用いる音声データ（ピッチＷと同じ長さの音声データ）を生成する（Ｓ２１２）。 When the voice data having the pitch W is supplied, the PICOLA processing unit 106 uses the voice data (pitch) stored in the speech speed conversion processing buffer 102 based on the voice data (voice waveform) having the pitch W. Audio data having the same length as W) is generated (S212).

そして、ＰＩＣＯＬＡ処理部１０６は、ピッチＷの波形の長さＴｗに応じた伸長処理用データ期間Ｔｇを、話速制御部１０４に問い合わせて取得する。そして、ＰＩＣＯＬＡ処理部１０６は、伸長処理用データ期間Ｔｇを取得すると、話速変換処理バッファ１０２で保持された音声データのうち、ピッチＷに続く伸長処理用データ期間Ｔｇ分の音声データを取得する。そして、ＰＩＣＯＬＡ処理部１０６は、取得したピッチＷの音声データと、伸長処理用データ期間Ｔｇの音声データとの間に、生成した音声データを挿入した音声データを、出力バッファ１０７に供給する（Ｓ２１３）。なお、話速変換処理バッファ１０２に、伸長処理用データ期間Ｔｇ分の音声データが残っていない場合には、ＰＩＣＯＬＡ処理部１０６は、足りない分の長さ（未処理データ期間Ｔｕ）を、話速制御部１０４に報告する。 Then, the PICOLA processing unit 106 inquires the speech speed control unit 104 to acquire the expansion processing data period Tg corresponding to the waveform length Tw of the pitch W. Then, when the PICOLA processing unit 106 acquires the expansion processing data period Tg, the PICOLA processing unit 106 acquires audio data for the expansion processing data period Tg following the pitch W among the audio data held in the speech speed conversion processing buffer 102. . Then, the PICOLA processing unit 106 supplies the output buffer 107 with the audio data in which the generated audio data is inserted between the acquired audio data of the pitch W and the audio data of the decompression data period Tg (S213). ). If there is no audio data for the decompression processing data period Tg remaining in the speech speed conversion processing buffer 102, the PICOLA processing unit 106 determines the length (unprocessed data period Tu) of the shortage. Report to the speed control unit 104.

そして、上述のステップＳ２１１、又はＳ２１４の処理が終了すると、ピッチ抽出部１０５では、話速変換処理バッファ１０２にピッチ探索範囲Ｔｐ分の音声データが保持されているか否かが確認される（Ｓ２１４）。 When the processing in step S211 or S214 described above is completed, the pitch extraction unit 105 checks whether or not the speech data for the pitch search range Tp is held in the speech speed conversion processing buffer 102 (S214). .

ステップＳ２１４で、ピッチ抽出部１０５により、話速変換処理バッファ１０２にピッチ探索範囲Ｔｐ分の音声データが保持されていると確認された場合には、話速変換装置１００は、上述のステップＳ２０９の処理から動作して再度ピッチ候補の抽出処理を行う。 If the pitch extraction unit 105 confirms that the speech data for the pitch search range Tp is held in the speech speed conversion processing buffer 102 in step S214, the speech speed conversion device 100 determines that the speech speed conversion device 100 in step S209 described above. From the processing, the pitch candidate extraction processing is performed again.

一方、話速変換処理バッファ１０２にピッチ探索範囲Ｔｐ未満の分しか、話速変換処理バッファ１０２に音声データが保持されていないと確認された場合には、話速変換装置１００では、話速変換処理が継続している限り、上述のステップＳ２０１に戻って動作する。 On the other hand, when it is confirmed that the speech speed conversion processing buffer 102 holds the speech data in the speech speed conversion processing buffer 102 for less than the pitch search range Tp, the speech speed conversion apparatus 100 performs the speech speed conversion. As long as the process continues, the operation returns to step S201 described above.

次に、話速変換装置１００が、上述の図３〜図５のフローチャートに従って動作した場合の、話速変換処理バッファ１０２及び出力バッファ１０７に保持される音声データの遷移について、図６〜図９のタイミングチャートを用いて説明する。 Next, regarding the transition of voice data held in the speech speed conversion processing buffer 102 and the output buffer 107 when the speech speed conversion apparatus 100 operates according to the flowcharts of FIGS. 3 to 5 described above, FIGS. This will be described using the timing chart.

ここでは、出力フレーム及び入力フレームを構成するサンプル数ｎ＝８０、伸長率Ｒｓ＝１．２５、ピッチ探索範囲Ｔｐ＝２４０サンプル分の期間（Ｔｓ×１２０の期間）であるものとして説明する。 Here, the description will be made assuming that the number of samples constituting the output frame and the input frame is n = 80, the expansion rate Rs = 1.25, and the pitch search range Tp = 240 samples (a period of Ts × 120).

図６〜図９のタイミングチャートでは、タイミングＴ０の時点から話速変換装置１００が話速変換処理を開始している。そして、図６〜図９のタイミングチャートでは、タイミングＴ１、Ｔ２、Ｔ３、Ｔ４、Ｔ５…、のそれぞれのタイミングが、出力フレームを出力バッファ１０７から出力すべきタイミングを示している。例えば、タイミングＴ１の時点は、タイミングＴ０から８０サンプル分の時間（Ｔｓ×８０の時間）が経過した時点を示しており、タイミングＴ２の時点は、タイミングＴ０から１６０サンプル分の時間（Ｔｓ×１６０の時間）が経過した時点を示していることになる。 In the timing charts of FIGS. 6 to 9, the speech speed conversion apparatus 100 starts the speech speed conversion process from the time point T0. 6 to 9, timings T1, T2, T3, T4, T5... Indicate timings at which output frames should be output from the output buffer 107. For example, the time point of the timing T1 indicates a time point when 80 samples of time (Ts × 80 time) has elapsed from the timing T0, and the time point of the timing T2 indicates a time corresponding to 160 samples from the timing T0 (Ts × 160). This indicates the point in time when

ここでは、説明を容易にするために、まず、最低遅延量Ｔｄ＝ピッチ探索範囲Ｔｐとした場合の話速変換処理バッファ１０２及び出力バッファ１０７に保持される音声データの遷移について、図６、図７を用いて説明する。 Here, for ease of explanation, first, transitions of speech data held in the speech speed conversion processing buffer 102 and the output buffer 107 when the minimum delay amount Td = pitch search range Tp are shown in FIGS. 7 for explanation.

図６では、最低遅延量Ｔｄ＝ピッチ探索範囲Ｔｐとし、ＰＩＣＯＬＡ処理部１０６による伸長処理が行われる場合の話速変換処理バッファ１０２及び出力バッファ１０７に保持される音声データの遷移について示している。 FIG. 6 shows transition of audio data held in the speech rate conversion processing buffer 102 and the output buffer 107 when the minimum delay amount Td = pitch search range Tp and the expansion processing by the PICOLA processing unit 106 is performed.

図６では、タイミングＴ３の時点で、最低遅延量Ｔｄ分の音声データ（２４０サンプル）が溜まることになるため、ピッチ抽出部１０５によりＴｗ_１の長さのピッチＷ_１が抽出されたものとする。そして、ピッチＷ_１は、ピッチ抽出部１０５により伸長処理に適用するものとして適当と判定され、さらに、ＰＩＣＯＬＡ処理部１０６によりピッチＷ_１を用いた音声波形の音声データが生成されたものとする。なお、ここでは、Ｔｗ_１＝４５サンプル分の期間であるものとする。そうすると、話速制御部１０４では、伸長処理用データ期間Ｔｇ_１＝（Ｔｗ_１／（Ｒｓ−１））−Ｔｗ_１＝（４５／（１．２５−１））−４５＝１３５サンプル分の期間となる。 In Figure 6, at the point of time T3, since the minimum delay amount Td worth of audio data (240 samples) so that the accumulated, it is assumed that the pitch W ₁ of the length of Tw ₁ is extracted by the pitch extraction unit 105 . It is assumed that the pitch W ₁ is determined to be appropriate for application to the decompression process by the pitch extraction unit 105, and further, voice data of a speech waveform using the pitch W ₁ is generated by the PICOLA processing unit 106. Here, it is assumed that Tw ₁ = 45 samples. Then, in the speech speed control unit 104, the data period for expansion processing Tg ₁ = (Tw ₁ / (Rs−1)) − Tw ₁ = (45 / (1.25−1)) − 45 = period corresponding to 135 samples. It becomes.

したがって、図６に示すように、ＰＩＣＯＬＡ処理部１０６は、タイミングＴ３の時点で、ピッチＷ_１（Ｔｗ_１）の音声データ、ピッチＷ_１に基づいて生成した音声データ、伸長処理用データ期間Ｔｇの音声データ（１８０サンプル）を保持して、データ出力部１０８に供給することになる。すなわち、この時点で、データ出力部１０８には、４５＋４５＋１３５＝２２５サンプル分の音声データが保持されることになる。そして、タイミングＴ３の時点で、話速変換処理バッファ１０２には２４０−２２５＝１５サンプル分の音声データしか保持されていない状態になるため、少なくともタイミングＴ６の時点までは、話速変換処理バッファ１０２にピッチ探索範囲Ｔｐ分の音声データ（２４０サンプル）が溜まらないため、伸長処理は行われない。したがって、タイミングＴ３〜Ｔ６の期間、データ出力部１０８には、新たな音声データが供給されない状態となる。 Therefore, as shown in FIG. 6, the PICOLA processing unit 106, at timing T3, the audio data having the pitch W ₁ (Tw ₁ ), the audio data generated based on the pitch W ₁ , and the decompression processing data period Tg Audio data (180 samples) is held and supplied to the data output unit 108. That is, at this time, the data output unit 108 holds the sound data for 45 + 45 + 135 = 225 samples. At the time T3, the speech speed conversion processing buffer 102 holds only 240-225 = 15 samples of audio data. Therefore, at least until the time T6, the speech speed conversion processing buffer 102 Since no audio data (240 samples) corresponding to the pitch search range Tp is collected, the decompression process is not performed. Accordingly, new audio data is not supplied to the data output unit 108 during the period from the timing T3 to T6.

そうすると、出力バッファ１０７では、タイミングＴ３の時点で、出力フレームＦ１（８０サンプル分の音声データ）を出力し、さらに、タイミングＴ４の時点で、出力フレームＦ２（８０サンプル分の音声データ）を出力することになる。すなわち、タイミングＴ４の時点で、データ出力部１０８には、２２５−８０−８０＝６５フレーム分の音声データしか残っていないため、タイミングＴ５の時点になっても正常な出力フレームを生成して出力することができないことになる。 Then, the output buffer 107 outputs the output frame F1 (80 samples of audio data) at the timing T3, and further outputs the output frame F2 (80 samples of audio data) at the timing T4. It will be. That is, since only 225-80-80 = 65 frames of audio data remain in the data output unit 108 at the timing T4, a normal output frame is generated and output even at the timing T5. You can't do that.

以上のように、最低遅延量Ｔｄ＝ピッチ探索範囲Ｔｐとした場合には、１回目の伸長処理の後、２回目のピッチ探索に必要な最低遅延量Ｔｄ分のデータが溜まる前に出力バッファ１０７に１出力フレーム分のデータがなくなる状況が発生してしまう場合がある。 As described above, when the minimum delay amount Td = the pitch search range Tp, after the first decompression process, before the data corresponding to the minimum delay amount Td necessary for the second pitch search is accumulated, the output buffer 107. In some cases, there is a situation in which data for one output frame is lost.

図７では、最低遅延量Ｔｄ＝ピッチ探索範囲Ｔｐとし、ＰＩＣＯＬＡ処理部１０６による伸長処理が行われなかった場合（抽出したピッチ候補が伸長処理に用いることに不適当だった場合）の話速変換処理バッファ１０２及び出力バッファ１０７に保持される音声データの遷移について示している。 In FIG. 7, the minimum delay amount Td = pitch search range Tp, and the speech speed conversion when the PICOLA processing unit 106 does not perform the expansion process (when the extracted pitch candidate is inappropriate for use in the expansion process). The transition of audio data held in the processing buffer 102 and the output buffer 107 is shown.

図７では、タイミングＴ３の時点で、最低遅延量Ｔｄ分の音声データ（２４０サンプル）が溜まることになるため、ピッチ抽出部１０５によりＴｗ_１の長さのピッチＷ_１（ピッチ候補）が抽出されたものとする。そして、ピッチＷ_１は、ピッチ抽出部１０５により伸長処理に適用するものとして不適当と判定され、話速変換処理バッファ１０２から出力バッファ１０７に出力されることになる。なお、ここでは、Ｔｗ_１＝４５サンプル分の期間であるものとする。したがって、図７では、タイミングＴ３の時点で、出力バッファ１０７には４５サンプル分の音声データが保持されることになるが、１出力フレーム分の音声データ（８０サンプル）には満たないため、正常な音声データを挿入した出力フレームを出力することができない状態が継続することになる。 In FIG. 7, since the voice data (240 samples) corresponding to the minimum delay amount Td is accumulated at the timing T3, the pitch extraction unit 105 extracts the pitch W ₁ (pitch candidate) having the length of Tw _1. Shall be. Then, the pitch W ₁ is determined to be inappropriate by the pitch extraction unit 105 to be applied to the decompression process, and is output from the speech speed conversion processing buffer 102 to the output buffer 107. Here, it is assumed that Tw ₁ = 45 samples. Accordingly, in FIG. 7, the audio data for 45 samples is held in the output buffer 107 at the timing T3. However, since the audio data for one output frame (80 samples) is not reached, the output buffer 107 is normal. The state in which the output frame into which the audio data is inserted cannot be output continues.

次に、上述の実施形態と同様に、ピッチ探索範囲Ｔｐ＝２４０、最低遅延量Ｔｄ＝２４０＋８０−１＝３１９サンプル（上記の（１）式を用いて算出した長さ）とした場合の話速変換処理バッファ１０２及び出力バッファ１０７に保持される音声データの遷移について、図８、図９を用いて説明する。 Next, as in the above-described embodiment, the speech speed when the pitch search range Tp = 240 and the minimum delay amount Td = 240 + 80-1 = 319 samples (the length calculated using the above equation (1)). Transition of audio data held in the conversion processing buffer 102 and the output buffer 107 will be described with reference to FIGS.

図８では、ピッチ探索範囲Ｔｐ＝２４０サンプル、最低遅延量Ｔｄ＝３１９サンプルとし、ＰＩＣＯＬＡ処理部１０６による伸長処理が行われる場合の話速変換処理バッファ１０２及び出力バッファ１０７に保持される音声データの遷移について示している。すなわち、図８のタイミングチャートは、最低遅延量Ｔｄ以外については、上述の図６と同様の条件となった場合について示している。 In FIG. 8, the pitch search range Tp = 240 samples and the minimum delay amount Td = 319 samples, and the speech data stored in the speech speed conversion processing buffer 102 and the output buffer 107 when the expansion processing by the PICOLA processing unit 106 is performed are shown. Shows the transition. That is, the timing chart of FIG. 8 shows a case where the conditions are the same as those of FIG. 6 except for the minimum delay amount Td.

図８では、タイミングＴ４の時点で、最低遅延量Ｔｄ分以上の音声データ（３２０サンプル）が溜まることになるため、ピッチ抽出部１０５によりＴｗ_１の長さのピッチＷ_１が抽出されたものとする。そして、ピッチＷ_１は、ピッチ抽出部１０５により伸長処理に適用するものとして適当と判定され、さらに、ＰＩＣＯＬＡ処理部１０６によりピッチＷ_１を用いた音声波形の音声データが生成されたものとする。なお、ここでは、Ｔｗ_１＝４５サンプル分の期間であるものとする。そうすると、話速制御部１０４では、図６の場合と同様に、伸長処理用データ期間Ｔｇ_１＝１３５サンプル分の期間が算出される。 8, at the point of time T4, since the minimum delay amount Td min or more audio data (320 samples) would accumulate, and that the pitch W ₁ of the length of Tw ₁ is extracted by the pitch extraction unit 105 To do. It is assumed that the pitch W ₁ is determined to be appropriate for application to the decompression process by the pitch extraction unit 105, and further, voice data of a speech waveform using the pitch W ₁ is generated by the PICOLA processing unit 106. Here, it is assumed that Tw ₁ = 45 samples. Then, the speech speed control unit 104 calculates the decompression processing data period Tg ₁ = 135 samples, as in the case of FIG.

また、図８に示すように、ＰＩＣＯＬＡ処理部１０６は、タイミングＴ４の時点で、ピッチＷ_１（Ｔｗ_１）の音声データ、ピッチＷ_１に基づいて生成した音声データ、伸長処理用データ期間Ｔｇの音声データ（１３５サンプル）を保持して、データ出力部１０８に供給することになる。すなわち、この時点で、データ出力部１０８には、２２５サンプル分の音声データが保持されることになる。そして、タイミングＴ４の時点で、話速変換処理バッファ１０２には３２０−２２５＝９５サンプル分の音声データが保持されることになる。そして、話速変換処理バッファ１０２では、タイミングＴ６の時点で、話速変換処理バッファ１０２にピッチ探索範囲Ｔｐ分以上の音声データが溜まり、データ出力部１０８に新たな音声データが供給されることになる。 Further, as shown in FIG. 8, the PICOLA processing unit 106, at the timing T4, the audio data of the pitch W ₁ (Tw ₁ ), the audio data generated based on the pitch W ₁ , and the decompression processing data period Tg The audio data (135 samples) is held and supplied to the data output unit 108. That is, at this time, the data output unit 108 holds the audio data for 225 samples. At timing T4, the speech speed conversion processing buffer 102 holds 320-225 = 95 samples of audio data. Then, in the speech speed conversion processing buffer 102, voice data of the pitch search range Tp or more accumulates in the speech speed conversion processing buffer 102 at timing T6, and new voice data is supplied to the data output unit 108. Become.

一方、出力バッファ１０７では、タイミングＴ４の時点で、出力フレームＦ１（８０サンプル分の音声データ）を出力し、さらに、タイミングＴ５の時点で、出力フレームＦ２（８０サンプル分の音声データ）を出力することになる。そして、図８の例では、タイミングＴ６の時点では、出力バッファ１０７に新たな音声データが供給されることになる。 On the other hand, the output buffer 107 outputs an output frame F1 (80 samples of audio data) at timing T4, and further outputs an output frame F2 (80 samples of audio data) at timing T5. It will be. In the example of FIG. 8, new audio data is supplied to the output buffer 107 at the timing T6.

したがって、図８の例では、最低遅延量Ｔｄについて上記の（１）のように設定することにより、図６の場合の例と異なり、出力バッファ１０７で出力フレームを出力すべきタイミングが到来しても、出力バッファ１０７に１出力フレーム分以上のデータを保持することが出来る。 Therefore, in the example of FIG. 8, by setting the minimum delay amount Td as described in (1) above, unlike the example of FIG. 6, the timing at which the output buffer 107 should output the output frame has arrived. In addition, the output buffer 107 can hold data for one output frame or more.

図９では、ピッチ探索範囲Ｔｐ＝２４０サンプル、最低遅延量Ｔｄ＝３１９サンプルとし、ＰＩＣＯＬＡ処理部１０６による伸長処理が行われない場合の話速変換処理バッファ１０２及び出力バッファ１０７に保持される音声データの遷移について示している。すなわち、図９のタイミングチャートは、最低遅延量Ｔｄ以外については、上述の図７と同様の条件となった場合について示している。 In FIG. 9, the audio data held in the speech speed conversion processing buffer 102 and the output buffer 107 when the pitch search range Tp = 240 samples and the minimum delay amount Td = 319 samples and the expansion processing by the PICOLA processing unit 106 is not performed. It shows about the transition. That is, the timing chart of FIG. 9 shows a case where the conditions are the same as those of FIG. 7 except for the minimum delay amount Td.

図９では、タイミングＴ４の時点で、話速変換処理バッファ１０２に、最低遅延量Ｔｄ分以上の音声データ（３２０サンプル）が溜まることになるため、ピッチ抽出部１０５によりＴｗ_１の長さのピッチＷ_１（ピッチ候補）が抽出されたものとする。そして、ピッチＷ_１は、ピッチ抽出部１０５により伸長処理に適用するものとして不適当と判定され、話速変換処理バッファ１０２から出力バッファ１０７に出力されることになる。なお、ここでは、Ｔｗ_１＝４５サンプル分の期間であるものとする。 In FIG. 9, since voice data (320 samples) equal to or greater than the minimum delay amount Td is accumulated in the speech speed conversion processing buffer 102 at the timing T4, the pitch extraction unit 105 causes the pitch of the length of Tw ₁ to be accumulated. It is assumed that W ₁ (pitch candidate) has been extracted. Then, the pitch W ₁ is determined to be inappropriate by the pitch extraction unit 105 to be applied to the decompression process, and is output from the speech speed conversion processing buffer 102 to the output buffer 107. Here, it is assumed that Tw ₁ = 45 samples.

そして、この時点では、話速変換処理バッファ１０２には、３２０−４５＝２７５サンプルの音声データが残っており、ピッチ探索範囲Ｔｐ（２４０サンプル分）より多いため、ピッチ抽出部１０５は、タイミングＴ４の時点で、連続してピッチ候補の探索を行うことができる。そして、ピッチ抽出部１０５では、２回目のピッチ候補探索で、Ｔｗ_２の長さのピッチＷ_２（ピッチ候補）が抽出されたものとする。そして、ピッチＷ_２は、ピッチ抽出部１０５により伸長処理に適用するものとして不適当と判定され、話速変換処理バッファ１０２から出力バッファ１０７に出力されたものとする。なお、ここでは、Ｔｗ_２＝４５サンプル分の期間であるものとする。 At this time, since the voice data of 320−45 = 275 samples remains in the speech speed conversion processing buffer 102 and is larger than the pitch search range Tp (240 samples), the pitch extraction unit 105 performs timing T4. At this point, it is possible to continuously search for pitch candidates. Then, it is assumed that the pitch extraction unit 105 extracts a pitch W ₂ (pitch candidate) having a length of Tw _{2 in} the _second pitch candidate search. Then, it is assumed that the pitch W ₂ is determined to be inappropriate for use in the decompression process by the pitch extraction unit 105 and is output from the speech speed conversion processing buffer 102 to the output buffer 107. Here, it is assumed that Tw ₂ = 45 samples.

そうすると、データ出力部１０８では、タイミングＴ４の時点で、４５＋４５＝９０サンプル分の音声データを保持することができ、図７の例と異なり、出力フレームを出力することができる。 Then, the data output unit 108 can hold 45 + 45 = 90 samples of audio data at the timing T4, and can output an output frame unlike the example of FIG.

したがって、図９の例では、図７の場合の例と異なり、ピッチ候補がそのまま出力バッファ１０７に供給されても、話速変換処理バッファ１０２にピッチ探索範囲Ｔｐ以上の音声データが残っているため、ピッチ抽出部１０５では、同一フレーム内（同一タイミング）で複数回ピッチ探索を行うことができる。すなわち、図９の例では、最低遅延量Ｔｄについて上記の（１）式のように設定することにより、図７の場合の例と異なり、出力バッファ１０７で出力フレームを出力すべきタイミングが到来しても、出力バッファ１０７に１出力フレーム分以上のデータを保持できる可能性が高くなることになる。 Therefore, in the example of FIG. 9, unlike the example of FIG. 7, even if the pitch candidates are supplied to the output buffer 107 as they are, voice data of the pitch search range Tp or more remains in the speech speed conversion processing buffer 102. The pitch extraction unit 105 can perform a pitch search a plurality of times within the same frame (same timing). That is, in the example of FIG. 9, by setting the minimum delay amount Td as in the above equation (1), unlike the example of FIG. 7, the timing at which the output buffer 107 should output an output frame has arrived. However, there is a high possibility that the output buffer 107 can hold data for one output frame or more.

（Ａ−３）第１の実施形態の効果
第１の実施形態によれば、以下のような効果を奏することができる。 (A-3) Effects of First Embodiment According to the first embodiment, the following effects can be achieved.

（Ａ−３−１）話速変換装置１００では、ＰＩＣＯＬＡ処理部１０６による伸長処理を開始する前に、ピッチ探索範囲Ｔｐ以上の音声データを、初期に話速変換処理バッファ１０２で保持することで、出力バッファ１０７に、１出力フレーム分以上の音声データが保持されやすい状態とすることができる。 (A-3-1) In the speech speed conversion apparatus 100, the speech data of the pitch search range Tp or more is initially held in the speech speed conversion processing buffer 102 before the expansion process by the PICOLA processing unit 106 is started. The output buffer 107 can easily hold audio data for one output frame or more.

（Ａ−３−２）初期に話速変換処理バッファ１０２で保持するデータ量が多い方が、出力バッファ１０７で保持するデータ量を枯渇しにくくさせることはできるが、話速変換処理による遅延量（データ入力部１０１に音声データが入力されてから、データ出力部１０８で出力されるまでの時間）が大きくなってしまうと、処理品質（リアルタイム性）が劣化してしまう。そこで、この実施形態の話速変換装置１００では、出力バッファ１０７に１出力フレーム分以上の音声データを保持するために必要な最低限の最低遅延量Ｔｄとして、「ピッチ探索範囲Ｔｐ＋１フレーム−１サンプル」を設定している。 (A-3-2) The amount of data held in the speech speed conversion processing buffer 102 in the initial stage can make the data amount held in the output buffer 107 less likely to be exhausted. If the time from when audio data is input to the data input unit 101 to when it is output by the data output unit 108 increases, the processing quality (real-time performance) deteriorates. Therefore, in the speech speed conversion apparatus 100 of this embodiment, “pitch search range Tp + 1 frame−1 sample is used as the minimum minimum delay amount Td necessary for holding the audio data for one output frame or more in the output buffer 107. Is set.

次に、最低遅延量Ｔｄとして、「ピッチ探索範囲Ｔｐ＋１フレーム−１サンプル」が望ましい理由について説明する。 Next, the reason why “pitch search range Tp + 1 frame−1 sample” is desirable as the minimum delay amount Td will be described.

出力バッファ１０７に１出力フレーム分のデータがなくなるという問題が発生する例として、上述の図６の条件（以下、「第１の条件」と呼ぶ）と、上述の図７の条件（以下、「第２の条件」と呼ぶ）について説明した。このうち、第２の条件は、第１の例の伸長率が１．０倍のときと同じであると見なせる。つまり第２の条件で発生する問題を回避できることが示せれば、同時に第１の条件で発生する問題も回避できる。何故なら第１の条件は伸長率Ｒｓ＞１．０で必ず伸長処理を行うので、第２の条件よりも出力バッファ１０７のデータは増えるからである。 As an example of the problem that data for one output frame disappears in the output buffer 107, the above-described condition of FIG. 6 (hereinafter referred to as “first condition”) and the above-described condition of FIG. (Referred to as “second condition”). Of these, the second condition can be regarded as the same as when the expansion rate of the first example is 1.0. That is, if it can be shown that the problem that occurs under the second condition can be avoided, the problem that occurs under the first condition can also be avoided. This is because the decompression process is always performed under the decompression rate Rs> 1.0 in the first condition, so that the data in the output buffer 107 increases as compared with the second condition.

第２の条件でも、ピッチ候補Ｗ分の期間（以下、「ＴＷ」と呼ぶ）が、１出力フレーム分の期間（以下、「ＴＦ」と呼ぶ）と同じかＴＦの倍数である場合は、上述の問題は発生しない。出力バッファ１０７供給される期間は常にＴＦ分の期間の倍数なので、入力されたデータをそのまま供給することができるからである。上述の問題が発生するのは、ピッチ候補Ｗの期間ＴＷが、それ以外の値の場合である。なお、ここでは、各「期間（データ長）」について、音声データを構成するサンプル数（サンプリング間隔は全て同一であるものとする）を単位として表わすものとする。 Even in the second condition, if the period for the pitch candidate W (hereinafter referred to as “TW”) is the same as the period for one output frame (hereinafter referred to as “TF”) or a multiple of TF, The problem does not occur. This is because the period during which the output buffer 107 is supplied is always a multiple of the period corresponding to TF, so that the input data can be supplied as it is. The above-described problem occurs when the period TW of the pitch candidate W has a value other than that. Here, for each “period (data length)”, the number of samples constituting the audio data (all sampling intervals are assumed to be the same) is represented as a unit.

以下に、上述の第２の例（図７に示す例）において、上述の問題を回避できる最低遅延量Ｔｄとして望ましい値がどのように求められるのかを示す。なお、以下の例では、ＴＷの最大値はＴｐ／２とするが、これに限定せずＴｄ分まで抽出するようにしても良い。また、以下の例ではＴｐ＞＝ＴＦとする。また、以下の例では、ＴＦ＝８０（サンプル）、Ｔｐ＝２４０（サンプル）であるものとする。 The following shows how a desirable value is obtained as the minimum delay amount Td that can avoid the above-described problem in the above-described second example (example shown in FIG. 7). In the following example, the maximum value of TW is Tp / 2, but the present invention is not limited to this, and extraction may be performed up to Td. In the following example, Tp> = TF. In the following example, it is assumed that TF = 80 (sample) and Tp = 240 (sample).

まず、ＴＷが１＊ＴＦ未満の場合（１＜＝ＴＷ＜＝７９の場合）について説明する。 First, the case where TW is less than 1 * TF (when 1 <= TW <= 79) will be described.

この場合、図７に示す１回目のピッチ探索（ピッチ抽出）では、出力バッファ１０７にＴＦ分のデータがたまらないため、同一フレーム処理内でもう一度ピッチ抽出を行う必要がある。話速変換処理バッファ１０２で保持されている音声データの期間をＰ、初期値をＸとすると、一回目のピッチ抽出後の期間は（Ｘ−１）＜＝Ｐ＜＝（Ｘ−７９）である。このときＰは、Ｔｐ分以上のデータであればよいから、必要なＸの最大値はＸ−７９＜＝２４、Ｘ＝３１９となる。 In this case, in the first pitch search (pitch extraction) shown in FIG. 7, since data for TF is accumulated in the output buffer 107, it is necessary to perform pitch extraction once again within the same frame processing. Assuming that the period of the voice data held in the speech speed conversion processing buffer 102 is P and the initial value is X, the period after the first pitch extraction is (X-1) <= P <= (X-79). is there. At this time, since P may be data equal to or more than Tp, the required maximum value of X is X−79 <= 24 and X = 319.

次に、ＴＷが１＊ＴＦより大きく、２＊ＴＦ未満のとき（８１＜＝ＴＷ＜＝１５９の場合）について説明する。 Next, the case where TW is larger than 1 * TF and smaller than 2 * TF (when 81 <= TW <= 159) will be described.

この場合、１回目のピッチ抽出の結果、話速変換処理バッファ１０２で保持される期間は（Ｘ−８１）＜＝Ｐ＜＝（Ｘ−１５９）となっている。また出力バッファ１０７には１＊ＴＦ以上２＊ＴＦ分未満存在することになるので、出力バッファ１０７からの出力フレームの出力は可能となる。しかし、出力バッファ１０７では、次に出力フレームを出力すべきタイミングとなったときに、ＴＦ分のデータは残らないことになるため、次の出力フレームの出力タイミングでは、必ずピッチ抽出を行う必要がある。したがって、この場合、次の出力フレームを出力すべきタイミングで、ＤＦ分のデータ入力後に話速変換処理バッファ１０２にＴｐ分以上のデータが存在すればよいので、必要なＸの最大値はＸ−１５９＋８０＝２４０、Ｘ＝３１９となる。このＸの値が最低遅延量Ｔｄとなり、これは２＊Ｆ＜Ｗ＜３＊Ｆ、３＊Ｆ＜Ｗ＜４＊Ｆの場合も同じである。 In this case, as a result of the first pitch extraction, the period held in the speech speed conversion processing buffer 102 is (X−81) <= P <= (X−159). Further, since the output buffer 107 exists in the range of 1 * TF or more and less than 2 * TF, output frames from the output buffer 107 can be output. However, in the output buffer 107, when the next output frame is to be output, the data for TF does not remain. Therefore, it is necessary to always perform pitch extraction at the output timing of the next output frame. is there. Therefore, in this case, since it is sufficient that data equal to or more than Tp exists in the speech rate conversion processing buffer 102 after the data for DF is input at the timing at which the next output frame is to be output, the required maximum value of X is X−. 159 + 80 = 240 and X = 319. The value of X is the minimum delay amount Td, which is the same when 2 * F <W <3 * F, 3 * F <W <4 * F.

以上から、上述の問題が発生しにくくなる最小のＴｄは、以下の（４）式により導きだすことができる。なお、以下の（４）式では、ＴＷ≠ｎ＊ＴＦ（ｎ＝０、１、２、…）、Ｔｐ＞ＴＦであるものとする。なお、以下の（４）式では、「ｉｎｔ（ＴＷ／ＴＦ）は「ＴＷ／ＴＦ」の整数部分を示している。また、以下の（４）式では、「ｍａｘ｛ＴＷ−ｉｎｔ（Ｗ／ＴＦ）＊ＴＦ｝」は、ｎを「ｎ＝０、１、２、…」と変化させた場合の「ＴＷ−ｉｎｔ（Ｗ／ＴＦ）＊ＴＦ」の最大値を示しており、計算すると以下の（５）式に示す結果となる。したがって、以下の（６）式に示すように、Ｔｄとしては、上述の「ピッチ探索範囲Ｔｐ＋１フレーム−１サンプル」が望ましいという結果が得られる。 From the above, the minimum Td at which the above-described problem is difficult to occur can be derived from the following equation (4). In the following equation (4), it is assumed that TW ≠ n * TF (n = 0, 1, 2,...) And Tp> TF. In the following expression (4), “int (TW / TF) indicates an integer part of“ TW / TF ”. Further, in the following expression (4), “max {TW−int (W / TF) * TF}” is “TW−int” when n is changed to “n = 0, 1, 2,... The maximum value of (W / TF) * TF ”is shown, and when calculated, the result shown in the following equation (5) is obtained. Therefore, as shown in the following equation (6), a result that the above-described “pitch search range Tp + 1 frame−1 sample” is desirable as Td is obtained.

Ｔｄ＝Ｔｐ＋ｍａｘ｛ＴＷ−ｉｎｔ（ＴＷ／ＴＦ）＊ＴＦ｝ …（４）
ｍａｘ｛ＴＷ−ｉｎｔ（ＴＷ／ＴＦ）＊ＴＦ｝＝ＴＦ−１ …（５）
Ｔｄ＝Ｔｐ＋ＴＦ−１ …（６）
以上のように、話速変換装置１００では、出力バッファ１０７に１出力フレーム分以上の音声データを保持することにより、安定した精度で話速変換処理をリアルタイムに実行することが可能となる。 Td = Tp + max {TW−int (TW / TF) * TF} (4)
max {TW-int (TW / TF) * TF} = TF-1 (5)
Td = Tp + TF-1 (6)
As described above, in the speech rate conversion apparatus 100, by storing audio data for one output frame or more in the output buffer 107, it is possible to execute the speech rate conversion process with stable accuracy in real time.

その結果、話速変換装置１００を利用することにより、テレビやラジオ、電話などリアルタイムに音声を処理する機器に話速変換を行うことが可能となる。また、話速変換装置１００を利用することにより、ユーザはこれらの機器の使用中いつでも話速変換を実行することができ、また話速変換の実行中であっても任意の話速に変更することができる。 As a result, by using the speech rate conversion device 100, it is possible to perform speech rate conversion on devices that process voice in real time, such as televisions, radios, and telephones. Further, by using the speech speed conversion device 100, the user can perform the speech speed conversion at any time during use of these devices, and can change the speech speed to an arbitrary speech speed even during the execution of the speech speed conversion. be able to.

（Ｂ）第２の実施形態
以下、本発明による音声信号処理装置及びプログラムの第２の実施形態を、図面を参照しながら詳述する。なお、第１の実施形態の音声信号処理装置は、話速変換装置である。 (B) Second Embodiment Hereinafter, a second embodiment of an audio signal processing apparatus and program according to the present invention will be described in detail with reference to the drawings. Note that the audio signal processing device according to the first embodiment is a speech rate conversion device.

（Ｂ−１）第２の実施形態の構成
図１０は、第２の実施形態の話速変換装置１００Ａの機能的構成を示すブロック図である。なお、図１０では、上述の図１と同一又は対応する部分には、同一又は対応する符号を付している。以下、第２の実施形態について、第１の実施形態との差異を説明する。 (B-1) Configuration of Second Embodiment FIG. 10 is a block diagram showing a functional configuration of a speech rate conversion apparatus 100A of the second embodiment. In FIG. 10, the same or corresponding reference numerals are given to the same or corresponding parts as those in FIG. Hereinafter, the difference between the second embodiment and the first embodiment will be described.

話速変換装置１００Ａでは、音声区間検出部１０９及び遅延回復部１１０が追加されている点で、第１の実施形態と異なっている。 The speech speed conversion apparatus 100A is different from the first embodiment in that a speech section detection unit 109 and a delay recovery unit 110 are added.

第１の実施形態の話速変換装置１００では、話速変換処理を継続すると、音声信号（音声データ）は次々と伸長処理されるため、遅延が際限なく増加し、それにともない出力バッファ１０７で保持される音声データも増大してしまう。その結果、例えば、当該話速変換装置１００を電話装置等の電話通信に適用し、リアルタイムに話速変換処理を行う場合には、話者間のコミュニケーションに不具合が生じてしまうことになる。 In the speech speed conversion apparatus 100 according to the first embodiment, if the speech speed conversion process is continued, the speech signal (speech data) is expanded one after another, so that the delay increases indefinitely and is held in the output buffer 107 accordingly. The voice data to be increased also increases. As a result, for example, when the speech speed conversion device 100 is applied to telephone communication such as a telephone device and the speech speed conversion processing is performed in real time, a problem occurs in communication between speakers.

そこで、第２の実施形態の話速変換装置１００Ａでは、音声区間検出部１０９及び遅延回復部１１０を追加して、遅延回復機能に対応させ、上述のような問題点に対応している。 Therefore, in the speech speed conversion device 100A of the second embodiment, the speech section detection unit 109 and the delay recovery unit 110 are added to correspond to the delay recovery function, and cope with the above-described problems.

なお、ここでいう遅延とは、例えば、話速変換装置において、伸長処理により、話速変換処理バッファ１０２で最新に入力された音声データに対応する時刻（サンプルの時系列上の時刻）と、出力バッファ１０７で出力する音声データに対応する時刻との差分である。 Note that the delay here refers to, for example, the time corresponding to the latest voice data input in the speech speed conversion processing buffer 102 by the decompression process in the speech speed conversion device (the time on the sample time series), This is the difference from the time corresponding to the audio data output from the output buffer 107.

そして、遅延回復機能とは、例えば、話速変換装置において、伸長処理により一定時間以上の遅延が発生していた場合、最新に入力される音声信号に含まれる非音声区間（無音区間）を話速変換処理バッファ１０２から削除することで、遅延時間を短縮する機能である。 The delay recovery function is, for example, a speech speed conversion device that, when a delay of a certain time or more has occurred due to decompression processing, a non-speech segment (silence segment) included in the latest input speech signal. This is a function for shortening the delay time by deleting from the fast conversion processing buffer 102.

音声区間検出部１０９は、データ入力部２０１が取得した１フレーム分の音声データ（音声信号）に対し、当該入力フレームの音声データが示す音声信号が音声区間（有音区間）であるのか、非音声区間（無音区間）であるのかを判定する。音声区間検出部１０９において有音区間を検出する処理については、既存の処理構成（例えば、音響特徴量として、入力信号のパワー、零交差、相関関数等を用いる方法）を適用することができる。なお、音声区間検出部１０９については、データ入力部１０１の中に組み込み、データ入力部１０１で入力フレームを生成する際に音声区間の検出処理を行うようにしてもよい。 The voice section detection unit 109 determines whether the voice signal indicated by the voice data of the input frame is a voice section (sound section) with respect to one frame of voice data (voice signal) acquired by the data input unit 201. It is determined whether it is a voice section (silent section). An existing processing configuration (for example, a method using input signal power, zero-crossing, correlation function, etc. as an acoustic feature amount) can be applied to the processing of detecting a voiced section in the voice section detection unit 109. Note that the speech segment detection unit 109 may be incorporated in the data input unit 101 and the speech segment detection process may be performed when the data input unit 101 generates an input frame.

そして、音声区間検出部１０９は、取得した入力フレームが非音声区間のものであった場合には、当該入力フレームを、遅延回復部１１０に引き渡す。 Then, when the acquired input frame is a non-speech section, the speech section detection unit 109 delivers the input frame to the delay recovery unit 110.

遅延回復部１１０は、音声区間検出部１０９から非音声区間の入力フレームが供給されると、話速変換処理バッファ２０３を制御して、溜まっている音声データを全て出力バッファ２０８へ出力させる。そして、遅延回復部１１０は、出力バッファ１０７に溜まっている音声データの量が所定以上の場合（ここでは、例として最低遅延量Ｔｄ以上とする）、最新に取得した入力フレームを削除（破棄）して、遅延回復を図る。一方、出力バッファ１０７に溜まっている音声データの量が所定未満の場合（ここでは、例として最低遅延量Ｔｄ未満とする）、遅延回復部１１０は、最新に取得した入力フレームを、出力バッファ１０７に供給する。 When the input frame of the non-speech section is supplied from the speech section detection unit 109, the delay recovery unit 110 controls the speech speed conversion processing buffer 203 to output all the accumulated speech data to the output buffer 208. Then, the delay recovery unit 110 deletes (discards) the most recently acquired input frame when the amount of audio data accumulated in the output buffer 107 is greater than or equal to a predetermined amount (in this example, the minimum delay amount is Td or more). And delay recovery. On the other hand, when the amount of audio data accumulated in the output buffer 107 is less than a predetermined value (here, it is assumed that it is less than the minimum delay amount Td as an example), the delay recovery unit 110 converts the latest acquired input frame into the output buffer 107. To supply.

（Ｂ−２）第２の実施形態の動作
次に、以上のような構成を有する第２の実施形態の話速変換装置１００Ａの動作を説明する。 (B-2) Operation | movement of 2nd Embodiment Next, operation | movement of the speech-speed converter 100A of 2nd Embodiment which has the above structures is demonstrated.

第２の実施形態の話速変換装置１００Ａは、上述の遅延回復機能に係る音声区間検出部１０９及び遅延回復部１１０の動作が追加されるだけで、その他は第１の実施形態と同様である。具体的には、第２の実施形態では、第１の実施形態の動作のステップＳ２０１とステップＳ２０２との間で、上述の遅延回復機能に係る音声区間検出部１０９及び遅延回復部１１０の動作が挿入される。 The speech speed conversion apparatus 100A according to the second embodiment is the same as the first embodiment except that the operations of the speech section detection unit 109 and the delay recovery unit 110 related to the above-described delay recovery function are added. . Specifically, in the second embodiment, the operations of the speech section detection unit 109 and the delay recovery unit 110 related to the delay recovery function described above are performed between step S201 and step S202 of the operation of the first embodiment. Inserted.

図１１では、第２の実施形態の話速変換装置１００Ａで、上述のステップＳ２０１とステップＳ２０２との間に挿入される、遅延回復機能に係る音声区間検出部１０９及び遅延回復部１１０の動作について示している。図１１では図示を省略しているが、その他の動作については、第１の実施形態と同様である。 In FIG. 11, in the speech speed conversion apparatus 100A of the second embodiment, the operations of the speech section detection unit 109 and the delay recovery unit 110 related to the delay recovery function inserted between the above-described steps S201 and S202. Show. Although not shown in FIG. 11, other operations are the same as those in the first embodiment.

第２の実施形態の話速変換装置１００Ａでは、上述のステップＳ２０１において、データ入力部１０１で入力フレームが取得されると、音声区間検出部１０９により、当該入力フレームについて音声区間であるか否かが判定される（Ｓ３０１）。 In the speech rate conversion apparatus 100A of the second embodiment, when an input frame is acquired by the data input unit 101 in step S201 described above, the speech segment detection unit 109 determines whether or not the input frame is a speech segment. Is determined (S301).

そして、ステップＳ３０１において入力フレームについて音声区間と判定された場合には、話速変換装置１００Ａでは、上述のステップＳ２０２により当該入力フレームが話速変換処理バッファ１０２に入力され、以後の処理は第１の実施形態と同様の処理となる。 When it is determined in step S301 that the input frame is a speech section, the speech speed conversion apparatus 100A inputs the input frame to the speech speed conversion processing buffer 102 in step S202 described above, and the subsequent processing is the first process. The processing is the same as that of the embodiment.

一方、ステップＳ３０１において入力フレームについて音声区間と判定された場合には、音声区間検出部１０９により、当該入力フレームは、遅延回復部１１０に供給される。そして、遅延回復部１１０は、話速変換処理バッファ２０３を制御して、溜まっている音声データを全て出力バッファ２０８へ出力させる（Ｓ３０２）。 On the other hand, when it is determined in step S301 that the input frame is a speech section, the speech section detection unit 109 supplies the input frame to the delay recovery unit 110. Then, the delay recovery unit 110 controls the speech speed conversion processing buffer 203 to output all accumulated audio data to the output buffer 208 (S302).

そして、遅延回復部１１０は、出力バッファ１０７に溜まっている音声データの量が最低遅延量Ｔｄ以上であるか否かを判定する（Ｓ３０３）。 Then, the delay recovery unit 110 determines whether the amount of audio data accumulated in the output buffer 107 is equal to or greater than the minimum delay amount Td (S303).

上述のステップＳ３０３で、出力バッファ１０７に溜まっている音声データの量が最低遅延量Ｔｄ以上であると判定された場合には、遅延回復部１１０は、最新に取得した入力フレームを削除（破棄）して、遅延回復を図る（Ｓ３０４）。そして、話速変換装置１００Ａは上述のステップＳ２０１に戻って動作する。すなわち、遅延回復部１１０は、話速変換処理バッファ１０２及び出力バッファ１０７に溜まっている音声データの合計量が、最低遅延量Ｔｄ以上となっている状態の場合には、遅延量が所定以上になっているものと判断し、遅延回復処理（非音声区間の入力フレームを破棄）を行う。 If it is determined in step S303 described above that the amount of audio data accumulated in the output buffer 107 is equal to or greater than the minimum delay amount Td, the delay recovery unit 110 deletes (discards) the most recently acquired input frame. Then, delay recovery is attempted (S304). Then, the speech speed conversion apparatus 100A returns to the above-described step S201 and operates. That is, the delay recovery unit 110 increases the delay amount to a predetermined amount or more when the total amount of audio data accumulated in the speech speed conversion processing buffer 102 and the output buffer 107 is equal to or greater than the minimum delay amount Td. And delay recovery processing (discarding input frames in non-voice segments) is performed.

一方、上述のステップＳ３０３で、出力バッファ１０７に溜まっている音声データの量が最低遅延量Ｔｄ未満であると判定された場合には、遅延回復部１１０は、最新に取得した入力フレームを、出力バッファ１０７に供給する（Ｓ３０５）。そして、話速変換装置１００Ａは上述のステップＳ２０１に戻って動作する。 On the other hand, if it is determined in step S303 described above that the amount of audio data accumulated in the output buffer 107 is less than the minimum delay amount Td, the delay recovery unit 110 outputs the most recently acquired input frame as an output. The data is supplied to the buffer 107 (S305). Then, the speech speed conversion apparatus 100A returns to the above-described step S201 and operates.

（Ｂ−３）第２の実施形態の効果
第２の実施形態によれば、第１の実施形態に加えて以下のような効果を奏することができる。 (B-3) Effects of the Second Embodiment According to the second embodiment, the following effects can be obtained in addition to the first embodiment.

話速変換装置１００Ａでは、出力バッファ１０７に最低遅延量Ｔｄのデータを保持しつつ、非音声区間の音声データを削除（破棄）している。これにより、話速変換装置１００Ａでは、１出力フレーム分の音声データが、出力バッファ１０７にない状況を発生させずに、遅延を短縮（遅延回復）させることができる。これにより、話速変換装置１００Ａでは、出力バッファ１０７に必要とされるメモリ量を抑制するとともに、遅延量を抑制して、安定した話速変換処理を行うことができる。 In the speech rate conversion apparatus 100A, the data of the minimum delay amount Td is held in the output buffer 107, and the voice data in the non-voice section is deleted (discarded). Thereby, in the speech speed conversion apparatus 100A, the delay can be shortened (delay recovery) without causing a situation in which the audio data for one output frame is not in the output buffer 107. As a result, the speech speed conversion apparatus 100A can perform stable speech speed conversion processing while suppressing the amount of memory required for the output buffer 107 and suppressing the delay amount.

（Ｃ）他の実施形態
本発明は、上記の各実施形態に限定されるものではなく、以下に例示するような変形実施形態も挙げることができる。 (C) Other Embodiments The present invention is not limited to the above-described embodiments, and may include modified embodiments as exemplified below.

（Ｃ−１）上記の各実施形態の話速変換装置では、話速を遅くする処理を行う場合についてのみ説明したが、上記の各実施形態の話速変換装置で、話速を早くする処理（例えば、既存の話速変換処理を適用するようにしても良い）についても処理可能な構成としても良い。 (C-1) In the speech speed conversion device of each embodiment described above, only the case of performing the process of reducing the speech speed has been described. However, in the speech speed conversion device of each of the above embodiments, the process of increasing the speech speed. (For example, an existing speech speed conversion process may be applied).

（Ｃ−２）上記の各実施形態の話速変換装置では、遅延制御部は、話速変換装置が処理を開始した直後にのみ、話速変換処理バッファに最低遅延量Ｔｄの音声データが溜まるまで、ＰＩＣＯＬＡ処理部による話速変換処理を保留する処理をおこなっているが、その他のタイミングでも同様の処理を行うようにしても良い。 (C-2) In the speech speed conversion device of each of the embodiments described above, the delay control unit accumulates voice data of the minimum delay amount Td in the speech speed conversion processing buffer only immediately after the speech speed conversion device starts processing. Up to this point, the speech speed conversion process by the PICOLA processing unit is suspended, but the same process may be performed at other timings.

例えば、話速変換装置で、遅延量が一定以上となった場合に、話速変換処理バッファ１０２及び出力バッファ１０７で保持している音声データをリフレッシュ（全てのデータを消去）した後、話速変換処理バッファに最低遅延量Ｔｄの音声データが溜まるまで、ＰＩＣＯＬＡ処理部による話速変換処理を保留するようにしても良い。 For example, in the speech speed conversion apparatus, when the delay amount becomes a certain amount or more, the speech data held in the speech speed conversion processing buffer 102 and the output buffer 107 is refreshed (all data is deleted), and then the speech speed is changed. The speech speed conversion process by the PICOLA processing unit may be suspended until voice data having the minimum delay amount Td is accumulated in the conversion processing buffer.

（Ｃ−３）上記の各実施形態で、遅延制御部は、伸長処理により遅延が際限なく増えることを防ぐため、話速変換装置で遅延量が一定量を超えた場合、伸長処理を中止するように他の処理構成を制御する機能（遅延制限機能）に対応させるようにしてもよい。 (C-3) In each of the embodiments described above, the delay control unit stops the expansion process when the amount of delay exceeds a certain amount in the speech speed conversion device in order to prevent the delay from increasing indefinitely due to the expansion process. As described above, it may be made to correspond to a function (delay limiting function) for controlling another processing configuration.

１００…話速変換装置、１０１…データ入力部、１０２…話速変換処理バッファ、１０３…遅延制御部、１０４…話速制御部、１０５…ピッチ抽出部、１０６…ＰＩＣＯＬＡ処理部、１０７…出力バッファ、１０８…データ出力部。 DESCRIPTION OF SYMBOLS 100 ... Speech speed converter 101 ... Data input part 102 ... Speech speed conversion process buffer 103 ... Delay control part 104 ... Speech speed control part 105 ... Pitch extraction part 106 ... PICOLA process part 107 ... Output buffer 108: Data output unit.

Claims

Input buffer means for storing audio data of the input audio signal;
For the audio signal waveform based on the audio data stored in the input buffer means, the basic period is extracted from the audio signal waveform for the search period, and the extracted audio signal waveform of the basic period is used for the input buffer means. Speech rate conversion means for performing speech rate conversion processing on the accumulated voice data;
Output buffer means for storing voice data after the speech speed conversion means performs the speech speed conversion processing;
Audio data output means for outputting an output audio data frame including audio data for the output interval among audio data stored in the output buffer means for each output interval;
The speech rate conversion is performed after the search period and the output interval are added to the input buffer means, and further, the voice data of the minimum accumulation period obtained by subtracting one sampling period of the voice data of the output buffer means is accumulated. An audio signal processing apparatus comprising: conversion processing control means for starting speech speed conversion processing by the means.

In the audio signal processing device, when the audio data input to the input buffer means is audio data in a non-audio section in a state where the delay amount associated with the speech speed conversion process is greater than or equal to a predetermined value, The audio signal processing apparatus according to claim 1, further comprising delay recovery means for discarding data and performing delay recovery.

The delayed recovery means, claim, characterized in that it is determined that the amount of audio data stored in said input buffer means and said output buffer means when a predetermined or more, the delay amount becomes a predetermined value or more 2 The audio signal processing apparatus according to 1.

Computer
Input buffer means for storing audio data of the input audio signal;
For the audio signal waveform based on the audio data stored in the input buffer means, the basic period is extracted from the audio signal waveform for the search period, and the extracted audio signal waveform of the basic period is used for the input buffer means. Speech rate conversion means for performing speech rate conversion processing on the accumulated voice data;
Output buffer means for storing voice data after the speech speed conversion means performs the speech speed conversion processing;
Audio data output means for outputting an output audio data frame including audio data for the output interval among audio data stored in the output buffer means for each output interval;
After the search buffer and the output interval are added to the input buffer means, and further, the audio data of the minimum accumulation period that is less than the minimum accumulation period obtained by subtracting one sampling period of the audio data of the output buffer means is accumulated. An audio signal processing program that functions as conversion processing control means for starting speech speed conversion processing by the speech speed conversion means.