JP2011033789A

JP2011033789A - Adaptive speech-rate conversion device and program

Info

Publication number: JP2011033789A
Application number: JP2009179254A
Authority: JP
Inventors: Toru Tsugi; 徹都木; Nobumasa Seiyama; 信正清山; Atsushi Imai; 篤今井; Reiko Tako; 礼子田高
Original assignee: Nippon Hoso Kyokai NHK; NHK Engineering Services Inc; Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp; NHK Engineering System Inc
Priority date: 2009-07-31
Filing date: 2009-07-31
Publication date: 2011-02-17
Anticipated expiration: 2029-07-31
Also published as: JP5412204B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a speech-rate conversion device which performs adaptive speech-rate conversion of an input signal, and to provide a program. <P>SOLUTION: The speech-rate conversion device 1 includes: a physical index calculating part 2 which, for each segment obtained by dividing an input signal into unit time, calculates a physical index of the input signal; and a speech-rate conversion factor determination part 3 for performing speech-rate conversion by determining speech-rate to be designated to each segment of the input signal, according to the physical index calculated by the physical index calculation part 2. The speech-rate conversion device 1 performs speech-rate conversion by determining a speech-rate conversion factor αn to be designated to each segment of the input signal, by using one or more "physical indexes" among a voicing degree Un expressing a relative maximum value obtained by autocorrelation per unit time in the input signal, an irregularity degree Sn expressing change tendency of loci of fundamental frequency and pseudo-fundamental frequency per unit time in the input signal; and a division band power ratio En expressing a ratio between a low region power component and a high region power component, which are divided in bands per unit time in the input signal. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、入力信号の話速を変換する話速変換技術に関し、特に、入力信号の話速を適応的に変換する話速変換装置及びプログラムに関する。 The present invention relates to a speech speed conversion technique for converting the speech speed of an input signal, and more particularly to a speech speed conversion apparatus and program for adaptively converting the speech speed of an input signal.

従来から、入力信号の話速を適応的に変換する技術が幾つか提案されている（例えば、特許文献１〜特許文献６参照）。 Conventionally, several techniques for adaptively converting the speech speed of an input signal have been proposed (see, for example, Patent Documents 1 to 6).

適応的な話速変換に共通している目的は、１倍速（実時間で再生）や２倍速（実時間の半分の時間で再生）といった任意の再生速度変換倍率α［倍速］が与えられた場合に、入力信号全体にわたって一様な再生速度変換倍率αで速度を変えるのではなく、連続した入力信号の速度を部分的に再生速度変換倍率αより大きい倍率や小さい倍率で変えるようにすることであり、これにより話速変換音声の生成を行う。従って、入力信号全体としては一様な倍率αで話速変換したのと同じ時間で再生するように帳尻を合わせることができ、連続した入力信号を聞く者からすれば、一様な倍率αで話速変換した場合よりも「ゆっくりと聞き取りやすく」なる。 The objective common to adaptive speech speed conversion was given an arbitrary playback speed conversion factor α [double speed] such as 1x speed (playback in real time) and 2x speed (playback in half the real time). In some cases, the speed of the continuous input signal is partially changed with a magnification larger or smaller than the reproduction speed conversion magnification α, instead of changing the speed with a uniform reproduction speed conversion magnification α over the entire input signal. Thus, speech speed converted speech is generated. Therefore, the entire input signal can be adjusted so that it can be played back in the same time as the speech rate conversion at a uniform magnification α, and for those who hear continuous input signals, at a uniform magnification α. "Slower and easier to hear" than when the speech speed is converted.

特許文献１の技術は、概ね次の３つの構成要素を含む。（１）入力信号のうちの基本周波数の高いところは話速を緩め、基本周波数の低いところでは話速を速める。（２）入力信号のうちの一息で発声された区間を単位として、音声の開始点では話速を緩め、音声の終了点に向かって基本周波数の変化に応じて徐々に話速を速める。（３）入力信号のうちの一息で発声された隣接区間の間にある無音区間を聴感上違和感のない範囲で短縮する。 The technique of Patent Document 1 generally includes the following three components. (1) Of the input signal, the speaking speed is reduced at a high fundamental frequency, and the speaking speed is increased at a low fundamental frequency. (2) Using the section of the input signal uttered at a breath as a unit, the speech speed is slowed at the start point of the voice and gradually increased in accordance with the change of the fundamental frequency toward the end point of the voice. (3) A silent section between adjacent sections uttered with a breath of the input signal is shortened within a range that does not cause a sense of incongruity.

また、特許文献２の技術は、入力信号のうちの一定以上長い無音区間をポーズ区間として設定し、このポーズ区間に挟まれた音声（フレーズ）区間について、その開始点で話速を緩めるとともに一定時間にわたって所定の減少関数に基づき話速を速くしていき、この一定時間の経過後の話速を緩める際に、各音声（フレーズ）区間における最大基本周波数の大小関係を考慮して話速を緩める率を変えるものである。 In the technique of Patent Document 2, a silent period longer than a certain length of an input signal is set as a pause period, and the speech (phrase) period sandwiched between the pause periods is reduced at the start point and constant. The speech speed is increased over time based on a predetermined decrease function. When the speech speed is reduced after a certain period of time, the speech speed is adjusted in consideration of the magnitude relationship of the maximum fundamental frequency in each speech (phrase) section. It changes the rate of loosening.

また、特許文献３の技術は、話速制御において、ポーズ区間に挟まれた音声区間内の短い無音区間に対しても聴感上違和感のない範囲で短縮することを許容するとともに、ブロック分割して話速変換した音声の或るブロックが、入力信号全体に対して一様な再生速度変換倍率αで話速変換した場合に想定される時刻に対して合致しているか、又はほとんど遅れていない場合に、次のブロックの話速をできるだけ緩めるように設定するものである。特に、特許文献３の技術は、話速変換した音声の或るブロックが、入力信号全体に対して一様な再生速度変換倍率αで話速変換した場合に想定される時刻に対して遅れている程度が大きい程、その後の話速を緩める度合いをより抑えるように制御するとともに、話速変換音声の各ブロックが、一様な再生速度変換倍率αで話速変換した場合に想定される時刻に対してできるだけずれることがないように制御する。 Further, the technology of Patent Document 3 allows the speech speed control to be shortened within a range in which there is no sense of incongruity even for a short silent section in a voice section sandwiched between pause sections, and is divided into blocks. When a certain block of speech speed converted speech matches the time assumed when the speech speed conversion is performed at a uniform playback speed conversion magnification α with respect to the entire input signal, or there is almost no delay In addition, the speed of the next block is set to be as slow as possible. In particular, the technique of Patent Document 3 is delayed with respect to the time that is assumed when a certain block of speech that has been subjected to speech speed conversion is subjected to speech speed conversion with a uniform reproduction speed conversion factor α for the entire input signal. The higher the degree, the higher the degree of slowing down the subsequent speech speed, and the time assumed when each speech speed converted speech block is converted at a uniform playback speed conversion factor α. Control is performed so as not to deviate as much as possible.

特許文献４，５の技術は、入力信号を音声区間と無音区間に分けるとともに、音声区間の話速は緩め、無音区間は短縮することを基本とするものである。ここで、音声区間の話速を緩めたことにより単位時間当たりの入力信号長に対して出力信号長が延びるため、一時的にメモリに話速変換後の音声を蓄積する必要が生じる。そこで、特許文献４，５の技術は、メモリの上限量に対して利用可能な残量に応じて、音声区間ごとに話速を徐々に速めていったり無音区間の削除量を増やしたりして全体の音声時間長を調整する。 The techniques of Patent Documents 4 and 5 are based on dividing an input signal into a voice section and a silent section, slowing down the speech speed of the voice section, and shortening the silent section. Here, since the output signal length is increased with respect to the input signal length per unit time by slowing down the speech speed of the speech section, it is necessary to temporarily store the speech after the speech speed conversion in the memory. Therefore, in the techniques of Patent Documents 4 and 5, the speech speed is gradually increased for each voice section or the deletion amount of the silent section is increased according to the remaining amount available for the upper limit amount of the memory. Adjust the overall audio duration.

特許文献６の技術は、所定期間ごとに分割した入力信号の話速を、各所定期間の音声データの大きさ（パワー）やピッチ（音声の高さ）の数値のｎ乗と反比例する係数によって決定するものである。 In the technique of Patent Document 6, the speech speed of an input signal divided every predetermined period is determined by a coefficient that is inversely proportional to the nth power of the numerical value of the audio data size (power) and pitch (speech height) for each predetermined period. To decide.

特許第３２４９５６７号明細書Japanese Patent No. 3249567 特許第３２１９８９２号明細書Japanese Patent No. 3219892 特許第３２２００４３号明細書Japanese Patent No. 3220043 特許第３３５７７４２号明細書Japanese Patent No. 3357742 特許第３３７３９３３号明細書Japanese Patent No. 3373933 Specification 特許第３６１９９４６号明細書Japanese Patent No. 36199946

特許文献１〜５の技術に共通していることは、入力信号を、音声の有る音声区間と音声の無い無音区間に分け、音声区間では何らかの情報に基づいてその継続時間を部分的に伸縮するとともに無音区間の長さを短縮して、総合的に全体の音声時間長を調整することである。しかしながら、これらの技術は、人の声だけの入力信号の場合には有効に機能するが、放送番組などの背景音と音声が混合している入力信号の場合には、背景音だけの区間が、“無音区間”と判定されるか、或いは“音声区間”と判定されるかは保証の限りではない。誤判定が生じた場合に“無音区間”における短縮効果が得られず、結果的に音声区間の伸張率を大きくすることができず、聞きやすい話速変換音声とはならない。 What is common to the techniques of Patent Documents 1 to 5 is that an input signal is divided into a voice section with voice and a silent section without voice, and the duration is partially expanded or contracted based on some information in the voice section. At the same time, the length of the silent section is shortened to comprehensively adjust the overall voice time length. However, these techniques function effectively in the case of an input signal only of a human voice, but in the case of an input signal in which background sound and sound such as a broadcast program are mixed, a section of only the background sound is present. Whether it is determined as a “silent section” or a “voice section” is not guaranteed. When an erroneous determination occurs, the effect of shortening the “silent period” cannot be obtained, and as a result, the expansion rate of the voice period cannot be increased, and the speech speed converted voice is not easy to hear.

特許文献６の技術に関して、入力信号の大きさ（パワー）は入力信号の全ての区間で求めることができるが、入力信号の基本周波数は、人の声帯が振動している“有声音区間”でしか正しく求めることができない。従って、特許文献６の技術に関しても、背景音と音声が混合している入力信号の場合、背景音だけの区間においては、パワーは大きく、基本周波数は正しく求めることができない区間であるため、本来音声ではない背景音だけの区間では話速を速めたいにも関わらず、パワーが大きいことからむしろ話速を緩めるように制御しうる。 Regarding the technique of Patent Document 6, the magnitude (power) of the input signal can be obtained in all sections of the input signal, but the fundamental frequency of the input signal is the “voiced sound section” in which the human vocal cords vibrate. However, it can only be obtained correctly. Therefore, in the technique of Patent Document 6, in the case of an input signal in which background sound and sound are mixed, power is large in the section of only the background sound, and the fundamental frequency cannot be obtained correctly. Although it is desired to increase the speech speed in the section of only background sound that is not speech, since the power is large, the speech speed can be controlled to be relaxed.

このように、従来の技術においては、背景音と音声が混合している入力信号の場合のような、音声の有る音声区間と音声の無い無音区間の判定が正確に行われない入力信号に対して、適応的な話速変換が期待通りに動作しないという欠点があった。 As described above, in the conventional technique, for an input signal in which the determination of a voice section with voice and a silent section without voice is not performed accurately, as in the case of an input signal in which background sound and voice are mixed. Therefore, there is a drawback that adaptive speech speed conversion does not work as expected.

本発明の目的は、背景音と音声が混合している入力信号の場合でも、適応的な話速変換を安定動作させることが可能な話速変換装置及びプログラムを提供することにある。 An object of the present invention is to provide a speech speed conversion apparatus and program capable of stably performing adaptive speech speed conversion even in the case of an input signal in which background sound and sound are mixed.

上記課題を解決するために、本発明の話速変換装置は、入力信号の適応的話速変換を行う話速変換装置であって、入力信号を単位時間毎に分割した各セグメントについて、当該入力信号の物理指標を算出する物理指標算出部と、前記物理指標算出部によって算出した物理指標に応じて、入力信号の各セグメントに指定すべき話速変換の倍率を決定して話速変換を行う話速変換倍率決定部と、を備えることを特徴とする。 In order to solve the above-described problem, the speech speed conversion apparatus of the present invention is a speech speed conversion apparatus that performs adaptive speech speed conversion of an input signal, and for each segment obtained by dividing the input signal for each unit time, A physical index calculation unit for calculating the physical index of the input signal, and a speech speed conversion by determining a rate of speech speed conversion to be specified for each segment of the input signal according to the physical index calculated by the physical index calculation unit And a fast conversion magnification determination unit.

また、本発明の話速変換装置において、前記物理指標算出部は、入力信号における単位時間あたりの自己相関で得られる相対最大値を表す有声度を、前記物理指標として算出する有声度算出部を備えることを特徴とする。 Further, in the speech rate conversion apparatus of the present invention, the physical index calculation unit includes a voiced degree calculation unit that calculates a voiced degree representing a relative maximum value obtained by autocorrelation per unit time in an input signal as the physical index. It is characterized by providing.

また、本発明の話速変換装置において、前記物理指標算出部は、入力信号における単位時間あたりの基本周波数及び擬似基本周波数の軌跡の変化傾向を表す凹凸度を、前記物理指標として算出する凹凸度算出部を備えることを特徴とする。 Further, in the speech speed conversion device of the present invention, the physical index calculation unit calculates a degree of unevenness representing a change tendency of a trajectory of a fundamental frequency and a pseudo fundamental frequency per unit time in an input signal as the physical index. A calculation unit is provided.

また、本発明の話速変換装置において、前記物理指標算出部は、入力信号における単位時間あたりの帯域分割した低域側パワー成分と高域側パワー成分との比率を、前記物理指標として算出する分割帯域パワー比演算部を備えることを特徴とする。 Further, in the speech speed conversion device of the present invention, the physical index calculation unit calculates a ratio between a low-frequency power component and a high-frequency power component obtained by dividing a band per unit time in the input signal as the physical index. A division band power ratio calculation unit is provided.

また、本発明の話速変換装置において、前記話速変換倍率決定部は、入力信号全体に対して速度変換すべき再生速度変換倍率が与えられた場合に、前記決定した話速変換倍率を前記再生速度変換倍率に適合するように微調整する話速変換倍率微調整部を備えることを特徴とする。 Further, in the speech speed conversion device of the present invention, the speech speed conversion magnification determination unit determines the determined speech speed conversion magnification when the playback speed conversion magnification to be speed converted is given to the entire input signal. A speech speed conversion magnification fine adjustment unit that finely adjusts the reproduction speed conversion magnification to suit the reproduction speed conversion magnification is provided.

また、本発明の話速変換装置において、前記話速変換倍率決定部は、前記有声度、前記凹凸度、及び前記分割帯域パワー比のうちの１つ以上の物理指標を用いて当該入力信号の各セグメントに指定すべき話速変換倍率を決定する話速変換倍率微調整部を備えることを特徴とする。 Further, in the speech speed conversion device of the present invention, the speech speed conversion magnification determination unit uses one or more physical indices of the voiced degree, the unevenness degree, and the divided band power ratio of the input signal. A speech speed conversion magnification fine adjustment unit for determining a speech speed conversion magnification to be specified for each segment is provided.

また、本発明の話速変換装置において、前記話速変換倍率微調整部は、入力信号の種別に応じて前記有声度、前記凹凸度、及び前記分割帯域パワー比のうちの１つ以上の物理指標に基づく話速変換倍率の配分割り当てを行うことを特徴とする。 Further, in the speech speed conversion device according to the present invention, the speech speed conversion magnification fine adjustment unit may include one or more of the voicing degree, the unevenness degree, and the divided band power ratio according to the type of the input signal. It is characterized by performing allocation allocation of speech speed conversion magnification based on the index.

また、本発明は、入力信号の適応的話速変換を行う話速変換装置として構成するコンピュータに、入力信号を単位時間毎に分割した各セグメントについて、入力信号における単位時間あたりの自己相関で得られる相対最大値を表す有声度、入力信号における単位時間あたりの基本周波数及び擬似基本周波数の軌跡の変化傾向を表す凹凸度、及び、入力信号における単位時間あたりの帯域分割した低域側パワー成分と高域側パワー成分との比率を表す分割帯域パワー比のうちの１つ以上の物理指標を算出するステップと、該ステップによって算出した物理指標に応じて、入力信号の各セグメントに指定すべき話速変換倍率を決定して話速変換を行うステップと、を実行させるためのプログラムとしても特徴付けられる。 In addition, the present invention can be obtained by autocorrelation per unit time in an input signal for each segment obtained by dividing the input signal for each unit time in a computer configured as a speech rate conversion apparatus that performs adaptive speech speed conversion of the input signal. The voicedness representing the relative maximum value, the unevenness representing the change tendency of the trajectory of the fundamental frequency and the pseudo fundamental frequency per unit time in the input signal, and the low frequency side power component and the high frequency band divided per unit time in the input signal A step of calculating one or more physical indices out of the divided band power ratios representing a ratio to the band-side power component, and a speech speed to be specified for each segment of the input signal according to the physical indices calculated by the steps It is also characterized as a program for executing the step of converting the speech speed by determining the conversion magnification.

本発明によれば、入力信号の物理指標に基づき適応的な話速変換を行うため、従来技術では背景音と音声が混合している入力信号では“音声区間”と“無音区間"の判定が正確に行われない場合においても適応的な話速変換を行うことができ、即ち、背景音と音声が混合している入力信号に対しても、安定して、ゆっくり感を与える効果を高め、自然な聞こえとなる適応的話速変換が可能となる。 According to the present invention, since adaptive speech speed conversion is performed based on the physical index of the input signal, in the conventional technology, the determination of “speech interval” and “silent interval” is performed for an input signal in which background sound and sound are mixed. Even when it is not performed accurately, adaptive speech speed conversion can be performed, that is, the effect of giving a stable and slow feeling to an input signal in which background sound and sound are mixed, is enhanced, Adaptive speech speed conversion that makes natural sound is possible.

本発明による一実施例の話速変換装置のブロック図である。1 is a block diagram of a speech speed converting apparatus according to an embodiment of the present invention. 本発明による一実施例の話速変換装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the speech speed converter of one Example by this invention. 本発明による一実施例の話速変換装置における有声度算出部の動作説明図であり、（ａ）は入力信号の音声波形に対する窓関数を示す図であり、（ｂ）は、自己相関関数による有声度の算出を示す図である。It is operation | movement explanatory drawing of the voicing degree calculation part in the speech rate converter of one Example by this invention, (a) is a figure which shows the window function with respect to the speech waveform of an input signal, (b) is based on an autocorrelation function. It is a figure which shows calculation of voicedness. 本発明による一実施例の話速変換装置における基本周波数・擬似基本周波数凹凸算出部の動作説明図である。It is operation | movement explanatory drawing of the fundamental frequency and pseudo | simulation fundamental frequency unevenness | corrugation calculation part in the speech rate converter of one Example by this invention.

以下、本発明による一実施例の話速変換装置を説明する。本実施例の話速変換装置は、本発明に係る要素を全て包含する態様であるが、幾つかの変形例があることは後述の説明から明らかになる。 A speech speed converting apparatus according to an embodiment of the present invention will be described below. The speech speed conversion apparatus according to the present embodiment is an embodiment that includes all the elements according to the present invention, but it will be apparent from the following description that there are some modifications.

[装置構成]
図１に、本発明による一実施例の話速変換装置のブロック図を示す。本実施例の話速変換装置１は、入力信号を単位時間毎に分割した各セグメントについて、当該入力信号の物理指標を算出する物理指標算出部２と、物理指標算出部２によって算出した物理指標に応じて、入力信号の各セグメントに指定すべき話速変換倍率αｎを決定して話速変換を行う話速変換倍率決定部３とを備え、これにより、入力信号の適応的話速変換を行う。尚、ｎは、入力信号を冒頭から、例えば５ｍｓ毎に区切った場合の何番目の位置かを示す整数値である。以後、単位時間あたりのセグメント（区間）として、この区切り間隔を５ｍｓとして説明する。 [Device configuration]
FIG. 1 shows a block diagram of a speech speed converting apparatus according to an embodiment of the present invention. The speech speed converting apparatus 1 according to this embodiment includes a physical index calculation unit 2 that calculates a physical index of an input signal for each segment obtained by dividing the input signal for each unit time, and a physical index calculated by the physical index calculation unit 2. And a speech speed conversion magnification determining unit 3 that determines the speech speed conversion magnification αn to be specified for each segment of the input signal and performs speech speed conversion, thereby performing adaptive speech speed conversion of the input signal. . Note that n is an integer value indicating the position of the input signal when the input signal is divided from the beginning, for example, every 5 ms. Hereinafter, this segmentation interval will be described as 5 ms as a segment (section) per unit time.

物理指標算出部２は、有声度算出部１００と、基本周波数・擬似基本周波数凹凸算出部２００と、凹凸度算出部２１０と、周波数帯域・パワー演算部３００と、分割帯域パワー比演算部３１０とを備える。基本周波数・擬似基本周波数凹凸算出部２００は、基本周波数抽出部２０２と、擬似基本周波数算出部２０４と、基本周波数軌跡連結部２０６とを有する。周波数帯域・パワー演算部３００は、スペクトル算出部３０２と、帯域分割部３０４と、パワー演算部３０６とを有する。 The physical index calculator 2 includes a voicedness calculator 100, a fundamental frequency / pseudo fundamental frequency unevenness calculator 200, an unevenness calculator 210, a frequency band / power calculator 300, and a divided band power ratio calculator 310. Is provided. The fundamental frequency / pseudo fundamental frequency unevenness calculation unit 200 includes a fundamental frequency extraction unit 202, a pseudo fundamental frequency calculation unit 204, and a fundamental frequency locus coupling unit 206. The frequency band / power calculation unit 300 includes a spectrum calculation unit 302, a band division unit 304, and a power calculation unit 306.

話速変換倍率決定部３は、第１話速変換倍率指定部（話速変換倍率指定部ａ）１２０と、第２話速変換倍率指定部（話速変換倍率指定部ｂ）２２０と、第３話速変換倍率指定部（話速変換倍率指定部ｃ）３２０と、話速変換倍率微調整部４００とを備える。 The speech speed conversion magnification determination unit 3 includes a first speech speed conversion magnification designation unit (speech speed conversion magnification designation unit a) 120, a second speech speed conversion magnification designation unit (speech speed conversion magnification designation unit b) 220, A three-speech speed conversion magnification designation unit (speech speed conversion magnification designation unit c) 320 and a speech speed conversion magnification fine adjustment unit 400 are provided.

本実施例の話速変換装置１は、包括的には、入力信号における単位時間あたりの自己相関で得られる相対的な最大値（相対最大値）を表す“有声度”Ｕｎ、入力信号における単位時間あたりの基本周波数及び擬似基本周波数の軌跡の変化傾向を表す“凹凸度”Ｓｎ、及び、入力信号における単位時間あたりの帯域分割した低域側パワー成分と高域側パワー成分との比率を表す“分割帯域パワー比”Ｅｎのうちの１つ以上の「物理指標」を用いて入力信号の各セグメントに指定すべき話速変換倍率αｎを決定して話速変換を行い、話速変換した出力信号を生成して出力する。 The speech speed conversion apparatus 1 according to the present embodiment generally includes a “voicedness” Un representing a relative maximum value (relative maximum value) obtained by autocorrelation per unit time in the input signal, and a unit in the input signal. “Roughness degree” Sn representing the change tendency of the trajectory of the fundamental frequency and the pseudo fundamental frequency per time, and the ratio of the low-frequency side power component and the high-frequency side power component divided into bands per unit time in the input signal Using one or more “physical indices” of “divided band power ratio” En, the speech speed conversion magnification αn to be specified for each segment of the input signal is determined, the speech speed conversion is performed, and the speech speed converted output Generate and output a signal.

以下、物理指標の“有声度”Ｕｎ、“凹凸度”Ｓｎ、及び、“分割帯域パワー比”Ｅｎに基づく入力信号に対する各区間の話速変換倍率の決定について順に説明する。尚、以下に述べる「話速変換倍率」とは、入力信号の単位時間当たりの音声区間に対する時間的な伸縮率の逆数に相当する。 Hereinafter, the determination of the speech rate conversion magnification of each section with respect to the input signal based on the “voiced” Un, “unevenness” Sn, and “divided band power ratio” En will be described in order. The “speech rate conversion magnification” described below corresponds to the reciprocal of the temporal expansion / contraction rate with respect to the speech section per unit time of the input signal.

まず、有声度Ｕｎによる話速変換倍率決定について図１及び図３を参照して説明する。図３は、本発明による一実施例の話速変換装置１における有声度算出部１００の動作説明図であり、（ａ）は入力信号の波形に対する窓関数を示す図であり、（ｂ）は、自己相関関数による有声度の算出を示す図である。 First, the speech speed conversion magnification determination based on the voiced degree Un will be described with reference to FIGS. FIG. 3 is an explanatory diagram of the operation of the voicing degree calculation unit 100 in the speech rate conversion apparatus 1 according to an embodiment of the present invention. FIG. 3A is a diagram showing a window function with respect to the waveform of the input signal, and FIG. It is a figure which shows calculation of the voicing degree by an autocorrelation function.

（有声度による話速変換倍率の決定）
有声度算出部１００は、入力信号の波形から、所定の単位時間ごとに分割した各セグメントに対して、各セグメントの開始からの時間遅れ量τと、τ＝０における自己相関関数Ｒ（τ）の基準値Ｒ（０）と、τ＞０における自己相関関数Ｒ（τ）の最大値Ｒ（τ）ｍａｘと、Ｒ（τ）ｍａｘを与えるτの値に応じて予め定めた重みＷ（τ）とで規定される有声度Ｕｎ＝Ｗ（τ）・Ｒ（τ）ｍａｘ／Ｒ（０）を、「物理指標」として算出する。より具体的には、有声度算出部１００は、例えば放送の音声及び背景音が混在した入力信号を所定の単位時間ごとに分割した各セグメント（ｎ番目の区間）に対して、入力信号の波形から自己相関関数Ｒ（τ）を求め、次に、τ＞０における自己相関関数Ｒ（τ）の最大値Ｒ（τ）ｍａｘを検出し、更に、τ＝０における自己相関関数Ｒ（τ）の基準値Ｒ（０）を算出し、Ｒ（τ）ｍａｘを与えるτの値に応じて予め定めた重みＷ（τ）を用いて、有声度Ｕｎ＝Ｗ（τ）・Ｒ（τ）ｍａｘ／Ｒ（０）を求める。但し、τはｎ番目の区間の開始からの時間遅れ量である。 (Determination of speech rate conversion magnification based on voicedness)
The voicing degree calculation unit 100, for each segment divided every predetermined unit time from the waveform of the input signal, the time delay amount τ from the start of each segment and the autocorrelation function R (τ) at τ = 0. , A maximum value R (τ) max of the autocorrelation function R (τ) when τ> 0, and a weight W (τ determined in advance according to the value of τ giving R (τ) max. ) Is calculated as “physical index”. More specifically, the voicing degree calculation unit 100, for example, the waveform of the input signal for each segment (n-th section) obtained by dividing an input signal in which broadcast sound and background sound are mixed for each predetermined unit time. Then, the autocorrelation function R (τ) is obtained from the above, then the maximum value R (τ) max of the autocorrelation function R (τ) when τ> 0 is detected, and further the autocorrelation function R (τ) when τ = 0. The reference value R (0) is calculated, and the voicedness Un = W (τ) · R (τ) max is calculated using a weight W (τ) predetermined according to the value of τ giving R (τ) max. Find / R (0). Here, τ is the amount of time delay from the start of the nth section.

例えば、図３（ａ）に示すように、入力信号の波形ｘ（ｋ）に対して窓関数（ハミング窓ｈ（ｋ））による重み付けを施し、入力信号の重み付けした波形ｘ’（ｋ）を抽出する。次に、図３（ｂ）に示すように、区間τにおける入力信号の波形ｘ’（ｋ）に対する自己相関関数Ｒ（τ）を計算する。これにより、τ＝０における自己相関関数Ｒ（τ）の基準値Ｒ（０）と重みＷ（τ）を用いて、有声度Ｕｎ＝Ｗ（τ）・Ｒ（τ）ｍａｘ／Ｒ（０）を求めることができる。 For example, as shown in FIG. 3A, the input signal waveform x (k) is weighted by a window function (Humming window h (k)), and the input signal weighted waveform x ′ (k) is obtained. Extract. Next, as shown in FIG. 3B, an autocorrelation function R (τ) with respect to the waveform x ′ (k) of the input signal in the section τ is calculated. Thus, using the reference value R (0) and the weight W (τ) of the autocorrelation function R (τ) at τ = 0, the voicedness Un = W (τ) · R (τ) max / R (0) Can be requested.

従って、有声度Ｕｎは、入力信号における単位時間あたりの自己相関で得られる相対最大値を表すものであるため、別の算出方法として、入力信号における単位時間（本例では、５ｍｓ）における入力信号の波形のゼロ交差の回数をカウントし、このカウント値の逆数を有声度Ｕｎとすることもできる。 Therefore, the voiced degree Un represents a relative maximum value obtained by autocorrelation per unit time in the input signal. Therefore, as another calculation method, the input signal in the unit time (5 ms in this example) in the input signal is used. It is also possible to count the number of zero crossings of the waveform of, and the reciprocal of this count value is the voicedness Un.

次に、第１話速変換倍率指定部（話速変換倍率指定部ａ）１２０は、有声度Ｕｎの値に応じて、有声度Ｕｎの値が所定の閾値よりも大きい場合には話速を緩め、有声度Ｕｎの値が所定の閾値以下となる場合には話速を速めるように、入力信号の単位時間当たりの音声区間に対する伸縮率を規定する話速変換倍率αａ_ｎを決定する。 Next, the first speech speed conversion magnification designation unit (speech speed conversion magnification designation unit a) 120 determines the speech speed when the value of the voiced degree Un is greater than a predetermined threshold according to the value of the voiced degree Un. Loosen, when the value of Yukoedo Un is equal to or lower than a predetermined threshold value so as to speed up the speech speed, determines the speech speed conversion ratio .alpha.a _n defining a scaling factor for the speech section per unit time of the input signal.

例えば、有声度Ｕｎの値として上記の自己相関関数Ｒ（τ）を用いて算出した場合に、有声度Ｕｎは、入力信号の多くに対して、−０．２〜１．２程度の範囲の値をとることが分かった。そこで、有声度Ｕｎの値が取りうると想定される変動幅の半値Ｕｂ（例えば、Ｕｂ＝０．７）を規定し、有声度Ｕｎの値がこの範囲の中央値に相当する基準値Ｕａ（例えば、Ｕａ＝０．５）より大きい場合は話速を緩め（αａ_ｎ＜１．０）、有声度Ｕｎの値が所定の閾値Ｕａ（例えば、Ｕａ＝０．５）以下であれば話速を速める（αａ_ｎ≧１．０）とすると、式（１）のように表すことができる。 For example, when the autocorrelation function R (τ) is calculated as the value of the voiced degree Un, the voiced degree Un is in the range of about −0.2 to 1.2 with respect to most of the input signals. It turns out that it takes a value. Therefore, a half-value Ub (for example, Ub = 0.7) of a fluctuation range that is assumed to be a value of the voiced degree Un is defined, and a reference value Ua (the value of the voiced degree Un corresponds to the median value of this range) For example, if Ua = 0.5), the speech speed is slowed down (αa _n <1.0), and if the value of voicedness Un is equal to or less than a predetermined threshold Ua (for example, Ua = 0.5), the speech speed is reduced. (Αa _n ≧ 1.0), it can be expressed as equation (1).

ここで、Ｋは、話速を緩めたり速めたりする幅を決める規準値となる定数であり、例えば、予め定めた最も遅い話速変換倍率に相当する伸縮率を与える定数としてＫ＝１．４とすることができる。また、Ｒａは、有声度Ｕｎによって指定される話速変換倍率αａ_ｎに対する寄与率であり、物理指標の“有声度”Ｕｎ、“凹凸度”Ｓｎ、及び、“分割帯域パワー比”Ｅｎに基づいて話速変換倍率の割り当てを決定する際の割り当ての度合いを表す。 Here, K is a constant that serves as a reference value for determining a width for slowing or speeding up the speech speed. For example, K = 1.4 as a constant that gives a scaling rate corresponding to a predetermined slowest speech speed conversion magnification. It can be. Also, Ra is a contribution to the speech speed conversion ratio .alpha.a _n specified by Yukoedo Un "Yukoedo" Un physical indicators, "asperity" Sn, and, based on the "split-band power ratio" En Represents the degree of assignment when determining the assignment of the speech rate conversion magnification.

上記のように、物理指標の“有声度”Ｕｎによって、入力信号の単位時間ごとに話速変換倍率αａ_ｎを決定することができる。 As described above, by "Yukoedo" Un physical indicators, it is possible to determine the speech rate conversion ratio .alpha.a _n for each unit of the input signal time.

次に、凹凸度による話速変換倍率の決定について図１及び図４を参照して説明する。図４に、本発明による一実施例の話速変換装置における基本周波数・擬似基本周波数凹凸算出部２００の動作説明図を示す。 Next, determination of the speech speed conversion magnification based on the degree of unevenness will be described with reference to FIGS. FIG. 4 is a diagram for explaining the operation of the fundamental frequency / pseudo fundamental frequency unevenness calculating unit 200 in the speech speed converting apparatus according to the embodiment of the present invention.

（凹凸度による話速変換倍率の決定）
基本周波数抽出部２０２は、入力信号に対して、単位時間（本例では、５ｍｓ）毎に抽出される基本周波数の値が所定の変化幅内で安定してほぼ連続的な変化をする領域を「安定区間」として決定するとともに、各安定区間の間の領域を「不安定区間」として決定し、各安定区間内の基本周波数を特定するとともに、各安定区間の基本周波数がさらにより滑らかな軌跡となるように、各安定区間の基本周波数からなる軌跡の平滑化を行う。この平滑化のために、カットオフ周波数３〜６Ｈｚ程度のローパスフィルタを用いて行うのが好適である。尚、単位時間あたり（本例では、５ｍｓ）毎に基本周波数を抽出する技法は任意の既知の技法を用いることができる（例えば、特許第３２１９８６８号明細書を参照）。 (Determination of speech speed conversion magnification by unevenness)
The fundamental frequency extraction unit 202 is a region where the fundamental frequency value extracted every unit time (5 ms in this example) is stably and substantially continuously changed within a predetermined change range with respect to the input signal. In addition to determining as “stable section”, the area between each stable section is determined as “unstable section”, and the fundamental frequency in each stable section is specified, and the fundamental frequency in each stable section is even smoother. The trajectory consisting of the fundamental frequency of each stable section is smoothed so that For this smoothing, it is preferable to use a low-pass filter having a cutoff frequency of about 3 to 6 Hz. It should be noted that any known technique can be used as the technique for extracting the fundamental frequency per unit time (in this example, 5 ms) (see, for example, Japanese Patent No. 3219868).

更に、基本周波数抽出部２０２は、安定区間及び／又は不安定区間の情報、及び安定区間の平滑化した軌跡の基本周波数の値を、擬似基本周波数算出部２０４及び基本周波数軌跡連結部２０６に出力する。 Further, the fundamental frequency extraction unit 202 outputs the information on the stable interval and / or the unstable interval and the value of the fundamental frequency of the smoothed locus in the stable interval to the pseudo fundamental frequency calculation unit 204 and the fundamental frequency locus coupling unit 206. To do.

尚、基本周波数抽出部２０２は、抽出される基本周波数の値が安定せず不連続で変化が激しいことを意味する「不安定区間」の各基本周波数の値は全て棄却する。 Note that the fundamental frequency extraction unit 202 rejects all the fundamental frequency values in the “unstable section”, which means that the extracted fundamental frequency values are not stable, discontinuous, and change rapidly.

擬似基本周波数算出部２０４は、基本周波数抽出部２０２から供給される安定区間の平滑化した軌跡の基本周波数の各値を用いて、スプライン関数などの補間関数で補間して、不安定区間における擬似的な基本周波数（擬似基本周波数）の値を決定し、基本周波数軌跡連結部２０６に出力する。尚、処理対象の入力信号の開始部分及び終了部分は音声区間ではないことが多いため擬似基本周波数を求める不安定区間となる。その場合、これらの区間をスプライン関数で補間する場合には、開始点又は終了点に規定値（例えば音声の基本周波数としては殆どあり得ない低い値である３０Ｈｚ）を設定し、一方の安定区間内の基本周波数の各値を用いてスプライン関数で補間する。 The pseudo fundamental frequency calculation unit 204 uses each value of the smoothed trajectory fundamental frequency of the stable interval supplied from the fundamental frequency extraction unit 202 and interpolates with an interpolation function such as a spline function to simulate the pseudo frequency in the unstable interval. A basic fundamental frequency (pseudo fundamental frequency) value is determined and output to the fundamental frequency locus coupling unit 206. Since the start and end portions of the input signal to be processed are often not speech intervals, they become unstable intervals for obtaining a pseudo fundamental frequency. In that case, when interpolating these sections with a spline function, a prescribed value (for example, 30 Hz, which is almost impossible as the fundamental frequency of speech) is set at the start point or end point, and one stable section is set. Is interpolated with a spline function using each value of the fundamental frequency.

基本周波数軌跡連結部２０６は、基本周波数抽出部２０２から供給される安定区間の平滑化した軌跡の基本周波数の値と、擬似基本周波数算出部２０４から供給される不安定区間の擬似基本周波数の値とを連結して、処理対象の入力信号の全ての区間（本例では、５ｍｓごと）の基本周波数及び擬似基本周波数からなる連続な軌跡（以下、「基本周波数軌跡」と称する）を求め、基本周波数軌跡を構成する単位時間毎の基本周波数の各値を凹凸度算出部２１０に送出する。 The fundamental frequency trajectory linking unit 206 is a value of the fundamental frequency of a smoothed trajectory supplied from the fundamental frequency extraction unit 202 and a value of the pseudo fundamental frequency of the unstable interval supplied from the pseudo fundamental frequency calculation unit 204. To obtain a continuous trajectory (hereinafter referred to as “fundamental frequency trajectory”) consisting of a fundamental frequency and a pseudo fundamental frequency in all sections (in this example, every 5 ms) of the input signal to be processed. Each value of the basic frequency for each unit time constituting the frequency locus is sent to the unevenness degree calculation unit 210.

凹凸度算出部２１０は、入力信号に対して、単位時間毎に抽出される基本周波数の値が所定の変化幅内で安定してほぼ連続的な変化をする領域の安定区間の基本周波数の値及び／又は各安定区間の間の領域の不安定区間の擬似基本周波数の値で規定される平滑化した基本周波数軌跡について、該基本周波数軌跡の変化傾向を表す凹凸度を、「物理指標」として算出する。より具体的には、凹凸度算出部２１０は、基本周波数軌跡を構成する単位時間（本例では、５ｍｓごと）毎の基本周波数の或る値Ｐｎに対して、それぞれ所定時間前の値（例えば、値Ｐｎの時刻−３０ｍｓにおける値）Ｐ１と、所定時間後の値（例えば、値Ｐｎの時刻＋３０ｍｓにおける値）Ｐ２をサンプリングして、前側差分値（Ｐｎ−Ｐ１）と後側差分値（Ｐｎ−Ｐ２）との平均値を処理対象の入力信号の全ての区間にわたって求め、全ての区間における、この平均値の各々をこれらの平均値のうちの最大値で除算して正規化し、この正規化した各平均値を基本周波数軌跡の変化傾向を表す“凹凸度”Ｓｎとして算出し、算出した凹凸度Ｓｎを第２話速変換倍率指定部（話速変換倍率指定部ｂ）２２０に送出する。 The unevenness degree calculation unit 210 has a fundamental frequency value in a stable section of a region where a fundamental frequency value extracted every unit time stably changes within a predetermined variation range with respect to an input signal. For the smoothed fundamental frequency trajectory defined by the value of the pseudo fundamental frequency of the unstable section in the area between the stable sections, the degree of unevenness representing the change tendency of the fundamental frequency trace is defined as a “physical index”. calculate. More specifically, the unevenness degree calculation unit 210 sets a value (for example, a value before a predetermined time) with respect to a certain value Pn of the fundamental frequency for each unit time (in this example, every 5 ms) constituting the fundamental frequency locus. , The value Pn at time −30 ms) P1 and the value after a predetermined time (for example, the value Pn at time +30 ms) P2 are sampled, and the front difference value (Pn−P1) and the rear difference value (Pn) are sampled. -P2) is averaged over all intervals of the input signal to be processed, and each of the average values in all intervals is normalized by dividing by the maximum value of these average values. Each of the average values is calculated as “concave / convex degree” Sn representing the change tendency of the fundamental frequency locus, and the calculated uneven degree Sn is sent to the second speech speed conversion magnification designation unit (speech speed conversion magnification designation unit b) 220.

例えば、基本周波数軌跡が平坦、又は単調増加や単調減少の区間では、凹凸度Ｓｎは０に近い値となる。尚、全ての凹凸度Ｓｎのうち、その絶対値が最も大きな値を使って正規化するため、基本周波数軌跡の変化傾向を表す凹凸度Ｓｎの各値は、−１〜１となる。 For example, the unevenness level Sn is a value close to 0 in a section where the fundamental frequency locus is flat or monotonically increasing or decreasing. Since all the irregularities Sn are normalized using a value having the largest absolute value, each value of the irregularity Sn representing the change tendency of the fundamental frequency locus is −1 to 1.

第２話速変換倍率指定部（話速変換倍率指定部ｂ）２２０は、凹凸度算出部２１０から供給される単位時間（本例では、５ｍｓごと）の凹凸度Ｓｎの各値に応じて、凹凸度Ｓｎの値が正の場合には話速を緩め、凹凸度Ｓｎの値が負の場合には話速を速めるように、入力信号の単位時間当たりの音声区間に対する伸縮率を規定する話速変換倍率αｂ_ｎを決定する。即ち、この基本周波数軌跡において、山状に凸（極大）になっている部分では話速を緩め、谷状（極小）になっている部分では話速を速める。 The second speech speed conversion magnification designation unit (speech speed conversion magnification designation unit b) 220 corresponds to each value of the unevenness degree Sn of unit time (in this example, every 5 ms) supplied from the unevenness degree calculation unit 210. Talk that regulates the rate of expansion / contraction of the input signal per unit time so that the speech speed is slowed when the unevenness Sn value is positive and the speech speed is increased when the unevenness Sn value is negative The fast conversion magnification αb _n is determined. That is, in this fundamental frequency locus, the speaking speed is slowed down in a mountain-shaped convex portion (maximum), and the speaking speed is increased in a valley-shaped portion (minimum).

例えば、話速変換倍率αｂ_ｎは、式（２）のように表すことができる。 For example, the speech rate conversion magnification αb _n can be expressed as in Expression (2).

ここで、Ｋは式（１）と同様であり、Ｒｂは凹凸度Ｓｎによって指定される話速変換倍率αｂ_ｎに対する寄与率であり、物理指標の“有声度”Ｕｎ、“凹凸度”Ｓｎ、及び、“分割帯域パワー比”Ｅｎに基づいて話速変換倍率の割り当てを決定する際の割り当ての度合いを表す。 Here, K is the same as in equation (1), Rb is the contribution rate to the speech rate conversion magnification αb _n specified by the degree of unevenness Sn, and the physical index “voiced” Un, “unevenness” Sn, Also, it represents the degree of assignment when determining the assignment of the speech rate conversion magnification based on the “divided band power ratio” En.

上記のように、物理指標の“凹凸度” Ｓｎによって、入力信号の単位時間ごとに話速変換倍率αｂ_ｎを決定することができる。 As described above, the speech rate conversion magnification αb _n can be determined for each unit time of the input signal based on the “concave / convex degree” Sn of the physical index.

次に、周波数帯域分割・パワー演算部３００における分割帯域パワー比Ｅｎによる話速変換倍率決定について説明する。 Next, determination of the speech rate conversion magnification based on the divided band power ratio En in the frequency band division / power calculation unit 300 will be described.

（分割帯域パワー比による話速変換倍率決定）
スペクトル算出部３０２は、入力信号に対して単位時間（本例では、５ｍｓ）毎に、ＦＦＴ（Fast Fourier transform）などによって時間領域の波形を周波数領域に変換し、各周波数の対数化パワースペクトルをｄＢ値で求めて帯域分割部３０４に送出する。 (Determination of speech rate conversion ratio based on divided band power ratio)
The spectrum calculation unit 302 converts the time domain waveform into the frequency domain by FFT (Fast Fourier transform) or the like for each unit time (5 ms in this example) with respect to the input signal, and converts the logarithmized power spectrum of each frequency. Obtained by the dB value and sends it to the band dividing unit 304.

帯域分割部３０４は、スペクトル算出部３０２から供給される対数化パワースペクトルを予め定めた複数の周波数帯域に分割し、帯域分割した各周波数帯域の対数化パワースペクトルの値をパワー演算部３０６に送出する。例えば、５分割する場合には、Ｂ１：０〜３００Ｈｚ，Ｂ２：３００〜１５００Ｈｚ，Ｂ３：１５００〜３０００Ｈｚ，Ｂ４：３０００〜８０００Ｈｚ，Ｂ５：８０００Ｈｚ以上といった具合に分割することができる。尚、単に２分割としてもよいことに留意する。 The band division unit 304 divides the logarithmic power spectrum supplied from the spectrum calculation unit 302 into a plurality of predetermined frequency bands, and sends the logarithmized power spectrum value of each frequency band divided to the power calculation unit 306. To do. For example, when dividing into five, it can divide into B1: 0-300 Hz, B2: 300-1500 Hz, B3: 1500-3000 Hz, B4: 3000-8000 Hz, B5: 8000 Hz or more. Note that it may be simply divided into two.

パワー演算部３０６は、帯域分割部３０４から供給される帯域分割した各周波数帯域の対数化パワースペクトルの値について、任意に予め選定した低い側の帯域及び高い側の帯域における各帯域の正規化したパワー成分を求め、これらの正規化した低域側パワー成分と高域側パワー成分を分割帯域パワー比演算部３１０に送出する。正規化したパワー成分は、低域側及び高域側の各帯域に含まれるパワースペクトルの本数だけ各パワースペクトルの値を合計し、その後その本数で除することで求めることができる。例えば、前述のように５分割する場合に、低い側の帯域としてＢ２、高い側の帯域としてＢ４を予め選定したとする。この場合、低い側の帯域Ｂ２と高い側の帯域Ｂ４のそれぞれの正規化したパワー成分を分割帯域パワー比演算部３１０に出力する。 The power calculation unit 306 normalizes each band in the lower band and the higher band arbitrarily selected in advance for the logarithmized power spectrum value of each frequency band divided from the band supplied from the band dividing unit 304 The power component is obtained, and the normalized low frequency side power component and high frequency side power component are sent to the divided band power ratio calculation unit 310. The normalized power component can be obtained by summing the values of each power spectrum by the number of power spectra included in each band on the low frequency side and high frequency side, and then dividing by the number. For example, when dividing into five as described above, it is assumed that B2 is selected in advance as the lower band and B4 is selected as the higher band. In this case, the normalized power components of the lower band B 2 and the higher band B 4 are output to the divided band power ratio calculation unit 310.

分割帯域パワー比演算部３１０は、入力信号に対して帯域分割した各周波数帯域の対数化パワースペクトルの値について、任意に予め選定される低帯域側パワー成分と高帯域側パワー成分との比率（分割帯域パワー比Ｅｎ）を、「物理指標」として算出する。より具体的には、分割帯域パワー比演算部３１０は、パワー演算部３０６から供給される正規化した低域側のパワー成分及び高域側のパワー成分の比率（分割帯域パワー比Ｅｎ）を算出して第３話速変換倍率指定部（話速変換倍率指定部ｃ）３２０に送出する。尚、パワー演算部３０６から供給される正規化した低域側のパワー成分及び高域側のパワー成分がすでに対数値（ｄＢ）で表わされている場合には、これらの対数値の差として、低域側の正規化パワーから高域側の正規化パワーを減じて分割帯域パワー比Ｅｎを求めることができる。 The divided band power ratio calculation unit 310 is configured to arbitrarily select a ratio between a low band side power component and a high band side power component for a logarithmized power spectrum value of each frequency band divided into bands for the input signal ( The divided band power ratio En) is calculated as a “physical index”. More specifically, the divided band power ratio calculation unit 310 calculates the ratio of the normalized low frequency side power component and high frequency side power component supplied from the power calculation unit 306 (division band power ratio En). Then, it is sent to the third speech speed conversion magnification designation section (speech speed conversion magnification designation section c) 320. When the normalized low frequency side power component and high frequency side power component supplied from the power calculation unit 306 are already represented by logarithmic values (dB), the difference between these logarithmic values is obtained. The subband power ratio En can be obtained by subtracting the normalized power on the high frequency side from the normalized power on the low frequency side.

第３話速変換倍率指定部（話速変換倍率指定部ｃ）３２０は、分割帯域パワー比Ｅｎの値に応じて、分割帯域パワー比Ｅｎの値が所定の閾値よりも大きい場合には話速を緩め、分割帯域パワー比Ｅｎの値が所定の閾値以下となる場合には話速を速めるように、入力信号の単位時間当たりの音声区間に対する伸縮率を規定する話速変換倍率αｃ_ｎを決定する。 The third speech speed conversion magnification designation unit (speech speed conversion magnification designation unit c) 320 determines the speech speed when the value of the divided band power ratio En is larger than a predetermined threshold according to the value of the divided band power ratio En. And the speech rate conversion magnification αc _n that defines the expansion / contraction rate for the speech interval per unit time of the input signal is determined so that the speech rate is increased when the value of the divided band power ratio En becomes a predetermined threshold value or less. To do.

例えば、分割帯域パワー比Ｅｎは、入力信号の多くに対して、１０［ｄＢ］〜４０［ｄＢ］程度の範囲の値をとることが分かった。そこで、第３話速変換倍率指定部（話速変換倍率指定部ｃ）３２０は、入力信号が取りうると想定される変動幅の半値Ｅｂ（例えば、Ｅｂ＝１５）を規定し、この想定される範囲の中央値に相当する基準値Ｅａ（例えば、Ｅａ＝２５）を規定することにより、分割帯域パワー比ＥｎがＥａ＝２５［ｄＢ］より大きい場合は話速を緩め、分割帯域パワー比ＥｎがＥａ＝２５［ｄＢ］以下であれば話速を速めるように、式（３）のように規定される。 For example, it has been found that the divided band power ratio En takes a value in the range of about 10 [dB] to 40 [dB] with respect to most of the input signals. Therefore, the third speech speed conversion magnification designation unit (speech speed conversion magnification designation unit c) 320 defines a half-value Eb (for example, Eb = 15) of the fluctuation range that can be assumed to be taken by the input signal. By defining a reference value Ea (for example, Ea = 25) corresponding to the median value of the range, when the divided band power ratio En is larger than Ea = 25 [dB], the speech speed is reduced, and the divided band power ratio En If Ea = 25 [dB] or less, it is defined as in equation (3) so as to increase the speech speed.

ここで、Ｋは式（１）と同様であり、Ｒｃは分割帯域パワー比Ｅｎによって指定される話速変換倍率αｃ_ｎに対する寄与率であり、上述した物理指標の“有声度”Ｕｎ、“凹凸度”Ｓｎ、及び、“分割帯域パワー比”Ｅｎに基づいて話速変換倍率の割り当てを決定する際の割り当ての度合いを表す。 Here, K is the same as equation (1), Rc is the contribution to the speech speed conversion ratio .alpha.c _n specified by the split-band power ratio En, "Yukoedo" Un of the above-mentioned physical indicators, "uneven The degree of assignment when the assignment of the speech rate conversion magnification is determined based on the degree “Sn” and the “divided band power ratio” En.

上記のように、物理指標の“分割帯域パワー比” Ｅｎによって、入力信号の単位時間ごとに話速変換倍率αｃ_ｎを決定することができる。 As described above, the "split-band power ratio" En physical indicators, it is possible to determine the speech rate conversion ratio .alpha.c _n for each unit of the input signal time.

上述したように、本実施例の話速変換装置１は、“有声度”Ｕｎ、“凹凸度”Ｓｎ、及び、“分割帯域パワー比”Ｅｎのうちの１つ以上の「物理指標」を用いて入力信号に対する各区間の話速変換倍率を決定することにより話速変換を行う。 As described above, the speech speed converting apparatus 1 according to the present embodiment uses one or more “physical indices” of “voicedness” Un, “unevenness” Sn, and “divided band power ratio” En. Then, the speech speed conversion is performed by determining the speech speed conversion magnification of each section with respect to the input signal.

（高度化した適応的な話速変換）
本実施例の話速変換装置１は、適応的な話速変換を実現するために、“有声度”Ｕｎ、“凹凸度”Ｓｎ、及び、“分割帯域パワー比”Ｅｎのうちの１つ以上の「物理指標」を用いて入力信号に対する各区間の話速変換倍率を決定するための話速変換倍率微調整部４００を備える。 (Advanced adaptive speech speed conversion)
In order to realize adaptive speech speed conversion, the speech speed converting apparatus 1 according to the present embodiment has one or more of “voiced degree” Un, “unevenness degree” Sn, and “divided band power ratio” En. The speech speed conversion magnification fine adjustment unit 400 for determining the speech speed conversion magnification of each section with respect to the input signal using the “physical index” of FIG.

話速変換倍率微調整部４００は、第１話速変換倍率指定部（話速変換倍率指定部ａ）１２０によって決定した話速変換倍率αａ_ｎと、第２話速変換倍率指定部（話速変換倍率指定部ｂ）２２０によって決定した話速変換倍率αｂ_ｎと、第３話速変換倍率指定部（話速変換倍率指定部ｃ）３２０によって決定した話速変換倍率αｃ_ｎとをそれぞれ入力し、予め設定される入力信号の種別（例えば、番組のジャンル）に応じた配分率で配分割り当てを行った各話速変換倍率αａ_ｎ，αｂ_ｎ，αｃ_ｎの値を加算し、加算して得られる話速変換倍率αｎ（αｎ＝αａ_ｎ＋αｂ_ｎ＋αｃ_ｎ）によって入力信号における単位時間毎の話速変換を行う。 Speech speed conversion ratio fine adjustment section 400, a speech speed conversion ratio .alpha.a _n determined by the first episode speed conversion magnification specifying unit (speech speed conversion ratio specifying section a) 120, Episode 2 speed conversion magnification specifying unit (speech rate a speech speed conversion ratio .alpha.b _n determined by the conversion magnification specifying unit b) 220, the determined speech speed conversion ratio .alpha.c _n and were inputted by the third episode speed conversion magnification specifying unit (speech speed conversion ratio specifying section c) 320 , the type of input signal to be preset (e.g., the genre of the program) by adding the speech speed conversion ratio was distributed allocation in the allocation ratio corresponding to .alpha.a _n, .alpha.b _n, the value of .alpha.c _n, obtained by adding performing speech speed conversion per unit time in the input signal by the speech speed conversion ratio _{αn (αn = αa n + αb} n + αc n) to be.

例えば、各話速変換倍率αａ_ｎ，αｂ_ｎ，αｃ_ｎの値における配分割り当てのために、式（１）〜式（３）の寄与率Ｒａ，Ｒｂ，Ｒｃの各値を変更する。例えば入力信号が放送番組の音声の場合、放送番組のジャンル（ニュース、ドキュメンタリー、ドラマ、バラエティ、落語、漫才等）に応じて、寄与率Ｒａ，Ｒｂ，Ｒｃの分配率を変えることにより配分割り当てを行うことができる。これにより、放送番組のジャンルに応じて、聞きやすさ及び自然さの観点で、より高品質な適応的話速変換が可能となる。例えば、入力信号がニュースの音声の場合には寄与率Ｒａ＝０．５，Ｒｂ＝０．３，Ｒｃ＝０．２、入力信号が落語や漫才の場合には寄与率Ｒａ＝０．２，Ｒｂ＝０．６，Ｒｃ＝０．２などに割り当てることができる。 For example, each value of the contribution ratios Ra, Rb, and Rc in the equations (1) to (3) is changed in order to allocate and allocate the values of the respective speech speed conversion magnifications αa _n , αb _n , and αc _n . For example, when the input signal is a broadcast program audio, the allocation allocation is performed by changing the distribution ratio of the contribution ratios Ra, Rb, Rc according to the genre of the broadcast program (news, documentary, drama, variety, rakugo, comic story, etc.). It can be carried out. This makes it possible to perform adaptive speech speed conversion with higher quality in terms of ease of listening and naturalness according to the genre of the broadcast program. For example, when the input signal is news speech, the contribution rate Ra = 0.5, Rb = 0.3, Rc = 0.2, and when the input signal is rakugo or comic, the contribution rate Ra = 0.2. Rb = 0.6, Rc = 0.2, etc. can be assigned.

また、“凹凸度”Ｓｎに基づいて、即ち“音声全体にわたって平滑化された基本周波数又は擬似基本周波数の軌跡の凹凸の状態”に基づいて話速を決定する場合、ハードディスクレコーダなどに一度記録した放送番組の音声を視聴する際に適用することができるが、リアルタイムの放送番組視聴のような場合に適用するのは好ましくない。そこで、リアルタイムの視聴のような場合、例えば話速変換装置１における入力から出力までの遅延が１００ｍｓ未満といった要求条件がある場合、“凹凸度”Ｓｎによって決定される話速αｂ_ｎの寄与率Ｒｂを、Ｒｂ＝０とすることができる。このように、本実施例の話速変換装置１は、入力信号の種別に応じた配分率で話速の配分割り当てを行うことにより、高度な適応的話速変換を行うことができるようになる。 In addition, when the speech speed is determined based on the “concavo-convex degree” Sn, that is, based on “the uneven state of the trajectory of the fundamental frequency or the pseudo fundamental frequency smoothed over the entire voice”, it is recorded once on a hard disk recorder or the like. Although it can be applied when viewing the sound of a broadcast program, it is not preferable to apply it in the case of viewing a real-time broadcast program. Therefore, in the case of real-time viewing, for example, when there is a requirement that the delay from input to output in the speech speed conversion device 1 is less than 100 ms, the contribution rate Rb of the speech speed αb _n determined by the “degree of unevenness” Sn. Can be Rb = 0. As described above, the speech speed conversion apparatus 1 according to the present embodiment can perform advanced adaptive speech speed conversion by allocating the speech speed at a distribution rate according to the type of the input signal.

また、話速変換倍率微調整部４００は、入力信号全体に対して速度変換すべき再生速度変換倍率α［倍速］が与えられた場合に、入力信号の先頭から単位時間（本例では、５ｍｓ）おきに数えてｎ番目に与える話速変換倍率αｎ（αｎ＝αａ_ｎ＋αｂ_ｎ＋αｃ_ｎ）を再生速度変換倍率α［倍速］に適合する信号長に微調整する機能を有する。 Also, the speech speed conversion magnification fine adjustment unit 400 gives a unit time (5 ms in this example) from the beginning of the input signal when a reproduction speed conversion magnification α [double speed] to be speed converted is given to the entire input signal. ) has a function of finely adjusting the fit signal length to the speech speed conversion ratio .alpha.n given to n-th _{(αn = αa n + αb n} + αc n) the reproduction speed conversion ratio alpha [speed] counted every other.

例えば、話速微調整部４００は、入力信号全体の長さをＬ［秒］とした時、信号波形全体に任意の再生速度変換倍率α［倍速］が与えられた場合に、話速変換後の信号全体の長さをＬ／α［秒］にするために、話速変換倍率αｎを連接した話速変換後の信号全体の長さＬ０［秒］を算出し、以下の式（４）に従って再生速度変換倍率α［倍速］に適合する信号長に微調整する。 For example, when the length of the entire input signal is set to L [seconds], the speech speed fine adjustment unit 400 performs the post-speech conversion when an arbitrary reproduction speed conversion magnification α [double speed] is given to the entire signal waveform. In order to make the length of the entire signal L / α [second], the length L0 [second] of the signal after speech speed conversion with the speech speed conversion magnification αn concatenated is calculated, and the following equation (4) To finely adjust the signal length to the playback speed conversion magnification α [double speed].

αｎ＝（αａ_ｎ＋αｂ_ｎ＋αｃ_ｎ）×Ｌ０／（Ｌ／α）（４） αn = (αa _n + αb _n + αc _n ) × L0 / (L / α) (4)

尚、再生速度変換倍率αとしては、０．５〜５．０などの任意の値を設定することができる。 Note that an arbitrary value such as 0.5 to 5.0 can be set as the reproduction speed conversion magnification α.

即ち、話速変換倍率微調整部４００は、式（４）によって、単位時間毎の話速変換倍率αｎを求め直し、話速変換することで微調整し、変換後の信号波形の長さを所定の長さに合わせることができる。 That is, the speech speed conversion magnification fine adjustment unit 400 re-determines the speech speed conversion magnification αn per unit time according to the equation (4), finely adjusts by converting the speech speed, and sets the length of the converted signal waveform. It can be adjusted to a predetermined length.

更に、できるだけ頻繁にα［倍速］で一様に変換した音声と同じタイミングに合わせ込みたい場合は、入力信号全体の長さＬではなく、これをより短い単位で分割した信号の長さに対して微調整を行うようにαｎを修正することもできる。例えば、入力信号全体の長さＬ＝Ｌ_１＋Ｌ_２＋・・・＋Ｌ_ＭのようにＭ個に分割して、Ｌ_１，Ｌ_２，・・・，Ｌ_Ｍの区間ごとに入力信号波形を分割し、それぞれの分割区間において、ｍ番目の区間では、先ずその区間の５ｍｓごとの各部分の話速変換倍率αｎ＝αａ_ｎ＋αｂ_ｎ＋αｃ_ｎを用いてこのｍ番目の区間の話速変換を行って連接し、連接した変換後の信号波形の部分長Ｌ_ｍ０をまず算出する。これにより、式（４）において、入力信号全体の長さＬの代わりに、信号波形の部分長Ｌ_ｍを適用し、連接した変換後の信号波形長Ｌ０の代わりに連接した変換後の信号波形の部分長Ｌ_ｍ０を適用することで、各話速変換倍率αｎを再度求め直して微調整を行って話速変換を行う。 Furthermore, when it is desired to match the same timing as that of the voice converted uniformly at α [double speed] as frequently as possible, not the length L of the entire input signal but the length of the signal obtained by dividing this by a shorter unit. It is also possible to correct αn so that fine adjustment is performed. For example, divided into M as the length _{_{L = L 1 + L 2 +}} ··· + L M of the entire input _signal, L _1, L 2, · · ·, an input signal waveform in each section of _{L M} divided, in each divided section, the m th interval, first the speech speed of the m th interval conversion using a speech speed conversion ratio _{_{αn = αa n + αb n +}} αc n of each part of each 5ms of the section First, the partial length L _m0 of the connected signal waveform after conversion is calculated. Thus, in the formula (4), instead of the length L of the entire input signal, applying a partial length L _m of the signal waveform, the signal waveform after conversion concatenated instead of the signal waveform length L0 after conversion concatenated by applying a partial length L _m0, performs speech speed conversion performed a fine adjustment again seeking the speech speed conversion ratio αn again.

尚、話速変換倍率αｎが与えられた場合の、話速変換（波形の伸縮）の手法は、様々な手法がすでに提案されている。例えば、声の高さ（基本周波数）を保つ方法として、ＰＩＣＯＬＡ（Pointer Interval Controlled OverLap and Add）法、ＴＤＨＳ（Time Domain Harmonic Scaling）法、ＰＳＯＬＡ（Pitch Synchronous OverLap Add）法などがあり、これ以外にも、特許第２６１２８６８号明細書、特許第３０８３８３０号明細書、特許第２９５５２４７号明細書に開示される波形伸縮法があり、いずれの波形伸縮法を用いてもよい。 Various methods have already been proposed for speaking speed conversion (waveform expansion / contraction) when the speaking speed conversion magnification αn is given. For example, there are PICOLA (Pointer Interval Controlled OverLap and Add) method, TDHS (Time Domain Harmonic Scaling) method, PSOLA (Pitch Synchronous OverLap Add) method, etc. In addition, there are waveform expansion / contraction methods disclosed in Japanese Patent Nos. 261868, 3083830, and 2955247, and any of the waveform expansion / contraction methods may be used.

本実施例の話速変換装置１の高度化した適応的な話速変換の動作について、図２を参照して説明する。図２は、本発明による一実施例の話速変換装置の動作を示すフローチャートである。 The advanced adaptive speech speed conversion operation of the speech speed conversion apparatus 1 of the present embodiment will be described with reference to FIG. FIG. 2 is a flowchart showing the operation of the speech speed converting apparatus according to an embodiment of the present invention.

ステップＳ１にて、話速変換装置１は、話速調整する信号を入力するとともに、話速調整に必要とされるパラメータ（入力信号の種別によって規定可能な寄与率Ｒａ，Ｒｂ，Ｒｃ、再生速度変換倍率α）を入力する。入力信号は、有声度算出部１００、基本周波数抽出部２０２、及びスペクトル算出部３０２に入力される。寄与率Ｒａ，Ｒｂ，Ｒｃは、それぞれ第１話速変換倍率指定部（話速変換倍率指定部ａ）１２０、第２話速変換倍率指定部（話速変換倍率指定部ｂ）２２０、及び、第３話速変換倍率指定部（話速変換倍率指定部ｃ）３２０に設定される。再生速度変換倍率αは、話速変換倍率微調整部４００に設定される。 In step S1, the speech speed converting apparatus 1 inputs a signal for adjusting the speech speed, and parameters necessary for the speech speed adjustment (contribution ratios Ra, Rb, Rc that can be defined by the type of the input signal, reproduction speed). Enter the conversion magnification α). The input signal is input to the voicing degree calculation unit 100, the fundamental frequency extraction unit 202, and the spectrum calculation unit 302. The contribution rates Ra, Rb, and Rc are respectively a first speech speed conversion magnification designation unit (speech speed conversion magnification designation unit a) 120, a second speech speed conversion magnification designation unit (speech speed conversion magnification designation unit b) 220, and The third speech speed conversion magnification designation unit (speech speed conversion magnification designation unit c) 320 is set. The reproduction speed conversion magnification α is set in the speech speed conversion magnification fine adjustment unit 400.

まず、話速変換装置１は、有声度算出部１００により、入力信号の所定の単位時間ごとに分割したｎ番目の区間に対して上述のように有声度Ｕｎを求め（ステップＳ２）、第１話速変換倍率指定部（話速変換倍率指定部ａ）１２０により、有声度Ｕｎの値に応じて、有声度Ｕｎの値が所定の閾値よりも大きい場合には話速を緩め、有声度Ｕｎの値が所定の閾値以下となる場合には話速を速めるように、入力信号の単位時間当たりの信号波形に対する伸縮率を規定する話速変換倍率αａｎを決定する（ステップＳ３）。 First, the speech rate conversion apparatus 1 obtains the voicing degree Un as described above for the nth section divided for each predetermined unit time of the input signal by the voicing degree calculation unit 100 (step S2). The speech speed conversion magnification designation unit (speech speed conversion magnification designation unit a) 120 reduces the speech speed when the value of the voiced degree Un is larger than a predetermined threshold according to the value of the voiced degree Un. When the value of is lower than a predetermined threshold value, the speech speed conversion magnification αan that defines the expansion / contraction rate for the signal waveform per unit time of the input signal is determined so as to increase the speech speed (step S3).

更に、話速変換装置１は、基本周波数抽出部２０２により、入力信号に対して、単位時間毎に抽出される基本周波数の値が所定の変化幅内で安定してほぼ連続的な変化をする「安定区間」と、各安定区間の間の領域を「不安定区間」としてセグメント分割し、各安定区間内の基本周波数を特定するとともに、各安定区間の基本周波数からなる軌跡の平滑化を行って「安定区間」の基本周波数を決定し、更に「不安定区間」の各基本周波数の値は全て棄却する（ステップＳ４，Ｓ５）。 Furthermore, in the speech speed converting apparatus 1, the fundamental frequency value extracted at every unit time by the fundamental frequency extracting unit 202 stably changes within a predetermined variation range with respect to the input signal. Segment the area between the “stable section” and each stable section as an “unstable section”, specify the fundamental frequency in each stable section, and smooth the trajectory of the fundamental frequency in each stable section Thus, the fundamental frequency of the “stable section” is determined, and all the values of the fundamental frequencies of the “unstable section” are rejected (steps S4 and S5).

続いて、話速変換装置１は、擬似基本周波数算出部２０４により、基本周波数抽出部２０２から供給される安定区間の平滑化した軌跡の基本周波数の各値を用いて、スプライン関数などの補間関数で補間して、不安定区間における擬似基本周波数の値を決定し、不安定区間における元の基本周波数の値を擬似基本周波数の値に変換（置換）する（ステップＳ６）。 Subsequently, the speech rate conversion apparatus 1 uses the values of the fundamental frequency of the smoothed trajectory of the stable section supplied from the fundamental frequency extraction unit 202 by the pseudo fundamental frequency calculation unit 204 to use an interpolation function such as a spline function. Are interpolated to determine the value of the pseudo fundamental frequency in the unstable interval, and the original fundamental frequency value in the unstable interval is converted (replaced) to the value of the pseudo fundamental frequency (step S6).

続いて、話速変換装置１は、基本周波数軌跡連結部２０６により、基本周波数抽出部２０２から供給される安定区間の平滑化した軌跡の基本周波数の値と、擬似基本周波数算出部２０４から供給される不安定区間の擬似基本周波数の値とを連結して、処理対象の入力信号の全ての区間の基本周波数及び擬似基本周波数からなる連続な軌跡となる基本周波数軌跡を求める（ステップＳ７）。 Subsequently, the speech speed converting apparatus 1 is supplied from the fundamental frequency locus coupling unit 206 by the fundamental frequency value of the smoothed locus of the stable section supplied from the fundamental frequency extraction unit 202 and the pseudo fundamental frequency calculation unit 204. The fundamental frequency trajectory that is a continuous trajectory composed of the fundamental frequency and the pseudo fundamental frequency of all the segments of the input signal to be processed is obtained (step S7).

続いて、話速変換装置１は、凹凸度算出部２１０により、基本周波数軌跡を構成する単位時間毎の基本周波数の或る値Ｐｎに対して、それぞれ所定時間前の値Ｐ１と、所定時間後の値Ｐ２をサンプリングして、前側差分値（Ｐｎ−Ｐ１）と後側差分値（Ｐｎ−Ｐ２）との平均値を処理対象の入力信号の全ての区間にわたって求め、全ての区間における、この平均値の各々をこれらの平均値のうちの最大値で除算して正規化し、この正規化した各平均値を基本周波数軌跡の変化傾向を表す“凹凸度”Ｓｎとして算出する（ステップＳ８）。 Subsequently, the speech speed conversion apparatus 1 uses the unevenness degree calculation unit 210 to set a value P1 before a predetermined time and a value P1 after a predetermined time with respect to a certain value Pn of the basic frequency for each unit time constituting the basic frequency locus. Value P2 is sampled, and an average value of the front side difference value (Pn−P1) and the rear side difference value (Pn−P2) is obtained over all the intervals of the input signal to be processed, and this average in all intervals Each of the values is normalized by dividing by the maximum value among these average values, and each of the normalized average values is calculated as the “concave / convex degree” Sn representing the change tendency of the fundamental frequency locus (step S8).

続いて、話速変換装置１は、第２話速変換倍率指定部（話速変換倍率指定部ｂ）２２０により、凹凸度算出部２１０から供給される単位時間の凹凸度Ｓｎの各値に応じて、凹凸度Ｓｎの値が正の場合には話速を緩め、凹凸度Ｓｎの値が負の場合には話速を速めるように、入力信号の単位時間当たりの信号波形に対する伸縮率を規定する話速変換倍率αｂｎを決定する（ステップＳ９）。 Subsequently, the speech speed conversion apparatus 1 responds to each value of the unevenness level Sn of the unit time supplied from the unevenness level calculation unit 210 by the second speech speed conversion rate specifying unit (speech speed conversion rate specifying unit b) 220. Thus, the expansion / contraction rate for the signal waveform per unit time of the input signal is specified so that the speaking speed is slowed when the unevenness Sn value is positive and the speaking speed is increased when the unevenness Sn value is negative. The speech speed conversion magnification αbn to be determined is determined (step S9).

更に、話速変換装置１は、スペクトル算出部３０２により、入力信号に対して単位時間毎に時間領域の波形を周波数領域に変換し、各周波数の対数化パワースペクトルからなるスペクトル分布を算出する（ステップＳ１０）。 Furthermore, the speech speed conversion apparatus 1 uses the spectrum calculation unit 302 to convert a time domain waveform into a frequency domain for each unit time with respect to the input signal, and calculate a spectrum distribution including logarithmic power spectra of each frequency ( Step S10).

続いて、話速変換装置１は、帯域分割部３０４により、スペクトル算出部３０２から供給される対数化パワースペクトルを予め定めた規定数の周波数帯域に分割する（ステップＳ１１）。 Subsequently, the speech speed conversion apparatus 1 divides the logarithmic power spectrum supplied from the spectrum calculation unit 302 into a predetermined number of frequency bands by the band dividing unit 304 (step S11).

続いて、話速変換装置１は、パワー演算部３０６により、帯域分割部３０４から供給される帯域分割した各周波数帯域の対数化パワースペクトルの値について、任意に予め選定した低帯域側及び高帯域側における各正規化したパワー成分を求める（ステップＳ１２）。 Subsequently, the speech speed conversion apparatus 1 uses the power calculation unit 306 to arbitrarily select the low band side and the high band for the logarithmized power spectrum value of each frequency band divided from the band division unit 304. Each normalized power component on the side is obtained (step S12).

続いて、話速変換装置１は、分割帯域パワー比演算部３１０により、パワー演算部３０６から供給される正規化した低域側のパワー成分を、同じく正規化した高域側のパワー成分で除した比率である“分割帯域パワー比”Ｅｎを算出する（ステップＳ１３）。 Subsequently, the speech speed converting apparatus 1 uses the divided band power ratio calculation unit 310 to divide the normalized low-frequency power component supplied from the power calculation unit 306 by the normalized high-frequency power component. The “divided band power ratio” En, which is the ratio, is calculated (step S13).

続いて、話速変換装置１は、第３話速変換倍率指定部（話速変換倍率指定部ｃ）３２０により、分割帯域パワー比Ｅｎの値に応じて、分割帯域パワー比Ｅｎの値が所定の閾値よりも大きい場合には話速を緩め、分割帯域パワー比Ｅｎの値が所定の閾値以下となる場合には話速を速めるように、入力信号の単位時間当たりの信号波形に対する伸縮率を規定する話速変換倍率αｃｎを決定する（ステップＳ１４）。 Subsequently, in the speech speed converting apparatus 1, the third speech speed conversion magnification specifying unit (speech speed conversion magnification specifying unit c) 320 sets the value of the divided band power ratio En according to the value of the divided band power ratio En. The rate of expansion / contraction with respect to the signal waveform per unit time of the input signal is increased so that the speech speed is reduced when the threshold value is greater than the threshold value, and the speech speed is increased when the value of the divided band power ratio En is equal to or less than the predetermined threshold value. The prescribed speech speed conversion magnification αcn is determined (step S14).

最終的に、話速変換装置１は、話速変換倍率微調整部４００により、物理指標の“有声度”Ｕｎ、“凹凸度”Ｓｎ、及び、“分割帯域パワー比”Ｅｎのうちの少なくとも１つ以上に基づく入力信号に対する各区間の話速変換倍率の決定に対して、再生速度変換倍率αに従って最終話速変換倍率のための配分割り当てを決定し、入力信号に対する話速変換を行う（ステップＳ１５）。 Finally, the speech speed conversion apparatus 1 uses the speech speed conversion magnification fine adjustment unit 400 to at least one of the “voiced” Un, “unevenness” Sn, and “divided band power ratio” En of the physical indices. With respect to the determination of the speech speed conversion magnification of each section with respect to the input signal based on one or more, the allocation allocation for the final speech speed conversion magnification is determined according to the reproduction speed conversion magnification α, and the speech speed conversion is performed on the input signal (step) S15).

従って、本実施例の話速変換装置１によれば、入力信号波形に対して、その周期性の強さを示す“有声度” Ｕｎに応じて、話速変換倍率αｎを適応制御することができる。有声度Ｕｎの物理指標は、入力信号の全ての位置で求めることができる。また背景音が混入している場合にも有声度Ｕｎの物理指標を求めることができ、安定した話速変換を実現することができる。 Therefore, according to the speech speed conversion apparatus 1 of the present embodiment, the speech speed conversion magnification αn can be adaptively controlled according to the “voicedness” Un indicating the strength of the periodicity with respect to the input signal waveform. it can. The physical index of the voiced degree Un can be obtained at all positions of the input signal. Further, even when background sounds are mixed, a physical index of voicing level Un can be obtained, and stable speech speed conversion can be realized.

通常、音声の母音部は有声度Ｕｎが高い。また、完全な無音部や、一般的に様々な音の周波数成分が混合した音楽や雑音などの背景音は有声度が低い。従って、本実施例の話速変換装置１によれば、有声度Ｕｎが高いところで話速を緩め、有声度Ｕｎが低いところでは話速を速めることができるので、背景音が混合している入力信号の場合においても、音声の聞き取りに重要な部分である母音部では話速が緩められ、完全な無音部や背景音だけの部分は話速が速められる。更に、全体として目的の時間長に合わせながら、適応的な話速変換が可能である。特に、実際の人の発声において、話速が遅い場合と速い場合を比較すると、主に母音部が伸縮することが分かっている（例えば、電子情報通信学会論文誌（Ａ），Ｖｏｌ．Ｊ６７−Ａ，Ｎｏ．７，１９８４年７月発行、ｐｐ．６２９−６３６）。従って、本実施例の話速変換装置１によれば、有声度Ｕｎに応じて話速変換するために、自然な聞こえの適応的話速変換が期待できる。 Usually, the vowel part of speech has a high voiced degree Un. In addition, a completely silent portion or a background sound such as music or noise in which frequency components of various sounds are generally mixed has low voicedness. Therefore, according to the speech speed conversion apparatus 1 of the present embodiment, the speech speed can be reduced when the voiced degree Un is high, and the speech speed can be increased where the voiced degree Un is low. Even in the case of a signal, the speaking speed is slowed down in the vowel part, which is an important part for listening to the voice, and the speaking speed is speeded up in the complete silent part and only the background sound. Furthermore, adaptive speech speed conversion is possible while matching the target time length as a whole. In particular, it is known that the vowel part mainly expands and contracts when the speaking speed is slow and fast in an actual person's utterance (for example, the IEICE Transactions (A), Vol. J67- A, No. 7, issued in July 1984, pp. 629-636). Therefore, according to the speech speed conversion apparatus 1 of the present embodiment, since the speech speed is converted according to the voiced degree Un, natural speech adaptive speech speed conversion can be expected.

更に、本実施例の話速変換装置１によれば、入力された音声波形に対して、“音声全体にわたって平滑化された基本周波数又は擬似基本周波数の基本周波数軌跡の凹凸の状態”である“凹凸度”Ｓｎに応じて、話速変換倍率αｎを適応制御することができる。従って、本実施例の話速変換装置１によれば、基本的には山状に凸になっている基本周波数軌跡の区間では話速を緩め、谷状に凹になっている基本周波数軌跡の区間では話速を速めることができるので、安定的に全体として目的の時間長に合わせながら、適応的な話速変換が可能となる。 Furthermore, according to the speech speed conversion apparatus 1 of the present embodiment, “the state of irregularities in the fundamental frequency locus of the fundamental frequency or the pseudo fundamental frequency smoothed over the entire speech” is applied to the input speech waveform. The speech speed conversion magnification αn can be adaptively controlled according to the degree of unevenness “Sn”. Therefore, according to the speech speed converting apparatus 1 of the present embodiment, the speech speed is basically reduced in the section of the fundamental frequency locus that is convex in a mountain shape, and the fundamental frequency locus that is concave in a valley shape. Since the speech speed can be increased in the section, adaptive speech speed conversion can be performed while stably adjusting to the target time length as a whole.

これは、特許文献１の技術のような「基本周波数の高いところは話速を緩め、低いところでは話速を速める」とする方式とは異なる。前述のように、音声の母音部など（有声音部分）では基本周波数を正確に求めることができるが、それ以外の背景音などの部分では安定して基本周波数を求めることができない。そこで、本実施例の話速変換装置１は、抽出される基本周波数の値が安定してほぼ連続的な変化をする安定区間では、話速変換倍率決定に用いる基本周波数として利用し、さらに、話速変換倍率決定に用いる基本周波数の軌跡がより滑らかになるように基本周波数の軌跡の平滑化を行う。 This is different from the method of “slowing the speaking speed when the fundamental frequency is high and increasing the speaking speed when it is low” as in the technique of Patent Document 1. As described above, the fundamental frequency can be accurately obtained in the vowel part of voice (voiced sound part), but the fundamental frequency cannot be stably obtained in other parts such as background sound. Therefore, the speech speed conversion apparatus 1 of the present embodiment uses the extracted fundamental frequency as a fundamental frequency used for determining the speech speed conversion magnification in a stable section where the value of the extracted fundamental frequency is stable and changes substantially continuously. The trajectory of the fundamental frequency is smoothed so that the trajectory of the fundamental frequency used for determining the speech rate conversion magnification becomes smoother.

また、本実施例の話速変換装置１は、抽出される基本周波数の値が安定せず、不連続で変化が激しい不安定区間では、この不安定区間の基本周波数の値を全て棄却し、安定区間の基本周波数の値を用いてスプライン関数などで補間することによって擬似基本周波数を求める。これにより、本実施例の話速変換装置１によれば、入力信号全ての区間において基本周波数又は擬似基本周波数からなる連続的な基本周波数軌跡を得ることができる。本実施例の話速変換装置１によれば、この基本周波数軌跡において、山状に凸（極大）になっている部分では話速を緩め、谷状（極小）になっている部分では話速を速めるため、安定的に、全体として目的の時間長に合わせながら、適応的な話速変換が可能となる。 Further, the speech speed converting apparatus 1 of the present embodiment rejects all the values of the fundamental frequency in the unstable section in the unstable section where the value of the extracted fundamental frequency is not stable and is discontinuous and changes rapidly. The pseudo fundamental frequency is obtained by interpolating with a spline function or the like using the fundamental frequency value in the stable section. Thereby, according to the speech rate conversion apparatus 1 of a present Example, the continuous fundamental frequency locus | trajectory which consists of a fundamental frequency or a pseudo fundamental frequency can be acquired in the area of all the input signals. According to the speech speed converting apparatus 1 of the present embodiment, in this fundamental frequency locus, the speech speed is slowed at a portion that is convex (maximum) in a mountain shape, and is spoken at a portion that is valley shape (minimum). Therefore, adaptive speech speed conversion can be performed stably and in accordance with the target time length as a whole.

更に、本実施例の話速変換装置１は、特許文献１の技術のような「基本周波数の高いところは話速を緩め、低いところでは話速を速める」とするやり方よりも有利な点がある。例えば、男女のコンビによる漫才などの入力信号は、男女の音声区間がほとんどポーズのない状態で激しく入れ替わる混合音声区間からなる。このような入力信号に対して、特許文献１のような「基本周波数の高いところは話速を緩め、低いところでは話速を速める」やり方では、女性の声は高いためいつも女性の声の音声に対しては話速を緩め、一方男性の声は低いため男性の声の音声に対しては話速をいつも速くするという傾向が生じてしまう。これに対して、本実施例の話速変換装置１は、平滑化された基本周波数又は擬似基本周波数の基本周波数軌跡において、発声のアクセントなどに付随して、女性の声の部分でも、男性の声の部分でも、必ず凹凸が生じるため、男女の発声の違いに関わらず、基本周波数軌跡の凸の部分は話速を緩め、基本周波数軌跡の凹の部分は話速を速めることができ、男女両者に公平な配分で話速の適応的な制御が可能である。 Furthermore, the speech speed conversion apparatus 1 of the present embodiment has an advantage over the method of “relaxing the speech speed when the fundamental frequency is high and increasing the speech speed when the fundamental frequency is low” as in the technique of Patent Document 1. is there. For example, an input signal such as a comic dialogue by a male / female combination consists of a mixed voice section in which the male and female voice sections are violently switched with almost no pause. For such an input signal, the method of “low speed of speech at a high fundamental frequency and high speed of speech at a low frequency” as in Patent Document 1 always has a voice of a female voice because a female voice is high. However, since the male voice is low, there is a tendency for the male voice to always increase the speaking speed. On the other hand, the speech rate conversion apparatus 1 according to the present embodiment, in the fundamental frequency trajectory of the smoothed fundamental frequency or pseudo fundamental frequency, accompanies utterance accents, etc. Since irregularities always occur in the voice part, the convex part of the fundamental frequency trajectory can slow down the speech speed, and the concave part of the fundamental frequency trajectory can speed up the speech speed, regardless of the difference in gender utterance. It is possible to adaptively control speech speed with fair distribution to both parties.

更に、本実施例の話速変換装置１は、入力信号波形に対して、“周波数スペクトルを複数の帯域に分割した場合のある２つの帯域の低帯域側を高帯域側で除したパワー成分の比”である“分割帯域パワー比”Ｅｎに応じて、話速変換倍率を適応制御することができる。特許文献４及び特許文献５の技術のような「定常状態における周波数スペクトルの複数の帯域と入力信号の周波数スペクトルの対応する各帯域のパワーを比較することにより、入力信号が“音声区間”であるか、又は“無音区間”であるかを判別する」ものとは相違して、本実施例の話速変換装置１は、“周波数スペクトルを複数の帯域に分割した場合のある２つの帯域の低帯域側を高帯域側で除したパワー成分の比”である“分割帯域パワー比”Ｅｎを利用する。この“分割帯域パワー比”Ｅｎは、定常状態におけるスペクトルのパワーと比較するのではなく、入力信号のある瞬間の周波数スペクトルだけを対象としており、ある瞬間の周波数スペクトルを帯域分割し、帯域分割した各周波数スペクトルのうちのある２つの帯域の低帯域側を高帯域側で除したパワー比を求めるものである。 Furthermore, the speech speed converting apparatus 1 of the present embodiment is configured to output power components obtained by dividing the low band side of two bands, which may be obtained by dividing the frequency spectrum into a plurality of bands, by the high band side. The speech rate conversion magnification can be adaptively controlled according to the “divided band power ratio” En, which is the ratio. As in the technologies of Patent Document 4 and Patent Document 5, “the input signal is a“ voice section ”by comparing the power of each of the bands of the frequency spectrum in the steady state and the corresponding bands of the frequency spectrum of the input signal. Unlike the case of “determining whether it is a silence period”, the speech speed converting apparatus 1 of the present embodiment is capable of “reducing two bands in which the frequency spectrum is divided into a plurality of bands. A “divided band power ratio” En, which is a ratio of power components obtained by dividing the band side by the high band side, is used. This “divided band power ratio” En is not compared with the power of the spectrum in the steady state, but only the frequency spectrum at a certain moment of the input signal, and the frequency spectrum at a certain moment is divided into bands. The power ratio obtained by dividing the low band side of two bands in each frequency spectrum by the high band side is obtained.

本実施例の話速変換装置１は、特許文献４及び特許文献５の技術のような「入力信号を音声区間と無音区間に分けるとともに、音声区間の話速は緩め、無音区間は短縮する」とするやり方よりも有利な点がある。 The speech speed conversion apparatus 1 according to the present embodiment is similar to the techniques disclosed in Patent Document 4 and Patent Document 5, “divides an input signal into a voice section and a silent section, slows down the speech speed of the voice section, and shortens the silent section”. There is an advantage over the method.

例えば特許文献４及び特許文献５の技術で、ある程度大きな音量の音楽などが背景音として混入している入力信号に対して“音声区間”であるか、又は“無音区間”であるかの判別を行った場合、先に述べたように、正しく“音声区間”と“無音区間”を判別することが困難であり、適応的な話速変換を行うことができない。一方、本実施例の話速変換装置１は、入力信号のある瞬間の周波数スペクトルだけを対象としており、ある瞬間の周波数スペクトルを帯域分割し、帯域分割した各周波数スペクトルのうちのある２つの帯域の低帯域側を高帯域側で除した“分割帯域パワー比”Ｅｎに基づき話速変換倍率を決定するものであるから、本質的に判定誤りというものは存在せず、安定して話速の制御を行うことができる。例えば、低帯域側のパワー成分に対して高帯域側のパワー成分が小さいときは話速を緩め、低帯域側のパワー成分に対して高帯域側のパワー成分が大きいときは話速を速めることができる。つまり、この“分割帯域パワー比”Ｅｎは、入力信号において、音声区間、音楽、雑音、無音などの種類によって異なる値を持つので、本実施例の話速変換装置１によれば、この“分割帯域パワー比”Ｅｎの値に基づいて話速制御を行うことにより、音声区間では話速を緩め、音楽、雑音、無音などの音声ではない区間では話速を速めることができるようになる。 For example, with the techniques of Patent Document 4 and Patent Document 5, it is determined whether the sound is a “sound section” or “silent section” with respect to an input signal in which music of a certain volume is mixed as background sound. When it is performed, as described above, it is difficult to correctly distinguish between “voice section” and “silent section”, and adaptive speech speed conversion cannot be performed. On the other hand, the speech speed converting apparatus 1 of the present embodiment is intended only for the frequency spectrum at a certain moment of the input signal, and the frequency spectrum at a certain moment is band-divided, and two bands of each frequency spectrum obtained by the band division are obtained. Since the speech speed conversion magnification is determined based on the “divided band power ratio” En obtained by dividing the low bandwidth side by the high bandwidth side, there is essentially no judgment error, and the speech speed is stable. Control can be performed. For example, when the power component on the high band side is small relative to the power component on the low band side, the speech speed is slowed down, and when the power component on the high band side is large compared to the power component on the low band side, the speech speed is increased. Can do. That is, since this “divided band power ratio” En has a different value depending on the type of voice section, music, noise, silence, etc. in the input signal, according to the speech rate conversion apparatus 1 of the present embodiment, By performing the speech speed control based on the value of the band power ratio “En”, the speech speed can be reduced in the speech section, and the speech speed can be increased in the sections that are not speech such as music, noise, and silence.

更に、本実施例の話速変換装置１は、“有声度”Ｕｎ、“凹凸度”Ｓｎ、及び、“分割帯域パワー比”Ｅｎのうちの１つ以上の「物理指標」を用いて入力信号に対する各区間の話速変換倍率を決定することにより話速変換を行う、より高度な適応的な話速変換が可能である。例えば、“有声度”Ｕｎで指定される話速変換倍率には０．５、“凹凸度”Ｓｎで指定される話速変換倍率には０．３、“分割帯域パワー比”Ｅｎで指定される話速変換倍率には０．２の寄与率（配分割り当て）を与えて、“有声度”Ｕｎ、“凹凸度”Ｓｎ、及び、“分割帯域パワー比”Ｅｎに基づく各話速成分を加算して最終話速変換倍率を決定することができる。例えば、話速変換を行う対象入力信号が放送の音である場合、特に、近年開発が盛んなメタ情報として番組のジャンル（ニュース、ドキュメンタリー、ドラマ、バラエティ、落語、漫才等）が付与されている放送の音の場合に、このジャンルに応じて寄与率（配分割り当て）を変更することができ、より聞きやすさや自然さの高い適応的話速変換が可能となる。 Furthermore, the speech speed conversion apparatus 1 according to the present embodiment uses one or more “physical indicators” of “voicedness” Un, “unevenness” Sn, and “divided band power ratio” En. It is possible to perform more advanced adaptive speech speed conversion in which speech speed conversion is performed by determining the speech speed conversion magnification of each section with respect to. For example, the voice rate conversion magnification specified by “Voice” Un is 0.5, the voice rate conversion magnification specified by “Roughness Level” Sn is 0.3, and “Divided Band Power Ratio” En is specified. Giving a contribution rate (allocation allocation) of 0.2 to the speech rate conversion magnification, and adding each speech rate component based on "voicedness" Un, "unevenness" Sn, and "divided band power ratio" En Thus, the final speech rate conversion magnification can be determined. For example, when the target input signal for speech speed conversion is a broadcast sound, the program genre (news, documentary, drama, variety, rakugo, comic story, etc.) is given as meta information that has been actively developed in recent years. In the case of broadcasting sound, the contribution rate (allocation allocation) can be changed according to this genre, and adaptive speech speed conversion with higher ease of listening and higher naturalness is possible.

更に、本実施例の話速変換装置１は、入力信号全体又は所定の規則により分割した各部分に対して、所定の時間長が設定された場合、この時間長に合うように時間的な伸縮倍率を調整して所定の時間長に合わせこむ適応的話速変換を行うことができる。これは、１倍速（実時間で再生）や２倍速（実時間の半分の時間で再生）といった任意の再生速度変換倍率α[倍速]が与えられた場合に、入力信号について分割した各部分ごとに再生速度変換倍率α[倍速]よりも大きい倍率や小さい倍率で話速を変えることが要求される際に、全体としては一様な倍率αで話速変換したのと同じ再生時間となるように分割した各部分の話速変換倍率を微調整して、結果的に、一様な再生速度変換倍率αで話速変換した場合と同じ時間長で話速変換音声の生成を行うことができる。 Furthermore, when a predetermined time length is set for the entire input signal or each part divided according to a predetermined rule, the speech speed converting apparatus 1 of the present embodiment expands and contracts in time so as to match this time length. It is possible to perform adaptive speech speed conversion that adjusts the magnification to match a predetermined time length. This is because each divided part of the input signal is given an arbitrary playback speed conversion magnification α [double speed] such as 1 × speed (playback in real time) or 2 × speed (playback in half the real time). When it is required to change the speech speed at a magnification larger or smaller than the playback speed conversion magnification α [double speed], the overall playback time will be the same as the speech speed conversion at a uniform magnification α. As a result, it is possible to generate the speech speed converted voice with the same time length as when the speech speed is converted with the uniform playback speed conversion magnification α. .

更に、本発明の一態様として、本実施例の話速変換装置１をコンピュータとして構成させることができる。コンピュータに、前述した各構成要素を実現させるためのプログラムは、コンピュータの内部又は外部に備えられる記憶部に記憶される。そのような記憶部は、外付けハードディスクなどの外部記憶装置、或いはＲＯＭ又はＲＡＭなどの内部記憶装置で実現することができる。コンピュータに備えられる制御部は、中央演算処理装置（ＣＰＵ）などの制御で実現することができる。即ち、ＣＰＵが、各構成要素の機能を実現するための処理内容が記述されたプログラムを、適宜、記憶部から読み込んで、各構成要素の機能をコンピュータ上で実現させることができる。ここで、各構成要素の機能をハードウェアの全部又は一部で実現しても良い。 Furthermore, as one aspect of the present invention, the speech rate conversion apparatus 1 of the present embodiment can be configured as a computer. A program for causing a computer to realize each of the above-described components is stored in a storage unit provided inside or outside the computer. Such a storage unit can be realized by an external storage device such as an external hard disk or an internal storage device such as ROM or RAM. The control unit provided in the computer can be realized by controlling a central processing unit (CPU) or the like. In other words, the CPU can appropriately read from the storage unit a program in which the processing content for realizing the function of each component is described, and realize the function of each component on the computer. Here, you may implement | achieve the function of each component by all or a part of hardware.

また、この処理内容を記述したプログラムを、例えばＤＶＤ又はＣＤ−ＲＯＭなどの可搬型記録媒体の販売、譲渡、貸与等により流通させることができるほか、そのようなプログラムを、例えばＩＰなどのネットワーク上にあるサーバの記憶部に記憶しておき、ネットワークを介してサーバから他のコンピュータにそのプログラムを転送することにより、流通させることができる。 In addition, the program describing the processing contents can be distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM, and such a program can be distributed on a network such as an IP. The program can be distributed by storing the program in the storage unit of the server and transferring the program from the server to another computer via the network.

また、そのようなプログラムを実行するコンピュータは、例えば、可搬型記録媒体に記録されたプログラム又はサーバから転送されたプログラムを、一旦、自己の記憶部に記憶することができる。また、このプログラムの別の実施態様として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、更に、このコンピュータにサーバからプログラムが転送される度に、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。 In addition, a computer that executes such a program can temporarily store, for example, a program recorded on a portable recording medium or a program transferred from a server in its own storage unit. As another embodiment of the program, the computer may directly read the program from a portable recording medium and execute processing according to the program, and each time the program is transferred from the server to the computer. In addition, the processing according to the received program may be executed sequentially.

以上、具体例を挙げて本発明の実施例を詳細に説明したが、本発明の特許請求の範囲から逸脱しない限りにおいて、あらゆる変形や変更が可能であることは当業者に明らかである。例えば、本実施例の話速変換装置１において、“有声度”Ｕｎ、“凹凸度”Ｓｎ、及び、“分割帯域パワー比”Ｅｎからなる「物理指標」を算出して、入力信号の配分割り当てを行って話速変換を行う例について説明したが、“有声度”Ｕｎのみの「物理指標」を算出して入力信号の話速変換を行う話速変換装置、“凹凸度”Ｓｎのみの「物理指標」を算出して入力信号の話速変換を行う話速変換装置、“分割帯域パワー比”Ｅｎのみの「物理指標」を算出して入力信号の話速変換を行う話速変換装置など、“有声度”Ｕｎ、“凹凸度”Ｓｎ、及び、“分割帯域パワー比”Ｅｎのうちの少なくとも１つ以上を算出して、入力信号の話速変換を行う話速変換装置とすることもできる。従って、本発明は上記の実施例に限定されるものではない。 While the embodiments of the present invention have been described in detail with specific examples, it will be apparent to those skilled in the art that various modifications and changes can be made without departing from the scope of the claims of the present invention. For example, in the speech speed conversion apparatus 1 of the present embodiment, a “physical index” composed of “voicedness” Un, “concave / convex degree” Sn, and “divided band power ratio” En is calculated, and the allocation allocation of the input signal is performed. In the above description, the speech speed conversion is performed to calculate the “physical index” of only “voiced” Un and to convert the speech speed of the input signal. A speech speed conversion apparatus that calculates the "physical index" and converts the speech speed of the input signal, a speech speed conversion apparatus that calculates the "physical index" of only the "divided band power ratio" En and performs the speech speed conversion of the input signal, etc. , Calculating at least one of “voiced” Un, “unevenness” Sn, and “divided band power ratio” En, and converting the input signal to a speech rate converter. it can. Therefore, the present invention is not limited to the above embodiments.

本発明によれば、テレビやラジオの音声をリアルタイムでゆっくり聞いたり、ハードディスクレコーダなどに一度記録して、ゆっくり又ははやく視聴したりする話速変換技術のあらゆる用途に適用することができる。例えば、視覚障害者からは音声情報を効率的に聴取したいという要望があり、本発明によれば、視覚障害者用の録音図書などを高速に再生して聞くことができる。さらに、本発明によれば、その教材の作成時に利用したり、学習時に学習者の上達度に合わせて音声の話速を変換して学習者に聞かせたりするための語学学習や発声訓練システムに適用することができ、話速変換を要する任意の用途に有用である。 INDUSTRIAL APPLICABILITY According to the present invention, the present invention can be applied to all uses of speech speed conversion technology that listens slowly to a television or radio sound in real time, or records it on a hard disk recorder once and views it slowly or quickly. For example, there is a request from a visually impaired person to efficiently listen to audio information. According to the present invention, a recorded book for visually impaired persons can be reproduced and heard at high speed. Furthermore, according to the present invention, a language learning or utterance training system that is used when creating the teaching material or for converting the speech speed of the voice in accordance with the learner's progress at the time of learning to the learner is provided. It can be applied and is useful for any application that requires speech speed conversion.

１話速変換装置
２物理指標算出部
３話速変換倍率決定部
１００有声度算出部
２００基本周波数・擬似基本周波数凹凸算出部
２１０凹凸度算出部
３００周波数帯域・パワー演算部
３１０分割帯域パワー比演算部
２０２基本周波数抽出部
２０４擬似基本周波数算出部
２０６基本周波数軌跡連結部
３００周波数帯域・パワー演算部
３０２スペクトル算出部
３０４帯域分割部
３０６パワー演算部
１２０第１話速変換倍率指定部（話速変換倍率指定部ａ）
２２０第２話速変換倍率指定部（話速変換倍率指定部ｂ）
３２０第３話速変換倍率指定部（話速変換倍率指定部ｃ）
４００話速変換倍率微調整部 DESCRIPTION OF SYMBOLS 1 Speech speed converter 2 Physical index calculation part 3 Speech speed conversion magnification determination part 100 Voicedness calculation part 200 Fundamental frequency / pseudo fundamental frequency unevenness calculation part 210 Concavity and convexity calculation part 300 Frequency band / power calculation part 310 Divided band power ratio calculation Unit 202 Fundamental frequency extraction unit 204 Pseudo fundamental frequency calculation unit 206 Fundamental frequency locus connection unit 300 Frequency band / power calculation unit 302 Spectrum calculation unit 304 Band division unit 306 Power calculation unit 120 First speech speed conversion magnification designation unit (speech speed conversion) Magnification designation part a)
220 Second speech speed conversion magnification designation section (speech speed conversion magnification designation section b)
320 Third speech speed conversion magnification designation section (speech speed conversion magnification designation section c)
400 Speaking speed conversion magnification fine adjustment section

Claims

A speech speed conversion device that performs adaptive speech speed conversion of an input signal,
For each segment obtained by dividing the input signal every unit time, a physical index calculation unit that calculates a physical index of the input signal,
According to the physical index calculated by the physical index calculation unit, the speech speed conversion magnification determination unit that determines the speech speed conversion magnification to be specified for each segment of the input signal and performs the speech speed conversion;
A speech speed conversion device comprising:

The physical index calculation unit
The speech rate conversion apparatus according to claim 1, further comprising: a voicing degree calculating unit that calculates a voicing degree representing a relative maximum value obtained by autocorrelation per unit time in the input signal as the physical index.

The physical index calculation unit
The unevenness degree calculation part which calculates the unevenness degree showing the change tendency of the locus of the fundamental frequency per unit time in the input signal and the pseudo fundamental frequency as the physical index is provided, The unevenness degree calculation part according to claim 1 or 2 characterized by things. Speaking speed converter.

The physical index calculation unit
2. A division band power ratio calculation unit that calculates a ratio between a low frequency side power component and a high frequency side power component obtained by dividing a band per unit time in an input signal as the physical index is provided. 4. The speech rate conversion device according to any one of 3 above.

The speech speed conversion magnification determination unit
A speech speed conversion magnification fine adjustment unit that finely adjusts the determined speech speed conversion magnification to match the reproduction speed conversion magnification when a playback speed conversion magnification to be speed converted is given to the entire input signal; The speech rate conversion apparatus according to any one of claims 1 to 4, further comprising:

The speech speed conversion magnification determination unit
A speech speed conversion magnification fine adjustment unit that determines a speech speed conversion magnification to be specified for each segment of the input signal using one or more physical indices of the voiced degree, the unevenness degree, and the divided band power ratio. The speech speed conversion device according to claim 1, comprising:

The speech speed conversion magnification fine adjustment unit allocates the speech speed conversion ratio based on one or more physical indices of the voiced degree, the unevenness degree, and the divided band power ratio according to the type of the input signal. The speech rate conversion device according to claim 6, wherein the speech rate conversion device is performed.

In a computer configured as a speech rate conversion device that performs adaptive speech rate conversion of an input signal,
For each segment obtained by dividing the input signal per unit time, the voicing level representing the relative maximum value obtained by autocorrelation per unit time in the input signal, the change in the trajectory of the fundamental frequency and pseudo fundamental frequency per unit time in the input signal One or more physical indices are calculated from the degree of unevenness representing the trend and the divided band power ratio representing the ratio between the low-frequency power component and the high-frequency power component divided into bands per unit time in the input signal. Steps,
In accordance with the physical index calculated in the step, determining a speech speed conversion magnification to be specified for each segment of the input signal and performing speech speed conversion;
A program for running