JP3220043B2

JP3220043B2 - Speech rate conversion method and apparatus

Info

Publication number: JP3220043B2
Application number: JP11296197A
Authority: JP
Inventors: 篤今井; 信正清山; 徹都木
Original assignee: Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 1997-04-30
Filing date: 1997-04-30
Publication date: 2001-10-22
Anticipated expiration: 2017-04-30
Also published as: JPH10301598A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、テレビジョン、ラ
ジオ、テープレコーダ、ビデオテープレコーダ、ビデオ
ディスクプレーヤ、補聴器などの映像機器、音響機器、
医療機器などにおいて、時間を伸張させることなく、話
速変換に期待される聞き易さを実現する話速変換方法お
よびその装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to video equipment, audio equipment, such as televisions, radios, tape recorders, video tape recorders, video disc players, and hearing aids.
The present invention relates to a speech speed conversion method and apparatus for realizing the audibility expected for speech speed conversion without extending time in medical equipment and the like.

【０００２】［発明の概要］本発明は、人が発声した音
声を加工してリアルタイムで発話速度を変換する話速変
換方法およびその装置に関するものであって、受聴音声
の発声する速さ（話速）を遅くする際に、入力音声のデ
ータ長と、事前に与えられた伸縮倍率に関する変換関数
によって予め計算された出力データ長と、実際に出力さ
れている音声のデータ長とを一定の処理単位で常に監視
しながら、情報の欠落を生じることなく、一連の処理を
行なうものである。[Summary of the Invention] [0002] The present invention relates to a speech rate conversion method and apparatus for converting speech rate in real time by processing a speech uttered by a human, and relates to a speech rate (talk rate) of a received speech. When the speed is reduced, the data length of the input voice, the output data length calculated in advance by a conversion function relating to a given scaling factor, and the data length of the voice actually output are fixed. A series of processes are performed without any loss of information while constantly monitoring in units.

【０００３】さらに、この話速変換方法およびその装置
においては、例えばテレビの視聴に使用する際、音声を
伸張することによる映像と音声との時間差を最小限にす
ることを目的として、話速変換に期待される遅さの度合
い（変換倍率）に応じて設定される可変のしきい値以上
の長さを有する無音区間を適宜、短縮し、かつ入力デー
タ長に対する出力データ長の時間差の程度によって適応
的に変換倍率を変化させることにより、変換音声の発話
時間を原音声の発話時間にほぼ保ちつつ、決められた時
間枠の中で実現し得る最大のゆっくり感を自動的に生成
するものである。[0003] Further, in this speech speed conversion method and its apparatus, for example, when used for viewing a television, the speech speed conversion method aims at minimizing the time difference between video and audio due to audio expansion. The length of a silent section having a length equal to or greater than a variable threshold value set in accordance with the degree of delay (conversion magnification) expected in the above is appropriately shortened, and the time difference between the output data length and the input data length is determined. By adaptively changing the conversion magnification, the maximum speech feeling that can be realized within the specified time frame is automatically generated while keeping the speech time of the converted speech almost equal to the speech time of the original speech. is there.

【０００４】[0004]

【従来の技術】話速変換手法を実際の放送に適用する場
合、緊急報道など、原音声からの遅れが問題になる場合
がある。特に、映像を伴うメディアについては、この遅
れが話速変換に期待される効果とは逆に、悪影響を及ぼ
す可能性がある。2. Description of the Related Art When a speech speed conversion technique is applied to an actual broadcast, a delay from an original voice may cause a problem such as an emergency report. In particular, for media with video, this delay may have an adverse effect on the contrary to the effect expected for speech speed conversion.

【０００５】そこで、原音声から遅れることなく、話速
変換効果（ゆっくり感）を実現する手法として、一様に
ゆっくり変換するのではなく、一息で行なう発声の開始
点から終了点に向かう経過時間の関数として、話速をゆ
っくりから速くに変化させることで伸張を抑制し、文章
間の無音区間を適宜、短縮する方法（池沢龍ほか、平成
４年日本音響学会春期研究発表会「話速変換に伴う時間
伸張を吸収するための一手法」２−６−２、ｐｐ．３３
１〜３３２）や、この手法をリアルタイム処理化する方
法（今井篤ほか、平成７年電子情報通信学会総合大会講
演論文集「話速変換に伴う時間伸張のリアルタイム吸収
法」Ｄ−６９４、ｐｐ．３００）などが報告されてい
る。Therefore, as a technique for realizing the speech speed conversion effect (slow feeling) without delay from the original voice, the time elapsed from the start point to the end point of the utterance performed in one breath, instead of performing the conversion uniformly and slowly, is considered. A method of suppressing the stretching by changing the speech speed from slow to fast as a function of the speech, and appropriately shortening the silent interval between sentences (Ryu Ikezawa et al., Spring Meeting of the Acoustical Society of Japan in 1992, "Speech speed conversion" Method for Absorbing Time Elongation Associated with "2-6-2, pp. 33"
1-32), and a method of realizing this method in real time (Atsushi Imai et al., Proceedings of the 1995 IEICE General Conference, "Real-time absorption method of time expansion accompanying speech speed conversion" D-694, pp. 139-143). 300) have been reported.

【０００６】前者は、全ての発話様式が既知として上
で、適当な関数を手動で設定するものであり、後者も倍
率を与える関数を手動で規定し、一度設定した後は、こ
れを固定するものである。In the former method, an appropriate function is manually set after all utterance modes are known. In the latter method, a function for giving a magnification is manually specified, and once set, the function is fixed. Things.

【０００７】一方、無音区間の短縮も、一定の残存時間
のみを手動で規定するものであり、仮に「ずれ」が多く
積算された場合には、バッファに蓄積された伸張分の音
声を手動でクリアするものであった。On the other hand, the shortening of the silent section is to manually define only a certain remaining time. If a large amount of “shift” is accumulated, the expanded voice stored in the buffer is manually deleted. It was clear.

【０００８】[0008]

【発明が解決しようとする課題】このため、従来の話速
変換装置では、放送音声の発話形態（話速や「間」のと
り方など）が発話者によって様々であり、人手によっ
て、それぞれに適したパラメータを設定しなければなら
ないことから、操作箇所が多数あり、また設定自体が難
しく、一般のユーザが取り扱うのに難し過ぎるという問
題があった。For this reason, in the conventional speech speed conversion device, the utterance form (such as the speech speed and the way of “between”) of the broadcast voices varies depending on the speaker, and is suitable for each person depending on the hand. Since such parameters must be set, there are many operation points, the setting itself is difficult, and there is a problem that it is too difficult for ordinary users to handle.

【０００９】本発明は上記の事情に鑑み、請求項１〜４
では、ユーザが数段階の目安となる変換倍率を一度だけ
設定操作するだけで、設定された条件に応じて話速変換
倍率や無音区間を適応的に制御し、実際に発話された時
間枠の中で、話速変換に期待される効果を安定して得る
ことができる話速変換方法を提供することを目的として
いる。The present invention has been made in view of the above circumstances, and has been described in claims 1 to 4.
Then, the user only needs to set the conversion rate, which is a guide for several steps, only once, and adaptively controls the speech rate conversion rate and the silence interval according to the set conditions, and sets the time frame of the actual utterance. It is an object of the present invention to provide a speech speed conversion method capable of stably obtaining an effect expected in speech speed conversion.

【００１０】また、請求項５〜８では、ユーザが数段階
の目安となる変換倍率を一度だけ設定操作するだけで、
設定された条件に応じて話速変換倍率や無音区間を適応
的に制御し、実際に発話された時間枠の中で、話速変換
に期待される効果を安定して得ることができる話速変換
装置を提供することを目的としている。Further, according to claims 5 to 8, the user only has to set and operate the conversion magnification which is a guide of several steps only once.
Adaptively controls the speech speed conversion magnification and silence interval according to the set conditions, and achieves a stable speech speed conversion effect that can be expected in the actual utterance time frame. It is intended to provide a conversion device.

【００１１】[0011]

【課題を解決するための手段】上記の目的を達成するた
めに、本発明による話速変換方法は、請求項１では、時
間的に変化する任意の比率で、入力データを伸張合成し
て得られた出力データについて、ある無音区間が出現
し、この無音区間の継続時間が所定のしきい値を越えて
いるとき、この入力データに対する出力データの伸張時
間を、この伸張時間内の任意の時間だけ削減することを
特徴としている。According to a first aspect of the present invention, there is provided a speech speed conversion method according to the present invention, wherein input data is expanded and synthesized at an arbitrary ratio which changes with time. When a certain silent section appears for the output data and the duration of the silent section exceeds a predetermined threshold, the expansion time of the output data with respect to the input data is set to an arbitrary time within the expansion time. It is characterized by only reducing.

【００１２】また、請求項２では、請求項１に記載の話
速変換方法において、入力データを伸縮合成する際、入
力データ長と、この入力データ長に任意の伸縮倍率を乗
じて算出される目標データ長と、実際の出力データ長と
の関係が矛盾しないように、逐次監視しながら、合成処
理を行ない、時間的に変化する任意の伸縮合成比率に対
し、音声部分に関して、情報の欠落が生じないようにす
るとともに、話速変換に伴う伸張に対する正確な時間情
報を保持させることを特徴としている。According to a second aspect of the present invention, in the speech speed conversion method according to the first aspect, when the input data is expanded and contracted, the input data length is calculated by multiplying the input data length by an arbitrary expansion and contraction magnification. In order to ensure that the relationship between the target data length and the actual output data length does not contradict, synthesis processing is performed while monitoring sequentially, and there is no lack of information regarding the audio part for any expansion / contraction synthesis ratio that changes over time. This is characterized in that it does not occur and that accurate time information on the expansion accompanying the speech speed conversion is held.

【００１３】また、請求項３では、請求項１または請求
項２に記載の話速変換方法で、話速変換に伴う入力デー
タ長からの伸張分を解消する場合において、一定のしき
い値を越える無音区間を削除することを許容する際、適
用している話速変換倍率や、遅れ時間などを加味して、
しきい値を適応的に変化させることを特徴としている。According to a third aspect of the present invention, in the speech speed conversion method according to the first or second aspect, when the extension from the input data length due to the speech speed conversion is eliminated, a certain threshold value is set. When allowing the deletion of the silence section that exceeds, taking into account the applied speech speed conversion magnification and the delay time,
It is characterized in that the threshold is adaptively changed.

【００１４】また、請求項４では、請求項１、２、３の
いずれかに記載の話速変換方法において、限られた時間
枠の中で、話速変換を行なう際、入力データ長と、この
入力データ長に任意の伸縮倍率を乗じて算出される目標
データ長と、実際の出力データ長との関係が矛盾しない
ように、逐次監視しながら、予め設定されている時間間
隔で伸張量を測定し、この測定結果に基づき、時間差が
少ないときには、話速変換倍率を一時的に上昇させ、ま
た時間差が多いときには、話速変換倍率を一時的に下降
させることにより、適応的に話速変換倍率を変化させる
ことを特徴としている。According to a fourth aspect of the present invention, in the speech speed conversion method according to any one of the first to third aspects, when the speech speed conversion is performed within a limited time frame, the input data length and While monitoring the target data length, which is calculated by multiplying the input data length by an arbitrary expansion / contraction ratio, and the actual output data length, the amount of expansion is determined at predetermined time intervals while monitoring the data sequentially. Based on the result of measurement, when the time difference is small, the speech speed conversion magnification is temporarily increased, and when the time difference is large, the speech speed conversion magnification is temporarily decreased, so that the speech speed conversion is adaptively performed. It is characterized in that the magnification is changed.

【００１５】また、上記の目的を達成するために、本発
明による話速変換装置は、請求項５では、入力データを
各ブロックに分割してブロックデータを生成するととも
に、各ブロックデータに基づき、接続データを生成する
分割処理／接続データ生成手段と、入力された所望話速
に基づき、前記分割処理／接続データ生成手段によって
生成された各ブロックデータ、各接続データの接続順序
を決めて、これらを接続し、出力データを生成する接続
処理手段とを備え、前記接続処理手段は、時間的に変化
する任意の比率で、各ブロックデータを伸張合成して得
られた出力データ中に無音区間が出現し、この無音区間
の継続時間が所定のしきい値を越えているとき、このブ
ロックデータに対する出力データの伸張時間を、この伸
張時間内の任意の時間だけ削減することを特徴としてい
る。According to another aspect of the present invention, a speech speed conversion device according to the present invention divides input data into respective blocks to generate block data, and generates a block data based on each block data. A division processing / connection data generating means for generating connection data; and a connection order of each block data and each connection data generated by the division processing / connection data generation means based on the input desired speech speed. And connection processing means for generating output data, wherein the connection processing means includes a silent section in output data obtained by expanding and synthesizing each block data at an arbitrary ratio that changes with time. Appearing, and when the duration of the silent section exceeds a predetermined threshold, the extension time of the output data for this block data is set to an arbitrary value within the extension time. It is characterized by reduced only during.

【００１６】また、請求項６では、請求項５に記載の話
速変換装置において、前記接続処理手段は、入力データ
の伸縮合成する際、入力データ長と、この入力データ長
に任意の伸縮倍率を乗じて算出される目標データ長と、
実際の出力データ長との関係が矛盾しないように逐次監
視しながら、合成処理を行ない、時間的に変化する任意
の伸縮合成比率に対し、音声部分に関して、情報の欠落
が生じないようにするとともに、話速変換に伴う伸張に
対する正確な時間情報を保持させることを特徴としてい
る。According to a sixth aspect of the present invention, in the speech speed conversion device according to the fifth aspect, when the expansion / contraction of the input data is performed, the connection processing means may include an input data length and an arbitrary expansion / contraction ratio for the input data length. And the target data length calculated by multiplying
A synthesis process is performed while sequentially monitoring the relationship with the actual output data length so that there is no inconsistency, so that information is not lost in the audio portion for an arbitrary expansion / contraction synthesis ratio that changes with time. It is characterized in that accurate time information for expansion accompanying speech speed conversion is held.

【００１７】また、請求項７では、請求項５または請求
項６に記載の話速変換装置で、話速変換に伴う入力デー
タ長からの伸張分を解消する場合において、一定のしき
い値を越える無音区間を削除することを許容する際、適
用している話速変換倍率や、遅れ時間などを加味して、
しきい値を適応的に変化させることを特徴としている。According to a seventh aspect of the present invention, in the speech speed conversion device according to the fifth or sixth aspect, when the extension from the input data length due to the speech speed conversion is eliminated, a certain threshold value is set. When allowing the deletion of the silence section that exceeds, taking into account the applied speech speed conversion magnification and the delay time,
It is characterized in that the threshold is adaptively changed.

【００１８】また、請求項８では、請求項５、６、７の
いずれかに記載の話速変換装置において、前記接続処理
手段は、限られた時間枠の中で、話速変換を行なう際、
入力データ長と、この入力データ長に任意の伸縮倍率を
乗じて算出される目標データ長と、実際の出力データ長
との関係が矛盾しないように、逐次監視しながら、予め
設定されている時間間隔で伸張量を測定し、この測定結
果に基づき、時間差が少ないときには、話速変換倍率を
一時的に上昇させ、また時間差が多いときには、話速変
換倍率を一時的に下降させることにより、適応的に話速
変換倍率を変化させることを特徴としている。According to an eighth aspect of the present invention, in the speech speed conversion device according to any one of the fifth, sixth and seventh aspects, the connection processing means performs the speech speed conversion within a limited time frame. ,
The preset time is monitored while monitoring the input data length, the target data length calculated by multiplying the input data length by an arbitrary expansion / contraction ratio, and the actual output data length sequentially. The expansion amount is measured at intervals, and based on the measurement result, when the time difference is small, the speech speed conversion magnification is temporarily increased, and when the time difference is large, the speech speed conversion magnification is temporarily decreased, thereby adapting. It is characterized in that the speech speed conversion magnification is changed.

【００１９】上記の構成において、請求項１に記載の話
速変換方法では、時間的に変化する任意の比率で、入力
データを伸張合成して得られた出力データについて、あ
る無音区間が出現し、この無音区間の継続時間が所定の
しきい値を越えているとき、この入力データに対する出
力データの伸張時間を、この伸張時間内の任意の時間だ
け削減することにより、ユーザが数段階の目安となる変
換倍率を一度だけ設定操作するだけで、設定された条件
に応じて話速変換倍率や無音区間を適応的に制御し、実
際に発話された時間枠の中で、話速変換に期待される効
果を安定して得る。In the above configuration, in the speech speed conversion method according to the first aspect, a certain silent section appears in output data obtained by expanding and combining input data at an arbitrary ratio that changes with time. When the duration of the silent section exceeds a predetermined threshold value, the user can reduce the expansion time of the output data with respect to the input data by an arbitrary time within the expansion time so that the user can obtain several levels of reference. By simply setting the conversion rate once, the speech rate conversion rate and the silence section are adaptively controlled according to the set conditions, and the speech rate conversion is expected in the time frame actually spoken. The effect obtained is obtained stably.

【００２０】また、請求項２では、入力データを伸縮合
成する際、入力データ長と、この入力データ長に任意の
伸縮倍率を乗じて算出される目標データ長と、実際の出
力データ長との関係が矛盾しないように、逐次監視しな
がら、合成処理を行ない、時間的に変化する任意の伸縮
合成比率に対し、音声部分に関して、情報の欠落が生じ
ないようにするとともに、話速変換に伴う伸張に対する
正確な時間情報を保持させることにより、ユーザが数段
階の目安となる変換倍率を一度だけ設定操作するだけ
で、設定された条件に応じて話速変換倍率や無音区間を
適応的に制御し、実際に発話された時間枠の中で、話速
変換に期待される効果を安定して得る。According to a second aspect of the present invention, when expanding and shrinking input data, an input data length, a target data length calculated by multiplying the input data length by an arbitrary expansion and contraction factor, and an actual output data length are obtained. In order to ensure that the relationship does not contradict, synthesis processing is performed while monitoring sequentially, so that information is not lost in the audio part for an arbitrary expansion / contraction synthesis ratio that changes with time, and speech speed conversion is performed. By retaining accurate time information for decompression, the user only needs to set the conversion factor, which is a guideline for several steps, only once, and adaptively controls the speech speed conversion factor and silence interval according to the set conditions. Then, the effect expected for the speech speed conversion is stably obtained within the time frame actually spoken.

【００２１】また、請求項３では、話速変換に伴う入力
データ長からの伸張分を解消する際、一定継続時間以上
の無音区間の一部を削除して、話速変換倍率、伸張量な
どに応じて、無音区間の残存割合を適応的に変化させる
ことにより、ユーザが数段階の目安となる変換倍率を一
度だけ設定操作するだけで、設定された条件に応じて話
速変換倍率や無音区間を適応的に制御し、実際に発話さ
れた時間枠の中で、話速変換に期待される効果を安定し
て得る。According to a third aspect of the present invention, when the extension from the input data length due to the speech speed conversion is eliminated, a part of the silent section having a certain duration or more is deleted, and the speech speed conversion magnification, the extension amount, etc. By changing the remaining ratio of the silent section adaptively according to the conditions, the user only needs to set and operate the conversion magnification once a few steps as a guide, and according to the set conditions, the speech speed conversion magnification and the silence The section is adaptively controlled, and the effect expected for the speech speed conversion is stably obtained within the time frame actually spoken.

【００２２】また、請求項４では、限られた時間枠の中
で、話速変換を行なう際、入力データ長と、この入力デ
ータ長に任意の伸縮倍率を乗じて算出される目標データ
長と、実際の出力データ長との関係が矛盾しないよう
に、逐次監視しながら、予め設定されている時間間隔で
伸張量を測定し、この測定結果に基づき、時間差が少な
いときには、話速変換倍率を一時的に上昇させ、また時
間差が多いときには、話速変換倍率を一時的に下降させ
ることにより、適応的に話速変換倍率を変化させること
により、ユーザが数段階の目安となる変換倍率を一度だ
け設定操作するだけで、設定された条件に応じて話速変
換倍率や無音区間を適応的に制御し、実際に発話された
時間枠の中で、話速変換に期待される効果を安定して得
る。According to a fourth aspect of the present invention, when speech speed conversion is performed within a limited time frame, an input data length and a target data length calculated by multiplying the input data length by an arbitrary expansion / contraction ratio are set. In order to ensure that the relationship with the actual output data length does not contradict, the amount of expansion is measured at predetermined time intervals while monitoring sequentially, and based on this measurement result, when the time difference is small, the speech speed conversion magnification is increased. By temporarily increasing the speech rate and when the time difference is large, by temporarily lowering the speech rate conversion factor, the speech rate conversion factor is adaptively changed, so that the user can once set the conversion factor, which is a guide of several steps. By simply performing the setting operation, the speech speed conversion magnification and the silence section are adaptively controlled according to the set conditions, and the effect expected for speech speed conversion is stabilized in the time frame actually spoken. Get it.

【００２３】また、請求項５に記載の話速変換装置で
は、入力データを各ブロックに分割してブロックデータ
を生成するとともに、各ブロックデータに基づき、接続
データを生成する分割処理／接続データ生成手段と、入
力された所望話速に基づき、前記分割処理／接続データ
生成手段によって生成された各ブロックデータ、各接続
データの接続順序を決めて、これらを接続し、出力デー
タを生成する接続処理手段とを有する話速変換装置にお
いて、前記接続処理手段によって、時間的に変化する任
意の比率で、各ブロックデータを伸張合成して得られた
出力データ中に無音区間が出現し、この無音区間の継続
時間が所定のしきい値を越えているとき、このブロック
データに対する出力データの伸張時間を、この伸張時間
内の任意の時間だけ削減することにより、ユーザが数段
階の目安となる変換倍率を一度だけ設定操作するだけ
で、設定された条件に応じて話速変換倍率や無音区間を
適応的に制御し、実際に発話された時間枠の中で、話速
変換に期待される効果を安定して得る。Further, in the speech speed conversion device according to the present invention, the input data is divided into blocks to generate block data, and the division processing / connection data generation for generating connection data based on each block data. Means for determining the connection order of each block data and each connection data generated by the division processing / connection data generation means based on the input desired speech speed, and connecting these to generate output data Means, a silent section appears in the output data obtained by expanding and synthesizing each block data at an arbitrary ratio which changes with time by the connection processing means, and the silent section When the duration of the data exceeds a predetermined threshold, the decompression time of the output data for this block data is set to an arbitrary time within this decompression time. By reducing the number of times, the user only has to set and operate the conversion rate which is a guide of several steps only once, and the speech rate conversion rate and the silent section are adaptively controlled according to the set conditions, and the actual utterance is performed. In the time frame, the effect expected for the speech speed conversion is stably obtained.

【００２４】また、請求項６では、前記接続処理手段に
よって、入力データの伸縮合成する際、入力データ長
と、この入力データ長に任意の伸縮倍率を乗じて算出さ
れる目標データ長と、実際の出力データ長との関係が矛
盾しないように、逐次監視しながら、合成処理を行な
い、時間的に変化する任意の伸縮合成比率に対し、音声
部分に関して、情報の欠落が生じないようにするととも
に、話速変換に伴う伸張に対する正確な時間情報を保持
させることにより、ユーザが数段階の目安となる変換倍
率を一度だけ設定操作するだけで、設定された条件に応
じて話速変換倍率や無音区間を適応的に制御し、実際に
発話された時間枠の中で、話速変換に期待される効果を
安定して得る。According to a sixth aspect of the present invention, when the connection processing means expands and contracts input data, an input data length, a target data length calculated by multiplying the input data length by an arbitrary expansion / contraction ratio, and In order not to contradict the relationship with the output data length, the synthesis processing is performed while sequentially monitoring, so that information is not lost with respect to the audio portion for any expansion / contraction synthesis ratio that changes with time. By retaining accurate time information for expansion due to speech speed conversion, the user only needs to set and operate the conversion magnification once as a guide for several steps, and according to the set conditions, the speech speed conversion magnification and silence The section is adaptively controlled, and the effect expected for the speech speed conversion is stably obtained within the time frame actually spoken.

【００２５】また、請求項７では、前記接続処理手段に
よって、話速変換に伴う入力データ長からの伸張分を解
消する際、一定継続時間以上の無音区間の一部を削除し
て、話速変換倍率、伸張量などに応じて、無音区間の残
存割合を適応的に変化させることにより、ユーザが数段
階の目安となる変換倍率を一度だけ設定操作するだけ
で、設定された条件に応じて話速変換倍率や無音区間を
適応的に制御し、実際に発話された時間枠の中で、話速
変換に期待される効果を安定して得る。According to a seventh aspect of the present invention, when the extension from the input data length due to the speech speed conversion is eliminated by the connection processing means, a part of a silent section having a certain duration or more is deleted, and the speech speed is reduced. By adaptively changing the remaining ratio of the silence section according to the conversion magnification, the amount of expansion, etc., the user only has to set and operate the conversion magnification, which is a guideline for several steps, only once and according to the set conditions. The speech rate conversion magnification and the silent section are adaptively controlled, and the effect expected for speech rate conversion can be stably obtained within the time frame in which the speech is actually made.

【００２６】また、請求項８では、前記接続処理手段に
よって、限られた時間枠の中で、話速変換を行なう際、
入力データ長と、この入力データ長に任意の伸縮倍率を
乗じて算出される目標データ長と、実際の出力データ長
との関係が矛盾しないように、逐次監視しながら、予め
設定されている時間間隔で伸張量を測定し、この測定結
果に基づき、時間差が少ないときには、話速変換倍率を
一時的に上昇させ、また時間差が多いときには、話速変
換倍率を一時的に下降させることにより、適応的に話速
変換倍率を変化させることにより、ユーザが数段階の目
安となる変換倍率を一度だけ設定操作するだけで、設定
された条件に応じて話速変換倍率や無音区間を適応的に
制御し、実際に発話された時間枠の中で、話速変換に期
待される効果を安定して得る。According to the present invention, when the speech speed conversion is performed in a limited time frame by the connection processing means,
The preset time is monitored while monitoring the input data length, the target data length calculated by multiplying the input data length by an arbitrary expansion / contraction ratio, and the actual output data length sequentially. The expansion amount is measured at intervals, and based on the measurement result, when the time difference is small, the speech speed conversion magnification is temporarily increased, and when the time difference is large, the speech speed conversion magnification is temporarily decreased, thereby adapting. By changing the speech speed conversion factor, the user only needs to set the conversion factor, which is a guide for several steps, only once, and the speech speed conversion factor and the silence section are adaptively controlled according to the set conditions. Then, the effect expected for the speech speed conversion is stably obtained within the time frame actually spoken.

【００２７】[0027]

【発明の実施の形態】図１は本発明の実施の形態として
の話速変換装置の一例を示すブロック図である。FIG. 1 is a block diagram showing an example of a speech speed conversion device according to an embodiment of the present invention.

【００２８】この図に示す話速変換装置は、端子１と、
Ａ／Ｄ変換部２と、分析処理部３と、ブロックデータ分
割部４と、ブロックデータ蓄積部５と、接続データ生成
部６と、接続データ蓄積部７と、接続順序生成部８と、
音声データ接続部９と、Ｄ／Ａ変換部１０と、端子１１
とを備えており、発話者からの入力音声データに対し
て、音声データの属性に基づく分析処理を施し、当該分
析情報に応じて所望の関数を使用して、話速変換音声デ
ータを合成する際、入力音声データのデータ長（入力デ
ータ長）と、これに任意の伸縮倍率を乗じて算出される
目標データ長と、実際の出力音声データのデータ長（出
力データ長）とを比較しながら、矛盾がないように、こ
れらの処理を行なうことにより、伸張・伸縮倍率の変化
に対しても、音声情報の欠落が生じることが無く、また
時々刻々、変化する原音声と、変換音声との時間差を監
視する。そして、時間差が少ない場合には、話速変換倍
率を一時的に上昇させ、また逆に多い場合には、話速変
換倍率を一時的に下降させる等、適応的に倍率を変化さ
せ、さらに話速変換倍率や伸張量などに基づいて、無音
区間の残存割合を適応的に変化させて、話速変換に伴う
原音声からの時間差を適応的に解消する。The speech speed conversion device shown in FIG.
An A / D conversion unit 2, an analysis processing unit 3, a block data division unit 4, a block data storage unit 5, a connection data generation unit 6, a connection data storage unit 7, a connection order generation unit 8,
Audio data connection unit 9, D / A conversion unit 10, terminal 11
And performs an analysis process based on the attribute of the voice data on the input voice data from the speaker, and synthesizes the speech speed converted voice data by using a desired function according to the analysis information. At this time, the data length of the input audio data (input data length), the target data length calculated by multiplying the data length by an arbitrary expansion / contraction ratio, and the data length of the actual output audio data (output data length) are compared. By performing these processes so that there is no inconsistency, there is no loss of audio information even when the expansion / contraction magnification changes, and the original audio that changes every moment and the converted audio Monitor the time difference. When the time difference is small, the speech speed conversion magnification is temporarily increased, and when the time difference is large, the speech speed conversion magnification is temporarily decreased. The remaining ratio of the silent section is adaptively changed based on the speed conversion magnification, the amount of expansion, and the like, so that the time difference from the original voice due to the speech speed conversion is adaptively eliminated.

【００２９】Ａ／Ｄ変換部２では、所定のサンプリング
レート（例えば、３２ｋＨｚ）で、端子１に入力された
音声信号、例えばマイクロホンやテレビジョン、ラジ
オ、その他の映像機器、音響機器のアナログ音声出力端
子から出力される音声信号をＡ／Ｄ変換するとともに、
これによって得られた音声データをＦＩＦＯメモリにバ
ッファリングしながら、過不足なく、後続の分析処理部
３と、ブロックデータ分割部４とに供給する。In the A / D converter 2, an audio signal input to the terminal 1 at a predetermined sampling rate (for example, 32 kHz), for example, an analog audio output of a microphone, a television, a radio, other video equipment, or an audio equipment. A / D converting the audio signal output from the terminal,
The obtained audio data is supplied to the subsequent analysis processing unit 3 and the block data division unit 4 without any excess or shortage while buffering the audio data in the FIFO memory.

【００３０】分析処理部３では、Ａ／Ｄ変換部２から出
力される音声データを分析して、各属性を抽出するとと
もに、これらの各属性に基づいて、ブロックデータ分割
部４において行われる音声データの分割処理で必要な各
ブロック時間長を決定する分割情報を生成し、これをブ
ロックデータ分割部４に供給する。The analysis processing unit 3 analyzes the audio data output from the A / D conversion unit 2 to extract each attribute and, based on each of these attributes, performs the audio processing performed by the block data division unit 4. Division information for determining each block time length required in the data division processing is generated and supplied to the block data division unit 4.

【００３１】この際、入力された音声データの各属性と
して、有声音、無声音、無音を設定する。雑音や音楽な
どの背景音といった属性も考えられるが、一般に、雑音
や背景音の信号と、音声信号とを自動的に判別すること
が難しいことから、雑音、背景音も、上述した３つの属
性の１つに分類して分析を行なう。At this time, voiced sound, unvoiced sound, and no sound are set as the attributes of the input voice data. Attributes such as background sounds such as noise and music can be considered, but in general, it is difficult to automatically discriminate between noise and background sound signals and audio signals. The analysis is performed by classifying the data into one of the following.

【００３２】まず、分析処理部３における処理量を低減
するために、サンプリングレートを１６ｋＨｚまで落と
すデシメーションを施す。次に、５ｍｓ前後の間隔で、
音声データのパワーを算出するため、１０ｍｓ前後の窓
幅において、データの自乗和を計算する。このパワーが
所定のしきい値Ｐ_thr未満の場合、その部分を無音区間
に決定する。First, in order to reduce the amount of processing in the analysis processing unit 3, decimation is performed to reduce the sampling rate to 16 kHz. Next, at intervals of about 5 ms,
In order to calculate the power of the audio data, the sum of the squares of the data is calculated in a window width of about 10 ms. If this power is less than a predetermined threshold value P _thr , that part is determined as a silent section.

【００３３】パワーしきい値の決定は、瞬時パワーの最
大値Ｐ_upperおよび瞬時パワーの最小値Ｐ_lowerを用い
て、次式に示すように、パワーに関するしきい値Ｐ_thr
を決定する。The power threshold is determined by using the maximum value P _upper of the instantaneous power and the minimum value P _lower of the instantaneous power, and as shown in the following equation, the threshold value P _thr relating to power.
To determine.

【００３４】[0034]

【数１】Ｐ_upper−Ｐ_lower≧６０［ｄＢ］の場合、Ｐ_thr＝Ｐ_upper−３５ …（１）Ｐ_upper−Ｐ_lower＜６０［ｄＢ］の場合、Ｐ_thr＝Ｐ_upper−３５＋３５｛１−（Ｐ_upper−Ｐ_lower）／６０｝…（２）但し、Ｐ_thr：Ｐ_thr＝Ｐ_upper−１３を上限とする。## EQU1 ## When P _upper −P _lower ≧ 60 [dB], P _thr = P _upper −35 (1) When P _upper −P _lower <60 [dB], P _thr = P _upper −35 + 35 ｛1 − (P _upper −P _lower ) / 60｝ (2) where P _thr : P _thr = P _upper −13 is the upper limit.

【００３５】そして、パワーが所定のしきい値Ｐ_thr以
上の区間については、声帯の振動を伴う音声である有声
音か、声帯の振動を伴わない音声である無声音かの判定
を行なう。これには、パワーの大きさだけでなく、ゼロ
交差分析、自己相関分析などを併用する。In the section where the power is _{equal to} or greater than the predetermined threshold value P _thr , it is determined whether the voiced voice is a voice accompanied by vocal cord vibration or an unvoiced voice is a voice without vocal cord vibration. For this, not only the magnitude of the power but also a zero-crossing analysis, an autocorrelation analysis and the like are used together.

【００３６】また、音声データを分析するために、各ブ
ロックの時間長を決定するときには、各属性毎に所定の
自己相関分析を行なって周期性を検出し、この周期性を
基に、ブロック長を決定する。また、有声音区間につい
ては、声帯の振動周期であるピッチ周期を検出し、各ピ
ッチ周期が各々のブロック長となるように分割を行な
う。この際、有声音区間のピッチ周期が１．２５ｍｓ〜
２８．０ｍｓ程度の広い範囲に分布しているため、長短
異なる窓幅の自己相関分析を行なうなどして、できるだ
け正確なピッチ周期を検出する。なお、有声音区間のブ
ロック長として、ピッチ周期を用いたのは、ブロック単
位の繰り返しに起因する声の高さの変化（低い声にな
る）を防止するためである。また、無声音区間、無音区
間については、５ｍｓ以内の周期性を検出して、ブロッ
ク長を検出する。When determining the time length of each block in order to analyze the audio data, a predetermined autocorrelation analysis is performed for each attribute to detect periodicity, and the block length is determined based on the periodicity. To determine. In the voiced sound section, a pitch period, which is a vibration period of a vocal cord, is detected, and division is performed so that each pitch period has a block length. At this time, the pitch period of the voiced sound section is 1.25 ms or more.
Since it is distributed over a wide range of about 28.0 ms, the pitch period is detected as accurately as possible by performing autocorrelation analysis of window lengths different in length. The pitch period is used as the block length of the voiced sound section in order to prevent a change in voice pitch (a low voice) due to repetition in block units. In the unvoiced section and the silent section, the block length is detected by detecting the periodicity within 5 ms.

【００３７】また、ブロックデータ分割部４では、分析
処理部３で決定されたブロック長にしたがって、Ａ／Ｄ
変換部２から出力される音声データを分割し、この分割
処理で得られたブロック単位の音声データと、そのブロ
ック長とをブロックデータ蓄積部５に供給するととも
に、分割処理で得られた各ブロック単位の音声データの
両端部分、すなわち開始部分から所定の時間長（例え
ば、２ｍｓ分）と、終了部分から所定の時間長（例え
ば、２ｍｓ分）前の部分を接続データ生成部６に供給す
る。In the block data dividing section 4, the A / D conversion is performed according to the block length determined by the analysis processing section 3.
The audio data output from the conversion unit 2 is divided, the audio data in block units obtained by this division processing and the block length are supplied to the block data storage unit 5, and each block obtained by the division processing is supplied. Both ends of the unit of audio data, that is, a part of a predetermined time length (for example, 2 ms) from the start part and a part of a predetermined time length (for example, 2 ms) before the end part are supplied to the connection data generation unit 6.

【００３８】また、ブロック蓄積部５では、リングバッ
ファによって、ブロックデータ分割部４から供給された
ブロック単位の音声データ、そのブロック長を一時的に
格納し、必要に応じて一時記憶しているブロック単位の
音声データを音声データ接続部６に供給するとともに、
必要に応じて一時記憶しているブロック長を接続順序生
成部８に供給する。In the block storage section 5, the ring buffer temporarily stores the audio data in block units supplied from the block data division section 4 and its block length, and temporarily stores the block data as necessary. While supplying the unit of audio data to the audio data connection unit 6,
The block length temporarily stored is supplied to the connection order generation unit 8 as necessary.

【００３９】また、接続データ生成部６では、各ブロッ
ク毎に、図２に示すように、直前のブロックの終了部
分、当該ブロックの開始部分の音声、直後のブロックの
開始部分の音声データに窓掛けを行なった後、直前のブ
ロックの終了部分と、当該ブロックの終了部分の重複加
算および当該ブロックの開始部分と直後のブロック開始
部分の重複加算を行なうとともに、これらを連結して各
ブロック毎に、接続データを生成し、これを接続データ
蓄積部７に供給する。As shown in FIG. 2, the connection data generating unit 6 applies a window to the end data of the immediately preceding block, the audio of the start of the current block, and the audio data of the start of the immediately following block. After the multiplication is performed, overlap addition of the end portion of the immediately preceding block and the end portion of the block and overlap addition of the start portion of the block and the start portion of the block immediately after are performed. , And generates connection data and supplies the connection data to the connection data storage unit 7.

【００４０】接続データ蓄積部７では、リングバッファ
によって、接続データ生成部６から供給された各ブロッ
ク毎の接続データを一時記憶するとともに、必要に応じ
て一時記憶している接続データを音声データ接続部９に
供給する。The connection data storage unit 7 temporarily stores the connection data for each block supplied from the connection data generation unit 6 by a ring buffer and, if necessary, stores the temporarily stored connection data in the audio data connection data. Supply to section 9.

【００４１】また、接続順序生成部８では、受聴者が設
定した所望の話速を実現するために、ブロック単位の音
声データおよび接続データの接続順序を生成する。この
場合、受聴者がデジタルボリュームなどをインタフェー
スとして、各属性毎の時間的な伸張倍率を設定できる。
この値は書き換え可能なメモリに格納されている。また
この値は、固定の伸張倍率として処理される方法（＝一
様伸張モード）と、この設定倍率を目標にしつつ、一定
時間以上ずれが積算しないように、各音声属性を総合的
に、かつ適応的に制御することで、限られた時間枠で話
速変換効果を実現する方法（＝時間伸張吸収モード）と
のいずれかを選択することによって提供される。The connection order generation unit 8 generates a connection order of the audio data and the connection data in block units in order to realize a desired speech speed set by the listener. In this case, the listener can set a temporal expansion ratio for each attribute using a digital volume or the like as an interface.
This value is stored in a rewritable memory. In addition, this value is determined based on a method of processing as a fixed expansion ratio (= uniform expansion mode), and comprehensively integrating each audio attribute so as to prevent a deviation from accumulating for a certain period of time while aiming at the set magnification. It is provided by selecting one of the methods (= time expansion absorption mode) for realizing the speech speed conversion effect in a limited time frame by adaptively controlling.

【００４２】この接続順序生成部８によれば、上記メモ
リに設定された伸張倍率に対して実際に音声合成を行な
う際に、同時刻の入力音声データ長と出力音声データ長
と、これから合成しようとする音声データ長の各時間関
係をリアルタイムで把握することで、原音声の発話時刻
と変換音声の出力時刻との時間差を常に監視することが
でき、この情報をフィードバックすることで時間差を自
動的に一定長以下に抑え込むことができる。また同時
に、任意のタイミングで任意の値に変更される伸縮倍率
に対して、その実行に時間的な矛盾（例えば、入力音声
データ長よりも出力音声データ長を短くするような要求
など）がないか否かをチェックでき、合成時に音声情報
の欠落を生ずることを防止できる。According to the connection order generating unit 8, when speech synthesis is actually performed for the expansion ratio set in the memory, the input speech data length and the output speech data length at the same time are to be synthesized. By grasping the time relationship of the audio data length in real time, the time difference between the utterance time of the original sound and the output time of the converted sound can be constantly monitored, and the time difference is automatically adjusted by feeding back this information. Can be kept below a certain length. At the same time, there is no time inconsistency in execution of the expansion / contraction ratio changed to an arbitrary value at an arbitrary timing (for example, a request to make the output audio data length shorter than the input audio data length). It is possible to check whether or not voice information is lost during synthesis.

【００４３】次に、この接続順序生成部８の処理を具体
的に説明する。任意の関数によって音声の伸縮倍率を設
定する際、ブロックデータ蓄積部５から供給される各ブ
ロック長に基づき、ブロックデータ分割部４で規定され
た処理単位の音声データ長（＝入力データ長）を逐次算
出し、この入力データ長に対し、受聴者によって設定さ
れた伸縮倍率を乗じたものを目標データ長とする。音声
データ接続部９では、この目標データ値と一致するよう
に音声データを接続するとともに、実際に出力された出
力音声データの長さとなる音声データ長（＝出力データ
長）を逐次、接続順序生成部８にフィードバックする。Next, the processing of the connection order generation unit 8 will be specifically described. When setting the expansion / contraction magnification of audio by an arbitrary function, the audio data length (= input data length) of the processing unit specified by the block data division unit 4 is determined based on each block length supplied from the block data storage unit 5. The target data length is calculated sequentially, and the input data length is multiplied by a scaling factor set by the listener. The audio data connection unit 9 connects the audio data so as to match the target data value, and sequentially generates an audio data length (= output data length) which is the length of the actually output output audio data, and sequentially generates a connection order. This is fed back to the unit 8.

【００４４】そして、図３に示すように、接続順序生成
部８に設けられた入出力データ長監視比較部２０、即ち
入力データ長を監視する入力データ長監視部２１と、こ
の入力データ長監視部２１で得られた入力データ長と例
えば受聴者（あるいは、装置に内蔵された関数メモリ）
によって与えられた値に基づいて行われた話速倍率変換
で生成される出力データの目標長（目標データ長）を演
算するとともに、この目標データ長を自動的に修正する
出力目標長演算部２２と、この出力目標長演算部２２で
得られた目標データ長と入力データ長監視部２１で得ら
れた入力データ長とを比較して、目標データ長が入力デ
ータ長より短いときは目標データ長を入力データ長に揃
え、さらに、目標データ長が入力データ長以上のときは
目標データ長をそのまま出力する比較部２３と、音声デ
ータ接続部９から出力データに関する既接続情報を入力
して出力データ長を監視する出力データ長監視部２４
と、この出力データ長監視部２４で得られた出力データ
長と比較部２３で得られた目標データ長とを比較し、目
標データ長が出力データ長より短いときは目標データ長
を出力データ長に揃え、さらに、目標データ長が出力デ
ータ長以上のときは目標データ長をそのまま出力する比
較部２５とによって生成される目標長を、接続順序情報
として音声データ接続部９に送る。そして、次に述べる
ように、音声の属性毎に設定されたメモリの値を所定の
時間間隔で読み出すとともに、読み出された属性毎の伸
張倍率を実現するために、目標データ長を求めるとと
も、この目標データ長と、出力データ長監視部２４で得
られた出力データ長とに基づき、音声の伸縮情報を加味
した接続情報を時々刻々、生成して、図４に示すよう
に、各ブロック毎の音声データと、接続データとを接続
させる。As shown in FIG. 3, an input / output data length monitoring / comparing unit 20 provided in the connection order generating unit 8, that is, an input data length monitoring unit 21 for monitoring the input data length, Input data length obtained by the unit 21 and, for example, a listener (or a function memory built in the device)
Output length calculation unit 22 that calculates the target length (target data length) of the output data generated by the speech speed conversion performed based on the value given by the above, and automatically corrects the target data length. Is compared with the input data length obtained by the input data length monitoring unit 21. If the target data length is shorter than the input data length, the target data length is calculated. To the input data length, and when the target data length is equal to or longer than the input data length, the comparison unit 23 that outputs the target data length as it is, Output data length monitoring unit 24 for monitoring the length
Is compared with the output data length obtained by the output data length monitoring unit 24 and the target data length obtained by the comparison unit 23. If the target data length is shorter than the output data length, the target data length is set to the output data length. In addition, when the target data length is equal to or longer than the output data length, the target length generated by the comparison unit 25 that outputs the target data length as it is is sent to the audio data connection unit 9 as connection order information. Then, as described below, the value of the memory set for each attribute of the voice is read at a predetermined time interval, and the target data length is calculated in order to realize the expansion ratio for each read attribute. Based on the target data length and the output data length obtained by the output data length monitoring unit 24, connection information taking into account the expansion / contraction information of the voice is generated every moment, and as shown in FIG. The audio data for each and the connection data are connected.

【００４５】まず、入力データ長と、目標データ長とを
逐次比較し、入力データ長が目標データ長以上と判定さ
れたときには、入力データ長に揃うように、目標データ
長を修正し、また入力データ長が目標データ長未満であ
ると判定されたときには、目標データ長の変更を中止す
る。First, the input data length is sequentially compared with the target data length, and when it is determined that the input data length is equal to or longer than the target data length, the target data length is corrected so as to match the input data length, and When it is determined that the data length is less than the target data length, the change of the target data length is stopped.

【００４６】次に、目標データ長と、実際の出力データ
長とを比較し、出力データ長が目標データ長以上と判定
されたときには、出力データ長に揃うように、目標デー
タ長を修正し、また出力データ長が目標データ長未満と
判定されたときには、目標データ長の変更を中止する。Next, the target data length is compared with the actual output data length, and when it is determined that the output data length is equal to or longer than the target data length, the target data length is corrected so as to be equal to the output data length. When it is determined that the output data length is less than the target data length, the change of the target data length is stopped.

【００４７】これらの比較処理によって得られた目標デ
ータ長と合致するように、伸張情報や接続情報などを示
す接続指令を生成して、これを音声データ接続部９に供
給する。A connection command indicating expansion information, connection information, and the like is generated so as to match the target data length obtained by the comparison processing, and is supplied to the audio data connection unit 9.

【００４８】次に、接続順序生成部８における話速変換
倍率の制御条件について説明する。例えば、放送の時間
枠など、限られた時間枠の中で、話速変換を行なうこと
を所望する場合においては、入力データ長と、出力デー
タ長とを逐次監視し、予め任意に設定した時間間隔で、
両データの時間差を測定することによって、遅延量が少
ないときには、話速変換倍率を一時的に上昇させ、また
逆に多いときには、これを下降させる処理を行なうな
ど、適応的に倍率を変化させるような関数を設定すれば
良い。Next, the control conditions for the speech speed conversion magnification in the connection order generation unit 8 will be described. For example, when it is desired to perform the speech speed conversion within a limited time frame such as a broadcast time frame, the input data length and the output data length are sequentially monitored, and a predetermined time period is set. At intervals,
By measuring the time difference between the two data, when the delay amount is small, the speech speed conversion magnification is temporarily increased, and when the delay amount is large, the speech speed conversion magnification is reduced. What kind of function should I set?

【００４９】例えば、この実施の形態では、２００ｍｓ
以上の無音区間が出現した時点で、それ以降に出現する
最初の有声音の開始時刻を“ｔ＝０”とし、“０≦ｔ≦
Ｔ”の範囲に出現する各有声音の開始時刻に対応した倍
率を与える関数として、以下のような余弦関数を用いる
ことができる。For example, in this embodiment, 200 ms
When the above silent section appears, the start time of the first voiced sound appearing thereafter is set to “t = 0”, and “0 ≦ t ≦
As a function for giving a magnification corresponding to the start time of each voiced sound appearing in the range of T ", the following cosine function can be used.

【００５０】[0050]

【数２】ｆ（ｔ）＝ｒｓ＋０．５（ｒｓ−ｒｅ）（ｃｏｓπｔ／Ｔ＋１．０）…（３）但し、ｔ：０≦ｔ≦Ｔｒｓ：受聴者による外部入力値（１．０≦ｒｓ≦１．
６）ｒｅ：初期値として与えられる値（例えば、ｒｅ＝１．
０）F (t) = rs + 0.5 (rs−re) (cosπt / T + 1.0) (3) where t: 0 ≦ t ≦ T rs: External input value by listener (1.0 ≦ rs ≦ 1.
6) re: a value given as an initial value (for example, re = 1.
0)

【００５１】ここで、入力データ長と、出力データ長と
の時間差をある一定の時間間隔、例えば１秒毎に計算
し、そのときの時間差に応じて、初期値ｒｅを“１．
０”から“０．０５”づつ増加させたり、また逆に
“０．９５”程度まで減少させる処理を行なう。ただ
し、期間Ｔを越えた時点で、まだ２００ｍｓ以上の無音
区間が出現しない場合には、それ以降の有声区間には、
例えば１．０倍の倍率を適用する。ここでは、ピッチや
パワーなどの変化量を指標にして新たな倍率を与えるこ
ともできる。Here, the time difference between the input data length and the output data length is calculated at a certain time interval, for example, every one second, and the initial value re is set to “1.
A process is performed to increase the value from “0” by “0.05” or vice versa, or to reduce the value to about “0.95.” However, when the silent period of 200 ms or more does not yet appear after the period T. In the voiced section after that,
For example, a magnification of 1.0 is applied. Here, a new magnification can be given using the amount of change in pitch, power, or the like as an index.

【００５２】また、無音区間の残存割合についても、話
速変換倍率や伸張量などを鑑みて適応的に変化させるよ
うにする。これも関数として任意に設定できる。The remaining ratio of the silent section is also adaptively changed in consideration of the speech speed conversion magnification, the amount of expansion, and the like. This can also be set arbitrarily as a function.

【００５３】また、外部入力値ｒｓに対応して無音区間
の短縮許容限（最低、どれだけは削減せずに保存するか
を示す値）を設定し、上述したような関数で表現しても
良いが、例えば次に述べるように、離散的に設定するこ
ともできる。In addition, a reduction limit of the silence section (minimum value indicating how much is saved without reduction) is set corresponding to the external input value rs, and expressed by the above-described function. Good, but it can be set discretely, for example, as described below.

【００５４】ｒｓ＝１．０のときは、３００ｍｓまで削減可能ｒｓ＝１．１のときは、２５０ｍｓまで削減可能ｒｓ＝１．２のときは、２３０ｍｓまで削減可能ｒｓ＝１．３のときは、２００ｍｓまで削減可能ｒｓ＝１．４のときは、２００ｍｓまで削減可能ｒｓ＝１．５のときは、１５０ｍｓまで削減可能ｒｓ＝１．６のときは、１００ｍｓまで削減可能などに設定しても良い。When rs = 1.0, it can be reduced to 300 ms. When rs = 1.1, it can be reduced to 250 ms. When rs = 1.2, it can be reduced to 230 ms. When rs = 1.3, , Can be reduced to 200 ms. When rs = 1.4, it can be reduced to 200 ms. When rs = 1.5, it can be reduced to 150 ms. When rs = 1.6, it can be reduced to 100 ms. good.

【００５５】また、無音区間の削減方式については、リ
ングバッファ上の任意のアドレスにポインタを移動させ
ることによって実現する。この実施の形態では、当該無
音区間の直後の有声音の開始部分に移動することによ
り、音声情報の欠落を防止している。The silence section reduction method is realized by moving the pointer to an arbitrary address on the ring buffer. In this embodiment, loss of audio information is prevented by moving to the start of a voiced sound immediately after the silent section.

【００５６】また、音声データ接続部９では、接続順序
生成部８で決定された接続順序にしたがってブロックデ
ータ蓄積部５からブロック単位の音声データを読み出
し、指定されたブロックの音声データを伸張させるとと
もに、接続データ蓄積部７から接続データを読み出しな
がら、Ｄ／Ａ変換部１０に設けられたＦＩＦＯメモリに
過不足が起こらないように、接続処理を抑制しながら、
音声データと接続データとを接続して、出力音声データ
を生成し、これをＤ／Ａ変換部１０に供給する。The audio data connection unit 9 reads audio data in block units from the block data storage unit 5 in accordance with the connection order determined by the connection order generation unit 8, expands the audio data of the specified block, and While reading connection data from the connection data storage unit 7, while suppressing connection processing so as not to cause excess or deficiency in the FIFO memory provided in the D / A conversion unit 10,
The audio data and the connection data are connected to generate output audio data, and the output audio data is supplied to the D / A converter 10.

【００５７】Ｄ／Ａ変換部１０では、ＦＩＦＯメモリに
よって、音声データ接続部９から供給される出力音声デ
ータをバッファリングしながら、所定のサンプリングレ
ート（例えば、３２ｋＨｚ）で、出力音声データをＤ／
Ａ変換して、出力音声信号を生成し、これを端子１１か
ら出力する。The D / A conversion unit 10 buffers the output audio data supplied from the audio data connection unit 9 at a predetermined sampling rate (for example, 32 kHz) while buffering the output audio data supplied from the audio data connection unit 9 by the FIFO memory.
A conversion is performed to generate an output audio signal, which is output from the terminal 11.

【００５８】このように、この実施の形態では、発話者
からの入力音声データに対して、音声データの属性に基
づく分析処理を施し、当該分析情報に応じた所望の関数
を使用して話速変換音声データを合成する際、入力デー
タ長と、これに任意の伸縮倍率を乗じて算出される目標
データ長と、実際の出力音声データ長とを比較しなが
ら、矛盾がないように、これらの処理を行なうようにし
たので、伸張・伸縮倍率の変化に対しても、音声情報の
欠落が生じないようにすることができる。また、時々刻
々変化する原音声と、変換音声との時間差を監視し、時
間差が少ない場合には、話速変換倍率を一時的に上昇さ
せ、また逆に多い場合には、話速変換倍率を一時的に下
降させるなど、適応的に倍率を変化させ、さらに話速変
換倍率や伸張量などに基づいて、無音区間の残存割合を
適応的に変化させて、話速変換に伴う原音声からの時間
差を適応的に解消するようにしているので、ユーザが数
段階の目安となる変換倍率を一度だけ設定操作するだけ
で、設定された条件に応じて話速変換倍率や無音区間を
適応的に制御し、実際に発話された時間枠の中で、話速
変換に期待される効果を安定して得ることができる。As described above, in this embodiment, the input voice data from the speaker is subjected to the analysis processing based on the attribute of the voice data, and the speech speed is calculated using a desired function corresponding to the analysis information. When synthesizing the converted audio data, the input data length, the target data length calculated by multiplying the input data length by an arbitrary expansion / contraction ratio, and the actual output audio data length are compared so that these are not inconsistent. Since the processing is performed, loss of audio information can be prevented from occurring even when the expansion / contraction magnification changes. In addition, the time difference between the original voice, which changes every moment, and the converted voice is monitored, and if the time difference is small, the speech speed conversion magnification is temporarily increased. Adaptively change the magnification, such as by temporarily lowering it, and then adaptively change the remaining rate of the silent section based on the speech speed conversion magnification, the amount of expansion, etc. Since the time difference is adaptively eliminated, the user only has to set and operate the conversion rate, which is a guideline for several steps, once, and the speech rate conversion rate and the silence section are adaptively set according to the set conditions. By controlling, the effect expected for the speech speed conversion can be stably obtained within the time frame actually spoken.

【００５９】これによって、話者が頻繁に入れ替わる放
送番組などに対しても、自動的に各発話者に最適な話速
変換効果を提供することができ、ごく簡単な操作で、早
口が聞き取り難い高齢者や視聴障害者に対しても、リア
ルタイム性を有する緊急報道やテレビなどの映像付きの
メディアの音声を時間遅れ無く、かつ安定してゆっくり
聴取させることができる。This makes it possible to automatically provide an optimum speech speed conversion effect for each speaker even for a broadcast program or the like in which speakers change frequently, and it is difficult to hear the speech quickly with a very simple operation. Even the elderly and the visually impaired can be stably and slowly listen to the audio of media with video, such as emergency news and television, having real-time characteristics without time delay.

【００６０】[0060]

【発明の効果】以上説明したように本発明によれば、請
求項１〜４では、ユーザが数段階の目安となる変換倍率
を一度だけ設定操作するだけで、設定された条件に応じ
て話速変換倍率や無音区間を適応的に制御し、実際に発
話された時間枠の中で、話速変換に期待される効果を安
定して得ることができる。As described above, according to the present invention, according to the first to fourth aspects, the user only has to set the conversion magnification, which is a guide of several steps, only once and according to the set conditions. It is possible to adaptively control the speed conversion magnification and the silent period, and to stably obtain the effect expected for the speed conversion in the time frame in which the speech is actually uttered.

【００６１】また、請求項５〜８では、ユーザが数段階
の目安となる変換倍率を一度だけ設定操作するだけで、
設定された条件に応じて話速変換倍率や無音区間を適応
的に制御し、実際に発話された時間枠の中で、話速変換
に期待される効果を安定して得ることができる。According to claims 5 to 8, the user only has to set and operate the conversion magnification, which is a guide of several steps, only once.
It is possible to adaptively control the speech speed conversion magnification and the silent period according to the set conditions, and to stably obtain the effect expected for the speech speed conversion within the time frame in which the speech is actually made.

[Brief description of the drawings]

【図１】本発明を適用した話速変換装置の一例を示すブ
ロック図である。FIG. 1 is a block diagram showing an example of a speech speed conversion device to which the present invention is applied.

【図２】図１に示す接続データ生成部における、同一ブ
ロックを繰り返して接続する際に用いる接続データの生
成方法を示す模式図である。FIG. 2 is a schematic diagram showing a method of generating connection data used when repeatedly connecting the same block in the connection data generation unit shown in FIG. 1;

【図３】図１に示す接続順序生成部における入出力デー
タ長監視比較部の詳細な構成例を示すブロック図であ
る。FIG. 3 is a block diagram showing a detailed configuration example of an input / output data length monitoring and comparing unit in the connection order generating unit shown in FIG. 1;

【図４】図１に示す接続順序生成部で生成される接続順
序の一例を示す模式図である。FIG. 4 is a schematic diagram illustrating an example of a connection order generated by a connection order generation unit illustrated in FIG. 1;

[Explanation of symbols]

１端子２Ａ／Ｄ変換部３分析処理部（分割処理／接続データ生成手段）４ブロックデータ分割部（分割処理／接続データ生成
手段）５ブロックデータ蓄積部（分割処理／接続データ生成
手段）６接続データ生成部（分割処理／接続データ生成手
段）７接続データ蓄積部（分割処理／接続データ生成手
段）８接続順序生成部（接続処理手段）９音声データ接続部（接続処理手段）１０Ｄ／Ａ変換部１１端子１２話速変換装置２０入出力データ長監視比較部２１入力データ長監視部２２出力目標長演算部２３比較部２４出力データ長監視部２５比較部Reference Signs List 1 terminal 2 A / D conversion unit 3 analysis processing unit (division processing / connection data generation unit) 4 block data division unit (division processing / connection data generation unit) 5 block data storage unit (division processing / connection data generation unit) 6 Connection data generation unit (division processing / connection data generation unit) 7 Connection data storage unit (division processing / connection data generation unit) 8 Connection order generation unit (connection processing unit) 9 Audio data connection unit (connection processing unit) 10 D / A converter 11 Terminal 12 Speech rate converter 20 Input / output data length monitoring / comparison unit 21 Input data length monitoring unit 22 Output target length calculation unit 23 Comparison unit 24 Output data length monitoring unit 25 Comparison unit

───────────────────────────────────────────────────── フロントページの続き (56)参考文献ＮＨＫ技研Ｒ＆ＤＮｏ．40 1996年５月Ｐ15−26 新型リアルタイム話速変換装置 (58)調査した分野(Int.Cl.⁷，ＤＢ名) G10L 21/04 ──────────────────────────────────────────────────続き Continued on front page (56) References NHK Giken R & D No. 40 May 1996 P15-26 New real-time speech speed converter (58) Fields investigated (Int. Cl. ⁷ , DB name) G10L 21/04

Claims

(57) [Claims]

An output data obtained by expanding and synthesizing input data at an arbitrary ratio that changes with time, a certain silent section appears, and the duration of the silent section exceeds a predetermined threshold value. Wherein the expansion time of the output data with respect to the input data is reduced by an arbitrary time within the expansion time.

2. The speech speed conversion method according to claim 1, wherein, when expanding and shrinking the input data, an input data length, a target data length calculated by multiplying the input data length by an arbitrary expansion and contraction factor, and Synthesizing processing is performed while monitoring continuously so that the relationship with the actual output data length does not contradict, so that information is not lost in the audio part for any expansion / contraction synthesis ratio that changes over time. With
A speech speed conversion method characterized by storing accurate time information for expansion accompanying speech speed conversion.

3. A speech rate conversion method according to claim 1 or 2, wherein when the extension from the input data length due to the speech rate conversion is eliminated, a silent section exceeding a certain threshold value is deleted. A speech speed conversion method characterized by adaptively changing a threshold value in consideration of a speech speed conversion magnification applied, a delay time, and the like when the operation is permitted.

4. The speech speed conversion method according to claim 1, wherein when the speech speed conversion is performed within a limited time frame, the input data length and the input data length are determined. The target data length calculated by multiplying by an arbitrary expansion / contraction magnification and the actual output data length do not contradict each other. Based on the results, when the time difference is small, the speech speed conversion magnification is temporarily increased, and when the time difference is large, the speech speed conversion magnification is temporarily decreased, thereby adaptively changing the speech speed conversion magnification. A speech speed conversion method characterized by the following.

5. A division processing / connection data generating means for generating block data based on each block data by dividing input data into respective blocks, and based on the input desired speech speed, Connection processing means for determining the connection order of each block data and each connection data generated by the division processing / connection data generation means, connecting them, and generating output data; At any ratio that changes
When a silence section appears in the output data obtained by expanding and synthesizing each block data, and when the duration of the silence section exceeds a predetermined threshold value, the decompression time of the output data for this block data is A speech speed conversion device characterized in that the speech speed is reduced by an arbitrary time within the decompression time.

6. The speech speed conversion device according to claim 5, wherein the connection processing means calculates the input data length by multiplying the input data length by an arbitrary expansion / contraction ratio when expanding / contracting the input data. In order to ensure that the relationship between the target data length and the actual output data length does not contradict, synthesis processing is performed while monitoring sequentially, and information is lost in the audio part for any expansion / contraction synthesis ratio that changes over time. A speech speed conversion apparatus characterized in that accurate time information for expansion accompanying speech speed conversion is retained while preventing speech speed conversion.

7. A speech rate conversion device according to claim 5, wherein when the extension from the input data length due to speech rate conversion is eliminated, a silent section exceeding a certain threshold value is deleted. A speech speed conversion device characterized in that the threshold value is adaptively changed in consideration of the applied speech speed conversion magnification, delay time, etc.

8. The speech speed conversion device according to claim 5, wherein the connection processing means determines the input data length when performing the speech speed conversion within a limited time frame. While monitoring the target data length, which is calculated by multiplying the input data length by an arbitrary expansion / contraction ratio, and the actual output data length sequentially, the expansion amount is determined at a preset time interval. Measure,
Based on the measurement result, when the time difference is small, the speech speed conversion magnification is temporarily increased, and when the time difference is large, the speech speed conversion magnification is temporarily decreased, thereby adaptively changing the speech speed conversion magnification. A speech speed conversion device characterized in that the speech speed is converted.