JP2008107706A

JP2008107706A - Speech speed conversion apparatus and program

Info

Publication number: JP2008107706A
Application number: JP2006292470A
Authority: JP
Inventors: Yuji Hisaminato; 裕司久湊
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2006-10-27
Filing date: 2006-10-27
Publication date: 2008-05-08

Abstract

<P>PROBLEM TO BE SOLVED: To prevent that a wave shape of a voice signal does not become continuous after processing, without imposing an excessive processing load on a speech speed conversion apparatus, when speech speed is converted by applying time axis compression on the voice signal. <P>SOLUTION: The speech speed conversion apparatus comprises: a calculation means for calculating a value which indicates a time change rate of a logarithmic value of a sound volume which is expressed by the voice signal to be processed; a discrimination means for discriminating that speech speed conversion is inhibited, in frames in which a magnitude of the value calculated by the calculation means is more than a prescribed threshold value, and the prescribed number of frames following the frame, in the plurality of frames for constituting the voice signal; and a speech speed conversion means which outputs the voice signal as it is, for the frame which is discriminated that the speech speed conversion is inhibited, and meanwhile, which outputs the voice signal by inserting or deleting a wave shape so that it may become the specified speech speed, for the other frames. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、音声信号に時間軸圧伸処理を施す技術に関する。 The present invention relates to a technique for performing time axis companding processing on an audio signal.

音声信号に時間軸圧伸処理を施し、その音声信号の表す音声の話速を適宜調整する技術が種々提案されている。例えば、非特許文献１には、フレーム長を変えつつ音声の自己相関を算出し、最も相関が高くなるフレーム長をその音声の周期とみなし、その周期単位で波形の挿入または削除を行うことにより話速変換を行うＰＩＣＯＬＡと呼ばれるアルゴリズムが開示されている。
森田直孝，板倉文忠、“ポインター移動制御による重複加算法（ＰＩＣＯＬＡ）を用いた音声の時間軸での伸長圧縮とその評価”、日本音響学会講演論文集、p.149-150、昭和６１年１０月 Various techniques have been proposed for performing time axis companding processing on an audio signal and appropriately adjusting the speech speed of the audio represented by the audio signal. For example, Non-Patent Document 1 calculates the autocorrelation of speech while changing the frame length, regards the frame length with the highest correlation as the cycle of the speech, and inserts or deletes waveforms in units of the cycle. An algorithm called PICOLA that performs speech speed conversion is disclosed.
Naotaka Morita and Fumada Itakura, “Expansion and compression of speech over time using pointer movement control (PICOLA) and its evaluation”, Proc. Of the Acoustical Society of Japan, p.149-150, October 1986 Moon

しかしながら、非特許文献１に開示された技術を単純に適用してしまうと、音量変化の大きい部分や破裂音の子音部分で聴感上の問題を生じさせてしまう場合がある。具体的には、音量変化の大きい部分で波形の挿入や削除を行ってしまうと、その部分で音量の変化が不連続になり、「ぼこっ」という異音が聞こえてしまう場合がある。また、破裂音の子音部分で波形の挿入や削除が行われてしまうと、子音が複数回聴こえてしまう場合がある。 However, if the technique disclosed in Non-Patent Document 1 is simply applied, there may be a problem in hearing in a portion where the volume change is large or a consonant portion of a plosive sound. Specifically, if a waveform is inserted or deleted at a portion where the volume change is large, the change in the volume becomes discontinuous at that portion, and there may be an audible noise “bumpy”. In addition, if a waveform is inserted or deleted at the consonant part of the plosive sound, the consonant may be heard multiple times.

上記の如き波形の不連続の発生を回避する方策としては、処理対象である音声信号に音声認識処理を施し、破裂音の子音など波形の不連続が生じ易い箇所を予め特定し、該当箇所を時間軸圧伸処理の対象から除外しておくことが考えられる。しかしながら、音声認識処理の実行には一般に多大なハードウェアリソースを要し（換言すれば、処理負荷が高い）、例えば携帯電話機など処理能力が低い端末装置には適用が難しいといった問題点がある。 As a measure for avoiding the occurrence of waveform discontinuity as described above, speech recognition processing is performed on the speech signal to be processed, a location where waveform discontinuity such as a consonant of a plosive is likely to occur is specified in advance, It may be possible to exclude it from the target of the time axis companding process. However, the execution of the speech recognition process generally requires a large amount of hardware resources (in other words, a high processing load), and there is a problem that it is difficult to apply to a terminal device with a low processing capability such as a mobile phone.

本発明は、上記課題に鑑みて為されたものであり、音声信号に時間軸圧伸を施して話速変換する際に、話速変換を実行する装置に過大な処理負荷をかけることなく、処理後の音声信号に波形の不連続が生じることを回避する技術を提供することを目的としている。 The present invention has been made in view of the above problems, and when performing speech axis conversion by applying time axis companding to a speech signal, without overloading the apparatus that performs speech speed conversion, An object of the present invention is to provide a technique for avoiding the occurrence of waveform discontinuity in the processed audio signal.

上記課題を解決するために、本発明は、処理対象である音声信号の表す音の立ち上がり部分および該音の立下り部分に該当するフレームを前記音の音量の時間変化から特定し、そのフレームとそのフレームの前後の所定数のフレームについて、話速変換を禁止されている話速変換禁止フレームであると判別する判別手段と、話速変換禁止フレームであると前記判別手段により判別されたフレームについてはそのまま出力する一方、その他のフレームについては、指定された話速になるように波形挿入または波形削除を行って出力する話速変換手段と、を具備することを特徴とする話速変換装置を提供する。 In order to solve the above-described problem, the present invention identifies a rising portion of a sound represented by an audio signal to be processed and a frame corresponding to the falling portion of the sound from a temporal change in the volume of the sound, A discriminating means for discriminating that a predetermined number of frames before and after the frame are speech speed conversion prohibiting frames for which speech speed conversion is prohibited, and for a frame determined by the discriminating means for being a speech speed conversion prohibiting frame A speech speed converting device comprising: a speech speed converting means for outputting the other frames while performing waveform insertion or waveform deletion so as to achieve the designated speech speed. provide.

また、上記課題を解決するために、本発明は、コンピュータ装置に、処理対象である音声信号の表す音の立ち上がり部分および該音の立下り部分に該当するフレームを前記音の音量の時間変化から特定し、そのフレームとそのフレームの前後の所定数のフレームについて、話速変換を禁止されている話速変換禁止フレームであると判別する第１のステップと、話速変換禁止フレームであると前記第１のステップにて判別されたフレームについてはそのまま出力する一方、その他のフレームについては、指定された話速になるように波形挿入または波形削除を行って出力する第２のステップと、を実行させることを特徴とするプログラムを提供する。 In order to solve the above-described problem, the present invention provides a computer apparatus with a rising portion of a sound represented by an audio signal to be processed and a frame corresponding to the falling portion of the sound from a time change of the sound volume. A first step of identifying the frame and a predetermined number of frames before and after the frame as a speech speed conversion prohibited frame for which speech speed conversion is prohibited; The frame determined in the first step is output as it is, while the other steps are output by performing waveform insertion or waveform deletion so as to achieve the designated speech speed. Provided is a program characterized in that

本発明によれば、音声信号に時間軸圧伸を施して話速変換する際に、話速変換を実行する装置に過大な処理負荷をかけることなく、処理後の音声信号に波形の不連続が生じることを回避することが可能になる、といった効果を奏する。 According to the present invention, when the speech signal is subjected to time axis companding to convert the speech speed, the waveform of the processed speech signal is not discontinuous without imposing an excessive processing load on the device that performs the speech speed conversion. It is possible to avoid the occurrence of the occurrence of the problem.

以下、図面を参照しつつ、本発明を実施する際の最良の形態について説明する。
（Ａ：構成）
図１は、本発明の一実施形態に係る話速変換装置１０のハードウェア構成の一例を示す図である。
図１の入力端子ＣＨ−ｉｎには、アナログの音声信号Ｓ−ｉｎ（ｔ）を出力する音源（図示省略）が接続されており、この音源から出力される音声信号が入力される。
図１においては詳細な図示は省略したが、入力端子ＣＨ−ｉｎへ入力された音声信号Ｓ−ｉｎ（ｔ）は、Ａ／Ｄ変換回路（図示省略）によってデジタル信号に変換された後に所定時間長（本実施形態では、８ミリ秒）のフレーム単位で切り出され、遅延処理回路１１０と音量変化率算出部１２０とへ引き渡される。以下では、デジタルデータに変換されフレーム単位で切り出された音声信号についても“Ｓ−ｉｎ(ｔ)”と表記する。
なお、以下では、図２（ａ）に示す信号波形を有する音声信号Ｓ−ｉｎ（ｔ）が入力端子ＣＨ−ｉｎへ入力されるものとする。また、本実施形態では、入力端子ＣＨ−ｉｎへ入力される音声信号がアナログ信号である場合について説明するが、デジタル信号であっても良いことは勿論であり、この場合、Ａ／Ｄ変換回路を設ける必要がないことは言うまでもない。 Hereinafter, the best mode for carrying out the present invention will be described with reference to the drawings.
(A: Configuration)
FIG. 1 is a diagram illustrating an example of a hardware configuration of a speech speed conversion apparatus 10 according to an embodiment of the present invention.
A sound source (not shown) that outputs an analog audio signal S-in (t) is connected to the input terminal CH-in in FIG. 1, and an audio signal output from this sound source is input.
Although the detailed illustration is omitted in FIG. 1, the audio signal S-in (t) input to the input terminal CH-in is converted into a digital signal by an A / D conversion circuit (not shown) for a predetermined time. A long frame (8 milliseconds in this embodiment) is cut out in units of frames and delivered to the delay processing circuit 110 and the volume change rate calculation unit 120. Hereinafter, an audio signal converted into digital data and cut out in units of frames is also expressed as “S-in (t)”.
In the following, it is assumed that an audio signal S-in (t) having the signal waveform shown in FIG. 2A is input to the input terminal CH-in. In this embodiment, the case where the audio signal input to the input terminal CH-in is an analog signal will be described. However, it is a matter of course that the audio signal may be a digital signal. In this case, the A / D conversion circuit Needless to say, there is no need to provide a.

図１の遅延処理回路１１０は、入力された音声信号Ｓ−ｉｎ（ｔ）を数十ミリ秒程度の遅延時間Δｔだけ遅延させて話速変換部１４０へ出力する。なお、音声信号Ｓ−ｉｎ（ｔ）に上記遅延処理を施す理由は、話速変換処理が略リアルタイムで為されているようにユーザに体感させるため、遅延時間Δｔだけ過去に遡って話速変換を実行するためである。ここで上記遅延時間Δｔは、数十ミリ秒程度であるから、聴感上の影響は殆どない。 The delay processing circuit 110 in FIG. 1 delays the input audio signal S-in (t) by a delay time Δt of about several tens of milliseconds and outputs the delayed signal to the speech speed conversion unit 140. Note that the reason why the delay process is performed on the audio signal S-in (t) is that the voice speed conversion is performed retroactively by the delay time Δt in order to make the user feel as if the voice speed conversion process is performed in substantially real time. It is for executing. Here, since the delay time Δt is about several tens of milliseconds, there is almost no influence on hearing.

図１の音量変化率算出部１２０は、上記のようにデジタル変換された音声信号Ｓ−ｉｎ（ｔ）をフレーム単位で順次受け取り、その音声信号の表す音声の音量の時間変化の度合いを表す値ΔＰ（ｔ）をフレーム毎に算出し判別部１３０へ引き渡すものである。
より詳細に説明すると、音量変化率算出部１２０は、フレーム単位で順次受け取った音声信号Ｓ−ｉｎ（ｔ）のエンベロープを求めることによって、その音声信号の音量をフレーム毎に特定し（図２（ｂ）参照）、さらに、上記ΔＰ（ｔ）として、音量の常用対数値（図２（ｃ））の一次微分（図２（ｄ））をフレーム毎に算出して判別部１３０へ引き渡す。
なお、上記音声信号Ｓ−ｉｎ（ｔ）の表す音声にて音量が急激に上昇する立ち上がり部分や音量が急激に下降する立下り部分を正確に特定するために、上記ΔＰ（ｔ）として音量の常用対数値（図２（ｃ））の一次微分を用いている。しかしながら、例えば音量の一次微分を用いても上記立ち上がり部分や立下り部分を特定することができる場合には、音量の一次微分を上記ΔＰ（ｔ）として用いても良いことは勿論である。 The volume change rate calculation unit 120 in FIG. 1 sequentially receives the audio signal S-in (t) digitally converted as described above in units of frames, and represents a value representing the degree of temporal change in volume of the audio represented by the audio signal. ΔP (t) is calculated for each frame and delivered to the determination unit 130.
More specifically, the volume change rate calculation unit 120 specifies the volume of the audio signal for each frame by obtaining the envelope of the audio signal S-in (t) sequentially received in units of frames (FIG. 2 ( b)), and the first derivative (FIG. 2D) of the common logarithmic value of the sound volume (FIG. 2C) is calculated for each frame as ΔP (t) and delivered to the determination unit 130.
Note that in order to accurately specify the rising portion where the volume suddenly increases and the falling portion where the volume rapidly decreases in the sound represented by the audio signal S-in (t), the volume of the sound is expressed as ΔP (t). The first derivative of the common logarithm value (FIG. 2 (c)) is used. However, for example, when the rising portion and the falling portion can be specified using the first derivative of the sound volume, it is needless to say that the first derivative of the sound volume may be used as the ΔP (t).

さて、本実施形態に係る音量変化率算出部１２０は、ｔ番目のフレームについてのΔＰ（ｔ）を以下の式（１）にしたがって算出する。
ΔＰ（ｔ）＝｛ｌｎ（１０）×（Ｐ（ｔ）−Ｐ（ｔ−１））｝／ｌｎ（Ｐ（ｔ）…（１）
ただし、式（１）において、ｌｎ（）は自然対数を意味し、Ｐ（ｔ）はｔ番目のフレームの音量を所定の基準音量で規格化して得られる値であり、Ｐ（ｔ−１）はその１つ手前のフレームの音量を所定の基準音量で規格化して得られる値である。
なお、本実施形態では、各フレームの音量の時間変化の度合いを示す値ΔＰ（ｔ）として音量の常用対数値の一次微分を用い、その値を式（１）にしたがって算出する場合について説明したが、他の手法（例えば、互いに異なる複数の真数値に対応付けてその真数値に対する常用対数の一次微分の値を格納したテーブルを参照して上記ΔＰ（ｔ）を求める手法）であっても勿論良い。 Now, the volume change rate calculation unit 120 according to the present embodiment calculates ΔP (t) for the t-th frame according to the following equation (1).
ΔP (t) = {ln (10) × (P (t) −P (t−1))} / ln (P (t) (1)
In Equation (1), ln () means a natural logarithm, P (t) is a value obtained by normalizing the volume of the t-th frame with a predetermined reference volume, and P (t−1) Is a value obtained by normalizing the volume of the previous frame with a predetermined reference volume.
In the present embodiment, a case has been described in which the first-order differential of the sound volume is used as the value ΔP (t) indicating the degree of temporal change in volume of each frame, and the value is calculated according to the equation (1). However, other methods (for example, a method of obtaining the above ΔP (t) by referring to a table in which the values of the first derivative of the common logarithm with respect to the true values are stored in association with a plurality of different true values) Of course it is good.

判別部１３０は、図１においては、詳細な図示は省略したが、遅延時間Δｔに応じた数のフレームを格納し得る記憶容量を有するバッファとコンパレータとを含んでいる。
この判別部１３０は、音量変化率算出部１２０から引き渡されるΔＰ（ｔ）を受け取って、その受け取り順に上記バッファへ格納する一方、そのΔＰ（ｔ）よりもΔｔだけ過去の時刻における音量の変化度合いを示す値（すなわち、ΔＰ（ｔ−Δｔ））を上記バッファから読み出し、そのΔＰ（ｔ−Δｔ）の大きさが所定の閾値ｔｈを超えているか否かを判別し、その判別結果に応じた制御信号ＣＳを出力する。
より詳細に説明すると、判別部１３０は、ΔＰ（ｔ）の大きさが上記閾値ｔｈを上回っているフレーム（図２（ｄ）にて破線で区画された部分）とその前後の所定数のフレーム（図２（ｄ）にてハッチングで示す部分）について、話速変換を禁止するフレーム（以下、「話速変換禁止フレーム」）であると判別し、その旨を示す制御信号ＣＳとして信号値が“１”である制御信号（以下、話速変換禁止信号）を話速変換部１４０へ出力する。また、判別部１３０は、他のフレームについては、信号値が“０”である制御信号ＣＳ（以下、話速変換許可信号）を話速変換部１４０へ出力する。これにより、音声信号Ｓ−ｉｎ（ｔ）を構成する複数のフレームうちの何れが話速変換禁止フレームであるのかが話速変換部１４０へ伝達されることになる。 Although not shown in detail in FIG. 1, the determination unit 130 includes a buffer having a storage capacity capable of storing a number of frames corresponding to the delay time Δt, and a comparator.
The determination unit 130 receives ΔP (t) delivered from the volume change rate calculation unit 120 and stores the ΔP (t) in the buffer in the order of reception. On the other hand, the change level of the volume at the past time by Δt from the ΔP (t). Is read from the buffer, and it is determined whether or not the magnitude of ΔP (t−Δt) exceeds a predetermined threshold th, and the value corresponding to the determination result is determined. A control signal CS is output.
More specifically, the determination unit 130 determines that a frame in which ΔP (t) has a value greater than the threshold th (a portion partitioned by a broken line in FIG. 2D) and a predetermined number of frames before and after the frame. (The portion indicated by hatching in FIG. 2D) is determined to be a frame for which speech speed conversion is prohibited (hereinafter referred to as “speech speed conversion prohibition frame”), and a signal value is given as a control signal CS indicating that. A control signal that is “1” (hereinafter referred to as a speech speed conversion prohibition signal) is output to the speech speed conversion unit 140. In addition, the determination unit 130 outputs a control signal CS (hereinafter referred to as a speech speed conversion permission signal) whose signal value is “0” to the speech speed conversion unit 140 for other frames. As a result, which of the plurality of frames constituting the audio signal S-in (t) is the speech rate conversion prohibited frame is transmitted to the speech rate conversion unit 140.

話速変換部１４０は、例えばＤＳＰであり、非特許文献１に開示された話速変換アルゴリズム（すなわち、ＰＩＣＯＬＡ）にしたがって、図示せぬ操作部を介してユーザによって指定された話速に応じた周期数分の波形を挿入または削除する処理をＳ−ｉｎ（ｔ−Δｔ）に施し、その処理結果である音声信号Ｓ−ｏｕｔ（ｔ−Δｔ）をＤ／Ａ変換回路（図示省略）によってアナログ信号へ変換し出力端子ＣＨ−ｏｕｔを介して外部へ出力する。この出力端子ＣＨ−ｏｕｔには、例えばスピーカなどの放音装置（図示省略）が接続されており、この放音装置からは、話速変換装置１０により話速変換処理が施された音声信号に応じた音声が放音される。 The speech rate conversion unit 140 is, for example, a DSP, and corresponds to the speech rate specified by the user via an operation unit (not shown) according to the speech rate conversion algorithm (that is, PICOLA) disclosed in Non-Patent Document 1. A process for inserting or deleting a waveform corresponding to the number of cycles is applied to S-in (t-Δt), and an audio signal S-out (t-Δt) as a result of the processing is analogized by a D / A conversion circuit (not shown). The signal is converted into a signal and output to the outside via the output terminal CH-out. For example, a sound emitting device (not shown) such as a speaker is connected to the output terminal CH-out. From the sound emitting device, the sound signal subjected to the speech speed conversion processing by the speech speed converting device 10 is supplied. The corresponding sound is emitted.

ただし、本実施形態にかかる話速変換部１４０は、処理対象であるフレームについて判別部１３０から引き渡された制御信号ＣＳが話速変換禁止信号である場合（すなわち、処理対象フレームが話速変換禁止フレームである場合）には、Ｓ−ｉｎ（ｔ−Δｔ）をそのままＳ−ｏｕｔ（ｔ−Δｔ´）として出力する点が非特許文献１に開示された技術と異なっている。なお、出力信号Ｓ−ｏｕｔの遅延量が「Δｔ´」となっているのは、それまでの話速変換によって遅延量がΔｔとは異なりうることを意味している。
このため、本実施形態に係る話速変換装置１０においては、音量が大きく変化している部分（例えば、音の立ち上がり部分や立下り部分）に波形挿入や波形削除が行われることはなく、非特許文献１について指摘した問題が生じることはない。なお、破裂音の子音部分については、音量変化の度合いが大きいことが一般的であり、そのような部分については判別部１３０によって話速変換禁止信号が出力されるので、破裂音の子音部分で複数回表れるといった不具合が生じることは極めて少なくなる。また、破裂音の子音部分について常に音量変化が大きいとは限らないが、音量変化小さい部分については、波形の不連続が生じたとしても聴感上はほとんど影響を与えない。 However, the speech speed conversion unit 140 according to the present embodiment, when the control signal CS delivered from the determination unit 130 for the processing target frame is a speech speed conversion prohibition signal (that is, the processing target frame is the speech speed conversion prohibition). In the case of a frame), S-in (t-Δt) is directly output as S-out (t-Δt ′), which is different from the technique disclosed in Non-Patent Document 1. Note that the delay amount of the output signal S-out is “Δt ′”, which means that the delay amount can be different from Δt due to the speech speed conversion up to that point.
For this reason, in the speech rate conversion apparatus 10 according to the present embodiment, waveform insertion or waveform deletion is not performed in a portion where the volume is greatly changed (for example, a rising portion or falling portion of sound). The problem pointed out with respect to Patent Document 1 does not occur. Note that the consonant part of the plosive generally has a large degree of volume change, and the speech rate conversion prohibition signal is output by the determination unit 130 for such a part. The occurrence of problems such as multiple appearances is extremely low. Further, although the volume change is not always large for the consonant part of the plosive sound, even if the waveform discontinuity occurs in the part where the volume change is small, there is almost no effect on hearing.

以上に説明したように、本実施形態に係る話速変換装置１０によれば、音声信号に時間軸圧伸を施して話速変換する際に、音量変化の大きい部分で音声信号に波形の不連続が生じることが回避される。加えて、本実施形態に係る話速変換装置１０においては、波形挿入や波形削除の対象から除外する部分を音量の変化率に基づいて特定し、音声認識などの複雑な処理を行ってはいないため、話速変換装置１０に過大な負荷がかかることはない。
このように、本実施形態によれば、音声信号に時間軸圧伸を施して話速変換する際に、話速変換を実行する装置に過大な処理負荷をかけることなく、処理後の音声信号に波形の不連続が生じることを回避することが可能になる、といった効果を奏する。 As described above, according to the speech speed converting apparatus 10 according to the present embodiment, when speech speed conversion is performed by applying time axis companding to a speech signal, the waveform of the speech signal is reduced in a portion where the volume change is large. Continuation is avoided. In addition, in the speech speed conversion apparatus 10 according to the present embodiment, a part to be excluded from waveform insertion and waveform deletion targets is specified based on the rate of change in volume, and complicated processing such as voice recognition is not performed. Therefore, an excessive load is not applied to the speech speed conversion apparatus 10.
Thus, according to the present embodiment, when the speech signal is subjected to time axis companding to convert the speech speed, the processed speech signal is not subjected to an excessive processing load on the device that performs the speech speed conversion. It is possible to avoid the occurrence of discontinuity in the waveform.

（Ｂ：変形）
以上、本発明の１実施形態について説明したが、係る実施形態に以下に述べるような変形を加えても良いことは勿論である。
（１）上述した実施形態では、ΔＰ（ｔ）の大きさが所定の閾値を超えているフレームとそのフレームの前後の所定数のフレームについて話速変換を禁止する場合について説明した。しかしながら、音の立ち上がり部分（ΔＰ（ｔ）が正の部分）と音の立下り部分とで、話速変換を禁止する区間の長さを変える（例えば、音の立下り部分について話速変換を禁止する区間を長くする）ようにしても良い。 (B: Deformation)
Although one embodiment of the present invention has been described above, it is needless to say that the embodiment may be modified as described below.
(1) In the above-described embodiment, a case has been described in which speech speed conversion is prohibited for a frame in which the magnitude of ΔP (t) exceeds a predetermined threshold and a predetermined number of frames before and after the frame. However, the length of the section in which speech speed conversion is prohibited is changed between the rising part of the sound (the part where ΔP (t) is positive) and the falling part of the sound (for example, speaking speed conversion is performed for the falling part of the sound). It is also possible to lengthen the prohibited section.

（２）上述した実施形態では、処理対象である音声信号の表す音声の音量の対数値の時間変化率を示す値をフレーム毎に算出する算出手段（音量変化率算出部１２０）、音声信号を構成する複数のフレームのうち、算出手段により算出された値の大きさが所定の閾値を超えているフレームおよびそのフレームに後続する所定数のフレームについて、話速変換を禁止されていると判別する判別手段（判別部１３０）、話速変換を禁止されていると判別されたフレームについてはそのまま出力する一方、その他のフレームについては、指定された話速になるように波形挿入または波形削除を行って出力する話速変換手段（話速変換部１４０）の各々をハードウェアモジュールで実現し、これらハードウェアモジュールを組み合わせて本発明に係る話速変換装置を構成する場合について説明したが、これら各手段をソフトウェアモジュールで実現するとしても良いことは勿論である。 (2) In the above-described embodiment, the calculation means (volume change rate calculation unit 120) that calculates a value indicating the time change rate of the logarithmic value of the sound volume represented by the audio signal to be processed for each frame, Among a plurality of constituting frames, it is determined that speech rate conversion is prohibited for a frame whose magnitude calculated by the calculation means exceeds a predetermined threshold and a predetermined number of frames following the frame. The discriminating means (discriminating unit 130) outputs the frame as it is determined that the speech speed conversion is prohibited, while performing waveform insertion or waveform deletion so as to achieve the designated speech speed for the other frames. Each of the speech speed converting means (speech speed converting unit 140) that outputs the data is realized by hardware modules, and the hardware modules are combined to provide the present invention. Been described to configure the speed converter, it is of course may be to realize these respective means in the software module.

具体的には、ＣＰＵ（Central Processing Unit）を上記音量変化率算出手段、判別手段および話速変換手段として機能させるプログラム（例えば、上記ＣＰＵに図３に示す話速変換処理を実行させるプログラム）を、パーソナルコンピュータなど一般的なコンピュータ装置にインストールし、そのコンピュータ装置を本発明に係る話速変換装置として機能させるようにすれば良い。 Specifically, a program that causes a CPU (Central Processing Unit) to function as the volume change rate calculation means, the determination means, and the speech speed conversion means (for example, a program that causes the CPU to execute the speech speed conversion processing shown in FIG. 3). It may be installed in a general computer device such as a personal computer so that the computer device functions as the speech rate conversion device according to the present invention.

本発明の一実施形態に係る話速変換装置１０の構成例を示すブロック図である。It is a block diagram which shows the structural example of the speech speed converter 10 which concerns on one Embodiment of this invention. 本実施形態に係る話速変換処理の処理過程を示す図である。It is a figure which shows the process of the speech speed conversion process which concerns on this embodiment. 変形例（１）に係る話速変換処理の流れを示すフローチャートである。It is a flowchart which shows the flow of the speech speed conversion process which concerns on a modification (1).

Explanation of symbols

１０…話速変換装置、ＣＨ−ｉｎ…入力端子、１１０…遅延回路、１２０…音量変化率算出部、１３０…判別部、１４０…話速変換部、ＣＨ−ｏｕｔ…出力端子。 DESCRIPTION OF SYMBOLS 10 ... Speech speed conversion apparatus, CH-in ... Input terminal, 110 ... Delay circuit, 120 ... Volume change rate calculation part, 130 ... Discrimination part, 140 ... Speech speed conversion part, CH-out ... Output terminal.

Claims

The frame corresponding to the rising portion of the sound represented by the audio signal to be processed and the falling portion of the sound is identified from the temporal change in the volume of the sound, and the frame and a predetermined number of frames before and after the frame are talked about. A discriminating means for discriminating that the frame is a speech speed conversion prohibition frame prohibited from speed conversion;
While the frame determined by the determination unit as being a speech rate conversion prohibition frame is output as it is, the other frames are output by performing waveform insertion or waveform deletion so that the specified speech rate is obtained. Conversion means;
A speech rate conversion device comprising:

A calculation unit that calculates a value indicating a time change rate of a logarithmic value of a volume of a voice represented by the voice signal for each frame;
The discrimination means includes
Of the plurality of frames constituting the audio signal, the speech rate conversion prohibition frame for a frame whose magnitude calculated by the calculation means exceeds a predetermined threshold and a predetermined number of frames before and after the frame. It is discriminate | determined that it is. The speech-speed converter of Claim 1 characterized by the above-mentioned.

A calculation means for calculating a value obtained by dividing a time change of a volume of a voice represented by the voice signal by the volume, for each frame;
The discrimination means includes
Of the plurality of frames constituting the audio signal, the speech rate conversion prohibition frame for a frame whose magnitude calculated by the calculation means exceeds a predetermined threshold and a predetermined number of frames before and after the frame. It is discriminate | determined that it is. The speech-speed converter of Claim 1 characterized by the above-mentioned.

Computer equipment,
The frame corresponding to the rising portion of the sound represented by the audio signal to be processed and the falling portion of the sound is identified from the temporal change in the volume of the sound, and the frame and a predetermined number of frames before and after the frame are talked about. A first step of determining that the frame is a speech speed conversion prohibited frame in which speed conversion is prohibited;
The frame determined in the first step is output as it is if it is a speech rate conversion prohibition frame, while the other frames are output after waveform insertion or waveform deletion so that the designated speech rate is obtained. A second step of:
A program characterized in that is executed.