WO2006077626A1 - Speech speed changing method, and speech speed changing device - Google Patents

Speech speed changing method, and speech speed changing device Download PDF

Info

Publication number
WO2006077626A1
WO2006077626A1 PCT/JP2005/000549 JP2005000549W WO2006077626A1 WO 2006077626 A1 WO2006077626 A1 WO 2006077626A1 JP 2005000549 W JP2005000549 W JP 2005000549W WO 2006077626 A1 WO2006077626 A1 WO 2006077626A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
section
protection
speech speed
voice signal
Prior art date
Application number
PCT/JP2005/000549
Other languages
French (fr)
Japanese (ja)
Inventor
Hitoshi Sasaki
Hiroshi Katayama
Rika Nishiike
Original Assignee
Fujitsu Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujitsu Limited filed Critical Fujitsu Limited
Priority to PCT/JP2005/000549 priority Critical patent/WO2006077626A1/en
Priority to JP2006553780A priority patent/JP4630876B2/en
Priority to EP05703786A priority patent/EP1840877A4/en
Publication of WO2006077626A1 publication Critical patent/WO2006077626A1/en
Priority to US11/778,720 priority patent/US7912710B2/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/04Time compression or expansion
    • G10L21/043Time compression or expansion by changing speed
    • G10L21/045Time compression or expansion by changing speed using thinning out or insertion of a waveform

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)
  • Telephone Function (AREA)

Abstract

A speech speed changing method is provided for changing a speech speed by storing an input voice signal in a buffer, by leaving as it is or extending a voice signal to be read from the buffer, for a sound section in which the power of the input voice signal exceeds a threshold value, and by leaving as it is, compressing or deleting the voice signal to be read from the buffer, for a silent section. In this method, a speech head protection section to be set prior to the voice section is made into the storage amount of the buffer limited by a predetermined limit value, and the compression or deletion of the voice signal is inhibited or adjusted in the compression ratio, if the voice section is in the speech head protection section, to perform the speech head protection so that the delay can be minimized to reduce the occurrence of the cutting of the speech head.

Description

明 細 書  Specification
話速変換方法及び話速変換装置  Speech speed conversion method and speech speed converter
技術分野  Technical field
[0001] 本発明は、話速変換方法及び話速変換装置に関し、音の高さを変えずに音声の 再生速度を変換する話速変換方法及び話速変換装置に関する。  TECHNICAL FIELD [0001] The present invention relates to a speech speed conversion method and a speech speed conversion apparatus, and more particularly to a speech speed conversion method and a speech speed conversion apparatus that convert a voice reproduction speed without changing the pitch of the sound.
背景技術  Background art
[0002] 従来から、相手の声の高さを変化させることなく音声の再生速度すなわち話速を遅 くさせることにより、会話の内容を聞き取りやすくするための技術が提案されている。こ のとき、単純に話速を遅くさせるのみでは、遅くさせた分の遅延が生じてしまう。  Conventionally, there has been proposed a technique for making it easy to hear the content of a conversation by slowing down the voice reproduction speed, that is, the speaking speed without changing the pitch of the other party's voice. At this time, if the speech speed is simply slowed down, a delay corresponding to the slowdown will occur.
[0003] このような問題を解決するため、会話の途中に存在する無音区間(人の声などの音 が無い区間)を詰めることや無音区間における話速を早くさせることで、遅延を解消 する技術が提案されている。  [0003] In order to solve such problems, delays are eliminated by closing silent intervals (intervals without sound such as human voice) that exist in the middle of conversations or by increasing the speech speed in silent intervals. Technology has been proposed.
[0004] 図 1は、従来の話速変換装置の一例のブロック図を示す。同図中、端子 10には 1フ レーム 20msでフレーム単位のデジタルの音声信号が入力され、有音無音判定部 11 及び話速変換部 12に供給される。  FIG. 1 shows a block diagram of an example of a conventional speech speed conversion device. In the figure, a digital audio signal in units of frames is input to a terminal 10 in one frame 20 ms, and is supplied to a sound / silence determination unit 11 and a speech speed conversion unit 12.
[0005] 有音無音判定部 11は、発話開始前等の初期無音時に雑音レベルを学習し、学習 した無音レベル例えば +4dBを有音閾値として設定し、入力音声信号を有音閾値と 比較して、音声信号が有音閾値以上の区間を有音判定区間と判定し、判定結果を 話速決定部 13に供給する。  [0005] The sound / silence determination unit 11 learns the noise level at the time of initial silence before the start of utterance, sets the learned silence level, for example, + 4dB as the sound threshold, and compares the input sound signal with the sound threshold. Then, the section where the audio signal is equal to or higher than the sound threshold is determined as the sound determination section, and the determination result is supplied to the speech speed determination unit 13.
[0006] 話速決定部 13は、入力蓄積量計算部 14から蓄積量 (蓄積フレーム数)を供給され ると共に、話頭保護区間(固定のフレーム数)を設定されており、有音判定結果と蓄 積量と話頭保護区間に応じて話速を決定し、この話速を話速変換部 12及び入力蓄 積量計算部 14に供給する。  [0006] The speech rate determination unit 13 is supplied with an accumulation amount (number of accumulated frames) from the input accumulation amount calculation unit 14, and is set with a speech head protection interval (a fixed number of frames). The speech speed is determined according to the accumulated amount and the speech protection interval, and this speech speed is supplied to the speech speed converting unit 12 and the input accumulated amount calculating unit 14.
[0007] 話速変換部 12は入力音声信号をバッファに書き込み、話速決定部 13からの話速 に従ってバッファから音声信号を読み出して端子 15から出力する。入力蓄積量計算 部 14は話速決定部 13からの話速に基づ 、て話速変換部 12のバッファに蓄積され ている蓄積量を計算して、話速決定部 13に供給する。 [0008] 図 2は、話速決定部 13の話速決定テーブルを示す。有音区間では、話速を 0. 5倍 (2倍伸張)とする。ただし、処理遅延時間が 1秒( = 50フレーム)以上の場合には話 速を 1倍とする。話頭保護区間、即ち後続 3フレーム以内に有音判定区間がある場合 には話速を 1倍とする。話尾保護区間、即ち過去 10フレーム以内に有音判定区間が ある場合には話速を 1倍とする。ポーズ保持区間、即ち話尾保護終了後の 10フレー ム以内は話速を 1倍とする。無音削除区間は、上記各区間以外では音声信号を削除 して詰める。ただし、処理遅延時間がない場合は話速を 1倍とする。 The speech speed conversion unit 12 writes the input speech signal into the buffer, reads the speech signal from the buffer according to the speech speed from the speech speed determination unit 13, and outputs it from the terminal 15. Based on the speech speed from the speech rate determination unit 13, the input accumulation amount calculation unit 14 calculates the accumulation amount stored in the buffer of the speech rate conversion unit 12 and supplies it to the speech rate determination unit 13. FIG. 2 shows a speech speed determination table of the speech speed determination unit 13. In voiced sections, the speech speed is 0.5 times (2 times expansion). However, if the processing delay time is 1 second (= 50 frames) or more, the speech speed is multiplied by 1. If there is a speech protection section within the next 3 frames, that is, a speech determination section, the speech speed will be multiplied by 1. If there is a speech protection section, that is, if there is a voiced judgment section within the past 10 frames, the speech speed will be multiplied by 1. The speech speed is doubled during the pause holding period, that is, within 10 frames after the end of talk protection. In the silence deletion section, the audio signal is deleted and packed outside the above sections. However, if there is no processing delay time, the speech speed is set to 1 time.
[0009] なお、特許文献 1には、一定時間長以上の非音声区間に挟まれた音声区間に対し 、その冒頭部分が所定の再生速度より遅くなり、かつ末尾に向けて次第に所定の再 生速度に戻すように話速変換することが記載されて 、る。  [0009] It should be noted that in Patent Document 1, the beginning portion of a speech section sandwiched between non-speech sections of a certain length of time becomes slower than a predetermined playback speed and gradually plays a predetermined playback toward the end. It is described that the speech speed is converted back to the speed.
特許文献 1:特開 2001—222300公報  Patent Document 1: Japanese Patent Laid-Open No. 2001-222300
発明の開示  Disclosure of the invention
発明が解決しょうとする課題  Problems to be solved by the invention
[0010] し力しながら、無音区間を詰める処理や無音区間における話速を速める処理を行う 際には、有音無音判定の精度を考慮する必要がある。例えば、雑音環境下では有音 無音判定において誤判定が生じる場合がある。雑音の無い環境下では、話頭や話 尾においても比較的正確に有音無音の判定が行われる。しかし、雑音環境下では、 雑音レベルが話頭や話尾におけるパワー値と近い値又は超える値となってしまう場 合があり、その場合は話頭や話尾が雑音に埋もれてしまう。 [0010] However, when performing the process of closing the silent section or the process of increasing the speech speed in the silent section, it is necessary to consider the accuracy of the sound / silence determination. For example, in a noisy environment, an erroneous determination may occur in the presence or absence of sound. In a no-noise environment, the presence or absence of speech and silence is determined relatively accurately even at the beginning and end of the story. However, in a noisy environment, the noise level may be a value close to or exceeding the power value at the beginning or end of the talk. In this case, the beginning or end of the talk will be buried in noise.
[0011] このため、雑音環境下では、有音無音の判定を正確に実現することが困難となる。 [0011] For this reason, it becomes difficult to accurately determine whether there is a sound or no sound in a noisy environment.
例えば、雑音環境下では、話頭や話尾や無声子音などのように音声パワーが小さい 部分は、有音区間であるにも拘わらず無音と誤判定される可能性が高くなつてしまう  For example, in a noisy environment, parts with low voice power, such as the beginning, end, and unvoiced consonants, are more likely to be misjudged as silence despite being a voiced section.
[0012] このような誤判定に基づいて無音区間を詰める処理や話速を速める処理が実行さ れると、音切れの発生や、無音継続長が過度に短縮されるなどの問題が生じてしまう [0012] When a process of closing a silent section or a process of increasing the speech speed is executed based on such a misjudgment, problems such as occurrence of sound interruption and excessive reduction in the duration of silence occur.
[0013] 図 3 (A)に入力音声信号パワー (音量)の概略の時間変化を実線で示す。音声信 号に定常パワーの雑音が重畳しており、その雑音レベル +4dBを有音閾値に設定し ている。図 3 (A)の下部には各区間の判定結果を示している。ただし話頭保護区間 は話頭から、語尾保護区間については語尾からの分のみを記載している。左から 1 番目、 2番目、 5番目、 6番目の音声については有音区間と判定される力 3番目、 4 番目の音声については雑音に埋もれた形となっていて無音区間と判定される。 [0013] Fig. 3 (A) shows the approximate time variation of the input audio signal power (volume) with a solid line. Steady power noise is superimposed on the audio signal, and the noise level +4 dB is set as the sound threshold. ing. The determination results for each section are shown in the lower part of Fig. 3 (A). However, only the portion from the beginning of the speech protection section is described from the beginning of the speech protection section, and the portion from the ending of the ending protection section. The first, second, fifth, and sixth voices from the left are judged to be voiced sections. The third and fourth voices are considered to be silent sections because they are buried in noise.
[0014] 3番目の音声については語尾保護で削除をまぬがれる力 4番目の音声について は固定の話頭保護区間が短いために話頭切れが生じる。図 3 (B)に話速変換後の 音声信号パワーを示す。  [0014] For the third voice, the power that can be removed by ending protection. For the fourth voice, the fixed head protection section is short, so the head break occurs. Figure 3 (B) shows the audio signal power after speech speed conversion.
[0015] 図 3 (B)の区間(1):開始時点で既に話速変換での処理遅延 (入力蓄積)力 10フレ ーム分あるものとする。  [0015] Section (1) in Fig. 3 (B): Assume that there is already 10 frames of processing delay (input accumulation) force in speech speed conversion at the start.
[0016] 区間(2) ,区間(3): 1番目、 2番目の音声は有音判定となるので 2倍伸長(1Z2倍 速)となる。区間(2) , (3)の間は話頭保護及び語尾保護で 1倍速の出力となる。  [0016] Section (2), Section (3): Since the first and second voices are sounded, they are doubled (1Z2 speed). During sections (2) and (3), the output is 1x speed with speech protection and ending protection.
[0017] 区間 (4): 3番目の音声は無音判定であるが、語尾保護とポーズ保持区間に入るの で、 1倍速で出力される。その後の無音区間もポーズ保持区間内は 1倍速の出力とな り、その後は削除される。  [0017] Section (4): The third voice is silent, but it is output at 1x speed because it enters the ending protection and pause holding section. Subsequent silent sections are output at 1x speed in the pause holding section, and are deleted thereafter.
[0018] 区間(5) :4番目の音声は無音判定で一部しか話頭保護されない。この時点での話 速変換遅延 (入力蓄積量)が十分あるために、保護区間のみ 1倍速で出力され、それ 以外は削除され、話頭切れが生じる。  [0018] Section (5): The fourth speech is silence-protected and only part of the head is protected. Since there is sufficient speech speed conversion delay (input accumulation amount) at this point, only the protected section is output at 1x speed, and the rest are deleted, causing the head to break.
[0019] 区間(6): 5番目の音声は有音判定なので、 2倍伸長となる。  [0019] Section (6): Since the fifth sound is a sound determination, it is expanded twice.
[0020] 話頭保護につ 、て従来は固定長の話頭保護区間を設定して 、るので、話頭保護 の分だけ遅延を挿入 (追加)する必要がある。例えば電話における留守録等の蓄積 音では十分な話頭保護を設定できる。しかし、実時間の通話で話速変換する場合に は、遅延を最小限に抑える必要があるので、十分な長さの話頭保護区間を設定する ことができず、話頭切れを生じるおそれがあるという問題があった。  [0020] For speech protection, a fixed-length speech protection section is conventionally set, and therefore it is necessary to insert (add) a delay corresponding to the speech protection. For example, sufficient protection can be set for stored sounds such as recorded messages on the telephone. However, when converting the speech speed in real-time calls, it is necessary to minimize the delay, so it is not possible to set a sufficiently long talk head protection section, and there is a possibility that the talk head may be cut off. There was a problem.
[0021] 本発明は、上記の点に鑑みなされたものであり、遅延を最小限に抑え、話頭切れの 発生を低減できる話速変換方法及び話速変換装置を提供することを総括的な目的と する。  The present invention has been made in view of the above points, and it is a general object of the present invention to provide a speech speed conversion method and a speech speed converter that can minimize the delay and reduce the occurrence of a head loss. Let's say.
課題を解決するための手段  Means for solving the problem
[0022] この目的を達成するため、本発明は、入力音声信号をバッファに蓄積し、前記入力 音声信号のパワーが閾値を超える有音区間は前記バッファから読み出す音声信号 をそのままもしくは伸張し、無音区間は前記バッファから読み出す音声信号をそのま まもしくは圧縮もしくは削除して話速を変換する話速変換方法にぉ 、て、前記有音区 間に先行して設定する話頭保護区間を、所定の制限値で制限した前記バッファの蓄 積量とし、前記話頭保護区間内に前記有音区間があれば前記音声信号の圧縮もし くは削除を、禁止もしくは圧縮率を調整して話頭保護を行うよう構成する。 In order to achieve this object, the present invention stores an input audio signal in a buffer, and In speech periods where the power of the audio signal exceeds the threshold, the audio signal read from the buffer is either directly or expanded, and in the silent period, the audio signal read from the buffer is unchanged or compressed or deleted to convert the speech speed. According to the conversion method, the speech protection interval set in advance between the speech zones is set as the accumulated amount of the buffer limited by a predetermined limit value, and the speech protection interval is within the speech protection interval. For example, compression or deletion of the audio signal is prohibited, or speech head protection is performed by adjusting the compression rate.
発明の効果  The invention's effect
[0023] このような話速変換方法によれば、遅延を最小限に抑え、話頭切れの発生を低減 できる。  [0023] According to such a speech speed conversion method, it is possible to minimize the delay and reduce the occurrence of the head break.
図面の簡単な説明  Brief Description of Drawings
[0024] [図 1]従来の話速変換装置の一例のブロック図である。 FIG. 1 is a block diagram of an example of a conventional speech speed conversion device.
[図 2]従来の話速変換装置の話速決定部の話速決定テーブルを示す図である。  FIG. 2 is a diagram showing a speech speed determination table of a speech speed determination unit of a conventional speech speed conversion device.
[図 3]従来の入力音声信号パワーと話速変換後の音声信号パワーを示す図である。  FIG. 3 is a diagram showing conventional input voice signal power and voice signal power after speech speed conversion.
[図 4]本発明の話速変換装置の第 1実施形態のブロック図である。  FIG. 4 is a block diagram of the first embodiment of the speech speed converting apparatus of the present invention.
[図 5]第 1実施形態における話速決定部の話速決定テーブルを示す図である。  FIG. 5 is a diagram showing a speech speed determination table of a speech speed determination unit in the first embodiment.
[図 6]本発明の入力音声信号パワーと話速変換後の音声信号パワーを示す図である  FIG. 6 is a diagram showing the input voice signal power and the voice signal power after speech speed conversion according to the present invention.
[図 7]第 2実施形態における有音無音判定部の音声無音判定テーブルを示す図であ る。 FIG. 7 is a diagram showing a voice / silence determination table of a voice / silence determination unit in the second embodiment.
[図 8]第 2実施形態における話速決定部の話速決定テーブルを示す図である。  FIG. 8 is a diagram showing a speech speed determination table of a speech speed determination unit in the second embodiment.
[図 9]本発明の話速変換装置の第 3実施形態のブロック図である。  FIG. 9 is a block diagram of a third embodiment of the speech speed converting apparatus of the present invention.
[図 10]第 4実施形態における話速決定部の話速決定テーブルを示す図である。 符号の説明  FIG. 10 is a diagram showing a speech speed determination table of a speech speed determination unit in the fourth embodiment. Explanation of symbols
[0025] 20, 26 端子 [0025] 20, 26 terminals
21 有音無音判定部  21 Sound / silence determination section
22 話速変換部  22 Speech rate converter
23 話速決定部  23 Spoken speed decision section
24 入力蓄積量計算部 25, 31 話頭保護区間決定部 24 Input accumulation calculation section 25, 31 Head protection section decision section
30 推定 SNR判定部  30 Estimated SNR judgment unit
発明を実施するための最良の形態  BEST MODE FOR CARRYING OUT THE INVENTION
[0026] 以下、図面に基づいて本発明の実施形態について説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.
<第 1実施形態 >  <First embodiment>
図 4は、本発明の話速変換装置の第 1実施形態のブロック図を示す。同図中、端子 20には 1フレーム 20msでフレーム単位のデジタルの音声信号が入力され、有音無 音判定部 21及び話速変換部 22に供給される。  FIG. 4 shows a block diagram of the first embodiment of the speech speed converting apparatus of the present invention. In the figure, a digital audio signal in units of frames is input to the terminal 20 in one frame 20 ms, and is supplied to the sound / silence determination unit 21 and the speech speed conversion unit 22.
[0027] 有音無音判定部 21は、発話開始前等の初期無音時に雑音レベルを学習し、学習 した無音レベル例えば +4dBを有音閾値として設定し、入力音声信号が有音閾値以 上の区間を有音判定区間と判定し、判定結果を話速決定部 23に供給する。なお、 簡単のためパワー (音量)のみで有音判定を行うこととしたが、周波数特性などの特 徴量を用いて有音判定を行っても良ぐまた、有音閾値として固定値を用いても良い [0027] The sound / silence determination unit 21 learns the noise level at the time of initial silence before the start of utterance, sets the learned silence level, for example, + 4dB as the sound threshold, and the input sound signal exceeds the sound threshold. The section is determined to be a sound determination section, and the determination result is supplied to the speech speed determination unit 23. For simplicity, it is decided to make a sound determination only with power (volume), but it is also possible to make a sound determination using a characteristic quantity such as frequency characteristics.In addition, a fixed value is used as the sound threshold. May be
[0028] 話速決定部 23は、入力蓄積量計算部 24から蓄積量 (蓄積フレーム数)を供給され ると共に、話頭保護区間決定部 25から話頭保護区間(可変のフレーム数)を供給さ れており、有音判定結果と蓄積量と話頭保護区間に応じて話速を決定し、この話速 を話速変換部 22及び入力蓄積量計算部 24に供給する。 [0028] The speech rate determination unit 23 is supplied with the accumulation amount (accumulated number of frames) from the input accumulation amount calculation unit 24 and is also supplied with the speech protection period (variable number of frames) from the speech protection period determination unit 25. The speech speed is determined according to the sound determination result, the accumulation amount, and the speech protection section, and this speech speed is supplied to the speech speed conversion unit 22 and the input accumulation amount calculation unit 24.
[0029] 話速変換部 22は入力音声信号をバッファに書き込み、話速決定部 23からの話速 に従ってバッファから音声信号を読み出して端子 26から出力する。削除区間は単に データを捨てる。話速を遅くする場合には、例えば各フレームを 4分割程度のサブフ レームに分割し、サブフレーム毎に伸長倍率に応じて繰返し再生する。 2倍伸長の場 合は各サブフレームを 2回繰返し再生する。 1. 5倍伸長であれば、奇数サブフレーム を 1回再生し、偶数サブフレームを 2回繰返し再生する。このとき、特許第 3147562 号に記載のように、相関などの情報を基に滑らかに接続できるようにずらして接続す る手法が一般的である。  The speech speed conversion unit 22 writes the input speech signal into the buffer, reads the speech signal from the buffer according to the speech speed from the speech speed determination unit 23, and outputs it from the terminal 26. The deletion section simply discards the data. In order to slow down the speech speed, for example, each frame is divided into about 4 subframes, and each subframe is repeatedly played according to the expansion ratio. In the case of 2 times extension, each subframe is played back twice. 1. For 5x expansion, play odd subframes once and repeat even subframes twice. At this time, as described in Japanese Patent No. 3147562, a method is generally used in which the connection is shifted so that the connection can be made smoothly based on information such as correlation.
[0030] なお、話速変換部 22は音声信号を削除する代りに、話速を速くして圧縮しても良い 。話速を 2倍にして圧縮する場合には、例えば奇数サブフレームを 1回再生し、偶数 サブフレームを削除する。 [0030] Note that the speech speed conversion unit 22 may compress the speech speed at a higher speed instead of deleting the voice signal. When compressing the speech speed by doubling, for example, an odd subframe is played once and an even number Delete the subframe.
[0031] 入力蓄積量計算部 24は話速決定部 23からの話速に基づいて話速変換部 22のバ ッファに蓄積されている蓄積量を計算して、話速決定部 23及び話頭保護区間決定 部 25に供給する。具体的には、削除であれば、削除するフレーム数だけ蓄積量及び 遅延は減少し、話速を 0. 5倍にすれば 1フレームにっき 20ms分だけ蓄積量が増加 することになる。この修正された蓄積量は次のフレームの話速を決定するのに用いら れる。  [0031] The input accumulation amount calculation unit 24 calculates the accumulation amount accumulated in the buffer of the speech rate conversion unit 22 based on the speech rate from the speech rate determination unit 23, and the speech rate determination unit 23 and the speech head protection Supply to section determination unit 25. Specifically, if deleted, the accumulated amount and delay decrease by the number of frames to be deleted, and if the speech rate is increased 0.5 times, the accumulated amount increases by 20 ms per frame. This modified accumulated amount is used to determine the speech rate of the next frame.
[0032] 話頭保護区間決定部 25は、蓄積量に応じて話頭保護区間 (可変のフレーム数)を 決定する。例えば、蓄積量 (話速変換での遅延に対応)が 10フレーム以下の場合は 、蓄積量 (蓄積フレーム数)を話頭保護区間とする。蓄積量が 10フレーム以上の場合 には話頭保護区間を 10フレームとする。  [0032] The speech protection section determination unit 25 determines a speech protection section (variable number of frames) according to the accumulation amount. For example, if the accumulated amount (corresponding to the delay in speech speed conversion) is 10 frames or less, the accumulated amount (number of accumulated frames) is set as the speech protection section. If the accumulated amount is 10 frames or more, the head protection section is set to 10 frames.
[0033] 図 5は、第 1実施形態における話速決定部 23の話速決定テーブルを示す。有音区 間では、話速を 0. 5倍 (2倍伸張)とする。ただし、処理遅延時間が 1秒( = 50フレー ム)以上の場合には音声信号の削除を禁止して話速を 1倍とする。  FIG. 5 shows a speech speed determination table of the speech speed determination unit 23 in the first embodiment. Between voiced zones, the speech speed is 0.5 times (2 times expansion). However, if the processing delay time is 1 second (= 50 frames) or more, deletion of the audio signal is prohibited and the speech speed is set to 1 time.
[0034] 話頭保護区間、即ち話頭保護区間決定部 25で決定されたフレーム数以内に有音 判定区間がある場合には音声信号の削除を禁止して話速を 1倍とする。なお、削除 を禁止する代りに圧縮率を調整しても良 ヽ。  [0034] If there is a voiced judgment section within the number of frames determined by the speech protection section, that is, the speech protection section determination section 25, deletion of the voice signal is prohibited and the speech speed is increased by one. You can adjust the compression ratio instead of prohibiting deletion.
[0035] 話尾保護区間、即ち過去 10フレーム以内に有音判定区間がある場合には音声信 号の削除を禁止して話速を 1倍とする。  [0035] If there is a speech protection section, that is, if there is a speech determination section within the past 10 frames, deletion of the voice signal is prohibited and the speech speed is set to 1 time.
[0036] ポーズ保持区間、即ち話尾保護終了後の Nフレームのポーズ保持区間は音声信 号の削除を禁止して話速を 1倍とする。 N= 13—話頭保護区間 (ただし、 Nの上限は 10フレーム、下限は 5フレーム)である。  [0036] In the pause holding section, that is, the pause holding section of the N frame after the end of the talk protection, the deletion of the voice signal is prohibited and the speech speed is set to 1 time. N = 13—Speech protection interval (where N is 10 frames, lower limit is 5 frames).
[0037] 無音削除区間は、上記各区間以外であり、処理遅延時間がある場合には音声信 号を削除する。処理遅延時間がない場合は話速を 1倍とする。  [0037] The silent deletion section is other than the above sections, and the audio signal is deleted when there is a processing delay time. When there is no processing delay time, the speech speed is set to 1 time.
[0038] 図 6 (A)に入力音声信号パワー (音量)の概略の時間変化を実線で示す。音声信 号に定常パワーの雑音が重畳しており、その雑音レベル +4dBを有音閾値に設定し ている。図 6 (A)の下部には各区間の判定結果を示している。ただし話頭保護区間 は話頭から、語尾保護区間については語尾からの分のみを記載している。左から 1 番目、 2番目、 5番目、 6番目の音声については有音区間と判定される力 3番目、 4 番目の音声については雑音に埋もれた形となっていて、無音区間と判定される。 [0038] Fig. 6 (A) shows the approximate time variation of the input audio signal power (volume) with a solid line. Steady power noise is superimposed on the audio signal, and the noise level + 4dB is set as the sound threshold. The judgment results for each section are shown in the lower part of Fig. 6 (A). However, only the portion from the beginning of the speech protection section is described from the beginning of the speech protection section, and the portion from the ending of the ending protection section. 1 from the left The second, fifth, sixth, and sixth voices are judged to be in a voiced section. The third and fourth voices are buried in noise and are judged to be silent sections.
[0039] 図 6 (B)に話速変換後の音声信号パワーを示す。 FIG. 6 (B) shows the audio signal power after the speech speed conversion.
[0040] 図 6 (B)の区間(1):開始時点で既に話速変換での処理遅延 (入力蓄積)力 10フレ ーム分あるものとする。  [0040] Section (1) in Fig. 6 (B): It is assumed that there is already 10 frames of processing delay (input accumulation) force in speech speed conversion at the start time.
[0041] 区間(2) ,区間(3): 1番目、 2番目の音声は有音区間と判定されるので 2倍伸長(1 Z2倍速)となる。区間(2) , (3)の間は話頭保護及び語尾保護で 1倍速の出力となる  [0041] Section (2), Section (3): Since the first and second voices are determined to be voiced sections, they are doubled (1 Z2 double speed). During section (2) and (3), the output is 1x speed with speech protection and ending protection.
[0042] 区間 (4): 3番目の音声に続く無音区間はポーズ保持区間(1倍速)を従来に対し減 らした分だけ早い時点力 削除を開始する。 [0042] Section (4): In the silent section following the third voice, the point force deletion starts earlier by the amount that the pause holding section (1x speed) is reduced compared to the conventional one.
[0043] 区間(5) :4番目の音声は話頭保護が増えたので話頭切れが解消する。 [0043] Section (5): In the fourth voice, the head break is eliminated because the head protection is increased.
[0044] 区間(6): 5番目の音声は有音判定なので 2倍伸長となる。 [0044] Section (6): Since the fifth voice is a sound determination, it is doubled.
[0045] 無音区間を詰める必要があるのは遅延が発生している場合、つまり未処理の音声 信号データが蓄積されている場合である。したがって、話速変換部 22のバッファ蓄積 量に応じ、かつ所定値に制限して話頭保護区間を設定することで、遅延を増やさず に話頭保護を実施でき、また、ポーズ保持区間を話頭保護区間に応じて可変するこ とにより、バッファ蓄積量が多いときには遅延量を増やすことなく従来よりも正確な話 頭保護が実現できる。  [0045] It is necessary to close the silent section when a delay occurs, that is, when unprocessed audio signal data is accumulated. Therefore, speech protection can be implemented without increasing the delay by setting the speech protection interval according to the buffer storage amount of the speech speed conversion unit 22 and limiting it to a predetermined value, and the pause holding interval can be set as the speech protection interval. By varying this according to the situation, it is possible to achieve more accurate speech protection than before without increasing the delay amount when the buffer storage amount is large.
<第 2実施形態 >  <Second embodiment>
第 2実施形態では、図 4のブロック図に示す有音無音判定部 21及び話速決定部 2 3の動作が第 1実施形態と異なっているので、有音無音判定部 21及び話速決定部 2 3の動作にっ 、て説明する。  In the second embodiment, since the operations of the sound / silence determination unit 21 and the speech speed determination unit 23 shown in the block diagram of FIG. 4 are different from those of the first embodiment, the sound / silence determination unit 21 and the speech speed determination unit The operation of 2 and 3 will be explained.
[0046] 図 7は、第 2実施形態における有音無音判定部 21の音声無音判定テーブルを示 す。有音無音判定部 21は、発話開始前等の初期無音時に雑音レベルを学習し、学 習した無音レベル例えば +4dBを有音閾値として設定し、学習した無音レベル + Id Bを無音確実度判定値として設定する。  FIG. 7 shows a voice / silence determination table of the voice / silence determination unit 21 in the second embodiment. The utterance / silence determination unit 21 learns the noise level during initial silence before the start of utterance, etc., sets the learned silence level, for example, +4 dB as the utterance threshold, and determines the learned silence level + Id B as the silence certainty level. Set as a value.
[0047] 有音無音判定部 21は入力音声信号が有音閾値以上の区間を有音判定区間と判 定し、入力音声信号が有音閾値以下で無音確実度判定値以上であれば確実度小 の無音区間と判定し、無音確実度判定値以下であれば確実度大の無音区間と判定 し、判定結果を話速決定部 23に供給する。 [0047] The sound / silence determination unit 21 determines a section where the input sound signal is equal to or greater than the sound threshold as a sound determination section. If the input sound signal is equal to or less than the sound threshold and equal to or greater than the sound certainty determination value, the certainty level is determined. small If it is equal to or less than the silence certainty judgment value, it is judged as a silent section with a high certainty, and the judgment result is supplied to the speech speed determination unit 23.
[0048] 図 8は、第 2実施形態における話速決定部 23の話速決定テーブルを示す。有音区 間では、話速を 0. 5倍 (2倍伸張)とする。ただし、処理遅延時間が 1秒( = 50フレー ム)以上の場合には音声信号の削除を禁止して話速を 1倍とする。  FIG. 8 shows a speech speed determination table of the speech speed determination unit 23 in the second embodiment. Between voiced zones, the speech speed is 0.5 times (2 times expansion). However, if the processing delay time is 1 second (= 50 frames) or more, deletion of the audio signal is prohibited and the speech speed is set to 1 time.
[0049] 話頭保護区間、即ち話頭保護区間決定部 25で決定されたフレーム数以内に有音 判定区間がある場合、または、話頭保護区間決定部 25で決定されたフレーム数が 1 0フレーム未満で確実度小の無音区間がある場合には音声信号の削除を禁止して 話速を 1倍とする。なお、削除を禁止する代りに圧縮率を調整しても良い。  [0049] When there is a speech determination section within the number of frames determined by the speech protection section, that is, the speech protection section determination section 25, or the number of frames determined by the speech protection section determination section 25 is less than 10 frames. If there is a silent section with a low degree of certainty, deletion of the audio signal is prohibited and the speech speed is set to 1. Note that the compression rate may be adjusted instead of prohibiting deletion.
[0050] 話尾保護区間、即ち過去 10フレーム以内に有音判定区間がある場合には音声信 号の削除を禁止して話速を 1倍とする。  [0050] When there is a speech protection section, that is, when there is a speech determination section within the past 10 frames, deletion of the voice signal is prohibited and the speech speed is set to 1 time.
[0051] ポーズ保持区間、即ち話尾保護終了後の 10フレームのポーズ保持区間は音声信 号の削除を禁止して話速を 1倍とする。  [0051] In the pause holding section, that is, the pause holding section of 10 frames after the end of the talk protection, the voice signal is prohibited from being deleted and the speech speed is set to 1 time.
[0052] 無音削除区間は、上記各区間以外であり、処理遅延時間がある場合には音声信 号を削除する。処理遅延時間がない場合は話速を 1倍とする。  [0052] The silent deletion section is other than the above sections, and the audio signal is deleted when there is a processing delay time. When there is no processing delay time, the speech speed is set to 1 time.
[0053] このように、話頭保護区間が 10フレーム未満の場合には現フレームの無音信頼度 が高い場合のみ削除または 1倍速の対象とすることによって、話頭保護区間が相対 的に短!ヽ場合に話頭切れが発生しやす!ヽという問題を低減する。  [0053] In this way, when the speech protection section is less than 10 frames, the speech protection section is relatively short by deleting or setting the target at 1x speed only when the silence reliability of the current frame is high!頭 If the talk breaks out easily! Reduce the problem of wrinkles.
<第 3実施形態 >  <Third embodiment>
図 9は、本発明の話速変換装置の第 3実施形態のブロック図を示す。同図中、図 4 と同一部分には同一符号を付す。  FIG. 9 shows a block diagram of a third embodiment of the speech speed converting apparatus of the present invention. In the figure, the same parts as those in FIG.
[0054] 図 9において、端子 20には 1フレーム 20msでフレーム単位のデジタルの音声信号 が入力され、有音無音判定部 21及び話速変換部 22及び推定 SNR算出部 27に供 給される。 In FIG. 9, a digital audio signal in units of frames is input to the terminal 20 in one frame 20 ms, and supplied to the sound / silence determination unit 21, speech rate conversion unit 22, and estimated SNR calculation unit 27.
[0055] 有音無音判定部 21は、発話開始前等の初期無音時に雑音レベルを学習し、学習 した無音レベル例えば +4dBを有音閾値として設定し、入力音声信号が有音閾値以 上の区間を有音判定区間と判定し、判定結果を話速決定部 23に供給する。なお、 簡単のためパワー (音量)のみで有音判定を行うこととしたが、周波数特性など特徴 量を用いて有音判定を行っても良ぐまた、有音閾値として固定値を用いても良い。 [0055] The voice / silence determination unit 21 learns the noise level at the time of initial silence before the start of utterance, sets the learned silence level, for example, + 4dB as the voice threshold, and the input voice signal exceeds the voice threshold. The section is determined to be a sound determination section, and the determination result is supplied to the speech speed determination unit 23. For simplicity, we decided to make a sound determination only with power (volume). The sound determination may be performed using the amount, or a fixed value may be used as the sound threshold.
[0056] 推定 SNR判定部 30は、 SNR (信号雑音比)を推定し、推定 SNRが高 ヽか低 ヽか 判定する。 SNRの推定判定法としては、例えば過去 30秒の最大パワー(音量)と最 小パワーの差を求め、その差が閾値 (例えば 15dB)を超えていれば推定 SNRが高 V、と見なし、閾値以下であれば推定 SNRが低 、と見なす。  [0056] The estimated SNR determination unit 30 estimates an SNR (signal-to-noise ratio) and determines whether the estimated SNR is high or low. As an SNR estimation judgment method, for example, the difference between the maximum power (volume) and the minimum power in the past 30 seconds is obtained, and if the difference exceeds a threshold (for example, 15 dB), the estimated SNR is considered to be high V, and the threshold The estimated SNR is considered to be low if
[0057] 話速決定部 23は、入力蓄積量計算部 24から蓄積量 (蓄積フレーム数)を供給され ると共に、話頭保護区間決定部 31から話頭保護区間(可変のフレーム数)を供給さ れており、有音判定結果と蓄積量と話頭保護区間に応じて話速を決定し、この話速 を話速変換部 22及び入力蓄積量計算部 24に供給する。  [0057] The speech rate determination unit 23 is supplied with the accumulation amount (accumulated number of frames) from the input accumulation amount calculation unit 24, and is also supplied with the speech protection interval (variable number of frames) from the speech protection interval determination unit 31. The speech speed is determined according to the sound determination result, the accumulation amount, and the speech protection section, and this speech speed is supplied to the speech speed conversion unit 22 and the input accumulation amount calculation unit 24.
[0058] 話速変換部 22は入力音声信号をバッファに書き込み、話速決定部 23からの話速 に従ってバッファから音声信号を読み出して端子 26から出力する。削除区間は単に データを捨てる。話速を遅くする場合には、例えば各フレームを 4分割程度のサブフ レームに分割し、サブフレーム毎に伸長倍率に応じて繰返し再生する。 2倍伸長の場 合は各サブフレームを 2回繰返し再生する。 1. 5倍伸長であれば、奇数サブフレーム を 1回再生し、偶数サブフレームを 2回繰返し再生する。  The speech rate conversion unit 22 writes the input speech signal into the buffer, reads the speech signal from the buffer according to the speech rate from the speech rate determination unit 23, and outputs it from the terminal 26. The deletion section simply discards the data. In order to slow down the speech speed, for example, each frame is divided into about 4 subframes, and each subframe is repeatedly played according to the expansion ratio. In the case of 2 times extension, each subframe is played back twice. 1. For 5x expansion, play odd subframes once and repeat even subframes twice.
[0059] 入力蓄積量計算部 24は話速決定部 23からの話速に基づいて話速変換部 22のバ ッファに蓄積されている蓄積量を計算して、話速決定部 23及び話頭保護区間決定 部 31に供給する。具体的には、削除であれば、削除するフレーム数だけ蓄積量及び 遅延は減少し、話速を 0. 5倍にすれば 1フレームにっき 20ms分だけ蓄積量が増加 することになる。この修正された蓄積量は次のフレームの話速を決定するのに用いら れる。  [0059] The input accumulation amount calculation unit 24 calculates the accumulation amount accumulated in the buffer of the speech rate conversion unit 22 based on the speech rate from the speech rate determination unit 23, and the speech rate determination unit 23 and the speech head protection Supply to section determination unit 31. Specifically, if deleted, the accumulated amount and delay decrease by the number of frames to be deleted, and if the speech rate is increased 0.5 times, the accumulated amount increases by 20 ms per frame. This modified accumulated amount is used to determine the speech rate of the next frame.
[0060] 話頭保護区間決定部 31は、蓄積量と推定 SNRに応じて話頭保護区間 (可変のフ レーム数)を決定する。例えば、推定 SNRが低い場合は、蓄積量 (話速変換での遅 延に対応)が 10フレーム以下であれば蓄積量 (蓄積フレーム数)を話頭保護区間と する。蓄積量が 10フレーム以上のときは話頭保護区間を 10フレームとする。  [0060] The speech protection section determination unit 31 determines a speech protection section (variable number of frames) according to the accumulated amount and the estimated SNR. For example, when the estimated SNR is low, if the accumulated amount (corresponding to the delay in speech speed conversion) is 10 frames or less, the accumulated amount (accumulated number of frames) is used as the head protection section. When the accumulated amount is 10 frames or more, the head protection section is set to 10 frames.
[0061] 推定 SNRが高 、場合は、蓄積量が 3フレーム以下の場合は蓄積量 (蓄積フレーム 数)を話頭保護区間とする。蓄積量が 3フレーム以上の場合には話頭保護区間を 3フ レームとする。 [0062] 本実施形態では、推定 SNRが高い場合には話頭を誤って無音と判定するおそれ が少ないことから、過剰に保護区間を設定することを防止できる。 [0061] When the estimated SNR is high, if the accumulated amount is 3 frames or less, the accumulated amount (the number of accumulated frames) is set as the speech protection section. When the accumulated amount is 3 frames or more, the head protection section is set to 3 frames. [0062] In the present embodiment, when the estimated SNR is high, there is less risk of erroneously determining the speech head to be silent, and therefore it is possible to prevent setting a protection interval excessively.
<第 4実施形態 >  <Fourth embodiment>
第 4実施形態では、図 4のブロック図に示す有音無音判定部 21及び話速決定部 2 3の動作が第 3実施形態と異なっているので、有音無音判定部 21及び話速決定部 2 3の動作にっ 、て説明する。  In the fourth embodiment, since the operations of the sound / silence determination unit 21 and the speech speed determination unit 23 3 shown in the block diagram of FIG. 4 are different from those of the third embodiment, the sound / silence determination unit 21 and the speech speed determination unit The operation of 2 and 3 will be explained.
[0063] 第 4実施形態における有音無音判定部 21の音声無音判定テーブルは図 7に示す 通りである。有音無音判定部 21は、発話開始前等の初期無音時に雑音レベルを学 習し、学習した無音レベル例えば +4dBを有音閾値として設定し、学習した無音レべ ル + ldBを無音確実度判定値として設定する。  The voice / silence determination table of the voice / silence determination unit 21 in the fourth embodiment is as shown in FIG. The sound / silence determination unit 21 learns the noise level during initial silence before the start of utterance, sets the learned silence level, e.g., +4 dB as the sound threshold, and uses the learned silence level + ldB as the silence certainty level. Set as judgment value.
[0064] 有音無音判定部 21は入力音声信号が有音閾値以上の区間を有音判定区間と判 定し、入力音声信号が有音閾値以下で無音確実度判定値以上であれば確実度小 の無音区間と判定し、無音確実度判定値以下であれば確実度大の無音区間と判定 し、判定結果を話速決定部 23に供給する。  [0064] The sound / silence determination unit 21 determines a section where the input sound signal is equal to or greater than the sound threshold as a sound determination section. If the input sound signal is equal to or less than the sound threshold and equal to or greater than the sound certainty determination value, the certainty level is determined. It is determined that the silent period is small, and if it is equal to or less than the silence certainty determination value, it is determined as a silent section with high certainty, and the determination result is supplied to the speech speed determining unit 23.
[0065] 図 10は、第 4実施形態における話速決定部 23の話速決定テーブルを示す。有音 区間では、話速を 0. 5倍 (2倍伸張)とする。ただし、処理遅延時間が 1秒( = 50フレ ーム)以上の場合には音声信号の削除を禁止して話速を 1倍とする。  FIG. 10 shows a speech speed determination table of the speech speed determination unit 23 in the fourth embodiment. In the voiced section, the speech speed is set to 0.5 times (2 times expansion). However, if the processing delay time is 1 second (= 50 frames) or more, deletion of the audio signal is prohibited and the speech speed is set to 1 time.
[0066] 話頭保護区間、即ち話頭保護区間決定部 25で決定されたフレーム数以内に有音 判定区間がある場合には音声信号の削除を禁止して話速を 1倍とする。ただし、現フ レームと後続 3フレームが全て確実度大の無音区間である場合には話頭保護を行わ ない。  [0066] When there is a voiced determination section within the number of frames determined by the speech protection section, that is, the head protection section determination unit 25, deletion of the voice signal is prohibited and the speech speed is increased by 1. However, if the current frame and the following three frames are all silent sections with a high degree of certainty, speech protection is not performed.
[0067] 話尾保護区間、即ち過去 10フレーム以内に有音判定区間がある場合には音声信 号の削除を禁止して話速を 1倍とする。なお、削除を禁止する代りに圧縮率を調整し ても良い。  [0067] If there is a speech protection section, that is, if there is a speech determination section within the past 10 frames, deletion of the voice signal is prohibited and the speech speed is set to 1 time. Note that the compression ratio may be adjusted instead of prohibiting deletion.
[0068] ポーズ保持区間、即ち話尾保護終了後の 10フレームのポーズ保持区間は音声信 号の削除を禁止して話速を 1倍とする。  [0068] In the pause holding section, that is, the pause holding section of 10 frames after the end of the talk protection, the voice signal is prohibited from being deleted and the speech speed is set to 1 time.
[0069] 無音削除区間は、上記各区間以外であり、処理遅延時間がある場合には音声信 号を削除する。処理遅延時間がない場合は話速を 1倍とする。 [0070] 本実施形態では、現フレームと後続 3フレームの無音確実度が大の場合には話頭 を誤って無音と判定するおそれが少ないことから、過剰に保護区間を設定することを 防止できる。 [0069] The silent deletion section is other than the above sections, and the audio signal is deleted when there is a processing delay time. When there is no processing delay time, the speech speed is set to 1 time. [0070] In the present embodiment, when the silence certainty of the current frame and the subsequent three frames is large, there is little possibility that the speech head is erroneously determined to be silent, so that it is possible to prevent setting the protection section excessively.
[0071] なお、話頭保護区間決定部 25, 31が請求項記載の話頭保護区間決定手段に相 当し、話速決定部 23が話頭保護手段及びポーズ保持区間設定手段に相当し、有音 無音判定部 21が無音確実度判定手段に相当し、推定 SNR判定部 30が信号雑音 比推定手段に相当する。  [0071] It should be noted that the speech protection section determination units 25 and 31 correspond to the speech protection section determination means described in the claims, and the speech speed determination section 23 corresponds to the speech protection means and pause holding section setting means. The determination unit 21 corresponds to a silence certainty determination unit, and the estimated SNR determination unit 30 corresponds to a signal-to-noise ratio estimation unit.

Claims

請求の範囲 The scope of the claims
[1] 入力音声信号をバッファに蓄積し、前記入力音声信号のパワーが閾値を超える有 音区間は前記バッファから読み出す音声信号をそのままもしくは伸張し、無音区間は 前記バッファから読み出す音声信号をそのままもしくは圧縮もしくは削除して話速を 変換する話速変換方法にぉ ヽて、  [1] The input audio signal is accumulated in the buffer, and the voice signal read from the buffer is directly or stretched in a voiced section where the power of the input voice signal exceeds a threshold value, and the voice signal read from the buffer is left as it is in a silent section. Talking about how to convert speech speed by compressing or deleting it,
前記有音区間に先行して設定する話頭保護区間を、所定の制限値で制限した前 記バッファの蓄積量とし、  The speech protection section set prior to the voiced section is defined as the accumulated amount of the buffer limited by a predetermined limit value.
前記話頭保護区間内に前記有音区間があれば前記音声信号の圧縮もしくは削除 を、禁止もしくは圧縮率を調整して話頭保護を行う話速変換方法。  A speech speed conversion method for performing speech protection by prohibiting or adjusting a compression rate if the speech signal is within the speech protection segment, if the speech segment is present.
[2] 請求項 1記載の話速変換方法において、  [2] In the speech speed conversion method according to claim 1,
前記有音区間に続く所定長の話尾保護区間の終了後に設定するポーズ保持区間 の長さを前記話頭保護区間の長さに応じて設定する話速変換方法。  A speech speed conversion method of setting a length of a pause holding section set after the end of a predetermined length of the tail protection section following the voiced section according to the length of the head protection section.
[3] 請求項 1または 2記載の話速変換方法にぉ 、て、 [3] According to the method for converting speech speed according to claim 1 or 2,
前記入力音声信号のパワーが前記閾値未満の無音区間で無音確実度を判定し、 前記話頭保護区間内における無音区間の無音確実度が小さければ前記音声信号 の圧縮もしくは削除を、禁止もしくは圧縮率を調整して話頭保護を行う話速変換方法  The silence certainty is determined in a silent section where the power of the input voice signal is less than the threshold, and if the silence certainty of the silent section in the speech protection section is small, the compression or deletion of the voice signal is prohibited or the compression rate is set. Speaking speed conversion method that adjusts and protects speech head
[4] 請求項 1乃至 3のいずれか 1項記載の話速変換方法において、 [4] In the speech rate conversion method according to any one of claims 1 to 3,
前記入力音声信号の信号雑音比を推定し、  Estimating the signal-to-noise ratio of the input speech signal;
推定信号雑音比が一定値より低い場合の前記話頭保護区間に対する前記制限値 より、前記推定信号雑音比が一定値より高い場合の前記話頭保護区間に対する前 記制限値を小さく設定する話速変換方法。  A speech speed conversion method for setting the limit value for the head protection interval when the estimated signal noise ratio is higher than a certain value to be smaller than the limit value for the speech protection interval when the estimated signal noise ratio is lower than a certain value .
[5] 入力音声信号をバッファに蓄積し、前記入力音声信号のパワーが閾値を超える有 音区間は前記バッファから読み出す音声信号をそのままもしくは伸張し、無音区間は 前記バッファから読み出す音声信号をそのままもしくは圧縮もしくは削除して話速を 変換する話速変換装置において、 [5] The input audio signal is accumulated in the buffer, and the voice signal read from the buffer is left or expanded as it is during a voiced section in which the power of the input voice signal exceeds a threshold, or the voice signal read from the buffer is left as it is during a silent period. In a speech speed conversion device that converts speech speed by compressing or deleting,
前記有音区間に先行して設定する話頭保護区間を、所定の制限値で制限した前 記バッファの蓄積量とする話頭保護区間決定手段と、 前記話頭保護区間内に前記有音区間があれば前記音声信号の圧縮もしくは削除 を、禁止もしくは圧縮率を調整して話頭保護を行う話頭保護手段を A speech protection interval determination means that sets the speech protection interval set prior to the voiced interval as an accumulation amount of the buffer limited by a predetermined limit value; Speech protection means for protecting the speech head by prohibiting or adjusting the compression rate if the voice signal is within the speech protection section, and if the voice signal is compressed or deleted;
有する話速変換装置。  A speech speed conversion device.
[6] 請求項 5記載の話速変換装置にお 、て、  [6] In the speech speed conversion device according to claim 5,
前記有音区間に続く所定長の話尾保護区間の終了後に設定するポーズ保持区間 の長さを前記話頭保護区間の長さに応じて設定するポーズ保持区間設定手段を 有する話速変換装置。  A speech speed conversion device comprising: a pause holding section setting means for setting a length of a pause holding section set after the end of a predetermined length of the speech protection section following the voiced section according to the length of the head protection section.
[7] 請求項 5または 6記載の話速変換装置にぉ 、て、 [7] In the speech speed conversion device according to claim 5 or 6,
前記入力音声信号のパワーが前記閾値未満の無音区間で無音確実度を判定する 無音確実度判定手段を有し、  A silence certainty degree determining means for determining a silence certainty level in a silent section where the power of the input voice signal is less than the threshold value;
前記話頭保護手段は、前記話頭保護区間内における無音区間の無音確実度が小 さければ前記音声信号の圧縮もしくは削除を、禁止もしくは圧縮率を調整して話頭保 護を行う話速変換装置。  The speech speed converting device that performs speech head protection by prohibiting or adjusting a compression rate when the silence protection degree of the silent section in the speech protection section is small.
[8] 請求項 5乃至 7のいずれか 1項記載の話速変換装置において、 [8] In the speech rate conversion device according to any one of claims 5 to 7,
前記入力音声信号の信号雑音比を推定する信号雑音比推定手段を有し、 前記話頭保護区間決定手段は、推定信号雑音比が一定値より低!ヽ場合の前記話 頭保護区間に対する前記制限値より、前記推定信号雑音比が一定値より高い場合 の前記話頭保護区間に対する前記制限値を小さく設定する話速変換装置。  Signal noise ratio estimation means for estimating the signal-to-noise ratio of the input speech signal, the speech protection section determination means, the limit value for the speech protection section when the estimated signal noise ratio is lower than a certain value! Accordingly, the speech speed converting apparatus that sets the limit value for the speech protection interval when the estimated signal-to-noise ratio is higher than a certain value.
[9] 入力音声信号をバッファに蓄積し、前記入力音声信号のパワーが閾値を超える有 音区間は前記バッファから読み出す音声信号を圧縮'伸張する際に、パワーが前記 閾値を下回る無音区間よりもゆっくりとなるよう話速を変換する話速変換装置におい て、 [9] The input voice signal is accumulated in the buffer, and the voiced section in which the power of the input voice signal exceeds the threshold is compressed more than the silent section in which the power falls below the threshold when the voice signal read from the buffer is compressed and expanded. In a speech speed conversion device that converts the speech speed so as to be slow,
前記有音区間に先行して設定する話頭保護区間を、所定の制限値で制限した前 記バッファの蓄積量とする話頭保護区間決定手段と、  A speech protection interval determination means that sets the speech protection interval set prior to the voiced interval as an accumulation amount of the buffer limited by a predetermined limit value;
前記話頭保護区間内に前記有音区間があれば前記音声信号の圧縮もしくは削除 を、禁止もしくは圧縮率を調整して話頭保護を行う話頭保護手段を  Speech protection means for protecting the speech head by prohibiting or adjusting the compression rate if the voice signal is within the speech protection section, and if the voice signal is compressed or deleted;
有する話速変換装置。  A speech speed conversion device.
PCT/JP2005/000549 2005-01-18 2005-01-18 Speech speed changing method, and speech speed changing device WO2006077626A1 (en)

Priority Applications (4)

Application Number Priority Date Filing Date Title
PCT/JP2005/000549 WO2006077626A1 (en) 2005-01-18 2005-01-18 Speech speed changing method, and speech speed changing device
JP2006553780A JP4630876B2 (en) 2005-01-18 2005-01-18 Speech speed conversion method and speech speed converter
EP05703786A EP1840877A4 (en) 2005-01-18 2005-01-18 Speech speed changing method, and speech speed changing device
US11/778,720 US7912710B2 (en) 2005-01-18 2007-07-17 Apparatus and method for changing reproduction speed of speech sound

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2005/000549 WO2006077626A1 (en) 2005-01-18 2005-01-18 Speech speed changing method, and speech speed changing device

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US11/778,720 Continuation US7912710B2 (en) 2005-01-18 2007-07-17 Apparatus and method for changing reproduction speed of speech sound

Publications (1)

Publication Number Publication Date
WO2006077626A1 true WO2006077626A1 (en) 2006-07-27

Family

ID=36692024

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2005/000549 WO2006077626A1 (en) 2005-01-18 2005-01-18 Speech speed changing method, and speech speed changing device

Country Status (4)

Country Link
US (1) US7912710B2 (en)
EP (1) EP1840877A4 (en)
JP (1) JP4630876B2 (en)
WO (1) WO2006077626A1 (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008107706A (en) * 2006-10-27 2008-05-08 Yamaha Corp Speech speed conversion apparatus and program
WO2009011021A1 (en) * 2007-07-13 2009-01-22 Panasonic Corporation Speaking speed converting device and speaking speed converting method
WO2009025142A1 (en) * 2007-08-22 2009-02-26 Nec Corporation Speaker speed conversion system, its method and speed conversion device
JP2009210712A (en) * 2008-03-03 2009-09-17 Yamaha Corp Sound processor and program
JP2010210947A (en) * 2009-03-10 2010-09-24 Panasonic Electric Works Co Ltd Voice speed conversion device
JP2010266778A (en) * 2009-05-18 2010-11-25 Panasonic Corp Reproduction device
WO2011027437A1 (en) * 2009-09-02 2011-03-10 富士通株式会社 Voice reproduction device and voice reproduction method
JP2013148654A (en) * 2012-01-18 2013-08-01 Nippon Hoso Kyokai <Nhk> Speech speed conversion device and program thereof, and recording medium recording program
JP2014115546A (en) * 2012-12-12 2014-06-26 Fujitsu Ltd Speech processing device, speech processing method, and speech processing program
JP2014157331A (en) * 2013-02-18 2014-08-28 Nippon Hoso Kyokai <Nhk> Speech speed conversion device, method and program

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4583781B2 (en) * 2003-06-12 2010-11-17 アルパイン株式会社 Audio correction device
EP1770688B1 (en) * 2004-07-21 2013-03-06 Fujitsu Limited Speed converter, speed converting method and program
JP4390289B2 (en) * 2007-03-16 2009-12-24 国立大学法人電気通信大学 Playback device
US9269366B2 (en) * 2009-08-03 2016-02-23 Broadcom Corporation Hybrid instantaneous/differential pitch period coding
FR2979465B1 (en) * 2011-08-31 2013-08-23 Alcatel Lucent METHOD AND DEVICE FOR SLOWING A AUDIONUMERIC SIGNAL
JP5977528B2 (en) * 2012-01-31 2016-08-24 シャープ株式会社 SPEED SPEED CONVERSION DEVICE, SPEED SPEED CONVERSION METHOD, AND PROGRAM
US10878835B1 (en) * 2018-11-16 2020-12-29 Amazon Technologies, Inc System for shortening audio playback times

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4591928A (en) 1982-03-23 1986-05-27 Wordfit Limited Method and apparatus for use in processing signals
JPH0573089A (en) * 1991-09-18 1993-03-26 Matsushita Electric Ind Co Ltd Speech reproducing method
JPH06337696A (en) * 1993-05-28 1994-12-06 Matsushita Electric Ind Co Ltd Device and method for controlling speed conversion
EP0643380A2 (en) 1993-09-10 1995-03-15 Hitachi, Ltd. Speech speed conversion method and apparatus
JP2000305580A (en) * 1999-04-23 2000-11-02 Roland Corp Silence determination method and device and computer readable recording medium
JP2001056696A (en) * 1999-08-18 2001-02-27 Nippon Telegr & Teleph Corp <Ntt> Method and device for voice storage and reproduction
JP2001222300A (en) * 2000-02-08 2001-08-17 Nippon Hoso Kyokai <Nhk> Voice reproducing device and recording medium
GB2396271A (en) 2002-12-10 2004-06-16 Motorola Inc A user terminal and method for voice communication

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2612868B2 (en) * 1987-10-06 1997-05-21 日本放送協会 Voice utterance speed conversion method
US5475791A (en) * 1993-08-13 1995-12-12 Voice Control Systems, Inc. Method for recognizing a spoken word in the presence of interfering speech
US6216103B1 (en) * 1997-10-20 2001-04-10 Sony Corporation Method for implementing a speech recognition system to determine speech endpoints during conditions with background noise
US6711536B2 (en) * 1998-10-20 2004-03-23 Canon Kabushiki Kaisha Speech processing apparatus and method
US6453291B1 (en) * 1999-02-04 2002-09-17 Motorola, Inc. Apparatus and method for voice activity detection in a communication system
US6324509B1 (en) * 1999-02-08 2001-11-27 Qualcomm Incorporated Method and apparatus for accurate endpointing of speech in the presence of noise
US6377931B1 (en) * 1999-09-28 2002-04-23 Mindspeed Technologies Speech manipulation for continuous speech playback over a packet network
US6885987B2 (en) * 2001-02-09 2005-04-26 Fastmobile, Inc. Method and apparatus for encoding and decoding pause information
JP4583781B2 (en) * 2003-06-12 2010-11-17 アルパイン株式会社 Audio correction device
US7412376B2 (en) * 2003-09-10 2008-08-12 Microsoft Corporation System and method for real-time detection and preservation of speech onset in a signal
US20050114118A1 (en) * 2003-11-24 2005-05-26 Jeff Peck Method and apparatus to reduce latency in an automated speech recognition system
US20050227657A1 (en) * 2004-04-07 2005-10-13 Telefonaktiebolaget Lm Ericsson (Publ) Method and apparatus for increasing perceived interactivity in communications systems
EP1770688B1 (en) * 2004-07-21 2013-03-06 Fujitsu Limited Speed converter, speed converting method and program

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4591928A (en) 1982-03-23 1986-05-27 Wordfit Limited Method and apparatus for use in processing signals
JPH0573089A (en) * 1991-09-18 1993-03-26 Matsushita Electric Ind Co Ltd Speech reproducing method
JPH06337696A (en) * 1993-05-28 1994-12-06 Matsushita Electric Ind Co Ltd Device and method for controlling speed conversion
EP0643380A2 (en) 1993-09-10 1995-03-15 Hitachi, Ltd. Speech speed conversion method and apparatus
JP2000305580A (en) * 1999-04-23 2000-11-02 Roland Corp Silence determination method and device and computer readable recording medium
JP2001056696A (en) * 1999-08-18 2001-02-27 Nippon Telegr & Teleph Corp <Ntt> Method and device for voice storage and reproduction
JP2001222300A (en) * 2000-02-08 2001-08-17 Nippon Hoso Kyokai <Nhk> Voice reproducing device and recording medium
GB2396271A (en) 2002-12-10 2004-06-16 Motorola Inc A user terminal and method for voice communication

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP1840877A4

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008107706A (en) * 2006-10-27 2008-05-08 Yamaha Corp Speech speed conversion apparatus and program
WO2009011021A1 (en) * 2007-07-13 2009-01-22 Panasonic Corporation Speaking speed converting device and speaking speed converting method
US8392197B2 (en) 2007-08-22 2013-03-05 Nec Corporation Speaker speed conversion system, method for same, and speed conversion device
WO2009025142A1 (en) * 2007-08-22 2009-02-26 Nec Corporation Speaker speed conversion system, its method and speed conversion device
JP2009210712A (en) * 2008-03-03 2009-09-17 Yamaha Corp Sound processor and program
JP2010210947A (en) * 2009-03-10 2010-09-24 Panasonic Electric Works Co Ltd Voice speed conversion device
JP2010266778A (en) * 2009-05-18 2010-11-25 Panasonic Corp Reproduction device
WO2011027437A1 (en) * 2009-09-02 2011-03-10 富士通株式会社 Voice reproduction device and voice reproduction method
JPWO2011027437A1 (en) * 2009-09-02 2013-01-31 富士通株式会社 Audio playback apparatus and audio playback method
US8457955B2 (en) 2009-09-02 2013-06-04 Fujitsu Limited Voice reproduction with playback time delay and speed based on background noise and speech characteristics
JP2013148654A (en) * 2012-01-18 2013-08-01 Nippon Hoso Kyokai <Nhk> Speech speed conversion device and program thereof, and recording medium recording program
JP2014115546A (en) * 2012-12-12 2014-06-26 Fujitsu Ltd Speech processing device, speech processing method, and speech processing program
JP2014157331A (en) * 2013-02-18 2014-08-28 Nippon Hoso Kyokai <Nhk> Speech speed conversion device, method and program

Also Published As

Publication number Publication date
US7912710B2 (en) 2011-03-22
EP1840877A1 (en) 2007-10-03
JPWO2006077626A1 (en) 2008-06-12
EP1840877A4 (en) 2008-05-21
US20070265839A1 (en) 2007-11-15
JP4630876B2 (en) 2011-02-09

Similar Documents

Publication Publication Date Title
JP4630876B2 (en) Speech speed conversion method and speech speed converter
JP4146489B2 (en) Audio packet reproduction method, audio packet reproduction apparatus, audio packet reproduction program, and recording medium
EP0910065B1 (en) Speaking speed changing method and device
US6889187B2 (en) Method and apparatus for improved voice activity detection in a packet voice network
KR100302370B1 (en) Speech interval detection method and system, and speech speed converting method and system using the speech interval detection method and system
KR100739355B1 (en) Speech processing method and apparatus
JP4460580B2 (en) Speed conversion device, speed conversion method and program
US10127924B2 (en) Communication apparatus mounted with speech speed conversion device
JP3553828B2 (en) Voice storage and playback method and voice storage and playback device
JP3378672B2 (en) Speech speed converter
JP4212253B2 (en) Speaking speed converter
WO2011027437A1 (en) Voice reproduction device and voice reproduction method
JP2006113375A (en) Voice reproducing device and program for controlling reproduction and stoppage of voice
JP2867744B2 (en) Audio playback device
JP3298188B2 (en) Voice detection method
JPH06289895A (en) Real-time speaking speed converting method
JP3706506B2 (en) Communication device with speech speed conversion device
JP6675079B2 (en) Telephone equipment
JPH08147874A (en) Speech speed conversion device
JP5326796B2 (en) Playback device
JPH0772896A (en) Device for compressing/expanding sound
KR20010085664A (en) Speech speed converting device
JP2010026243A (en) Automatic speech speed conversion device
JP2007212967A (en) Speaking speed converting device
JPH0530137A (en) Sound packet transmission device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application
WWE Wipo information: entry into national phase

Ref document number: 2006553780

Country of ref document: JP

WWE Wipo information: entry into national phase

Ref document number: 2005703786

Country of ref document: EP

WWE Wipo information: entry into national phase

Ref document number: 11778720

Country of ref document: US

NENP Non-entry into the national phase

Ref country code: DE

WWP Wipo information: published in national office

Ref document number: 2005703786

Country of ref document: EP

WWP Wipo information: published in national office

Ref document number: 11778720

Country of ref document: US