JP3553828B2

JP3553828B2 - Voice storage and playback method and voice storage and playback device

Info

Publication number: JP3553828B2
Application number: JP23097299A
Authority: JP
Inventors: 享邦西田; 昌洋渡辺; みづほ井上; 義武鈴木
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1999-08-18
Filing date: 1999-08-18
Publication date: 2004-08-11
Anticipated expiration: 2019-08-18
Also published as: JP2001056696A

Description

【０００１】
【発明の属する技術分野】
本発明は，音声通信システム等において，自然な通話の実現を可能とした音声蓄積再生方法および音声蓄積再生装置に関する。
【０００２】
【従来の技術】
音声通信を半２重通信路やネットワーク上で行うときなど効率的に音声を伝送するために，いわゆるボイス（音声）スイッチを用い，音声を送るときには音声回線を開いて音声を送り，無音時には音声回線を閉じて他のユーザが音声を送信できるようにしたり，他のデータを送受できるようにしたシステムがある。このシステムでは，送信側において，音声パワー計測によって音声パワーがある閾値を越えたときに，語頭と判断して音声回線を開くようにしたり，音素認識技術を用いることにより語頭検出精度を高め，語頭（話頭）切断を防止していた。
【０００３】
しかしながら，背景雑音が大きなときには，語頭における音声パワーは背景雑音に対し小さく，また，音声認識率も低くなることから，語頭検出誤りによる欠落が生じやすくなり，音声通話は非常に不自然かつ不明瞭になり易いという問題点があった。
【０００４】
図５に「北見」と発声したときの波形と音声パワー，および音声スイッチがＯＮしている期間の例を示す。図５の例から明らかなように，語頭の「ｋ」の部分は音声パワーが閾値に達していないため，音声スイッチのＯＮが遅れ，これにより話頭の「き」の「ｋ」の部分が欠落することがわかる。このように，話頭切断は，話頭が子音部のような音声パワーの小さいときに生じ易く，母音など音声パワーの大きなときには生じにくい。日本語を考えると，通常，音声は子音＋母音の組合せが多い。そこで，上記問題点を解決するために，常に音声を一旦蓄積しておき，音声パワーの大きな母音部等で語頭が検出されたときに，ある一定期間さかのぼった時点から音声を再生し語頭欠落を防止する方法がある。
【０００５】
【発明が解決しようとする課題】
しかしながら，音声の蓄積により，音声遅延が生じ通話は非常に不自然なものになってしまう。通常人間が会話するときには，相手の発言が終ってから発声することが多いと考えられる。例えば，１００ｍｓｅｃの音声蓄積を行うことを考えると，発声者が，発声し終ってから相手に発声の終了がわかるまでに，回線遅延がなかったとして１００ｍｓｅｃかかり，その後，相手が発声し最初の発声が相手の発声開始を知るまでに１００ｍｓｅｃかかることになる。そのため，会話間の無音部分は，合計２００ｍｓｅｃとなり，スムーズな会話が阻害される。
【０００６】
本発明は，以上のような音声スイッチを実現するときに問題となる話頭切断を防止し，かつ音声遅延を生じさせないで自然な通話を可能とすることを目的とする。
【０００７】
【課題を解決するための手段】
本発明は，上記課題を解決するため，話頭部分では，音声蓄積部から過去の音声を話速変換することにより速く再生し，音声蓄積部に過去の音声データがなくなったところで，入力されている音声データを等速度で再生し，語尾において音声遅延が生じないようにする。上記方法により，先に示した音声を蓄積しておいて話頭切断を防止するだけのシステムを用いたときの会話間の無音部分は，多くても１００ｍｓｅｃとなり，スムーズに会話することが可能となる。
【０００８】
これにより，話頭切断による会話の不明瞭さを防止し，音声遅延による会話の不自然さを防止することができる。
【０００９】
ところで，話速を変換する装置として，特開平８−８３０９５号公報「話速変換方法および装置」や，特開平８−２０２３９１号公報「話速変換装置」に記載されているものがある。これらは，受聴者の聞き取り能力に合った話速度で入力音声信号を出力する装置であり，主に話速度を下げる制御を行う。また，話速度は，音素や音声処理フレームに対しては変動するが，一つの文といった大枠で，話速度が設定されるので，本発明のように，話頭部分で速く再生し，途中から等速再生し語尾において遅延をなくすことはできない。
【００１０】
【発明の実施の形態】
図１は，本発明の構成例を示すブロック図である。図１において，１は音声を入力し，入力音声が音声区間かどうかを判別する音声検出部，２は入力された音声を蓄積する音声蓄積部であるリングバッファ，３は入力音声をリングバッファ２に格納し入力ポインタおよび出力ポインタを更新する制御を行うリングバッファ制御部，４は音声検出部１において音声が検出されたときに，どのくらい時間をさかのぼった時点の蓄積された音声から再生するかを決め，リングバッファ２に蓄積された音声のうち話頭部分を速く再生し，入力音声に追いついたところで等速度再生する制御を行う話速制御部，５は話速制御部４の制御のもとにリングバッファ２に蓄積された音声の話速を変換する話速変換器を表す。
【００１１】
図１の装置に入力された音声は，音声検出部１において音声パワー等が計測され，リングバッファ２に蓄えられる。また，音声検出部１は，常に背景雑音パワーを計測し，音声区間検出のための閾値を動的に変化させる。
【００１２】
発声者が発声しないときには，入力音声は常に過去のデータを保持しながら次々とリングバッファ２に蓄えられる。音声検出部１で音声が検出されると通知がリングバッファ制御部３および話速制御部４へ送られる。リングバッファ制御部３では，今現在書き込まれている音声データの格納されているポインタ，および過去の音声データが書き込まれているポインタを把握しているので，過去の音声データが存在していること，またどのくらい過去のデータが蓄えられているかを話速制御部４に通知する。
【００１３】
話速制御部４では，リングバッファ制御部３から受け取ったデータにより，話速変換器５に話速度を通知し，ある特定の時間内に過去のデータを全て再生し，今現在書き込まれている入力音声データのポインタに過去の入力音声データのポインタが追いつくようにする。
【００１４】
例えば，蓄積されている過去のデータが，１００ｍｓｅｃ分あり，１００ｍｓｅｃで追いつくようにしようと考えると，再生速度は２倍ということになり，この情報を話速変換器５に通知する。逆に，目標とする時間を設定せず，話速変換器５に通知する話速度は，常に２倍とすることも考えられる。
【００１５】
ところで，通常人間が話速を調整するとき，無音部や母音部の長さが大きく変化するが，子音部の速度は変化しない。逆に子音部の速度を変化させずにポーズ部や母音部の速度を変化させても，聞き取りに大きな劣化は生じない。つまり，音素により認知できる最小の継続時間が違うので，音声検出部１に音素認識を用いたときには，再生データの音素によって細かく動的に話速度を変化させることで，さらに違和感のない通話が可能となる。
【００１６】
そこで，音素認識を用いたときには，蓄積されている音声データのどこからどこまでがどの音素なのかという情報も蓄積されているので，それぞれの区間における音素に対する最小継続時間が保証される再生速度を話速変換器５に通知する。ただし，あまりにも大きな速度になると違和感が増大するので，もし，あらかじめ定められた最大速度を越えるようなときには，最大速度を話速変換器５に通知する。例えば，「おーがき」と発声したときに，「おがき」と再生されることを防ぐ役割を持たせる。
【００１７】
加えて，先に説明した認知できる最小の継続時間は，ポーズ部，無音摩擦音，破裂音，母音等，ある似通った音素間での違いは小さいので，厳密に，処理量が大きな音素認識をせず，ポーズ部，無声摩擦音，破裂音，母音等といった処理量の小さな音素の大分類を用いて，再生速度を決定することも考えられる。
【００１８】
話速変換器５は，リングバッファ２から音声フレームデータを取り出し，話速制御部４から指定された速度に応じてフレームデータを圧縮することで，フレームデータ数を減少させる。音声出力では，定期的な周期でフレームデータの１サンプル毎に再生されるので，フレームデータの削減により，話速度が大きくなる。入力音声データに追いついたところで，話速度を入力音声と同じとする。
【００１９】
図２に「北見」と発声したときの話速の変化再生される音声の例を示す。図２（ａ）は，音素認識せずにパワーのみで音声を検出し，一定速度で現在の音声データに追いつくように再生をしたとき，図２（ｂ）は，音素認識を行い，音素の種類によって再生速度を変化させる可変速度で再生をしたときの様子を示している。便宜上，音声データの単位をフレームと呼ぶ。また，簡単のため速度変化を音声フレーム数を間引くことにより表現している。「＊」は無音部を表す。
【００２０】
図２（ａ）のとき，１５フレームの「ａ」で入力音声に再生音声が到達するが，そこに到達するまで，再生速度を２倍にして再生し，その後は，入力音声に対し等速度で再生する。図２（ｂ）のとき，１〜２フレーム目の「＊」は，破裂音に先行する無音部なので，フレームを１つ飛ばした速度で再生する。３〜４フレームの「ｋ」は，子音部なので，そのまま再生する。５〜８フレームの「ｉ」は，人間が母音を認知するのに必要なフレーム数を確保するために，例えば３フレームにして再生する。９〜１０フレームの「＊」は，１フレームにする。１１〜１２フレームの「ｔ」は，子音部なのでそのまま再生する。１３〜１７フレームの「ａ」は，「ｉ」と同様の理由により５フレームを３フレームにして再生する。これ以降過去の音声データはなくなるので，そのまま再生する。
【００２１】
図２からわかるように，語尾において入力音声フレームと再生音声フレームとは一致しているので，語尾において音声遅延はなくなる。また，語尾においては，音声区間終端が検出されたときには，リングバッファ制御部３は，出力音声データのポインタ（以下，出力ポインタという）を停止させる。入力音声データのポインタ（以下，入力ポインタという）が，出力ポインタに追いついた時点で，出力ポインタを進ませる。これにより，音声終端が検出され再生が終った後に，すぐに音声区間が検出されたときに，２重に音声が再生されることを防ぐ。
【００２２】
音素認識せずにパワーのみで音声検出をする場合の制御例を図３に，音素認識をして，音素の種類により再生速度を変化させる場合の制御例を図４に示す。
【００２３】
音素認識をせずに，音声（音響）パワーのみで音声検出をする場合，まず，音声検出部１では，音声区間を音声パワーと閾値との大小比較により検出する（Ｓ１）。リングバッファ２には，リングバッファ制御部３によって常時入力音声が蓄積される。話速制御部４は，音声検出部１から通知を受け，話速変換器５が参照する話速レジスタ（図示省略）に目標話速度を設定する（Ｓ２）。
【００２４】
話速変換器５は，話速制御部４の制御のもとにリングバッファ２から音声フレームデータを取り込み（Ｓ３），話速レジスタ値に準じた話速変換を行う（Ｓ４）。その変換した音声フレームデータを出力バッファ（図示省略）へ書き出し（Ｓ５），リングバッファ２の出力ポインタをインクリメントする（Ｓ６）。
【００２５】
リングバッファ２の入力ポインタが出力ポインタに追いついたかどうかをチェックし（Ｓ７），追いついていない場合，ステップＳ３へ戻って，同様に目標話速度の速い速度による音声再生出力を繰り返す。入力ポインタが出力ポインタに追いついた場合には，話速レジスタに等速度を設定して（Ｓ８），ステップＳ３へ戻り，入力音声の速度と同じ速度で音声を再生する。以上の処理を音声区間が終了するまで繰り返す。
【００２６】
音素認識を行い，音素の種類により再生速度を変化させる場合の制御は，図４に示すように行われる。この方法では，あらかじめ音素に対する最小継続時間が格納されたテーブル１０を用意しておく。
【００２７】
まず，音声検出部１では，入力音声について音素認識を行い，その認識結果によって音声区間を検出する（Ｓ１０）。このとき音声パワーも考慮し，音声パワーによる音声区間の検出を併用してもよい。リングバッファ２には，リングバッファ制御部３によって，常時入力音声が蓄積され，入力ポインタがその都度更新される。また，音声検出部１による音素認識の結果も併せてリングバッファ２に蓄積される。
【００２８】
音声区間が検出されると，リングバッファ２から音声フレームデータを取り込み（Ｓ１１），それに対応する音素認識結果を話速制御部４に取り込む。話速制御部４は，先に処理していた音声フレームデータの音素と今から処理しようとしている音声フレームデータの音素は同じかどうかを判定する（Ｓ１２）。同じ場合には，ステップＳ１４へ進む。違う音素であれば，ステップＳ１３へ進み，音素の継続時間を調べ，音素に対する最小継続時間テーブル１０から最小継続時間を読み出し，所定の最高話速度を越えないように求められた話速度を話速レジスタに設定する（Ｓ１３）。その後，ステップＳ１４へ進む。
【００２９】
話速変換器５は，リングバッファ２から取り込まれた音声フレームデータについて，話速レジスタ値に準じた話速変換を行う（Ｓ１４）。その変換した音声フレームデータを出力バッファ（図示省略）へ書き出し（Ｓ１５），リングバッファ２の出力ポインタをインクリメントする（Ｓ１６）。
【００３０】
リングバッファ２の入力ポインタが出力ポインタに追いついたかどうかをチェックし（Ｓ１７），追いついていない場合，ステップＳ１１へ戻って，同様に可変速度による音声再生出力を繰り返す。入力ポインタが出力ポインタに追いついた場合には，話速レジスタに等速度を設定して（Ｓ１８），リングバッファ２から次の音声フレームデータを取り込み，ステップＳ１４へ戻って，入力音声の速度と同じ速度で音声を再生する。以上の処理を音声区間が終了するまで繰り返す。
【００３１】
【発明の効果】
以上のとおり，本発明により，音声スイッチを実現するときの問題となる話頭切断を防止し，なおかつ音声遅延を生じさせず，自然な通話を実現することができるようになる。
【図面の簡単な説明】
【図１】本発明の構成例を示すブロック図である。
【図２】話速変換の様子を示す図である。
【図３】音素認識せずにパワーのみで音声検出をする場合の制御フローを示す図である。
【図４】音素の種類により再生速度を変化させる場合の制御フローを示す図である。
【図５】音声波形と音声パワー，および音声スイッチの動作の関係を説明する図である。
【符号の説明】
１音声検出部
２リングバッファ
３リングバッファ制御部
４話速制御部
５話速変換器[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a voice storage / reproduction method and a voice storage / reproduction device that enable natural communication in a voice communication system or the like.
[0002]
[Prior art]
In order to transmit voice efficiently, such as when performing voice communication on a half-duplex communication path or a network, a so-called voice (voice) switch is used. When transmitting voice, open a voice line and transmit voice. There are systems in which a line is closed to allow another user to transmit voice or to transmit / receive other data. In this system, when the voice power exceeds a certain threshold by voice power measurement, the transmitting side determines the beginning of the word and opens the voice line. (Talk head) The disconnection was prevented.
[0003]
However, when the background noise is large, the speech power at the beginning of the word is lower than that of the background noise, and the speech recognition rate is low. There was a problem that it was easy to become.
[0004]
FIG. 5 shows an example of a waveform when "Kitami" is uttered, voice power, and a period in which the voice switch is ON. As is apparent from the example of FIG. 5, since the voice power does not reach the threshold at the beginning of the word "k", the ON of the voice switch is delayed, and the "k" portion of the voice at the beginning of the word is missing. You can see that As described above, the beginning of the speech is likely to occur when the speech head has a low sound power such as a consonant part, and is unlikely to occur when the speech power is high such as a vowel. Considering Japanese, speech usually has many combinations of consonants + vowels. Therefore, in order to solve the above problem, speech is always stored once, and when the beginning of a word is detected in a vowel part or the like having a large speech power, the speech is reproduced from a point in time when a certain period of time has passed, and the beginning of the speech is deleted. There are ways to prevent it.
[0005]
[Problems to be solved by the invention]
However, the accumulation of voices causes voice delays and makes the call very unnatural. Usually, when a human talks, it is considered that the utterance is often made after the speech of the other party ends. For example, considering that voice storage is performed for 100 msec, it takes 100 msec after the speaker finishes uttering until there is no line delay before the other party knows that the utterance has ended. It takes 100 msec for the user to know the utterance start of the other party. Therefore, a silent portion between conversations is 200 msec in total, and a smooth conversation is hindered.
[0006]
SUMMARY OF THE INVENTION It is an object of the present invention to prevent a disconnection of a head of speech which is a problem when realizing the above-described voice switch, and to enable a natural telephone call without causing a voice delay.
[0007]
[Means for Solving the Problems]
According to the present invention, in order to solve the above-mentioned problem, in the beginning part of the speech, the past speech is quickly reproduced from the speech accumulation unit by converting the speech speed, and the speech is input when the speech accumulation unit has no more past speech data. The audio data is reproduced at a constant speed so that no audio delay occurs at the end. According to the above-mentioned method, the silent part between conversations is at most 100 msec when the system shown above is used to store the voices and only to prevent the beginning of the conversation, and the conversations can be smoothly conducted. .
[0008]
As a result, it is possible to prevent the conversation from being unclear due to the start of the conversation, and to prevent the conversation from being unnatural due to a voice delay.
[0009]
By the way, as a device for converting the speech speed, there are devices described in JP-A-8-83095 "Method and device for speech speed conversion" and JP-A-8-202391 "Device for speech speed conversion". These are devices that output an input voice signal at a speaking speed that matches the listening ability of the listener, and mainly performs control to reduce the speaking speed. Although the speech speed varies for phonemes and speech processing frames, the speech speed is set in a large frame such as one sentence. It is not possible to play fast and eliminate the delay at the end.
[0010]
BEST MODE FOR CARRYING OUT THE INVENTION
FIG. 1 is a block diagram showing a configuration example of the present invention. In FIG. 1, reference numeral 1 denotes a voice detection unit for inputting voice and discriminating whether the input voice is a voice section, 2 a ring buffer as a voice storage unit for storing the input voice, and 3 a ring buffer 2 for input voice. A ring buffer control unit 4 for controlling the updating of the input pointer and the output pointer to store the input pointer and the output pointer, and when the voice is detected by the voice detection unit 1, determines how far back the audio is reproduced from the stored voice. The voice speed control unit 5 controls the voice speed control unit 4 to reproduce the head portion of the voice stored in the ring buffer 2 at a high speed and to reproduce the voice at a constant speed when catching up with the input voice. The speech speed converter converts the speech speed of the voice stored in the ring buffer 2.
[0011]
The sound input to the apparatus shown in FIG. 1 is measured for sound power and the like by a sound detection unit 1 and stored in a ring buffer 2. Further, the voice detection unit 1 always measures the background noise power, and dynamically changes a threshold for voice section detection.
[0012]
When the speaker does not speak, the input speech is stored in the ring buffer 2 one after another while always retaining past data. When the voice is detected by the voice detector 1, a notification is sent to the ring buffer controller 3 and the voice speed controller 4. Since the ring buffer control unit 3 knows the pointer where the audio data currently being written is stored and the pointer where the past audio data is written, the past audio data must exist. , And notifies the speech speed control unit 4 of how much past data is stored.
[0013]
The voice speed control unit 4 notifies the voice speed converter 5 of the voice speed based on the data received from the ring buffer control unit 3, reproduces all the past data within a specific time, and writes the current data. The pointer of the past input voice data is made to catch up with the pointer of the input voice data.
[0014]
For example, there is 100 msec of accumulated past data, and if one tries to catch up in 100 msec, the reproduction speed is doubled, and this information is notified to the speech speed converter 5. Conversely, it is conceivable that the speech speed notified to the speech speed converter 5 is always doubled without setting the target time.
[0015]
By the way, when a human usually adjusts the speech speed, the length of a silent portion or a vowel portion changes greatly, but the speed of a consonant portion does not change. Conversely, even if the speed of the pause portion or the vowel portion is changed without changing the speed of the consonant portion, no significant deterioration occurs in listening. In other words, since the minimum duration that can be perceived differs depending on the phoneme, when using phoneme recognition for the voice detection unit 1, it is possible to change the speech speed finely and dynamically according to the phoneme of the reproduced data, so that a call with even more uncomfortable feeling is possible. It becomes.
[0016]
Therefore, when phoneme recognition is used, information on where and how much of the stored speech data is from which phoneme is also stored, so the playback speed that guarantees the minimum duration for the phoneme in each section is set as the speech speed. The converter 5 is notified. However, if the speed is too high, the sense of incongruity increases. If the speed exceeds a predetermined maximum speed, the maximum speed is notified to the speech speed converter 5. For example, when "Ogaki" is uttered, a role of preventing "Ogaki" from being reproduced is provided.
[0017]
In addition, the minimum perceivable duration described above is small because there is little difference between certain similar phonemes, such as pauses, silence fricatives, plosives, and vowels. Instead, it is conceivable to determine the playback speed using a large classification of phonemes with a small processing amount such as a pause portion, unvoiced fricatives, plosives, vowels, and the like.
[0018]
The voice speed converter 5 extracts the audio frame data from the ring buffer 2 and compresses the frame data according to the speed specified by the voice speed control unit 4 to reduce the number of frame data. In the audio output, since the frame data is reproduced at regular intervals for each sample of the frame data, the reduction of the frame data increases the speech speed. After catching up with the input voice data, the speech speed is assumed to be the same as the input voice.
[0019]
FIG. 2 shows an example of a voice reproduced with a change in speech speed when "Kitami" is uttered. FIG. 2A shows a case where speech is detected only with power without phoneme recognition and reproduced so as to catch up with the current speech data at a constant speed. FIG. 2B shows a case where phoneme recognition is performed and phoneme recognition is performed. The state when the reproduction is performed at a variable speed that changes the reproduction speed depending on the type is shown. For convenience, a unit of audio data is called a frame. For simplicity, the speed change is expressed by thinning out the number of audio frames. “*” Represents a silent part.
[0020]
In the case of FIG. 2 (a), the reproduced sound reaches the input sound at "a" of 15 frames, and the reproduction speed is doubled until reaching the input sound. To play. In the case of FIG. 2B, since “*” in the first and second frames is a silent part preceding the plosive, the frame is reproduced at a speed skipping one frame. Since "k" of 3 to 4 frames is a consonant part, it is reproduced as it is. The “i” of 5 to 8 frames is reproduced as, for example, 3 frames in order to secure the number of frames necessary for a human to recognize a vowel. “*” Of 9 to 10 frames is 1 frame. Since “t” of the 11 to 12 frames is a consonant part, it is reproduced as it is. “A” of 13 to 17 frames is reproduced by changing 5 frames into 3 frames for the same reason as “i”. Since the past audio data will be lost thereafter, it is reproduced as it is.
[0021]
As can be seen from FIG. 2, since the input speech frame and the reproduced speech frame coincide with each other at the end, there is no speech delay at the end. When the end of the voice section is detected at the end, the ring buffer control unit 3 stops the pointer of the output voice data (hereinafter, referred to as the output pointer). When the pointer of the input voice data (hereinafter referred to as the input pointer) catches up with the output pointer, the output pointer is advanced. As a result, when an audio section is detected immediately after the end of the audio is detected and the reproduction is completed, the audio is prevented from being reproduced twice.
[0022]
FIG. 3 shows an example of control in the case where voice detection is performed only with power without phoneme recognition, and FIG. 4 shows an example of control in the case where phoneme recognition is performed and the reproduction speed is changed depending on the type of phoneme.
[0023]
When voice detection is performed only with voice (sound) power without performing phoneme recognition, first, the voice detection unit 1 detects a voice section by comparing the voice power with a threshold (S1). In the ring buffer 2, the input voice is always stored by the ring buffer control unit 3. The voice speed controller 4 receives the notification from the voice detector 1 and sets a target voice speed in a voice speed register (not shown) referred to by the voice speed converter 5 (S2).
[0024]
The voice speed converter 5 fetches voice frame data from the ring buffer 2 under the control of the voice speed controller 4 (S3), and performs voice speed conversion according to the voice speed register value (S4). The converted audio frame data is written to an output buffer (not shown) (S5), and the output pointer of the ring buffer 2 is incremented (S6).
[0025]
It is checked whether or not the input pointer of the ring buffer 2 has caught up with the output pointer (S7). If it has not caught up, the process returns to step S3, and the sound reproduction and output at the high target speech speed is repeated. If the input pointer has caught up with the output pointer, a constant speed is set in the speech speed register (S8), and the process returns to step S3 to reproduce the voice at the same speed as the input voice. The above processing is repeated until the voice section ends.
[0026]
Control for performing phoneme recognition and changing the reproduction speed according to the type of phoneme is performed as shown in FIG. In this method, a table 10 in which the minimum duration for a phoneme is stored in advance is prepared.
[0027]
First, the voice detection unit 1 performs phoneme recognition on an input voice, and detects a voice section based on the recognition result (S10). At this time, the sound power may be considered, and the detection of the sound section based on the sound power may be used together. The ring buffer 2 constantly stores input voices in the ring buffer 2, and updates the input pointer each time. The result of the phoneme recognition by the voice detection unit 1 is also stored in the ring buffer 2 together.
[0028]
When a voice section is detected, voice frame data is fetched from the ring buffer 2 (S11), and the corresponding phoneme recognition result is fetched into the speech speed control unit 4. The speech speed control unit 4 determines whether the phoneme of the voice frame data that has been processed previously is the same as the phoneme of the voice frame data that is about to be processed (S12). If they are the same, proceed to step S14. If the phoneme is a different phoneme, the process proceeds to step S13, where the duration of the phoneme is checked, the minimum duration is read from the minimum duration table 10 for the phoneme, and the speech speed determined so as not to exceed the predetermined maximum speech speed is calculated. It is set in a register (S13). Thereafter, the process proceeds to step S14.
[0029]
The voice speed converter 5 performs voice speed conversion on the voice frame data fetched from the ring buffer 2 according to the voice speed register value (S14). The converted audio frame data is written to an output buffer (not shown) (S15), and the output pointer of the ring buffer 2 is incremented (S16).
[0030]
It is checked whether or not the input pointer of the ring buffer 2 has caught up with the output pointer (S17). If it has not caught up, the process returns to step S11, and the sound reproduction output at the variable speed is similarly repeated. If the input pointer has caught up with the output pointer, a constant speed is set in the speech speed register (S18), the next voice frame data is fetched from the ring buffer 2, and the process returns to step S14 to be the same as the speed of the input voice. Play sound at speed. The above processing is repeated until the voice section ends.
[0031]
【The invention's effect】
As described above, according to the present invention, it is possible to realize a natural telephone call without causing a speech head disconnection, which is a problem when implementing a voice switch, and without causing a voice delay.
[Brief description of the drawings]
FIG. 1 is a block diagram illustrating a configuration example of the present invention.
FIG. 2 is a diagram showing a state of speech speed conversion.
FIG. 3 is a diagram illustrating a control flow in a case where voice detection is performed using only power without performing phoneme recognition.
FIG. 4 is a diagram showing a control flow when the reproduction speed is changed depending on the type of phoneme.
FIG. 5 is a diagram illustrating a relationship between a sound waveform, a sound power, and an operation of a sound switch.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 1 Voice detection part 2 Ring buffer 3 Ring buffer control part 4 Voice speed control part 5 Voice speed converter

Claims

The process of inputting voice,
Determining whether the input voice is a voice section,
Storing the input voice in the voice storage means ;
When a voice section is detected, the reproduction of the stored voice is started from the position indicated by the output pointer of the voice storage means, and while the output pointer is advanced , the beginning of the stored voice is quickly reproduced, and the input voice is reproduced. The process of playing back at a constant speed after catching up ,
When the end of the voice section is detected, the output pointer is stopped until voice for a predetermined time is stored, and the output pointer is advanced when the voice is stored. Playback method.

Voice detection means for inputting voice and determining whether the input voice is a voice section,
Voice storage means for storing input voice;
The storage of the input voice to the voice storage means is controlled, and when the voice section end is detected by the voice detection means, the output pointer is stopped until voice for a predetermined time is stored, and the stored voice is stopped. A voice accumulation control means for advancing an output pointer at a time ;
When a voice is detected by the voice detecting means, the reproduction of the stored voice is started from the position indicated by the output pointer, and the leading part of the voice stored in the voice storing means is quickly moved while the output pointer is advanced. Speech speed conversion control means for controlling reproduction at a constant speed when reproduced and catching up with the input voice;
A voice storage / reproduction device comprising: a voice speed conversion unit configured to convert a voice speed of voice stored in the voice storage unit under the control of the voice speed conversion control unit.

Detecting the voice section based on voice power,
2. The voice storage / reproduction method according to claim 1, wherein reproduction of a speech start part of the stored voice is performed at a constant speed higher than the speed of the input voice.

Performs phoneme recognition of the input speech,
2. The voice storage / reproduction method according to claim 1, wherein reproduction of a head portion of the stored voice is performed at a speed determined based on the phoneme recognition result.