JP2001142477A

JP2001142477A - Voiced sound generator and voice recognition device using it

Info

Publication number: JP2001142477A
Application number: JP32280399A
Authority: JP
Inventors: Daisuke Mori; 大輔森
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1999-11-12
Filing date: 1999-11-12
Publication date: 2001-05-25

Abstract

PROBLEM TO BE SOLVED: To provide a voiced sound generator which has a simple constitution and is saved in the trouble of preliminarily learning the voice of a user and can generate a voiced sound from whispering. SOLUTION: Whispering voice data outputted from a whispering voice input part 10 and delayed voice data outputted from a delay device 12 are given to an adder 11, and the adder 11 adds these data to output the additional result. Whispering voice data newly outputted from the whispering voice input part 10 is added to voice data circulating along a closed loop, which returns from the adder 11 to the adder 11 through the delay device 12, each time when circulating once, and thus voiced sound data is generated.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、有声音形成装置と
それを用いた音声認識装置に関し、より特定的には、さ
さやいて発声された音声や会話の途中で無声化された音
声等（以下、ささやき声）から、通常音声として聴取可
能な有声音を形成する有声音形成装置と、その有声音形
成装置を用いてささやき声の認識率を向上させた音声認
識装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voiced sound forming apparatus and a voice recognition apparatus using the same, and more particularly, to a whispered voice, a voice that has been devoiced during a conversation (hereinafter referred to as a voiceless voice, etc.). , Whispering) and a voiced sound forming device that forms a voiced sound that can be heard as normal voice, and a voice recognition device that uses the voiced sound forming device to improve the whispering voice recognition rate.

【０００２】[0002]

【従来の技術】近年、携帯電話が急速に普及しつつあ
る。それに伴って引き起こされる社会問題の一つに、電
車内で携帯電話を使用する人の声が大きく、周囲の人達
が迷惑するといった問題がある。また、音声認識技術の
進歩に伴い、ワープロやパーソナルコンピュータ等のマ
ンマシンインターフェースにおいて、キーボード入力以
外に、音声入力、すなわち音声によってコマンドや文字
情報を入力することが可能になりはじめた。ところが、
オフィスのように多数のパーソナルコンピュータが設置
されている環境では、操作者たちがお互いの入力音声を
うるさく感じてしまうために、普及が進まないといった
問題がある。2. Description of the Related Art In recent years, mobile phones have been rapidly spreading. One of the social problems caused by this is that a person who uses a mobile phone in a train has a loud voice, and the surrounding people are troubled. Also, with the advancement of voice recognition technology, it has become possible to input commands and character information by voice, that is, voice, in addition to keyboard input, in man-machine interfaces such as word processors and personal computers. However,
In an environment where a large number of personal computers are installed, such as in an office, there is a problem that the use of each other's input sound is noisy, and thus the use of the personal computers does not progress.

【０００３】このような問題点を解決するために、特開
平７−２１９５６７号公報では、ささやき声を通常の音
声に変換する装置が提案されている。以下、特開平７−
２１９５６７号公報記載の従来の音声変換装置について
説明する。図８は、上記従来の音声変換装置の構成を表
すブロック図である。図８において、従来の音声変換装
置は、入力音声分析部５１１、ささやき音声分析部５１
２、通常音声分析部５１３、写像関数推定部５１４、ス
ペクトル変換部５１６、および音声合成部５１６を備え
ている。以上のように構成された従来の音声変換装置の
動作について、以下に説明を行う。In order to solve such a problem, Japanese Patent Application Laid-Open No. 7-219567 proposes a device for converting a whisper into a normal voice. Hereinafter, JP-A-7-
A conventional voice conversion device described in Japanese Patent Publication No. 219567 will be described. FIG. 8 is a block diagram illustrating a configuration of the above-described conventional voice conversion apparatus. In FIG. 8, a conventional voice conversion device includes an input voice analysis unit 511 and a whisper voice analysis unit 51.
2, a normal speech analysis unit 513, a mapping function estimation unit 514, a spectrum conversion unit 516, and a speech synthesis unit 516. The operation of the conventional speech converter configured as described above will be described below.

【０００４】最初、学習用として、ささやき声（ユーザ
によってささやかれた特定の音韻）と、そのささやき声
（上記特定の音韻）を通常に発音した通常音声とがそれ
ぞれ、ささやき音声分析部５１２と、通常音声分析部５
１３とに入力される。ささやき音声分析部５１２と、通
常音声分析部５１３とはそれぞれ、入力される音声につ
いてフレーム毎に音声分析を行って、その入力音声のス
ペクトル情報を抽出し、得られたスペクトル情報を写像
関数推定部５１４へ出力する。写像関数推定部５１４
は、ささやき音声分析部５１２および通常音声分析部５
１３が出力する一対のスペクトル情報群（すなわち、さ
さやき声のスペクトル情報、およびそのささやき声と対
になる通常音声のスペクトル情報）を相互に関連づける
写像関数を推定すると共に、その写像関数と、それら一
対のスペクトル情報群とを学習結果として保存する。な
お、スペクトル情報は、公知の音声処理技術、たとえば
入力音声をフレーム毎に線形予測分析（ＬＰＣ分析）す
ることにより得ることができる。[0004] First, for learning, a whisper (a specific phoneme whispered by the user) and a normal voice normally uttering the whisper (the specific phoneme) are respectively provided by a whisper voice analyzer 512 and a normal voice. Analysis unit 5
13 is input. The whispering voice analyzer 512 and the normal voice analyzer 513 each perform voice analysis on the input voice for each frame, extract spectrum information of the input voice, and use the obtained spectrum information as a mapping function estimator. 514. Mapping function estimator 514
Are whispering voice analyzer 512 and normal voice analyzer 5
13 estimates a mapping function that correlates a pair of spectrum information groups (that is, spectrum information of a whispering voice and spectrum information of a normal voice paired with the whispering voice), and maps the mapping function and the paired spectrum information. The information group is stored as a learning result. The spectrum information can be obtained by a known speech processing technique, for example, by performing a linear prediction analysis (LPC analysis) on the input speech for each frame.

【０００５】このように、最初、ささやき声と、それと
対になる通常音声とをそれぞれ、ささやき音声分析部５
１２と通常音声分析部５１３とに入力して、一対のスペ
クトル情報群を抽出し、次に、それら一対のスペクトル
情報群を写像関数推定部５１４に入力して、それらを互
いに関連付ける写像関数を推定する。得られた一対のス
ペクトル情報群と、それらを関連付ける写像関数とは、
写像関数推定部５１４に蓄積される。こうして、互いに
対となるささやき声および通常音声の組を多数、予め学
習させておくことにより、写像関数推定部５１４には、
ささやき声と通常音声とを関連付ける写像に関する学習
結果のデータベースが構築されることとなる。[0005] As described above, first, the whisper and the normal voice that is paired with the whisper are respectively sent to the whisper voice analysis unit 5
12 and a normal speech analysis unit 513 to extract a pair of spectrum information groups, and then input the pair of spectrum information groups to a mapping function estimation unit 514 to estimate a mapping function that associates them with each other. I do. The obtained pair of spectral information groups and the mapping function that associates them,
It is stored in the mapping function estimating unit 514. By learning a large number of pairs of whispering voices and normal voices that are paired with each other in advance, the mapping function estimating unit 514 includes:
A database of learning results regarding a mapping that associates a whisper with a normal voice is constructed.

【０００６】上記のようにして学習を終了すると、ささ
やき声を通常音声に変換することができるようになる。
すなわち、変換しようとするささやき声が入力音声分析
部５１１に入力されると、そのささやき声に対して、さ
さやき音声分析部５１２が行ったのと同様の音声分析が
為され、スペクトル情報が抽出される。入力音声分析部
５１１が抽出したスペクトル情報は、スペクトル変換部
５１５に与えられる。スペクトル変換部５１５は、写像
関数推定部５１４から指示される写像関数を用いて、入
力音声分析部５１１から与えられたスペクトル情報を写
像する（すなわち、写像関数推定部５１４に蓄積された
写像に関するデータベースを参照して、入力音声分析部
５１１から与えられるささやき声のスペクトル情報を、
そのささやき声と対になる通常音声のスペクトル情報に
変換する）。こうして得られた通常音声のスペクトル情
報は、音声合成部５１６に入力される。音声合成部５１
６は、入力されたスペクトル情報をもとに、通常音声相
当のスペクトルを有する音声を合成し、得られた合成音
声を出力する。[0006] When the learning is completed as described above, the whispering voice can be converted into the normal voice.
That is, when a whisper to be converted is input to the input voice analysis unit 511, the whisper is subjected to the same voice analysis as performed by the whisper voice analysis unit 512, and spectrum information is extracted. The spectrum information extracted by the input voice analysis unit 511 is provided to a spectrum conversion unit 515. The spectrum conversion unit 515 maps the spectrum information provided from the input speech analysis unit 511 using the mapping function specified by the mapping function estimation unit 514 (that is, the database relating to the mapping stored in the mapping function estimation unit 514). , The whispering spectrum information given from the input speech analysis unit 511 is
It is converted to the spectrum information of normal voice that is paired with the whisper.) The spectrum information of the normal voice thus obtained is input to the voice synthesizer 516. Voice synthesis unit 51
Reference numeral 6 synthesizes a speech having a spectrum equivalent to a normal speech based on the input spectrum information, and outputs the resultant synthesized speech.

【０００７】次に、従来の音声認識装置について説明す
る。図９は、従来の音声認識装置の構成の一例を表すブ
ロック図である。なお、このような音声認識装置は、例
えば、特開昭６０−５３９９８号公報に記載されてい
る。図９において、従来の音声認識装置は、入力音声分
析部５２３、類似度計算部５２４、音素係数メモリ５２
５、判定部５２６、および単語辞書メモリ５２７を備え
ている。以上のように構成された従来の音声認識装置に
ついて、以下に説明を行う。Next, a conventional speech recognition apparatus will be described. FIG. 9 is a block diagram illustrating an example of a configuration of a conventional voice recognition device. Such a speech recognition device is described in, for example, Japanese Patent Application Laid-Open No. 60-53998. In FIG. 9, a conventional speech recognition apparatus includes an input speech analysis unit 523, a similarity calculation unit 524, a phoneme coefficient memory 52
5, a determination unit 526, and a word dictionary memory 527. The conventional speech recognition device configured as described above will be described below.

【０００８】最初、入力音声分析部５２３に音声が入力
される。入力音声分析部５２３は、上述した従来の音声
変換装置の入力音声分析部５１１と同様、入力された音
声について順次、フレーム毎に線形予測分析（ＬＰＣ分
析）を行うことにより、その音声からスペクトル情報を
抽出し、類似度計算部５２４へ出力する。音素メモリ５
２５には、予め、複数の音素名と、それぞれの音素に対
応するスペクトル情報とが互いに関連付けて記憶されて
いる。類似度計算部５２４は、音素メモリ５２５に記憶
されている各スペクトル情報について、入力音声分析部
５２３からのスペクトル情報との間の距離をそれぞれ算
出し、算出結果の距離が最短のスペクトル情報、つま
り、音素メモリ５２５に記憶されている複数のスペクト
ル情報のうち、入力音声分析部５２３からのスペクトル
情報に最も類似しているものを選び出す。そして、選び
出したスペクトル情報と対応する音素名を音素メモリ５
２５から読み出し、出力する。こうして類似度計算部５
２４からは、音素名が順次出力され、それら一連の音素
名は、判定部５２６へと与えられる。First, a voice is input to an input voice analysis unit 523. The input speech analysis unit 523 performs a linear prediction analysis (LPC analysis) on the input speech sequentially for each frame, similarly to the input speech analysis unit 511 of the conventional speech conversion apparatus described above, thereby obtaining spectrum information from the speech. Is extracted and output to the similarity calculation unit 524. Phoneme memory 5
25 stores in advance a plurality of phoneme names and spectrum information corresponding to each phoneme in association with each other. The similarity calculation unit 524 calculates the distance between each piece of spectrum information stored in the phoneme memory 525 and the spectrum information from the input speech analysis unit 523, and the calculated result has the shortest distance, that is, spectrum information. , Out of the plurality of pieces of spectrum information stored in the phoneme memory 525, the one that is most similar to the spectrum information from the input speech analysis unit 523 is selected. Then, the phoneme name corresponding to the selected spectrum information is stored in the phoneme memory 5.
25, and output. Thus, the similarity calculator 5
24 sequentially outputs phoneme names, and the series of phoneme names is given to the determination unit 526.

【０００９】単語辞書メモリ５２７には、複数の単語
と、それぞれの単語と対応する音素系列とが互いに関連
付けて記憶されている。判定部５２６は、類似度計算部
５２４から与えられる一連の音素名によって構成される
音素系列と、単語辞書メモリ５２７に記憶されている各
音素系列とを比較照合して、最も類似度の高い音素系列
を選び出す。そして、選び出した音素系列と対応する単
語を単語辞書メモリ５２７から読み出す。こうして読み
出された単語は、入力音声の認識結果として、出力され
ることとなる。The word dictionary memory 527 stores a plurality of words and phoneme sequences corresponding to the respective words in association with each other. The determination unit 526 compares and compares a phoneme sequence composed of a series of phoneme names provided from the similarity calculation unit 524 with each phoneme sequence stored in the word dictionary memory 527, and determines the phoneme having the highest similarity. Select a series. Then, a word corresponding to the selected phoneme sequence is read from the word dictionary memory 527. The word read in this way is output as a recognition result of the input voice.

【００１０】[0010]

【発明が解決しようとする課題】上記のような従来の音
声認識装置では、入力される音声を分析してスペクトル
情報を抽出する際、入力音声をそのまま線形予測分析
（ＬＰＣ分析）するため、入力音声がささやき声の場合
には、認識率が低くなる。ささやき声は、通常音声と比
べ、特に母音部の音量レベルが低くて、いわゆるＳ／Ｎ
比（信号・雑音レベル比）が良好でない上、発声中に母
音部音声のスペクトルが安定し難いからである。従来、
ささやき声を認識するためには、上記のような、ささや
き声を通常音声に変換する従来の音声変換装置を、従来
の音声認識装置の前段に用いるようにしていた。つま
り、ささやき声を従来の音声変換装置を通じて通常音声
に変換した後、従来の音声認識装置へと入力する。In the conventional speech recognition apparatus as described above, when the input speech is analyzed to extract spectrum information, the input speech is directly subjected to linear prediction analysis (LPC analysis). If the voice is a whisper, the recognition rate is low. A whisper has a lower volume level, especially in a vowel portion, than a normal voice, and is called S / N.
This is because the ratio (signal-to-noise level ratio) is not good and the spectrum of the vowel sound is difficult to stabilize during utterance. Conventionally,
In order to recognize a whisper, a conventional voice converter for converting a whisper into a normal voice as described above is used in a stage preceding the conventional voice recognizer. That is, the whisper is converted into normal voice through a conventional voice conversion device, and then input to a conventional voice recognition device.

【００１１】しかしながら、上記従来の音声変換装置で
は、ささやき声と通常音声との写像関数を推定するため
に、装置の使用に先だって、ささやき声と、そのささや
き声と対になる通常音声との組を多数入力する必要があ
る。つまり、ユーザの声を予め学習させる手間がかかっ
ていた。また、学習のための回路やメモリが必要なの
で、装置の構成も複雑となっていた。加えて、上記のよ
うにして予め学習を行わせる場合、ユーザが交代する
と、音声変換を正しく行えなくなる恐れがあった。人に
よって声の特徴が異なるからである。大勢のユーザの声
を予め学習させるには、多くの手間がかかるうえ、大容
量のメモリも必要となる。However, in the above-mentioned conventional voice conversion apparatus, in order to estimate a mapping function between a whisper and a normal voice, a large number of pairs of a whisper and a normal voice paired with the whisper are input before the apparatus is used. There is a need to. That is, it takes time to learn the user's voice in advance. In addition, since a circuit and a memory for learning are required, the configuration of the apparatus is complicated. In addition, when learning is performed in advance as described above, if the user changes, there is a possibility that voice conversion cannot be performed correctly. This is because voice characteristics differ from person to person. Learning many users' voices in advance requires a lot of trouble and also requires a large-capacity memory.

【００１２】それゆえに、本発明の目的は、単純な構成
を有し、予めユーザの声を学習させる手間もなく、ささ
やき声から有声音を形成することができるような有声音
形成装置と、それを用いた音声認識装置とを提供するこ
とである。SUMMARY OF THE INVENTION Therefore, an object of the present invention is to provide a voiced sound forming apparatus capable of forming a voiced sound from a whispered voice with a simple configuration and without having to learn the user's voice in advance, and a use thereof. And a voice recognition device.

【００１３】[0013]

【課題を解決するための手段および発明の効果】第１の
発明は、ホルマントにおける振幅レベルとホルマント以
外の帯域における振幅レベルとの間のレベル差が相対的
に小さいようなスペクトルを持つささやき声から、当該
レベル差が相対的に大きいようなスペクトルを持つ有声
音を形成するための有声音形成装置であって、ささやき
声が入力され、当該ささやき声と対応するささやき音声
信号を出力する音声入力部、音声信号を遅延させるため
の遅延器、および音声入力部から出力されるささやき音
声信号と、遅延器から出力される遅延した音声信号とが
入力され、当該ささやき音声信号と、当該遅延した音声
信号とを互いに加算して出力する加算器を備え、遅延器
には、加算器から出力される音声信号が入力され、加算
器から遅延器を経て当該加算器に帰還する閉ループを巡
回する音声信号には、一巡する度に、音声入力部から新
たに出力されるささやき音声信号が加算されていき、そ
れによって、有声音と対応する有声音信号が形成される
ことを特徴としている。Means for Solving the Problems and Effects of the Invention The first invention is based on a whisper having a spectrum in which a level difference between an amplitude level in a formant and an amplitude level in a band other than the formant is relatively small. A voiced sound forming device for forming a voiced sound having a spectrum in which the level difference is relatively large, wherein a whispered voice is input and a voice input unit that outputs a whispered voice signal corresponding to the whispered voice, a voice signal A whisper audio signal output from the audio input unit and a delayed audio signal output from the delay unit, and the whisper audio signal and the delayed audio signal An adder for adding and outputting is provided. The audio signal output from the adder is input to the delay unit. The whisper audio signal newly output from the audio input unit is added to the audio signal circulating through the closed loop that returns to the adder, so that the voiced sound and the corresponding voiced sound signal are added. It is characterized by being formed.

【００１４】上記第１の発明によれば、ささやき音声信
号を加算器から遅延器を経て加算器に帰還するような閉
ループに入力して循環させる（すなわち、ささやき音声
信号を加算器と遅延器とから構成されるくし形フィルタ
に通す）だけで、有声音を形成することができる。この
ため、従来の音声変換装置のように予め学習を行わせる
手間がかからず、携帯電話などに組み込みやすい、簡単
な構成の有声音形成装置を提供することができるように
なる。According to the first aspect of the present invention, the whispered voice signal is input to a closed loop that returns from the adder to the adder via the delay unit and is circulated (that is, the whispered voice signal is transmitted to the adder and the delay unit). ) Can form a voiced sound. For this reason, it is possible to provide a voiced sound forming device having a simple configuration that does not require the trouble of performing learning in advance as in a conventional voice conversion device and is easy to incorporate into a mobile phone or the like.

【００１５】第２の発明は、第１の発明において、加算
器から出力される音声信号を有声音と対応する有声音信
号として外部へと出力するための出力端子をさらに備え
ている。The second invention according to the first invention further comprises an output terminal for outputting a voice signal output from the adder to the outside as a voiced sound signal corresponding to a voiced sound.

【００１６】第３の発明は、第１の発明において、遅延
器から出力される音声信号を有声音と対応する有声音信
号として外部へと出力するための出力端子をさらに備え
ている。According to a third aspect, in the first aspect, an output terminal for outputting an audio signal output from the delay unit to the outside as a voiced sound signal corresponding to a voiced sound is further provided.

【００１７】上記第２または第３の発明によれば、上記
の閉ループを巡回しつつ形成される有声音信号を外部へ
と出力することができる。なお、第２の発明と第３の発
明とを比較すると、第２の発明の方が、有声音信号の出
力タイミングの遅延がより小さくて済む。According to the second or third aspect, a voiced sound signal formed while circulating through the closed loop can be output to the outside. When the second invention is compared with the third invention, the delay of the output timing of the voiced sound signal is smaller in the second invention.

【００１８】第４の発明は、第１の発明において、複数
の遅延時間長が予め準備されており、それら複数の遅延
時間長の中からいずれか１つを選択して、その遅延時間
長を遅延器に対して設定する遅延長選択部をさらに備
え、遅延器は、加算器から出力される音声信号を、遅延
長選択部によって設定された遅延時間長だけ遅延させる
ことを特徴としている。In a fourth aspect based on the first aspect, a plurality of delay time lengths are prepared in advance, and any one of the plurality of delay time lengths is selected and the delay time length is selected. The delay unit further includes a delay length selection unit configured to set the delay unit, wherein the delay unit delays the audio signal output from the adder by a delay time length set by the delay length selection unit.

【００１９】上記第４の発明によれば、遅延長選択部
が、予め複数の遅延時間長を準備しておき、それらの中
からいずれか１つを選択して、その選択結果を遅延器に
対して設定するようにしたので、形成する有声音の音高
を、予め準備された選択肢の中から選ぶことができるよ
うになる。その結果、形成する有声音の周波数を使い分
けることが可能となる。なお、使い分けの一例として、
ユーザが男性の場合は相対的に低い音高の有声音を形成
し、女性の場合は相対的に高い音高の有声音を形成する
こと等がある。According to the fourth aspect, the delay length selection unit prepares a plurality of delay time lengths in advance, selects one of them, and sends the selection result to the delay unit. Since the pitch is set for the voiced sound to be formed, the pitch of the voiced sound to be formed can be selected from options prepared in advance. As a result, the frequency of the voiced sound to be formed can be properly used. As an example of proper use,
When the user is a male, a voiced sound having a relatively low pitch is formed, and when the user is a female, a voiced sound having a relatively high pitch is formed.

【００２０】第５の発明は、第１の発明において、任意
の音高が指示入力され、その音高を持つ有声音を形成す
るのに必要な遅延時間長を算出して、その遅延時間長を
遅延器に対して設定する遅延長入力部をさらに備え、遅
延器は、加算器から出力される音声信号を、遅延長入力
部によって設定された遅延時間長だけ遅延させることを
特徴としている。In a fifth aspect based on the first aspect, an arbitrary pitch is instructed and inputted, and a delay time required to form a voiced sound having the pitch is calculated, and the delay time is calculated. Is further provided for the delay unit, and the delay unit delays the audio signal output from the adder by the delay time set by the delay length input unit.

【００２１】上記第５の発明によれば、遅延長入力部
が、音高に関する指示入力を受けて遅時間延長の算出を
行い、その算出結果を遅延器に対して設定するようにし
たので、形成する有声音の音高を任意に指定することが
できるようになる。その結果、音高が時間的に変化する
ような有声音を形成することが可能となる。一例を挙げ
れば、メロディー（音階）付きの有声音や、イントネー
ション（抑揚）付きの有声音を形成することが可能とな
る。According to the fifth aspect, the delay length input section receives the instruction input relating to the pitch, calculates the delay time extension, and sets the calculation result to the delay device. The pitch of the voiced sound to be formed can be arbitrarily specified. As a result, a voiced sound whose pitch changes with time can be formed. For example, a voiced sound with a melody (scale) and a voiced sound with intonation (intonation) can be formed.

【００２２】第６の発明は、ホルマントにおける振幅レ
ベルとホルマント以外の帯域における振幅レベルとの間
のレベル差が相対的に小さいようなスペクトルを持つさ
さやき声から、当該レベル差が相対的に大きいようなス
ペクトルを持つ有声音を形成するための有声音形成装置
であって、ささやき声が入力され、当該ささやき声と対
応するささやき音声信号を出力する音声入力部、一定の
周波数間隔をおいて離散的に出力ピークが現れているよ
うな出力特性を持つくし形フィルタ部、および音声入力
部から出力されるささやき音声信号がくし形フィルタ部
を透過することによって形成される有声音信号を外部に
出力するための出力端子を備えている。According to a sixth aspect of the present invention, a whisper having a spectrum in which the level difference between the amplitude level in the formant and the amplitude level in a band other than the formant is relatively small, the level difference is relatively large. A voiced sound forming device for forming a voiced sound having a spectrum, wherein a whispered voice is input and a whispered voice signal corresponding to the whispered voice is output, and a discrete output peak is provided at a constant frequency interval. And an output terminal for externally outputting a voiced sound signal formed by transmitting a whispered audio signal from the audio input unit through the comb filter unit. It has.

【００２３】上記第６の発明によれば、ささやき音声信
号をくし形フィルタ部に通すだけで、有声音信号を形成
することができる。これは、ささやき音声信号の場合、
ホルマント以外の帯域では、スペクトルが比較的ランダ
ムに変動するために、くし形フィルタ部を通過する有効
な成分が少なく、その結果として、ホルマント以外の帯
域における振幅レベルが、ホルマントにおける振幅レベ
ルと比べて低くなることに起因する。このため、従来の
音声変換装置のように予め学習を行わせる手間がかから
ず、携帯電話などに組み込みやすい、簡単な構成の有声
音形成装置を提供することができるようになる。According to the sixth aspect, a voiced sound signal can be formed only by passing a whispered audio signal through the comb filter. This is a whispered audio signal,
In the band other than the formant, the spectrum fluctuates relatively randomly, so that there are few effective components passing through the comb filter.As a result, the amplitude level in the band other than the formant is smaller than the amplitude level in the formant. It is caused by lowering. For this reason, it is possible to provide a voiced sound forming device having a simple configuration that does not require the trouble of performing learning in advance as in a conventional voice conversion device and is easy to incorporate into a mobile phone or the like.

【００２４】第７の発明は、第６の発明において、一定
の周波数間隔は、形成しようとする有声音の音高に関連
して決められることを特徴としている。The seventh invention is characterized in that, in the sixth invention, the fixed frequency interval is determined in relation to the pitch of a voiced sound to be formed.

【００２５】上記第７の発明によれば、適当な出力特性
（各出力ピークの間の周波数間隔）を持つくし形フィル
タ部を用いることによって、所望の音高の有声音を形成
できるようになる。さらには、出力特性を変えることが
可能なくし形フィルタ部を用いることによって、形成し
ようとする有声音の音高を時間的に変化させることも可
能となる。According to the seventh aspect, a voiced sound having a desired pitch can be formed by using the comb filter having an appropriate output characteristic (frequency interval between output peaks). . Further, by using a comb filter unit whose output characteristics cannot be changed, it is possible to temporally change the pitch of a voiced sound to be formed.

【００２６】第８の発明は、ホルマントにおける振幅レ
ベルとホルマント以外の帯域における振幅レベルとの間
のレベル差が相対的に小さいようなスペクトルを持つさ
さやき声から、当該レベル差が相対的に大きいようなス
ペクトルを持つ有声音を形成するための有声音形成装置
であって、ささやき声が入力され、当該ささやき声と対
応するささやき音声信号を出力する音声入力部、および
音声入力部から出力されるささやき音声信号から、有声
音と対応する有声音信号を形成する有声音形成部を備
え、有声音形成部は、音声入力部から出力されるささや
き音声信号を遅延させ、遅延した当該ささやき音声信号
を、音声入力部から出力されるささやき音声信号と加算
し、加算後の音声信号を順次遅延させつつ、音声入力部
から新たに出力されるささやき音声信号をさらに加算し
ていくことによって、有声音と対応する有声音信号を形
成することを特徴としている。According to an eighth aspect, a whisper having a spectrum in which the level difference between the amplitude level in the formant and the amplitude level in a band other than the formant is relatively small, the level difference is relatively large. A voiced sound forming device for forming a voiced sound having a spectrum, wherein a whispered voice is input, and a whispered voice signal output from a voice input unit that outputs a whispered voice signal corresponding to the whispered voice and a whispered voice signal output from the voice input unit A voiced sound forming unit that forms a voiced sound signal corresponding to the voiced sound, the voiced sound forming unit delays the whispered voice signal output from the voice input unit, and outputs the delayed whispered voice signal to the voice input unit. Is added to the whispered audio signal output from the By going further adding the audio signal whisper is characterized by forming a voiced sound signal corresponding to the voiced.

【００２７】上記第８の発明によれば、ささやき音声信
号から有声音信号を形成することができる。このため、
従来の音声変換装置のように予め学習を行わせる手間が
かからず、携帯電話などに組み込みやすい、簡単な構成
の有声音形成装置を提供することができるようになる。According to the eighth aspect, a voiced sound signal can be formed from a whispered voice signal. For this reason,
It is possible to provide a voiced sound forming apparatus having a simple configuration, which does not require the trouble of performing learning in advance like a conventional voice conversion apparatus and is easy to incorporate into a mobile phone or the like.

【００２８】第９の発明は、ホルマントにおける振幅レ
ベルとホルマント以外の帯域における振幅レベルとのレ
ベル差が相対的に小さいようなスペクトルを持つささや
き声を認識するための音声認識装置であって、ささやき
声が入力され、当該ささやき声と対応するささやき音声
信号を出力する音声入力部、音声入力部から出力される
ささやき音声信号から、レベル差が相対的に大きいよう
なスペクトルを持つ有声音と対応する有声音信号を形成
する有声音形成部、および有声音形成部から出力される
有声音信号に基づいて、ささやき声を認識する音声認識
部を備えている。A ninth invention is a speech recognition apparatus for recognizing a whisper having a spectrum in which a level difference between an amplitude level in a formant and an amplitude level in a band other than the formant is relatively small. A voice input unit for inputting and outputting a whisper voice signal corresponding to the whisper voice; a voiced sound signal corresponding to a voiced sound having a spectrum having a relatively large level difference from a whisper voice signal output from the voice input unit And a voice recognition unit that recognizes a whisper based on a voiced sound signal output from the voiced sound generation unit.

【００２９】上記第９の発明によれば、認識しようとす
るささやき音声信号を、まず有声音形成部を通じて有声
音信号に変換し、その有声音信号を音声認識部へと入力
して認識するようにしたので、ささやき音声の認識率を
向上させた音声認識装置を提供することができるように
なる。According to the ninth aspect, the whispered voice signal to be recognized is first converted into a voiced sound signal through the voiced sound forming unit, and the voiced sound signal is input to the voice recognition unit for recognition. Therefore, it is possible to provide a speech recognition device with an improved whispering speech recognition rate.

【００３０】第１０の発明は、第９の発明において、有
声音形成部は、音声入力部から出力されるささやき音声
信号を遅延させ、遅延した当該ささやき音声信号を、音
声入力部から出力されるささやき音声信号と加算し、加
算後の音声信号を順次遅延させつつ、音声入力部から新
たに出力されるささやき音声信号をさらに加算していく
ことによって、有声音と対応する有声音信号を形成する
ことを特徴としている。In a tenth aspect based on the ninth aspect, the voiced sound forming section delays the whispered voice signal output from the voice input section, and outputs the delayed whispered voice signal from the voice input section. The voiced sound signal corresponding to the voiced sound is formed by adding the whispered voice signal and further adding the whispered voice signal newly output from the voice input unit while sequentially delaying the voice signal after the addition. It is characterized by:

【００３１】上記第１０の発明によれば、ささやき音声
信号から有声音信号を形成することができる。このた
め、学習結果を参照してささやき声を有声音に変換する
従来の音声変換装置を音声認識部の前段に用いるのに比
べ、音声認識装置の構成が簡単になる。According to the tenth aspect, a voiced sound signal can be formed from a whispered voice signal. For this reason, the configuration of the voice recognition device is simplified as compared with the case where a conventional voice conversion device that converts a whisper into a voiced sound with reference to the learning result is used in a stage preceding the voice recognition unit.

【００３２】[0032]

【発明の実施の形態】以下、本発明の実施の形態につい
て、図面を参照しながら説明する。（第１の実施形態）図１は、本発明の第１の実施形態に
係る有声音形成装置の構成を表すブロック図である。図
１において、第１の実施形態に係る有声音形成装置は、
ささやき音声入力部１０、加算器１１、および遅延器１
２を備えている。加算器１１と遅延器１２とで、くし形
フィルタ部（図示せず）が構成されている。Embodiments of the present invention will be described below with reference to the drawings. (First Embodiment) FIG. 1 is a block diagram showing a configuration of a voiced sound forming apparatus according to a first embodiment of the present invention. In FIG. 1, the voiced sound forming device according to the first embodiment includes:
Whispering voice input unit 10, adder 11, and delay unit 1
2 is provided. The adder 11 and the delay unit 12 constitute a comb filter unit (not shown).

【００３３】ささやき音声入力部１０は、マイクロホン
やアンプ、アナログデジタルコンバータ等により構成さ
れ、入力されるささやき声を、ささやき音声データに変
換して出力する。加算器１１は、ささやき音声入力部１
０の出力データと、遅延器１２の出力データとを互いに
加算して出力する。遅延器１２は、加算器１１の出力デ
ータを予め定められた時間（これをＴミリ秒とする）だ
け遅延させる。すなわち、加算器１１の出力データを受
け、そのデータを予め定められた遅延時間長（＝Ｔミリ
秒）だけ保持した後、出力する。The whispering voice input unit 10 is constituted by a microphone, an amplifier, an analog-to-digital converter, etc., and converts the input whispering voice into whispering voice data and outputs it. The adder 11 is a whispering voice input unit 1
The output data of 0 and the output data of the delay unit 12 are added together and output. The delay unit 12 delays the output data of the adder 11 by a predetermined time (this is T milliseconds). That is, it receives the output data of the adder 11, holds the data for a predetermined delay time length (= T milliseconds), and outputs the data.

【００３４】以上のように構成された有声音形成装置の
動作について、以下に説明を行う。ユーザがささやき声
を発すると、そのささやき声は、ささやき音声入力部１
０に入力される。ささやき音声入力部１０は、入力され
たささやき声を、ささやき音声データに変換する。ささ
やき音声入力部１０から出力されるささやき音声データ
は、加算器１１を通じて遅延器１２へと入力され、そこ
で予め決められた遅延時間長（Ｔミリ秒）だけ保持され
た後、加算器１１へとフィードバックされる。加算器１
１では、ささやき音声入力部１０から新たに出力される
ささやき音声データと、上記のようにしてフィードバッ
クされてくるデータ（すなわち、ささやき音声入力部１
０からＴミリ秒前に出力されたささやき音声データ）と
が加算される。The operation of the voiced sound forming apparatus configured as described above will be described below. When the user makes a whisper, the whisper is sent to the whisper sound input unit 1.
Input to 0. The whisper voice input unit 10 converts the input whisper voice into whisper voice data. The whispered voice data output from the whispered voice input unit 10 is input to the delay unit 12 through the adder 11, where the whispered voice data is held for a predetermined delay time length (T milliseconds). Feedback will be given. Adder 1
1, the whisper voice data newly output from the whisper voice input unit 10 and the data fed back as described above (that is, the whisper voice input unit 1).
Whispered voice data output 0 to T milliseconds before).

【００３５】つまり、ささやき音声入力部１０から出力
されるささやき音声データは、加算器１１と遅延器１２
とからなるくし形フィルタ部へと入力されて、それら加
算器１１と遅延器１２とを通るような閉ループを巡回す
ることとなる。そして、閉ループを巡回するささやきデ
ータには、１巡する度に、ささやき音声入力部１０から
新たに出力されるささやき音声データが順次加算され、
それによって、だんだんと有声音信号が形成されてい
く。このとき、加算器１１、あるいは遅延器１２の出力
データを分岐して取り出せば、ささやき音声信号から形
成された有声音信号が得られる。That is, the whispered voice data output from the whispered voice input unit 10 is added to the adder 11 and the delay unit 12.
The signal is input to the comb filter unit composed of the following formulas and passes through a closed loop passing through the adder 11 and the delay unit 12. Then, the whisper data that is newly output from the whisper audio input unit 10 is sequentially added to the whisper data that circulates in the closed loop,
As a result, a voiced sound signal is gradually formed. At this time, if the output data of the adder 11 or the delay unit 12 is branched and extracted, a voiced sound signal formed from the whispered audio signal is obtained.

【００３６】次に、上記のくし形フィルタ部において、
ささやき声から有声音が形成される原理を、より詳しく
説明する。図２は、図１の加算器１１および遅延器１２
からなるくし形フィルタ部において、ささやき声から有
声音が形成される原理を模式的に表した図である。図２
において、（ａ）はささやき声の、（ｂ）はくし形フィ
ルタ部の、（ｃ）はくし形フィルタにおいて形成される
有声音の、それぞれ周波数−振幅スペクトルを表してい
る。Next, in the above comb filter section,
The principle of forming a voiced sound from a whisper will be described in more detail. FIG. 2 shows the adder 11 and the delay unit 12 of FIG.
FIG. 4 is a diagram schematically illustrating a principle in which a voiced sound is formed from whispering voices in a comb-shaped filter unit made of. FIG.
(A) represents the frequency-amplitude spectrum of the whisper, (b) represents the frequency-amplitude spectrum of the comb filter, and (c) represents the frequency-amplitude spectrum of the voiced sound formed by the comb filter.

【００３７】ここで、通常音声のスペクトル、および従
来のくし形フィルタのスペクトル（出力特性）につい
て、若干の補足説明をしておく。音声の母音は、それぞ
れに特徴的な周波数成分を持っており、この特徴的な周
波数成分の帯域を、一般にホルマントと呼んでいる。通
常音声のスペクトルは、連続スペクトルであり、かつこ
のホルマントにおいて顕著なピークを有する。一方、遅
延器と加算器とからなる従来のくし形フィルタは、例え
ば、テレビジョン受信回路において、輝度信号と色信号
とを分離するために使われており、その名が示すよう
に、櫛の歯のようなスペクトルを持つ。Here, the spectrum of the normal voice and the spectrum (output characteristics) of the conventional comb filter will be described with some supplementary explanations. Each vowel of a voice has a characteristic frequency component, and the band of the characteristic frequency component is generally called a formant. The spectrum of normal speech is a continuous spectrum and has a prominent peak in this formant. On the other hand, a conventional comb filter including a delay unit and an adder is used, for example, in a television receiving circuit to separate a luminance signal and a chrominance signal. It has a spectrum like a tooth.

【００３８】再び図２において、ささやき声の周波数振
幅スペクトルは、（ａ）に示すように連続スペクトルで
あり、かつホルマント（図中、Ｆ１，Ｆ２で示される）
においてピークを有する。ただし、通常音声のスペクト
ルとは異なり、ホルマントにおいて現れるピークがあま
り顕著でない（すなわち、ホルマントのピークレベルと
ホルマント以外の帯域のレベルとの差が、通常音声の場
合と比べ、相対的に小さい）。Referring again to FIG. 2, the frequency amplitude spectrum of the whisper is a continuous spectrum as shown in FIG. 2A and has formants (indicated by F1 and F2 in the figure).
At the peak. However, unlike the spectrum of the normal voice, the peak appearing in the formant is not so remarkable (that is, the difference between the peak level of the formant and the level of the band other than the formant is relatively small as compared with the case of the normal voice).

【００３９】一方、加算器１１および遅延器１２からな
るくし形フィルタ部のスペクトルでは、（ｂ）に示すよ
うに、一定の周波数間隔をおいて離散的にピークが現れ
ている。ただし、各ピークの周波数間隔は、遅延器１２
の遅延時間長（＝Ｔミリ秒）から決定される値Ｆ［Ｈ
ｚ］（ＦはＴの逆数）となっている。On the other hand, in the spectrum of the comb filter section composed of the adder 11 and the delay unit 12, discrete peaks appear at regular frequency intervals as shown in FIG. However, the frequency interval of each peak depends on the delay unit 12
F [H determined from the delay time length (= T milliseconds)
z] (F is the reciprocal of T).

【００４０】図２（ａ）のようなスペクトルを有するさ
さやき声が、（ｂ）のような出力特性を持つくし形フィ
ルタ部に入力されると、ささやき声がくし形フィルタ内
を巡回するにつれて、ホルマントにおいて現れるピーク
がだんだん顕著になっていく。その結果、（ｃ）に示す
ような、ホルマントにおいて顕著なピークが現れている
スペクトルを持つ、音高Ｆ［Ｈｚ］の有声音が形成され
る。これは、ささやき声の場合、ホルマント以外の帯域
では、スペクトルが比較的ランダムに変動するために、
くし形フィルタ部を通過する有効な成分が少なく、その
結果として、ホルマント以外の帯域における振幅レベル
が、ホルマントにおける振幅レベルと比べて低くなるこ
とに起因する。When a whisper having a spectrum as shown in FIG. 2A is input to a comb filter having an output characteristic as shown in FIG. 2B, it appears in a formant as the whisper circulates in the comb filter. The peaks become increasingly pronounced. As a result, a voiced sound having a pitch F [Hz] having a spectrum in which a remarkable peak appears in the formant as shown in (c) is formed. This is because in the case of a whisper, the spectrum fluctuates relatively randomly in bands other than formants,
This is because less effective components pass through the comb filter unit, and as a result, the amplitude level in a band other than the formant is lower than the amplitude level in the formant.

【００４１】次に、遅延器１２について、詳しく説明す
る。遅延器１２は、いわゆるシフトレジスタやランダム
アクセスメモリにより構成される。本実施形態に係る有
声音形成装置全体のサンプリング周波数がたとえば１０
［ｋＨｚ］の場合、遅延時間長Ｔを１０ミリ秒とする
と、形成される有声音の音高Ｆは、１００［Ｈｚ］とな
る。この時、遅延器１２に使用されるメモリは、１００
個（ワード）となる。Ｔを１０．１ミリ秒とすると、使
用されるメモリは、１０１個（ワード）となる。Next, the delay unit 12 will be described in detail. The delay unit 12 is constituted by a so-called shift register or random access memory. The sampling frequency of the entire voiced sound forming apparatus according to the present embodiment is, for example, 10
In the case of [kHz], when the delay time length T is 10 milliseconds, the pitch F of the voiced sound formed is 100 [Hz]. At this time, the memory used for the delay unit 12 is 100
(Words). Assuming that T is 10.1 milliseconds, 101 memories (words) are used.

【００４２】なお、Ｔを１０．０５ミリ秒とすると、整
数個のメモリでは、遅延時間長Ｔを表現しきれなくなる
が、その場合には、１００個目のメモリに記憶されたデ
ータと、１０１個目のメモリに記憶されたデータとを用
いて内挿補間を行えばよい。あるいは、全域通過フィル
タの位相遅延量を制御してもよい。このように、必要と
なるメモリの個数が整数だけでなく端数が生じるような
場合にも、所望の遅延量を実現することは、公知の技術
である。If T is 10.05 milliseconds, the delay time length T cannot be expressed by an integer number of memories. In this case, however, the data stored in the 100th memory and 101 The interpolation may be performed using the data stored in the memory of the number. Alternatively, the phase delay amount of the all-pass filter may be controlled. It is a known technique to realize a desired amount of delay even when the number of required memories is not only an integer but also a fraction.

【００４３】図３は、図１に示される遅延器１２の構成
の一例（ランダムアクセスメモリを使用する場合）を表
すブロック図である。図３において、遅延器１２は、分
配器４０、アドレス発生器４１、書き込み器４２、ラン
ダムアクセスメモリを使用した遅延メモリ５０、第１の
読み出し器５２、第２の読み出し器５３、加算器６０お
よび６３、および乗算器６１および６２を備えている。FIG. 3 is a block diagram showing an example of the configuration of the delay unit 12 shown in FIG. 1 (when a random access memory is used). 3, the delay unit 12 includes a distributor 40, an address generator 41, a writer 42, a delay memory 50 using a random access memory, a first reader 52, a second reader 53, an adder 60, 63 and multipliers 61 and 62.

【００４４】上述したサンプリング周波数が１０［ｋＨ
ｚ］、予め定められた遅延時間長Ｔが１０．０５ミリ秒
の場合、メモリデータとして必要な遅延長Ｌは、１０
０．５となる。有声音形成動作が開始されると、分配器
４０は、遅延長Ｌの整数部Ｍ（＝１００）をアドレス発
生器４１へ出力すると共に、遅延長Ｌの小数部αを乗算
器６１へ、また遅延長補数として（１−α）を乗算器６
２へ出力する。The above sampling frequency is 10 [kHz]
z], when the predetermined delay time length T is 10.05 ms, the delay length L required as memory data is 10
0.5. When the voiced sound forming operation is started, the distributor 40 outputs the integer part M (= 100) of the delay length L to the address generator 41, and outputs the decimal part α of the delay length L to the multiplier 61, and (1-α) as a delay length complement multiplier 6
Output to 2.

【００４５】アドレス発生器４１は、サンプリング周期
（＝０．１ミリ秒）毎に、書き込み用アドレスＷＰ（＝
アドレスｉ）と、読み出し用アドレスＲＰ（＝アドレス
ｉ−Ｍ）とを出力する。出力された書き込み用アドレス
ＷＰは、書き込み器４２へ与えられ、読み出し用アドレ
スＲＰは、第１の読み出し器５２および加算器６０へと
与えられる。加算器６０では、読み出し用アドレスＲＰ
に１が加算され、その加算結果（ＲＰ＋１）が、第２の
読み出し器５３に与えられる。The address generator 41 supplies the write address WP (= 0.1) every sampling period (= 0.1 millisecond).
The address i) and the read address RP (= address i−M) are output. The output write address WP is provided to the writer 42, and the read address RP is provided to the first reader 52 and the adder 60. In the adder 60, the read address RP
, And the result of the addition (RP + 1) is given to the second reader 53.

【００４６】書き込み器４２は、図１の加算器１１から
の出力を、与えられた書き込み用アドレスＷＰに従って
遅延メモリ５０に書き込む。同時に、第１の読み出し器
５２は、与えられた読み出し用アドレスＲＰに記憶され
たデータを遅延メモリ５０から読み出して、乗算器６１
に出力し、一方、第２の読み出し器５３は、与えられた
読み出し用アドレス（ＲＰ＋１）に記憶されたデータを
遅延メモリ５０から読み出して、乗算器６２に出力す
る。The writer 42 writes the output from the adder 11 in FIG. 1 to the delay memory 50 according to the given write address WP. At the same time, the first reader 52 reads the data stored at the given read address RP from the delay memory 50 and
On the other hand, the second reader 53 reads the data stored at the given read address (RP + 1) from the delay memory 50 and outputs the data to the multiplier 62.

【００４７】乗算器６１は、第１の読み出し器５２によ
って入力されるデータに、分配器４０によって入力され
る遅延長小数部αを乗算し、一方、乗算器６２は、第２
の読み出し器５３によって入力されるデータに、分配器
４０によって入力される小数部補数（１−α）を乗算す
る。加算器６３は、乗算器６１の乗算結果と、乗算器６
２の乗算結果とを互いに加算する。そして、その加算結
果が、図１の遅延器１２の出力となる。こうして、乗算
器６１および６２と加算器６３とによって、遅延長整数
部Ｍ（＝１００）および遅延長整数部Ｍ＋１（＝１０
１）のデータを用いて、遅延長小数部αに基づく内挿補
間演算が為される。The multiplier 61 multiplies the data input by the first readout unit 52 by the delay length fraction α input by the distributor 40, while the multiplier 62
Is multiplied by the fractional complement (1-α) input by the distributor 40 to the data input by the readout unit 53. The adder 63 calculates the multiplication result of the multiplier 61 and the multiplier 6
The result of multiplication by 2 is added to each other. Then, the addition result becomes the output of the delay unit 12 in FIG. Thus, the delay length integer part M (= 100) and the delay length integer part M + 1 (= 10
Using the data of 1), an interpolation calculation based on the delay length fractional part α is performed.

【００４８】アドレス発生器４１は、サンプリング周期
（＝０．１ミリ秒）毎に、書き込み用アドレスＷＰおよ
び読み出し用アドレスＲＰを、それぞれ１づつ増加して
いくので、遅延器１２からは、常に遅延時間長Ｔ（＝１
０．０５ミリ秒）前に加算器１１によって入力されたデ
ータに相当するデータが出力されることとなる。なお、
通常、アドレス発生器４１は、書き込み用アドレスＷＰ
や読み出し用アドレスＲＰの値が予め定められた値に達
すると、それらのアドレス値をリセットする。こうし
て、遅延メモリ５０をいわゆるリングメモリとして使用
することによって、遅延メモリ５０の有限な記憶量を有
効に使うことができる。The address generator 41 increases the write address WP and the read address RP by one each for each sampling period (= 0.1 millisecond). Time length T (= 1
The data corresponding to the data input by the adder 11 before (0.05 msec) is output. In addition,
Normally, the address generator 41 outputs the write address WP
When the value of the read address RP reaches a predetermined value, those address values are reset. Thus, by using the delay memory 50 as a so-called ring memory, a finite storage amount of the delay memory 50 can be effectively used.

【００４９】以上のように、本実施形態によれば、ささ
やき音声入力部１０に入力されたささやき声を、加算器
１１と遅延器１２とから構成されるくし形フィルタ部に
通すだけで、有声音を形成することができる。このた
め、従来の音声変換装置のように予め学習を行わせる手
間がかからず、携帯電話などに組み込みやすい、簡単な
構成の有声音形成装置を提供することができるようにな
る。As described above, according to the present embodiment, the whispering voice input to the whispering voice input unit 10 is passed through the comb filter unit composed of the adder 11 and the delay unit 12 to provide a voiced sound. Can be formed. For this reason, it is possible to provide a voiced sound forming device having a simple configuration that does not require the trouble of performing learning in advance as in a conventional voice conversion device and is easy to incorporate into a mobile phone or the like.

【００５０】なお、上記のくし形フィルタ部を巡回しつ
つ形成される有声音は、加算器１１の出力、あるいは遅
延器１２の出力を取り出すことにより得られる。ただ
し、加算器１１の出力を取り出す方が、有声音の出力タ
イミングの遅延がより小さくなるので、有声音の出力タ
イミングの遅延が問題になる場合には、加算器１１の出
力を取り出して有声音出力とする方が望ましい。The voiced sound formed while circulating through the above comb filter is obtained by extracting the output of the adder 11 or the output of the delay unit 12. However, extracting the output of the adder 11 causes a smaller delay in the output timing of the voiced sound. Therefore, when the delay in the output timing of the voiced sound becomes a problem, the output of the adder 11 is extracted and the voiced sound is output. It is desirable to output.

【００５１】（第２の実施形態）図４は、本発明の第２
の実施形態に係る有声音形成装置の構成を表すブロック
図である。図４において、第２の実施形態に係る有声音
形成装置は、ささやき音声入力部１０、加算器１１、遅
延器１２および遅延長選択部１３を備えている。遅延長
選択部１３は、遅延長選択の指示を受けて、予め決めら
れた複数の遅延長（ここでは、２つの遅延長）のなかか
らいずれか１つを選択し、その遅延長を遅延器１３に対
して設定する。なお、ささやき音声入力部１０、加算器
１１および遅延器１２は、図１と同じものである。(Second Embodiment) FIG. 4 shows a second embodiment of the present invention.
It is a block diagram showing the structure of the voiced sound formation apparatus which concerns on embodiment. In FIG. 4, the voiced sound forming apparatus according to the second embodiment includes a whispered voice input unit 10, an adder 11, a delay unit 12, and a delay length selection unit 13. Upon receiving the delay length selection instruction, the delay length selection unit 13 selects one of a plurality of predetermined delay lengths (here, two delay lengths), and determines the delay length as a delay unit. 13 is set. The whispering voice input unit 10, adder 11, and delay unit 12 are the same as those in FIG.

【００５２】遅延長選択部１３には、遅延時間長Ｔとし
て、複数の選択肢（ここでは、Ｔ１およびＴ２の２つ）
が予め準備されている。ここでは、遅延長選択部１３
は、Ｔ１とＴ２とを相互に切り換えるためのスイッチ
（図示せず）によって実現され、ユーザがこのスイッチ
を操作して、Ｔ１およびＴ２の一方を選択する指示を行
うものとする。加えて、遅延長選択部１３は、図示しな
いメモリを持ち、このメモリに、それぞれの遅延時間長
と対応する遅延長Ｌ（ここでは、Ｔ１と対応するＬ１、
およびＴ２と対応するＬ２の２つ）が予め記憶されてい
るとする。なお、遅延長Ｌは、遅延時間長Ｔと、有声音
形成装置全体のサンプリング周波数と、形成しようとす
る有声音の音高とから求めることができる。The delay length selection unit 13 has a plurality of options (here, two of T1 and T2) as the delay time length T.
Are prepared in advance. Here, the delay length selection unit 13
Is realized by a switch (not shown) for switching between T1 and T2, and the user operates this switch to give an instruction to select one of T1 and T2. In addition, the delay length selection unit 13 has a memory (not shown), and stores a delay length L corresponding to each delay time length (here, L1 corresponding to T1, L1 corresponding to T1).
And L2 corresponding to T2) are stored in advance. Note that the delay length L can be determined from the delay time length T, the sampling frequency of the entire voiced sound forming apparatus, and the pitch of the voiced sound to be formed.

【００５３】遅延長の選択が指示されると、遅延長選択
部１３は、その指示に応じ、最初、遅延時間長Ｔとし
て、Ｔ１（＝１０ミリ秒）またはＴ２（＝５ミリ秒）を
選択し、次に、その選択結果と対応する遅延長Ｌを上記
のメモリから読み出して、遅延器１２へと出力する。こ
れによって、遅延器１２は、設定された遅延長Ｌと対応
する遅延時間長を、加算器１１によって入力されるデー
タに付与して出力する（つまり、設定された遅延長Ｌと
対応する遅延時間長だけ、加算器１１によって入力され
るデータを保持した後、再び加算器１１へと出力する）
動作を行うようになる。When the selection of the delay length is instructed, the delay length selection unit 13 first selects T1 (= 10 milliseconds) or T2 (= 5 milliseconds) as the delay time length T in response to the instruction. Then, the selection result and the delay length L corresponding to the selection result are read out from the above-mentioned memory and output to the delay unit 12. Thus, the delay unit 12 adds the delay time length corresponding to the set delay length L to the data input by the adder 11 and outputs the data (that is, the delay time corresponding to the set delay length L) After holding the data input by the adder 11 by the length, the data is output to the adder 11 again)
Performs the operation.

【００５４】遅延長を設定した後の有声音形成装置の動
作は、遅延器１２が、遅延量選択部１３によって選択的
に設定された遅延長Ｌに基づいて動作する点を除けば、
第１の実施形態と同様である。すなわち、ささやき音声
入力部１０から出力されるささやき音声データは、加算
器１１および遅延器１２により構成されたくし形フィル
タ部へと入力され、そこを巡回する。そして、１巡する
度に、ささやき音声入力部１０から新たに出力されるさ
さやき音声データが加算されていき、それによって有声
音が形成される。こうして形成される有声音は、遅延長
選択部１３を通じて選択的に設定された遅延長と対応す
る音高、すなわち遅延時間長としてＴ１が選択された場
合には１００［Ｈｚ］の、Ｔ２が選択された場合には２
００［Ｈｚ］の音高を持つこととなる。The operation of the voiced sound forming apparatus after setting the delay length is as follows, except that the delay unit 12 operates based on the delay length L selectively set by the delay amount selection unit 13.
This is the same as in the first embodiment. That is, the whispered voice data output from the whispered voice input unit 10 is input to the comb filter unit configured by the adder 11 and the delay unit 12 and circulates there. Then, every time one round, whisper audio data newly output from the whisper audio input unit 10 is added, thereby forming a voiced sound. In the voiced sound thus formed, T2 of 100 [Hz] when T1 is selected as the pitch corresponding to the delay length selectively set through the delay length selection unit 13, that is, T2 is selected. 2 if
It has a pitch of 00 [Hz].

【００５５】以上のように、本実施形態によれば、ささ
やき音声入力部１０に入力されたささやき声を、加算器
１１と遅延器１２とから構成されるくし形フィルタ部に
通すだけで、有声音を形成することができる。このた
め、従来の音声変換装置のように予め学習を行わせる手
間がかからず、携帯電話などに組み込みやすい、簡単な
構成の有声音形成装置を提供することができるようにな
る。As described above, according to the present embodiment, the whispered voice input to the whispered voice input unit 10 is simply passed through the comb filter unit including the adder 11 and the delay unit 12 to provide a voiced sound. Can be formed. For this reason, it is possible to provide a voiced sound forming device having a simple configuration that does not require the trouble of performing learning in advance as in a conventional voice conversion device and is easy to incorporate into a mobile phone or the like.

【００５６】さらに、遅延長選択部１３が、予め複数の
遅延長を記憶しておき、それらの中からいずれか１つを
選択して、その選択結果を遅延器１２に対して設定する
ようにしたので、形成する有声音の音高を、予め準備さ
れた選択肢の中から選ぶことができるようになる。その
一例として、ユーザが男性の場合は１００［Ｈｚ］の有
声音を形成し、女性の場合は２００［Ｈｚ］の有声音を
形成するなどのように使い分けることが可能となる。Further, the delay length selection section 13 stores a plurality of delay lengths in advance, selects any one of them, and sets the selection result to the delay device 12. Therefore, the pitch of the voiced sound to be formed can be selected from options prepared in advance. As an example, when the user is a male, a voiced sound of 100 [Hz] is formed, and when the user is a female, a voiced sound of 200 [Hz] is formed.

【００５７】なお、本実施形態では、２種類の音高の有
声音を使い分けたが、選択肢を増やすことによって、よ
り多くの種類の音高を使い分けることもできる。In this embodiment, two types of voiced sounds are selectively used, but more types of pitches can be used properly by increasing the number of options.

【００５８】（第３の実施形態）図５は、本発明の第３
の実施形態に係る有声音形成装置の構成を表すブロック
図である。図５において、第３の実施形態に係る有声音
形成装置は、ささやき音声入力部１０、加算器１１、遅
延器２２および遅延長入力部２３を備えている。遅延長
入力部２３は、形成しようとする有声音の音高について
の指示を受け、指示された音高を持つ有声音を得るのに
必要な遅延長を算出して、その算出結果を遅延器２２に
設定する。(Third Embodiment) FIG. 5 shows a third embodiment of the present invention.
It is a block diagram showing the structure of the voiced sound formation apparatus which concerns on embodiment. In FIG. 5, the voiced sound forming apparatus according to the third embodiment includes a whispered voice input unit 10, an adder 11, a delay unit 22, and a delay length input unit 23. The delay length input unit 23 receives an instruction about the pitch of the voiced sound to be formed, calculates a delay length necessary to obtain a voiced sound having the specified pitch, and outputs the calculation result to the delay unit. Set to 22.

【００５９】なお、このような遅延長入力部２３の機能
は、例えば、電子楽器の鍵盤と、マイクロプロセッサ
と、プログラムメモリとを用いて実現できる。すなわ
ち、プログラムメモリには、音高に応じた遅延長を算出
するためのプログラムを格納しておく。ユーザが鍵盤を
通じて音高を指示すると、マイクロプロセッサは、この
プログラムに従って動作することにより、遅延長を算出
する。The function of the delay length input section 23 can be realized by using, for example, a keyboard of an electronic musical instrument, a microprocessor, and a program memory. That is, the program memory stores a program for calculating the delay length according to the pitch. When the user instructs the pitch through the keyboard, the microprocessor operates according to this program to calculate the delay length.

【００６０】遅延器２２は、加算器１１の出力データを
受け、そのデータを、遅延長入力部２３によって設定さ
れた遅延時間長だけ保持した後、出力する（図１の遅延
器１２との違いについては、後述する）。なお、ささや
き音声入力部１０および加算器１１は、図１と同じもの
である。The delay unit 22 receives the output data of the adder 11, holds the data for the delay time set by the delay length input unit 23, and outputs the data (the difference from the delay unit 12 in FIG. 1). Will be described later). The whispering voice input unit 10 and the adder 11 are the same as those in FIG.

【００６１】以上のように構成された有声音形成装置の
動作について、以下に説明を行う。出力させたい有声音
の音高についての指示が遅延長入力部２３に入力される
と、遅延長入力部２３は、指示された音高（＝Ｆ［Ｈ
ｚ］）を持つ有声音を形成するのに必要な遅延長を算出
する。このような遅延長Ｌは、有声音形成装置全体のサ
ンプリング周波数を、上記の音高Ｆ［Ｈｚ］で除するこ
とにより算出することができる。The operation of the voiced sound forming apparatus configured as described above will be described below. When an instruction on the pitch of the voiced sound to be output is input to the delay length input unit 23, the delay length input unit 23 outputs the specified pitch (= F [H
z]), the delay length required to form a voiced sound having Such a delay length L can be calculated by dividing the sampling frequency of the entire voiced sound forming apparatus by the above pitch F [Hz].

【００６２】遅延長入力部２３は、上記のようにして算
出した遅延長Ｌを、遅延器２２に対して設定する。これ
によって、遅延器２２は、設定された遅延長Ｌと対応す
る遅延時間長を、加算器１１によって入力されるデータ
に付与して出力する（つまり、設定された遅延長Ｌと対
応する遅延時間長だけ、加算器１１によって入力される
データを保持した後、再び加算器１１へと出力する）動
作を行うようになる。The delay length input unit 23 sets the delay length L calculated as described above to the delay unit 22. Thereby, the delay unit 22 adds the delay time length corresponding to the set delay length L to the data input by the adder 11 and outputs the data (that is, the delay time corresponding to the set delay length L) After the data input by the adder 11 is held by the length, the data is output to the adder 11 again).

【００６３】遅延長を設定した後の有声音形成装置の動
作は、遅延器２２が、遅延長入力部２３によって任意に
設定された遅延長Ｌに基づいて動作する点を除けば、第
１の実施形態と同様である。すなわち、ささやき音声入
力部１０から出力されるささやき音声データは、加算器
１１と遅延器２２とにより構成されたくし形フィルタ部
へと入力され、そこを巡回する。そして、１巡する度
に、ささやき音声入力部１０から新たなささやき音声デ
ータが順次加算されていき、それによって有声音が形成
される。こうして形成される有声音は、遅延長入力部２
３が任意に設定した遅延長と対応する音高を持つことと
なる。The operation of the voiced sound forming apparatus after setting the delay length is the same as that of the first embodiment except that the delay unit 22 operates based on the delay length L arbitrarily set by the delay length input unit 23. This is the same as the embodiment. That is, the whispered voice data output from the whispered voice input unit 10 is input to a comb filter unit configured by the adder 11 and the delay unit 22, and circulates there. Then, every time one round, new whispered voice data is sequentially added from the whispered voice input unit 10, thereby forming a voiced sound. The voiced sound thus formed is input to the delay length input unit 2
3 has a pitch corresponding to the arbitrarily set delay length.

【００６４】所望の音高を持つ有声音の形成を実現する
ために、遅延器２２は、以下のように構成されている。
図６は、図５に示される遅延器２２の構成の一例（ラン
ダムアクセスメモリを使用する場合）を表すブロック図
である。図６において、遅延器２２は、分配器４０、ア
ドレス発生器４１、書き込み器４２、ランダムアクセス
メモリを使用した遅延メモリ５４、第１の読み出し器５
２、第２の読み出し器５３、加算器６０および６３、お
よび乗算器６１および６２を備えている。In order to realize a voiced sound having a desired pitch, the delay unit 22 is configured as follows.
FIG. 6 is a block diagram showing an example of the configuration of the delay unit 22 shown in FIG. 5 (when a random access memory is used). 6, the delay unit 22 includes a distributor 40, an address generator 41, a writer 42, a delay memory 54 using a random access memory, and a first reader 5
2, a second readout unit 53, adders 60 and 63, and multipliers 61 and 62.

【００６５】すなわち、図６に示す遅延器２２は、遅延
メモリ５０の代わりに遅延メモリ５４を備えている点を
除けば、図５に示す遅延器１２と同じものである。遅延
メモリ５０と遅延メモリ５４とでは、最低限確保するメ
モリ領域が異なる。遅延メモリ５０では、アドレス（ｉ
−Ｍ−１）からアドレスｉまでのメモリ領域（Ｍ＋１）
が最低限確保されるが、遅延メモリ５４では、アドレス
（ｉ−Ｎ）からアドレスｉまでのメモリ領域（ＬＬ）が
最低限確保される。That is, the delay unit 22 shown in FIG. 6 is the same as the delay unit 12 shown in FIG. 5 except that a delay memory 54 is provided instead of the delay memory 50. The delay memory 50 and the delay memory 54 have different minimum memory areas. In the delay memory 50, the address (i
−M−1) to memory area (M + 1) from address i
Is secured at a minimum, but in the delay memory 54, a memory area (LL) from the address (i-N) to the address i is secured at a minimum.

【００６６】上記のメモリ領域ＬＬは、遅延長入力部２
３に対して入力指示可能な最低音高をＦＬとして、装置
全体のサンプリング周波数をこの最低音高ＦＬで除する
ことにより得られる。例えば、最低音高ＦＬが５０［Ｈ
ｚ］、装置全体のサンプリング周波数が１０［ｋＨｚ］
である時、上記のメモリ領域ＬＬは２００となる。遅延
器２２では、有声音形成動作中、上記のメモリ領域ＬＬ
が、最低限必要なメモリ領域として常時確保される。そ
の結果、有声音形成動作中に、形成しようとする有声音
の音高を変化させても、変化後の音高に応じた遅延時間
長だけ、加算器１１によって入力されるデータを保持す
ることができるようになる。The memory area LL is provided in the delay length input section 2
3 is obtained by dividing the sampling frequency of the entire apparatus by the minimum pitch FL, with the lowest pitch that can be instructed to be input to FL. For example, the lowest pitch FL is 50 [H
z], the sampling frequency of the entire apparatus is 10 [kHz]
, The memory area LL becomes 200. In the delay unit 22, during the voiced sound forming operation, the memory area LL is used.
Is always secured as a minimum necessary memory area. As a result, even when the pitch of the voiced sound to be formed is changed during the voiced sound formation operation, the data input by the adder 11 is held for a delay time length corresponding to the changed pitch. Will be able to

【００６７】以上のように、本実施形態によれば、ささ
やき音声入力部１０に入力されたささやき声を、加算器
１１と遅延器２２とから構成されるくし形フィルタ部に
通すだけで、有声音を形成することができる。このた
め、従来の音声変換装置のように予め学習を行わせる手
間がかからず、携帯電話などに組み込みやすい、簡単な
構成の有声音形成装置を提供することができるようにな
る。As described above, according to the present embodiment, the whispered voice input to the whispered voice input unit 10 is passed through the comb filter unit composed of the adder 11 and the delay unit 22 to provide a voiced sound. Can be formed. For this reason, it is possible to provide a voiced sound forming device having a simple configuration that does not require the trouble of performing learning in advance as in a conventional voice conversion device and is easy to incorporate into a mobile phone or the like.

【００６８】さらに、遅延長入力部２３が、音高に関す
る指示入力を受けて遅延長の算出を行い、その算出結果
を遅延器２２に対して設定するようにしたので、形成す
る有声音の音高を任意に指定することができるようにな
る。その一例として、音高が時間的に変化するような有
声音、すなわちメロディー（音階）付きの有声音や、イ
ントネーション（抑揚）付きの有声音を形成することが
可能となる。Further, the delay length input section 23 calculates the delay length in response to the instruction input relating to the pitch and sets the calculation result to the delay unit 22, so that the sound of the voiced sound to be formed is formed. The height can be specified arbitrarily. As an example, it is possible to form a voiced sound whose pitch changes with time, that is, a voiced sound with a melody (scale) and a voiced sound with intonation (intonation).

【００６９】（第４の実施形態）図７は、本発明の第４
の実施形態に係る音声認識装置の構成を表すブロック図
である。図７において、第４の実施形態に係る音声認識
装置は、有声音形成部１および音声認識部２を備えてい
る。有声音形成部１は、第１〜第３の実施形態に係る有
声音形成装置のいずれかと同様の構成を有する。音声認
識部２は、例えば、図９に示す従来の音声認識装置と同
様の構成を有する。(Fourth Embodiment) FIG. 7 shows a fourth embodiment of the present invention.
FIG. 2 is a block diagram illustrating a configuration of a speech recognition device according to the embodiment. 7, the voice recognition device according to the fourth embodiment includes a voiced sound forming unit 1 and a voice recognition unit 2. The voiced sound forming unit 1 has the same configuration as any of the voiced sound forming devices according to the first to third embodiments. The voice recognition unit 2 has, for example, the same configuration as the conventional voice recognition device shown in FIG.

【００７０】有声音形成装置１へは、ささやき声が入力
され、有声音形成部１は、第１〜第３の実施形態と同様
の動作により、入力されるささやき声から、有声音を形
成する。こうして形成される有声音は、ささやき声と比
べ、母音部スペクトルが安定し、かつＳ／Ｎ比も大き
い。音声認識部２へは、有声音形成部１が形成した有声
音が入力され、音声認識部２は、入力される音声を認識
する。このとき音声認識部２が行う認識動作は、従来と
同様であるが、入力音声がささやき声でなく、上記のよ
うにしてささやき声から形成された有声音なので、より
高い認識率が得られることとなる。A whispered voice is input to the voiced sound forming device 1, and the voiced sound forming unit 1 forms a voiced sound from the input whispered voice by the same operation as in the first to third embodiments. The voiced sound thus formed has a stable vowel spectrum and a large S / N ratio as compared to a whispered voice. The voice recognition unit 2 receives the voiced sound formed by the voiced sound generation unit 1, and the voice recognition unit 2 recognizes the input voice. At this time, the recognition operation performed by the voice recognition unit 2 is the same as the conventional operation, but a higher recognition rate can be obtained because the input voice is not a whispered voice but a voiced sound formed from the whispered voice as described above. .

【００７１】なお、有声音形成部１からは、形成した有
声音に加えて、有声音を形成する過程で生成される遅延
長Ｌをも出力されるようにし、それら有声音および遅延
長Ｌを音声認識部２に入力するようにしてもよい。それ
によって、音声認識部２の入力音声分析部５２３（図９
参照）における分析処理は、入力音声の音高に同期した
サイズのフレーム毎に行われることとなり、その結果、
分析誤差が低減される。The voiced sound forming section 1 outputs a delay length L generated in the process of forming a voiced sound, in addition to the formed voiced sound, and outputs the voiced sound and the delay length L. You may make it input into the speech recognition part 2. Thereby, the input voice analysis unit 523 of the voice recognition unit 2 (FIG. 9)
Is performed for each frame of a size synchronized with the pitch of the input voice, and as a result,
Analysis errors are reduced.

【００７２】以上のように、本実施形態によれば、認識
しようとするささやき声（ささやき音声信号）を、まず
有声音形成部１を通じて有声音（有声音信号）に変換
し、その有声音（有声音信号）を音声認識部２へと入力
して認識するようにしたので、ささやき声の認識率を向
上させた音声認識装置を提供することができるようにな
る。さらに、このような音声認識装置を組み込めば、多
数の人々が活動するオフィスなどで使用しても周囲に騒
音とならないような音声指示入力機器を実現することが
できるようになる。As described above, according to the present embodiment, the whisper to be recognized (whispered voice signal) is first converted into a voiced sound (voiced sound signal) through the voiced sound forming unit 1 and the voiced sound (voiced sound signal) is converted. (Voice signal) is input to the voice recognition unit 2 for recognition, so that it is possible to provide a voice recognition device with an improved whispering voice recognition rate. Further, if such a voice recognition device is incorporated, it is possible to realize a voice instruction input device that does not generate noise around when used in an office where many people are active.

[Brief description of the drawings]

【図１】本発明の第１の実施形態に係る有声音形成装置
の構成を表すブロック図である。FIG. 1 is a block diagram illustrating a configuration of a voiced sound generation device according to a first embodiment of the present invention.

【図２】図１の加算器１１および遅延器１２からなるく
し形フィルタ部において、ささやき声から有声音が形成
される原理を模式的に表した図である。FIG. 2 is a diagram schematically illustrating a principle that a voiced sound is formed from a whisper in a comb filter unit including an adder 11 and a delay unit 12 in FIG.

【図３】図１に示される遅延器１２の構成の一例（ラン
ダムアクセスメモリを使用する場合）を表すブロック図
である。FIG. 3 is a block diagram illustrating an example of a configuration of a delay unit 12 illustrated in FIG. 1 (when a random access memory is used).

【図４】本発明の第２の実施形態に係る有声音形成装置
の構成を表すブロック図である。FIG. 4 is a block diagram illustrating a configuration of a voiced sound generation device according to a second embodiment of the present invention.

【図５】本発明の第３の実施形態に係る有声音形成装置
の構成を表すブロック図である。FIG. 5 is a block diagram illustrating a configuration of a voiced sound forming device according to a third embodiment of the present invention.

【図６】図５に示される遅延器２２の構成の一例（ラン
ダムアクセスメモリを使用する場合）を表すブロック図
である。6 is a block diagram illustrating an example of a configuration of the delay unit 22 illustrated in FIG. 5 (when a random access memory is used).

【図７】本発明の第４の実施形態に係る音声認識装置の
構成を表すブロック図である。FIG. 7 is a block diagram illustrating a configuration of a speech recognition device according to a fourth embodiment of the present invention.

【図８】従来の音声変換装置の構成の一例を表すブロッ
ク図である。FIG. 8 is a block diagram illustrating an example of a configuration of a conventional voice conversion device.

【図９】従来の音声認識装置の構成の一例を表すブロッ
ク図である。FIG. 9 is a block diagram illustrating an example of a configuration of a conventional voice recognition device.

[Explanation of symbols]

１…有声音形成部２…音声認識部１０…ささやき音声入力部１１…加算器１２，２２…遅延器１３…遅延長選択部２３…遅延長入力部 DESCRIPTION OF SYMBOLS 1 ... Voiced sound formation part 2 ... Voice recognition part 10 ... Whispering voice input part 11 ... Adder 12, 22 ... Delayer 13 ... Delay length selection part 23 ... Delay length input part

Claims

[Claims]

1. A whisper having a spectrum in which a level difference between an amplitude level in a formant and an amplitude level in a band other than a formant is relatively small, and having a spectrum in which the level difference is relatively large. A voiced sound forming device for forming a vocal sound, wherein the whispered voice is input, and a voice input unit that outputs a whispered voice signal corresponding to the whispered voice, a delay unit for delaying a voice signal, and the voice input unit A whispered voice signal output from the delay device, a delayed voice signal output from the delay unit is input, the whispered voice signal, the delayed voice signal and an adder that adds and outputs each other, An audio signal output from the adder is input to the delay unit, and the adder passes through the delay unit from the adder to the adder. A whispered voice signal newly output from the voice input unit is added to the voice signal circulating through the returning closed loop every time the loop is completed, thereby forming a voiced sound signal corresponding to the voiced sound. A voiced sound forming device, characterized in that:

2. The voiced sound forming apparatus according to claim 1, further comprising an output terminal for outputting a voice signal output from the adder as a voiced sound signal corresponding to the voiced sound to the outside.

3. The voiced sound forming device according to claim 1, further comprising an output terminal for outputting a voice signal output from the delay unit as a voiced sound signal corresponding to the voiced sound to the outside.

4. A delay length in which a plurality of delay time lengths are prepared in advance, and one of the plurality of delay time lengths is selected and the delay time length is set for the delay device. The delay unit further includes a selection unit, wherein the delay unit outputs the audio signal output from the adder,
The voiced sound forming apparatus according to claim 1, wherein the voiced sound forming apparatus is delayed by a delay time length set by the delay length selecting unit.

5. A delay for inputting an arbitrary pitch, calculating a delay time required to form a voiced sound having the pitch, and setting the delay time to the delay unit. The delay unit further includes a long input unit, wherein the delay unit outputs the audio signal output from the adder,
The voiced sound forming apparatus according to claim 1, wherein the voiced sound forming apparatus delays by a delay time length set by the delay length input unit.

6. A whisper having a spectrum in which a level difference between an amplitude level in a formant and an amplitude level in a band other than a formant is relatively small, and having a spectrum in which the level difference is relatively large. A voiced sound forming device for forming a vocal sound, wherein the whispered voice is input, and a voice input unit that outputs a whispered voice signal corresponding to the whispered voice, wherein output peaks appear discretely at regular frequency intervals. Comb-shaped filter unit having an output characteristic as described above, and an output terminal for externally outputting a voiced sound signal formed by transmitting a whispered sound signal from the sound input unit through the comb-shaped filter unit A voiced sound forming device comprising:

7. The voiced sound forming apparatus according to claim 6, wherein the predetermined frequency interval is determined in relation to a pitch of a voiced sound to be formed.

8. A whisper having a spectrum in which a level difference between an amplitude level in a formant and an amplitude level in a band other than a formant is relatively small, and having a spectrum in which the level difference is relatively large. A voiced sound forming device for forming a voice sound, wherein the whispered voice is input, a voice input unit that outputs a whispered voice signal corresponding to the whispered voice, and a whispered voice signal output from the voice input unit, A voiced sound forming unit that forms a voiced sound signal corresponding to the voiced sound, wherein the voiced sound forming unit delays the whispered voice signal output from the voice input unit, and converts the delayed whispered voice signal into the voice While adding the whispered audio signal output from the input unit and sequentially delaying the added audio signal, the audio input unit Et newly by going further adding the whispering voice signal output, and forming a voiced sound signal corresponding to the voiced, voiced forming apparatus.

9. A voice recognition apparatus for recognizing a whisper having a spectrum in which a level difference between an amplitude level in a formant and an amplitude level in a band other than the formant is relatively small, wherein the whisper is input, A voice input unit that outputs a whispered voice signal corresponding to the whispered voice, from a whispered voice signal output from the voice input unit,
A voiced sound forming unit that forms a voiced sound signal corresponding to a voiced sound having a spectrum in which the level difference is relatively large,
And a voice recognition unit that recognizes the whisper based on a voiced sound signal output from the voiced sound formation unit.

10. The voiced sound forming section delays a whispered voice signal output from the voice input section, adds the delayed whispered voice signal to a whispered voice signal output from the voice input section, A voiced sound signal corresponding to the voiced sound is formed by further adding a whispered voice signal newly output from the voice input unit while sequentially delaying the voice signal after the addition. The speech recognition device according to claim 9.