JP3008404B2

JP3008404B2 - Voice recognition device

Info

Publication number: JP3008404B2
Application number: JP1055658A
Authority: JP
Inventors: 誠阿久根
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 1989-03-08
Filing date: 1989-03-08
Publication date: 2000-02-14
Anticipated expiration: 2015-02-14
Also published as: JPH02235099A

Description

【発明の詳細な説明】〔産業上の利用分野〕この発明は、音声認識装置、特に入力音声のパワーの
時間的変化から音韻認識に有用な音韻特徴情報を得るこ
とを意図した音声認識装置に関する。Description: BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech recognition apparatus, and more particularly to a speech recognition apparatus intended to obtain phoneme feature information useful for phoneme recognition from temporal changes in the power of input speech. .

[Conventional technology]

従来から音韻認識、音声認識では入力音声のパワーレ
ベルの時間的変化が、有声・無声区間の判別或いは音韻
の有無の判断のパラメータのひとつとして利用されてい
た。Conventionally, in phoneme recognition and speech recognition, a temporal change in the power level of an input speech has been used as one of parameters for discriminating a voiced / unvoiced section or for judging the presence or absence of a phoneme.

[Problems to be solved by the invention]

上述の判断は、入力音声のパワーレベルと所定のスレ
ッショルドとの比較によって行われていた。例えば、入
力音声のパワーレベルがスレッショルドを上回った時に
音声或いは音韻の立上がりと判断され、入力音声のパワ
ーレベルがスレッショルドを下回った時に音声或いは音
韻の立下がりと判断される。つまり、従来の技術では、
入力音声のパワーは音声或いは音韻の有無の検出に用い
られるに止まっていたため、より有効に活用することが
望まれていた。The above determination has been made by comparing the power level of the input voice with a predetermined threshold. For example, when the power level of the input voice exceeds the threshold, it is determined that the voice or phoneme rises, and when the power level of the input voice falls below the threshold, it is determined that the voice or phoneme falls. In other words, in the conventional technology,
Since the power of the input voice has been used only for detecting the presence or absence of voice or phoneme, it has been desired to utilize the power more effectively.

従ってこの発明の目的は、入力音声のパワーの時間的
変化から音韻認識に有用な情報を得られる音声認識装置
を提供することにある。Accordingly, an object of the present invention is to provide a speech recognition apparatus that can obtain useful information for phoneme recognition from a temporal change in power of an input speech.

[Means for solving the problem]

この発明は、入力音声のパワーの時間的変化のピーク
を検出し、入力音声のパワーが入力音声のピークにおけ
るパワーに対して所定量小さいピークエッジを検出する
ピークエッジを検出し、入力音声に基づいて音響要素を
表すパラメータを生成し、当該パラメータに基づいて音
韻境界情報を生成し、ピークエッジにより判定されるエ
ネルギ集中区間の情報と音韻境界情報とを参照して、パ
ラメータに基づいて入力音声の音韻を認識する構成とし
ている。The present invention detects a peak of a temporal change in the power of the input voice, detects a peak edge in which the power of the input voice is smaller than the power at the peak of the input voice by a predetermined amount, and detects the peak edge based on the input voice. Generating a parameter representing an acoustic element, generating phoneme boundary information based on the parameter, referring to the information of the energy concentration section determined by the peak edge and the phoneme boundary information, and based on the parameter, It is configured to recognize phonemes.

[Action]

入力音声のパワーを所定の時間単位にて分割し、この
所定の時間単位内で、パワーレベルの時間的変化を求め
てピークを検出し、そして、このピークに対してパワー
が所定量小さい点、即ちピークエッジを検出する。ピー
クエッジに基づいて、音韻認識に有用な音韻特徴情報、
例えば、エネルギ集中区間、バースト区間等を得ること
ができ、音韻認識の精度を向上させることができる。The power of the input voice is divided by a predetermined time unit, a peak is detected by calculating a temporal change in the power level within the predetermined time unit, and a point at which the power is smaller than the peak by a predetermined amount, That is, a peak edge is detected. Based on the peak edge, phoneme feature information useful for phoneme recognition,
For example, an energy concentration section, a burst section, and the like can be obtained, and the accuracy of phoneme recognition can be improved.

〔Example〕

以下、この発明の一実施例について第１図乃至第６図
を参照して説明する。An embodiment of the present invention will be described below with reference to FIGS.

第１図は、この発明に係る音声認識装置の例を示す。 FIG. 1 shows an example of a speech recognition apparatus according to the present invention.

マイクロホン１からの音声信号が、アンプ２及びロー
パスフイルタ３を介して、A/D変換回路４に供給され
る。上述の音声信号は、A/D変換回路４にて、例えば、1
2.5KHzのサンプリング周波数で12ビットのデジタル音声
信号に変換される。このデジタル音声信号は、音響分析
手段５に供給される。An audio signal from the microphone 1 is supplied to the A / D conversion circuit 4 via the amplifier 2 and the low-pass filter 3. The above-mentioned audio signal is, for example, 1
It is converted to a 12-bit digital audio signal at a sampling frequency of 2.5 KHz. This digital audio signal is supplied to the acoustic analysis means 5.

音響分析手段５は、バンドパスフィルタバンクを有す
る過渡検出パラメータ生成手段51と、音声パワーを検出
する対数パワー検出手段52と、ゼロクロスレート演算手
段53と、隣接サンプルの相関関係をみるための１次のパ
ーコール係数の演算手段54と、パワースペクトルの傾き
の演算手段55と、音声の基本周期の検出手段56を備え
る。The acoustic analysis unit 5 includes a transient detection parameter generation unit 51 having a band-pass filter bank, a logarithmic power detection unit 52 for detecting audio power, a zero-cross rate calculation unit 53, and a primary for checking the correlation between adjacent samples. And a means 55 for calculating the slope of the power spectrum, and a means 56 for detecting the fundamental period of the voice.

過渡検出パラメータは、入力音声の過渡性及び定常性
を検出するためのもので、この過渡検出パラメータは、
音声スペクトルの変化量を各チャンネル（周波数）の時
間方向のブロック内の分散の和として定義される。即
ち、音声スペクトルSi（ｎ）を周波数方向の以下に示す
平均値Savg（ｎ）でゲインを正規化する。The transient detection parameter is for detecting the transient and steadiness of the input voice.
The amount of change in the audio spectrum is defined as the sum of the variances in the block in the time direction of each channel (frequency). That is, the gain of the audio spectrum Si (n) is normalized by an average value Savg (n) shown below in the frequency direction.

ここで、ｉはチャンネル番号、ｑはチャンネル数（バ
ンドパスフィルタ数）を示す。また、ｑチャンネルの各
チャンネルの情報は時間方向にサンプリングされるが、
同一時点のｑチャンネルの情報のブロックをフレームと
いい、ｎは認識に使用されるフレームの番号を示してい
る。 Here, i indicates the channel number, and q indicates the number of channels (the number of bandpass filters). The information of each channel of the q channel is sampled in the time direction,
A block of q-channel information at the same time is called a frame, and n indicates the number of a frame used for recognition.

ゲイン正規化の行われた音声スペクトルｉ（ｎ）
は、となり、過渡検出パラメータＴ（ｎ）は、そのフレーム
の前後のＭフレームの合計（2M＋１）である〔ｎ−M,n
＋Ｍ〕ブロック内の各チャンネルの時間方向の分散の和
として定義される。Speech spectrum i (n) after gain normalization
Is And the transient detection parameter T (n) is the sum (2M + 1) of M frames before and after that frame [n−M, n
+ M] is defined as the sum of variances in the time direction of each channel in the block.

ここで、であり、各チャンネルのブロック内の時間方向の平均値
である。 here, , Which is the average value in the time direction within the block of each channel.

実際的には、〔ｎ−M,n＋Ｍ〕ブロック中心付近の変
化は、音の揺らぎ或いはノイズを拾い易いので、過渡検
出パラメータＴ（ｎ）の計算から取り除くこととし、第
（３）式は次のように変形される。In practice, the change near the [n−M, n + M] block center is easy to pick up sound fluctuations or noises, so that it should be removed from the calculation of the transient detection parameter T (n). It is transformed as follows.

そして、第（５）式において、ａ＝１、Ｍ＝28、ｍ＝
３、ｑ＝32として過渡検出パラメータＴ（ｎ）が求めら
れる。例えば、「あさ（asa）」という入力音声の場
合、第２図Ａのような過渡検出パラメータＴ（ｎ）が得
られる。 Then, in the equation (5), a = 1, M = 28, m =
3. Assuming q = 32, the transient detection parameter T (n) is obtained. For example, in the case of the input voice “asa”, a transient detection parameter T (n) as shown in FIG. 2A is obtained.

他のパラメータ、例えば、第２図Ｂに示される対数パ
ワー、第２図Ｃに示されるゼロクロスレート、第２図Ｄ
に示される１次のパーコール係数、第２図Ｅに示される
パワースペクトルの傾き、第２図Ｆに示される基本周期
等のパラメータの演算も過渡検出パラメータＴ（ｎ）と
同様に、或る時点（フレーム）を中心としてその前後に
Ｍフレーム分の時間幅を有するウインドーを考え、ウイ
ンドーを順次、１サンプル点ずつ時間方向に移動させ、
各ウインドー内で夫々演算を行うことにより得られる。
尚、第２図Ｇには、入力音声「あさ（asa）」の音声波
形と、音韻境界候補の例を示す。Other parameters, such as the log power shown in FIG. 2B, the zero cross rate shown in FIG.
, The slope of the power spectrum shown in FIG. 2E, and the calculation of parameters such as the fundamental period shown in FIG. 2F, at a certain point in time, similarly to the transient detection parameter T (n). Consider a window having a time width of M frames before and after (frame), and sequentially move the window in the time direction by one sample point,
It is obtained by performing calculations in each window.
FIG. 2G shows an example of a speech waveform of the input speech “asa” and phoneme boundary candidates.

音響分析手段５から得られた各パラメータは認識処理
用パラメータとして、音韻認識手段８に供給される。ま
た、手段51〜55から出力される各パラメータはセグメン
テーション用パラメータとして第１セグメンテーション
手段６の特徴点抽出手段61に供給される。そして、手段
51におけるバンドパスフィルタバンクからの出力がピー
キング処理回路11に供給される。Each parameter obtained from the acoustic analysis means 5 is supplied to the phoneme recognition means 8 as a parameter for recognition processing. Each parameter output from the means 51 to 55 is supplied to the feature point extracting means 61 of the first segmentation means 6 as a segmentation parameter. And means
The output from the bandpass filter bank at 51 is supplied to the peaking processing circuit 11.

ピーキング処理回路11は、第３図に示されるように、
パワー計算回路14と、ピーク検出回路12と、ピークエッ
ジ検出回路13からなる。The peaking processing circuit 11, as shown in FIG.
It comprises a power calculation circuit 14, a peak detection circuit 12, and a peak edge detection circuit 13.

バンドパスフィルタバンク51から供給される入力音声
のパワーのデジタル出力が、順次、パワー計算回路14に
供給されると、各フレーム毎にデジタル出力の平均値が
求められる。そして、この平均値が時系列的にピーク検
出回路12に供給されると、時間軸方向でパワーレベルの
ピークが検出される。得られたパワーレベルの時間的変
化が、例えば第４図に示されるようなものである場合、
ピークPP1、PP2が検出される。When the digital output of the power of the input sound supplied from the band-pass filter bank 51 is sequentially supplied to the power calculation circuit 14, an average value of the digital output is obtained for each frame. Then, when this average value is supplied to the peak detection circuit 12 in time series, a peak of the power level is detected in the time axis direction. When the obtained temporal change of the power level is, for example, as shown in FIG.
Peaks PP1 and PP2 are detected.

このピークPP1、PP2及びパワーレベルの時間的変化が
次段のピークエッジ検出回路13に供給されると、ピーク
PPの両側で、パワーレベルが所定レベル低下している
点、例えば3dB低下している点がピークエッジPE1、PE
2、PE3として検出され、端子15を介して音韻認識手段８
に供給される。When the peaks PP1 and PP2 and the temporal change of the power level are supplied to the next-stage peak edge detection circuit 13,
On both sides of the PP, the point where the power level has decreased by a predetermined level, for example, the point where the power level has decreased by 3 dB is the peak edge PE1, PE
2. Detected as PE3, phoneme recognition means 8 via terminal 15
Supplied to

ピークエッジPEを設定することにより、入力音声のパ
ワーレベルの波形に対応して音韻特徴情報が得られる。
尚、上述のピークPP及びピークエッジPEの検出は、第４
図に示されるように、デジタル出力を所定の時間単位Ｔ
に分割し、各時間単位Ｔ内で行われる。By setting the peak edge PE, phonemic feature information can be obtained corresponding to the power level waveform of the input voice.
The above-described detection of the peak PP and the peak edge PE is performed in the fourth step.
As shown, the digital output is converted to a predetermined time unit T
And is performed within each time unit T.

例えば、第５図に示される波形では、ピークPP6に対
し3dB低下している点が、ピークエッジPE61、PE62とし
て検出される。これから、両ピークエッジPE61〜PE62間
がエネルギ集中区間ＴEとされる。これは入力音声のパ
ワーのエネルギーが高いレベルに集中している区間であ
り、音韻の発生している、音韻区間と判断される。この
場合、エネルギ集中区間ＴE外は、無音区間或いは休止
区間と判断することもできる。このように音韻の有無を
判断し得る情報が得られる。For example, in the waveform shown in FIG. 5, points that are 3 dB lower than the peak PP6 are detected as the peak edges PE61 and PE62. Thus, the energy concentration section TE is defined between the two peak edges PE61 to PE62. This is a section in which the energy of the power of the input voice is concentrated at a high level, and is determined to be a phoneme section in which a phoneme occurs. In this case, the area outside the energy concentration section TE can be determined to be a silent section or a pause section. Thus, information that can determine the presence or absence of a phoneme is obtained.

次いで、第６図に示される波形では、ピークPP7に対
して、ピークエッジPE71、PE72が検出される。ピークエ
ッジPE71〜PE727間は、定められた区間〔時間長〕ＴCよ
りも短い区間として検出される。このように、所定の時
間長ＴCよりも短い時間にピークPP、ピークエッジPEが
検出される時、ピークエッジPE71〜PE727間は、バース
ト区間ＴPEと判断される。これから、例えば「パ、プ」
のような破裂音、破擦音を判別し得る情報が得られる。Next, in the waveform shown in FIG. 6, peak edges PE71 and PE72 are detected with respect to the peak PP7. A section between the peak edges PE71 to PE727 is detected as a section shorter than a predetermined section [time length] TC. As described above, when the peak PP and the peak edge PE are detected in a time shorter than the predetermined time length TC, the interval between the peak edges PE71 to PE727 is determined to be a burst section TPE. From now on, for example,
The information which can judge the plosive sound and the affricate is obtained.

この例では、ピークPPから3dB低下した点をピークエ
ッジPEとして検出する例を示しているが、この例に限定
されることはなく、3dBを基準として、よりレベルの接
近している方をピークエッジPEとして検出してもよい。
また、この例では、バンドパスフィルタバンク51を用い
る例を説明しているが、この例に限定されることはなく
FFTを用いても良い。In this example, an example in which a point 3 dB lower than the peak PP is detected as the peak edge PE is shown.However, the present invention is not limited to this example. It may be detected as an edge PE.
Further, in this example, an example in which the band-pass filter bank 51 is used is described, but the present invention is not limited to this example.
FFT may be used.

このように、入力音声のパワーレベルの時間的変化か
ら得られるピークPPに対してパワーが3dB小さいピーク
エッジPEを検出し、ピークエッジPEに基づいてエネルギ
集中区間ＴE、バースト区間ＴPE等の音韻特徴情報を得
ることができる。ピークPP及びピークエッジPEの情報、
そしてエネルギ集中区間ＴE、バースト区間ＴPE等の情
報は音韻特徴情報として音韻認識手段８に供給される。
これによって、音韻認識の精度を向上できる。As described above, the peak edge PE whose power is 3 dB smaller than the peak PP obtained from the temporal change of the power level of the input voice is detected, and based on the peak edge PE, the phonological features such as the energy concentration section TE and the burst section TPE are detected. Information can be obtained. Information on peak PP and peak edge PE,
Information such as the energy concentration section TE and the burst section TPE is supplied to the phoneme recognition means 8 as phoneme feature information.
Thereby, the accuracy of phoneme recognition can be improved.

第１セグメンテーション手段６では、セグメンテーシ
ョン用パラメータから音韻境界候補を求めるために、一
般的な特徴点を抽出する。この例では、特徴点として次
の７種を用いる。The first segmentation means 6 extracts general feature points in order to obtain phoneme boundary candidates from the segmentation parameters. In this example, the following seven types are used as feature points.

立上がり点−平坦な部分から増加方向に変化する点立下がり点−減少方向に変化した後、平坦になる部分
の点増加変化点−増加率が変化する点減少変化点−減少率が変化する点ピーク点−ピークの位置正のゼロクロス点−増加方向で零レベルと交差する点負のゼロクロス点−減少方向で零レベルと交差する点特徴点抽出手段61では、特徴点情報記憶手段62からの
特徴点情報を参照して各パラメータ毎に特徴点を抽出す
る。第２図Ａ〜Ｅの各パラメータ中、時間軸方向に対し
縦線で示す位置が各特徴点の位置である。第１セグメン
テーション手段６から得られ、特徴点の付された各パラ
メータは、第２セグメンテーション手段７の特徴点統合
処理手段71に供給される。Rise point-point where the area changes from a flat part in the direction of increase Fall point-point where the area changes in the direction of decrease and then becomes flat Increase point of change-point where the rate of change changes Decrease point-point where the rate of decrease changes Peak point-peak position Positive zero cross point-point crossing zero level in increasing direction Negative zero cross point-point crossing zero level in decreasing direction Feature point extraction means 61 has a feature from feature point information storage means 62 A feature point is extracted for each parameter with reference to the point information. In each of the parameters in FIGS. 2A to 2E, the position indicated by a vertical line in the time axis direction is the position of each feature point. Each parameter obtained from the first segmentation means 6 and provided with a feature point is supplied to a feature point integration processing means 71 of the second segmentation means 7.

第２セグメンテーション手段７は、特徴点統合処理手
段71と、音韻境界特徴検出手段72と、特徴点統合情報記
憶手段73と、音韻境界特徴情報記憶手段74とからなる。The second segmentation means 7 comprises a feature point integration processing means 71, a phoneme boundary feature detection means 72, a feature point integration information storage means 73, and a phoneme boundary feature information storage means 74.

第１セグメンテーション手段６で求めた特徴点はパラ
メータ毎に位置ズレ、未検出等があるので、特徴点統合
処理手段71にて特徴点統合情報記憶手段73からの特徴点
統合情報を参照し各パラメータの特徴点をまとめ音韻境
界候補を決定する。特徴点統合情報は、どのパラメータ
の特徴点を優先するかについての情報である。Since the feature points obtained by the first segmentation means 6 have positional deviation, undetection, and the like for each parameter, the feature point integration processing means 71 refers to the feature point integrated information from the feature point integrated information storage means 73, and refers to each parameter. And the phonetic boundary candidates are determined. The feature point integrated information is information on which parameter feature point has priority.

音韻境界特徴検出手段72では、各音韻境界候補の音韻
境界特徴を求める。この例では８種類の音韻境界特徴が
用いられている。The phoneme boundary feature detecting means 72 obtains a phoneme boundary feature of each phoneme boundary candidate. In this example, eight types of phoneme boundary features are used.

無音からの立上がり（Ｓ−Ｒ）子音性→母音性（Ｃ−Ｖ）子音性→子音性（Ｃ−Ｃ）母音性→母音性（Ｖ−Ｖ）母音性への立下がり（Ｖ−Ｆ）母音声→子音性（Ｖ−Ｃ）無音への立下がり（Ｆ−Ｓ）有音→無音（Ｓ−Ｓ）音韻境界特徴情報記憶手段74には、これら８種類の音
韻境界特徴情報が記憶されており、音韻境界特徴検出手
段72では、音韻境界特徴情報記憶手段74からの情報を参
照して各音韻境界候補の音韻境界特徴を検出する。この
結果、第２図Ｇに示されるように音韻境界特徴が音韻境
界候補の縦線の近傍に示されている。Rise from silence (SR) Consonant → vowel (CV) Consonant → consonant (CC) Vowel → vowel (VV) Fall to vowel (VF) Vowel → consonant (VC) Fall to silence (FS) Voice → silent (SS) The phoneme boundary feature information storage means 74 stores these eight types of phoneme boundary feature information. The phoneme boundary feature detecting means 72 refers to the information from the phoneme boundary feature information storage means 74 and detects phoneme boundary features of each phoneme boundary candidate. As a result, as shown in FIG. 2G, the phoneme boundary feature is shown near the vertical line of the phoneme boundary candidate.

第２セグメンテーション手段７からは、音韻区間情報
として、音韻境界候補情報と、その音韻境界特徴情報が
得られ、この音韻区間情報が音韻認識手段８に供給され
る。From the second segmentation means 7, phoneme boundary candidate information and its phoneme boundary feature information are obtained as phoneme section information, and this phoneme section information is supplied to the phoneme recognition means 8.

音韻認識手段８では、音響分析手段５から供給される
各パラメータを認識処理用パラメータとし、ピーキング
処理回路11からの音韻特徴情報と第２セグメンテーショ
ン手段７からの音韻区間情報を参照しながら音韻認識を
実行する。そして、音韻認識手段８からは、認識された
音韻記号が得られ、これが後段の連続音声、大語彙音声
認識手段に供給される。The phoneme recognition means 8 uses the parameters supplied from the acoustic analysis means 5 as parameters for recognition processing, and performs phoneme recognition while referring to phoneme feature information from the peaking processing circuit 11 and phoneme section information from the second segmentation means 7. Execute. Then, the recognized phoneme symbols are obtained from the phoneme recognition means 8 and supplied to the subsequent continuous speech and large vocabulary speech recognition means.

この実施例では、ハードウエアで構成する例について
説明しているが、第１、第２セグメンテーション手段
６、７、音響分析手段５の演算部分、ピーキング処理回
路11、音韻認識手段８等はコンピュータにより実現して
もよい。In this embodiment, an example in which hardware is used has been described. However, the first and second segmentation units 6 and 7, the calculation unit of the acoustic analysis unit 5, the peaking processing circuit 11, the phoneme recognition unit 8, and the like are implemented by a computer. It may be realized.

〔The invention's effect〕

この発明によれば、入力音声のパワーの時間的変化に
おいてピーク及びピークエッジを新たに抽出でき、ピー
クエッジから入力音声の音韻特徴情報、例えば、エネル
ギ集中区間、バースト区間等を得ることができるという
効果がある。また、音韻認識に際して、音韻特徴情報を
用いることができるので、入力音声のパワーは従来のよ
うに有声・無声の判別、音韻の有無の検出に止まらず音
韻認識の精度の向上に寄与できるという効果がある。According to the present invention, it is possible to newly extract a peak and a peak edge in the temporal change of the power of the input voice, and to obtain phonemic feature information of the input voice, such as an energy concentration section and a burst section, from the peak edge. effective. In addition, since phoneme feature information can be used for phoneme recognition, the power of the input speech can contribute not only to discrimination between voiced and unvoiced, detection of the presence or absence of phonemes, but also to the improvement of phoneme recognition accuracy. There is.

[Brief description of the drawings]

第１図はこの発明の一実施例を示すブロック図、第２図
は夫々実施例を説明するための波形図、第３図はピーキ
ング処理回路を示すブロック図、第４図乃至第６図は夫
々実施例を説明するための説明図である。図面における主要な符号の説明 8:音韻認識手段、11:ピーキング処理回路、12:ピーク検
出回路、13:ピークエッジ検出回路、14:パワー計算回
路。FIG. 1 is a block diagram showing an embodiment of the present invention, FIG. 2 is a waveform diagram for explaining each embodiment, FIG. 3 is a block diagram showing a peaking processing circuit, and FIGS. It is explanatory drawing for demonstrating an Example, respectively. Explanation of main symbols in the drawings 8: phoneme recognition means, 11: peaking processing circuit, 12: peak detection circuit, 13: peak edge detection circuit, 14: power calculation circuit.

───────────────────────────────────────────────────── フロントページの続き (56)参考文献特開昭62−141595（ＪＰ，Ａ) 特開昭59−111700（ＪＰ，Ａ) 特開昭60−39691（ＪＰ，Ａ) 特開昭63−168700（ＪＰ，Ａ) 特開昭59−123893（ＪＰ，Ａ) 特開昭61−185799（ＪＰ，Ａ) 特開昭58−130395（ＪＰ，Ａ) 特開平２−232699（ＪＰ，Ａ) 特開平１−170998（ＪＰ，Ａ) 特公昭62−58518（ＪＰ，Ｂ２) 特公平５−88840（ＪＰ，Ｂ２) (58)調査した分野(Int.Cl.⁷，ＤＢ名) G10L 3/00 - 9/20 ＪＩＣＳＴファイル（ＪＯＩＳ)──────────────────────────────────────────────────続き Continuation of front page (56) References JP-A-62-141595 (JP, A) JP-A-59-111700 (JP, A) JP-A-60-39691 (JP, A) JP-A 63-141 168700 (JP, A) JP-A-59-123893 (JP, A) JP-A-61-185799 (JP, A) JP-A-58-130395 (JP, A) JP-A-2-232699 (JP, A) JP-A-1-170998 (JP, A) JP-B-62-58518 (JP, B2) JP-B-5-88840 (JP, B2) (58) Fields investigated (Int. Cl. ⁷ , DB name) G10L 3/00-9/20 JICST file (JOIS)

Claims

(57) [Claims]

1. A peak detecting means for detecting a peak of a temporal change in power of an input voice, and a peak edge detecting a peak edge in which the power of the input voice is smaller than the power of the input voice by a predetermined amount. Detecting means; sound analyzing means for generating a parameter representing an acoustic element based on the input voice; phoneme boundary information generating means for generating phoneme boundary information based on the parameter; energy concentration determined by the peak edge A speech recognition device, comprising: phoneme recognition means for recognizing a phoneme of the input speech based on the parameter with reference to section information and the phoneme boundary information.