JPH096381A

JPH096381A - Voice word recognition method

Info

Publication number: JPH096381A
Application number: JP7154165A
Authority: JP
Inventors: Hirooki Itou; 博起伊藤; Yoshimasa Sawada; 喜正沢田
Original assignee: Meidensha Corp; Meidensha Electric Manufacturing Co Ltd
Current assignee: Meidensha Corp; Meidensha Electric Manufacturing Co Ltd
Priority date: 1995-06-21
Filing date: 1995-06-21
Publication date: 1997-01-10

Abstract

PURPOSE: To make it possible to carry out recognition of silence adapted to the environment without re-learning even when the voice input environment is changed by performing a phoneme recognition processing forcedly allocating silence data to frame in a detected silence section. CONSTITUTION: A silence section detection part obtains logarithmic power at every frame of an input voice by using the power information of the voice, and detects a vice section that the logarithmic power becomes a threshold or below as the silence section. Inputted feature vector time sequence is converted into phoneme label time sequence by using a phoneme identifier by a neural net. In such a case, the silence is recognized as one phoneme, too. At this item, when the input environment when voice recognition is really performed is different from the same of the time of learning, since voice waveform is affected by a disturbance such as a noise of a microphone, etc., the silence data are allocated forcedly to the frame in the silence section obtained by the silence section detection part for the voice labeled phoneme recognized result. Even when the silence is not recognized as the silence, no influence is affected to a word recognition part thereafter.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、無音部分の環境依存性
を排除した音声単語認識方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech word recognition method which eliminates the environment dependence of silent parts.

【０００２】[0002]

【従来の技術】現在の音声認識技術の概念は図７のよう
に示される。単語認識を行う際、音声波形はある時間間
隔で標本化され、スペクトラム等の多次元特徴ベクトル
の時系列に変換されてから取り扱われる。ここで、特徴
ベクトルは、何等かの音素ラベルに対応付けされる。こ
の場合、無音も１つの音素として扱われる。また、同様
に、認識の対象となる単語を多次元特徴ベクトルの時系
列に変換しておき、これらを標準パターンとしてコンピ
ュータに登録しておく。ここでも、無音が単語の中に現
れるような単語では、無音も１つの音素として扱う。2. Description of the Related Art The concept of current speech recognition technology is shown in FIG. When performing word recognition, a speech waveform is sampled at a certain time interval, converted into a time series of multidimensional feature vectors such as a spectrum, and then handled. Here, the feature vector is associated with some phoneme label. In this case, silence is also treated as one phoneme. Similarly, the words to be recognized are converted into time series of multidimensional feature vectors, and these are registered in the computer as standard patterns. Again, silence is treated as one phoneme for words where silence appears in the word.

【０００３】音声認識過程においては、入力された特徴
ベクトル時系列と標準パターンの特徴ベクトル時系列の
類似度を全ての標準パターンについて求め、最も類似し
ている標準パターンの単語を認識単語とする。In the voice recognition process, the similarity between the input feature vector time series and the standard pattern feature vector time series is calculated for all standard patterns, and the word of the most similar standard pattern is used as the recognition word.

【０００４】しかし、一般的に、入力された特徴ベクト
ル時系列と標準パターンの特徴ベクトル時系列を直接そ
のまま比較する事はできない。というのは、人間がある
文章なり単語なりを発声する時間の長さには個人差があ
り、また同じ人が同じ言葉を発声しても日によって気分
により大きく変動することによる。しかもこの時、発声
時間の伸縮は一様でなく、非線形に変動する。However, in general, it is not possible to directly compare the input feature vector time series and the standard pattern feature vector time series as they are. This is because there are individual differences in the length of time that a person utters a sentence or word, and even if the same person utters the same word, it varies greatly depending on the mood depending on the day. Moreover, at this time, the expansion and contraction of the vocalization time is not uniform, but varies nonlinearly.

【０００５】ＤＰマッチング法では、入力された音声の
特徴ベクトル時系列が標準パターンの特徴ベクトル時系
列と最も良く一致するように動的計画法を用いて時間軸
を変換し、その後に類似度を求める。In the DP matching method, the time axis is converted by using the dynamic programming method so that the feature vector time series of the input speech best matches the feature vector time series of the standard pattern, and then the similarity is calculated. Ask.

【０００６】このＤＰマッチングの概念は図８のように
示される。同図において、水平軸は入力音声を、垂直軸
はあらかじめコンピュータに登録されている単語の標準
パターンを示している。ここでは、入力音声及び標準パ
ターン共に特徴ベクトル時系列でなく、音素ラベルの時
系列で記述されているものとする。The concept of this DP matching is shown in FIG. In the figure, the horizontal axis represents the input voice and the vertical axis represents the standard pattern of words registered in the computer in advance. Here, it is assumed that both the input voice and the standard pattern are described in time series of phoneme labels, not in time series of feature vectors.

【０００７】[0007]

【発明が解決しようとする課題】前記従来の音声認識方
法においては、無音も１つの音素として学習することが
できる。この場合、例えば１分（いっぷん）のように、
「い」と「ぷ」の間に、必ず無音部分が存在するような
単語を認識させる場合には、あらかじめｉ−ｐｕｎのよ
うに、無音を考慮したテンプレートを作成する。ところ
が、入力される音声データは、実際の音声からマイクを
通してＡ／Ｄ変換されるといった過程を経てディジタル
データ化されるため、特に無音部分に対しては、入力取
り込み時の際に、取り込みの方法や、取り込み機器の特
性の差によって、波形が異なってしまうのが普通であ
る。In the conventional speech recognition method, silence can be learned as one phoneme. In this case, for example, like 1 minute
When recognizing a word in which a silent portion always exists between “i” and “pu”, a template considering silence is created in advance such as i-pun. However, since the input voice data is converted into digital data through the process of A / D conversion from the actual voice through a microphone, especially for a silent portion, the input method at the time of input capturing Or, the waveform is usually different due to the difference in the characteristics of the capturing device.

【０００８】このため学習時に使用された音素データの
入力環境と、音声認識を実際に使用する音声入力環境と
が異なっている場合、無音部分が他の音素として誤認識
されてしまう場合がある。Therefore, when the input environment of the phoneme data used at the time of learning is different from the voice input environment in which voice recognition is actually used, the silent part may be erroneously recognized as another phoneme.

【０００９】したがって、学習を終えたニューラルネッ
トワークパラメータを他の環境で正しく無音を認識させ
るためには、再度、学習時と全く同じ入力環境を用意す
るか、逆に無音部分の再学習をそのシステムに対して行
うより他にはなく、大変手間のかかるものになってしま
う。Therefore, in order to correctly recognize the silence in the neural network parameters after learning in another environment, the input environment exactly the same as that at the time of learning is prepared again, or conversely, re-learning of the silent part is performed in the system. There is nothing else you can do for it, and it will be very troublesome.

【００１０】本発明は上記の点に鑑みてなされたもので
その目的は、音声入力環境が変化しても、再学習を行わ
ずに環境に適応した無音の認識が行える音声単語認識方
法を提供することにある。The present invention has been made in view of the above points, and an object thereof is to provide a speech word recognition method capable of recognizing silence without adapting to the environment even if the speech input environment changes. To do.

【００１１】[0011]

【課題を解決するための手段】本発明は、（１）入力音
声の各フレーム毎の対数パワーＰ（ｉ）を、According to the present invention, (1) the logarithmic power P (i) of each frame of input speech is

【００１２】[0012]

【数２】 [Equation 2]

【００１３】なる数式によって求め、前記対数パワーＰ
（ｉ）が所定のしきい値θ以下となる音声区間を無音区
間とする無音区間検出処理と、前記各フレーム毎に離散
フーリエ変換により特徴抽出を行って、入力特徴ベクト
ル時系列を音素ラベル時系列に変換して音素を認識する
とともに、該音素認識結果のうち、前記無音区間検出処
理により検出された無音区間内のフレームに対して強制
的に無音データを割り付ける音素認識処理と、前記音素
認識処理により得られた音素ラベルデータと、辞書とし
て持つ音素ラベル時系列の標準テンプレートとの類似度
をＤＰマッチング法により求め、類似度の一番近い標準
テンプレートの結果を音声認識結果として出力する単語
認識処理とを行うことを特徴とし、（２）前記所定のし
きい値θは、音声パワーの低い音素の音声データと無音
状態の音素データとを各々収録し、該音声データの各々
についてフレームパワーを演算し、該演算された各フレ
ームパワーの平均値の中央の値に設定されていることを
特徴とし、（３）入力音声の各フレーム毎に離散フーリ
エ変換により特徴抽出を行って、入力特徴ベクトル時系
列を音素ラベル時系列に変換する音素認識処理と、辞書
として持つ音素ラベル時系列の標準テンプレート中の無
音部分を、無音が誤認識されそうな音素に置き換えて設
定するとともに、前記音素認識処理により得られた音素
ラベルデータと、前記標準テンプレートとの類似度をＤ
Ｐマッチング法により求め、類似度の一番近い標準テン
プレートの結果を音声認識結果として出力する単語認識
処理とを行うことを特徴とし、（４）前記無音が誤認識
されそうな音素は、無音状態の音声データを収録し、該
音声データについて離散フーリエ変換により特徴抽出を
行って、入力特徴ベクトル時系列を音素ラベル時系列に
変換し、該変換された音素結果のうち一番多い音素結果
であることを特徴としている。And the logarithmic power P
(I) Silence section detection processing in which a speech section in which a threshold value θ is equal to or less than a predetermined threshold value is a silent section, and feature extraction is performed by discrete Fourier transform for each frame, and an input feature vector time series is set at a phoneme label time. A phoneme recognition process of recognizing a phoneme by converting it into a sequence, and forcibly allocating silence data to a frame within a silence interval detected by the silence interval detection process of the phoneme recognition result, and the phoneme recognition The word recognition that obtains the similarity between the phoneme label data obtained by the processing and the standard template of the phoneme label time series held as a dictionary by the DP matching method, and outputs the result of the standard template with the closest similarity as the speech recognition result. (2) The predetermined threshold value θ is set as follows: the phoneme data of a phoneme with low voice power and the phoneme data of a silent state. Are recorded, the frame power is calculated for each of the audio data, and the average value of the calculated frame powers is set to the center value. (3) For each frame of the input audio In addition, the feature extraction is performed by the discrete Fourier transform, and the phoneme recognition process that converts the input feature vector time series into the phoneme label time series is performed. The phoneme label data obtained by the phoneme recognition process and the similarity to the standard template are set to D by replacing the phoneme with such phonemes.
And a word recognition process of outputting the result of the standard template having the closest similarity degree as a speech recognition result by the P matching method. (4) The phoneme in which the silence is likely to be erroneously recognized is in a silent state. Of the phoneme data is recorded, the feature extraction is performed on the voice data by the discrete Fourier transform, the input feature vector time series is converted into a phoneme label time series, and the most phoneme result among the converted phoneme results is obtained. It is characterized by that.

【００１４】[0014]

【作用】請求項１、２の発明の音素認識処理において
は、検出された無音区間内のフレームに対して強制的に
無音データを割り付けているので、ノイズの影響を受け
易い無音が、無音として音素認識されなかった場合であ
っても、無音を誤った単語に認識することはない。In the phoneme recognition process according to the first and second aspects of the invention, since the silence data is forcibly assigned to the frames within the detected silence section, the silence which is easily affected by noise is regarded as silence. Even if no phoneme is recognized, silence is not recognized as a wrong word.

【００１５】請求項３の発明の単語認識処理において
は、標準テンプレート中の無音部分を、無音が誤認識さ
れそうな音素に置き換えて設定し、また請求項４の発明
の単語認識処理においては、標準テンプレート中の無音
部分を、無音状態でデータ収録した時の音素結果のうち
一番多い音素に置き換えて設定しているので、認識しよ
うとする入力音声の無音部分の音素は、前記置き換えて
設定された音素に一番近く類似する可能性が極めて高く
なり、単語の誤認識は避けられる。In the word recognition processing of the invention of claim 3, the silent part in the standard template is set by replacing it with a phoneme in which silence is likely to be erroneously recognized, and in the word recognition processing of the invention of claim 4, Since the silent part in the standard template is set by replacing it with the most phoneme of the phoneme results when the data was recorded in the silent state, the phoneme of the silent part of the input speech to be recognized is set by the above replacement. It is extremely likely that the phoneme most closely resembles the phoneme that has been identified, and false recognition of words is avoided.

【００１６】[0016]

【実施例】以下、図面を参照しながら本発明の一実施例
を説明する。（実施例１）図１は本実施例の方法のフローチャートを
示している。入力音声（ステップＳ₁）は、まず無音区
間検出部に入力され（ステップＳ₂）、ここで発声され
た音声の無音区間が検出される。次に、音素認識部にお
いて音素認識が行われる（ステップＳ₃）。そしてこの
ラベル付けされたデータを単語認識部に渡し単語認識が
行われ（ステップＳ₄）、認識結果が得られる（ステッ
プＳ₅）。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS An embodiment of the present invention will be described below with reference to the drawings. (Embodiment 1) FIG. 1 shows a flow chart of the method of this embodiment. The input voice (step S ₁ ) is first input to the silent section detector (step S ₂ ), and the silent section of the voice uttered here is detected. Next, the phoneme recognition is performed in the phoneme recognition section (Step S _3). And this labeled data is performed word recognition passed to word recognition unit (Step S _4), the recognition result is obtained (Step S _5).

【００１７】以下に、無音区間検出部、音素認識部およ
び単語認識部のそれぞれについて詳細に説明する。（１）無音区間検出部音声のパワー情報を用いて音声データから無音区間を検
出する。（１−１）次の数式により、入力音声の各フレーム毎の
対数パワーを求める。The silent section detecting section, the phoneme recognizing section and the word recognizing section will be described in detail below. (1) Silence section detection unit Detects a silence section from voice data using voice power information. (1-1) The logarithmic power of each frame of the input voice is calculated by the following mathematical expression.

【００１８】[0018]

【数３】 (Equation 3)

【００１９】ここで、Ｐ（ｉ）：第ｉフレームの対数パワーＡＭ_K：第ｉフレームに含まれる区間のｋ番目の音声の
振幅。Here, P (i): logarithmic power of the i-th frame AM _K : amplitude of the k-th speech in the section included in the i-th frame.

【００２０】（１−２）そして対数パワーがしきい値θ
以下となる音声区間を図２のように無音区間として検出
する。ここでしきい値θの値によって、音素（ｓ）、
（ｆ）、（ｈ）のような摩擦音では、音声パワーが小さ
いため、この音素区間を無音区間としてしまう場合もあ
る。このため、ここではしきい値θが無音区間と有音区
間を区別する事の出来うる最適な値であると仮定する。(1-2) Then, the logarithmic power is the threshold value θ.
The following voice section is detected as a silent section as shown in FIG. Here, depending on the value of the threshold value θ, the phoneme (s),
In the fricative sounds such as (f) and (h), since the voice power is low, this phoneme section may be a silent section. Therefore, it is assumed here that the threshold value θ is an optimum value capable of distinguishing the silent section and the voiced section.

【００２１】（１−３）前記しきい値θは図３のフロー
チャートに沿って求められる。まず実際に認識を行うシ
ステムにおいて、音声パワーの低い音素（ｓ）、（ｈ）
を含んだ音声をいくつか発声して音声データを収録する
（ステップＳ₁）。この際取り込んだデータ範囲に無音
部分が含まれないようにする。具体的には、収録開始合
図と収録終了合図を出し、この間、音声を発声し続け
る。次に同様に、音声を入力しない状態（無音状態）
で、音声データを収録する（ステップＳ₃）。次にステ
ップＳ₁、Ｓ₃で得られたデータを元に、それぞれ前記数
式（１）を用いた方法で、各々の音声区間内のフレーム
パワーを求める（ステップＳ₂、Ｓ₄）。そしてそのフレ
ームパワーの平均値を求める（ステップＳ₅）。そして
前記のようにして得られた各々のフレームパワー平均値
の、中央の値をしきい値θとして設定する（ステップＳ
₆）。(1-3) The threshold value θ is obtained according to the flowchart of FIG. First, in an actual recognition system, phonemes (s) and (h) with low voice power are used.
A plurality of voices including is voiced to record voice data (step S ₁ ). Make sure that the captured data range does not include silence. Specifically, a recording start signal and a recording end signal are issued, and during this period, the voice is continuously uttered. Similarly, the state where no voice is input (silent state)
In, to record the voice data (step S _3). Then based on the data obtained in step S _1, S _3, by a method using each of the formulas (1) to determine the frame power in each of the voice section (Step S _2, S _4). Then, the average value of the frame power is obtained (step S ₅ ). Then, the central value of the respective frame power average values obtained as described above is set as the threshold value θ (step S
₆ ).

【００２２】（２）音素認識部音声データの音素認識を行い、音素ラベル付けを行う。（２−１）各フレーム毎に、ＤＦＴ（Ｄｉｓｃｅｒｔ
ＦｏｕｒｉｅｒＴｒａｎｓｆｏｒｍ、離散フーリェ変
換）により特徴抽出を行い、特徴ラベル時系列を求め
る。(2) Phoneme Recognition Unit Phoneme recognition of voice data is performed and phoneme labeling is performed. (2-1) For each frame, DFT (Discert)
Feature Transform (discrete Fourier transform) is performed to obtain a feature label time series.

【００２３】（２−２）ニューラルネットによる音素識
別器を用いて、入力特徴ベクトル時系列を音素ラベル時
系列に変換する。この際、無音も１つの音素として認識
される。このとき、ニューラルネットの学習時に用いら
れた音声入力環境と、その学習された結果を用いて、実
際に音声を入力して音声認識を行う際の入力環境が異な
っている場合には、音声入力部に音声データが入力され
るまでの間にマイクのノイズなどの外乱により、音声波
形に影響を及ぼす場合がある。特に無音部分の音声波形
が影響を受け易い。この場合、無音が無音として認識さ
れず他の音素に誤認識されてしまう場合がある。(2-2) The input feature vector time series is converted into a phoneme label time series by using a phoneme classifier based on a neural network. At this time, silence is also recognized as one phoneme. At this time, if the voice input environment used when learning the neural network and the input environment when actually performing voice recognition by using the learned results are different, A disturbance such as noise from a microphone may affect the voice waveform until voice data is input to the section. Especially, the voice waveform of a silent portion is easily affected. In this case, silence may not be recognized as silence and may be erroneously recognized by another phoneme.

【００２４】（２−３）そこで、本発明では以下のよう
な手法によりこの問題点を解決している。音声ラベル付
けされた音素認識結果に対して、前記の無音区間検出部
で得られた、無音区間のフレームに対して、図４のよう
に強制的に無音データを割り付ける。このことによっ
て、無音が無音として認識されていない場合でも、その
後の単語認識部に影響を及ぼさない。上記（１−２）、
（１−３）における、しきい値θをそのシステムに最適
な値とすることで、無音区間の環境依存性が解消され
る。(2-3) Therefore, the present invention solves this problem by the following method. With respect to the phoneme recognition result labeled with the voice, the silence data is forcibly assigned to the frame of the silence section obtained by the silence section detection unit as shown in FIG. As a result, even if silence is not recognized as silence, it does not affect the subsequent word recognition unit. Above (1-2),
By setting the threshold value θ in (1-3) to an optimum value for the system, the environmental dependency of the silent section is eliminated.

【００２５】（３）単語認識部前記音素認識部で得られた音素ラベルデータと、あらか
じめ用意されたすべての標準テンプレートとの類似度を
ＤＰマッチングにより求め、一番近い標準テンプレート
の結果を音声認識結果として出力する。(3) Word Recognition Unit The similarity between the phoneme label data obtained by the phoneme recognition unit and all standard templates prepared in advance is obtained by DP matching, and the result of the closest standard template is recognized by speech. Output as a result.

【００２６】（実施例２）図５は、本実施例の方法のフ
ローチャートを示している。入力音声（ステップＳ₁）
は、まず、音素認識部によって音素認識が行われる（ス
テップＳ₂）。そしてこのラベル付けされたデータを単
語認識部に渡し単語認識が行われ（ステップＳ₃）、認
識結果が得られる（ステップＳ₄）。以下に音素認識部
及び単語認識部のそれぞれについて詳細に説明する。(Embodiment 2) FIG. 5 shows a flow chart of the method of this embodiment. Input voice (step S ₁ )
First, phoneme recognition is performed by the phoneme recognition unit (step S ₂ ). And this labeled data is performed word recognition passed to word recognition unit (Step S _3), the recognition result is obtained (Step S _4). Each of the phoneme recognition unit and the word recognition unit will be described in detail below.

【００２７】（１）音素認識部音声データの音素認識を行い、音素ラベル付けを行う。（１−１）各フレーム毎に、ＤＦＴ（Ｄｉｓｃｅｒｔ
ＦｏｕｒｉｅｒＴｒａｎｓｆｏｒｍ、離散フーリェ変
換）により特徴抽出を行い、特徴ラベル時系列を求め
る。(1) Phoneme Recognition Unit Phoneme recognition of voice data is performed and phoneme labeling is performed. (1-1) DFT (Discert) is performed for each frame.
Feature Transform (discrete Fourier transform) is performed to obtain a feature label time series.

【００２８】（１−２）ニューラルネットによる音素識
別器を用いて、入力特徴ベクトル時系列を音素ラベル時
系列に変換する。この際、無音も１つの音素として認識
される。(1-2) The input feature vector time series is converted into a phoneme label time series using a phoneme classifier based on a neural network. At this time, silence is also recognized as one phoneme.

【００２９】（２）単語認識部１つの音素で得られた音素ラベルデータと、あらかじめ
用意されたすべての標準テンプレートとの類似度を求
め、一番近い標準テンプレートの結果を音声認識結果と
して出力する。(2) Word recognition section The degree of similarity between phoneme label data obtained by one phoneme and all standard templates prepared in advance is obtained, and the result of the closest standard template is output as a speech recognition result. .

【００３０】（２−１）標準テンプレート中の無音部分
を、無音が誤認識されそうな音素に置き換える。これに
より、音声入力環境の違いにより、無音が無音として認
識されなくても良いことになる。(2-1) The silent part in the standard template is replaced with a phoneme in which silence is likely to be erroneously recognized. As a result, silence does not have to be recognized as silence due to the difference in the voice input environment.

【００３１】以下に、単語認識部の、無音が誤認識され
そうな音素の設定方法について図６のフローチャートと
ともに述べる。まず実際に音声を認識するシステムにお
いて、ある一定時間音声を入力しない状態（無音状態）
で、音声データを収録する（ステップＳ₁）。次にニュ
ーラルネットによる音素識別器を用いて、音素ラベル付
けを行う（ステップＳ₂）。そして前記ステップＳ₂で音
素ラベル付けを行った音素結果を元に、一番多い音素認
識結果を無音が誤認識されそうな音素として設定する
（ステップＳ₃、Ｓ₄）。A method of setting a phoneme in which silence is likely to be erroneously recognized by the word recognition section will be described below with reference to the flowchart of FIG. First, in a system that actually recognizes voice, a state where no voice is input for a certain period of time (silence state)
Then, the voice data is recorded (step S ₁ ). Next, phoneme labeling is performed using a phoneme classifier based on a neural network (step S ₂ ). Then, based on the phoneme results labeled with phonemes in step S ₂ , the most phoneme recognition result is set as a phoneme in which silence is likely to be erroneously recognized (steps S ₃ and S ₄ ).

【００３２】（２−２）前記音素認識部で得られた音素
ラベルデータと、あらかじめ用意されたすべての標準テ
ンプレートとの類似度をＤＰマッチングにより求め、一
番近い標準テンプレートの結果を音声認識結果として出
力する（図５のステップＳ₄）。(2-2) The similarity between the phoneme label data obtained by the phoneme recognition unit and all standard templates prepared in advance is obtained by DP matching, and the result of the closest standard template is the speech recognition result. Is output (step S _{4 in} FIG. 5).

【００３３】[0033]

【発明の効果】以上のように本発明によれば次のような
優れた効果が得られる。（１）請求項１〜４に記載の発明によれば、音声入力環
境の変更の際に、再学習を行わずに環境に適応した無音
の認識が可能となる。As described above, according to the present invention, the following excellent effects can be obtained. (1) According to the invention described in claims 1 to 4, when the voice input environment is changed, it is possible to recognize the silence, which is adapted to the environment, without re-learning.

【００３４】（２）請求項１、２の発明の音素認識処理
においては、検出された無音区間内のフレームに対して
強制的に無音データを割り付けているので、ノイズの影
響を受け易い無音が、無音として音素認識されなかった
場合であっても、無音を誤った単語に認識することはな
い。(2) In the phoneme recognition processing according to the first and second aspects of the present invention, since the silence data is forcibly assigned to the frames within the detected silence section, the silence that is easily affected by noise is generated. , Even if the phoneme is not recognized as silence, the silence is not recognized as a wrong word.

【００３５】（３）請求項３の発明の単語認識処理にお
いては、標準テンプレート中の無音部分を、無音が誤認
識されそうな音素に置き換えて設定し、また請求項４の
発明の単語認識処理においては、標準テンプレート中の
無音部分を、データ収録時の音素結果のうち一番多い音
素に置き換えて設定しているので、音声入力環境の違い
により、無音が無音として認識されなくても良いことに
なる。しかも、入力音声の無音部分の音素は、前記置き
換えて設定された音素に一番近く類似する可能性が極め
て高いので、単語の誤認識は避けられる。(3) In the word recognition process of the invention of claim 3, the silent part in the standard template is set by replacing it with a phoneme which is likely to cause false recognition of the silence, and the word recognition process of the invention of claim 4 In the above, since the silent part in the standard template is set by replacing it with the most phoneme of the phoneme results when recording the data, silence does not have to be recognized as silence due to the difference in the voice input environment. become. Moreover, since the phoneme of the silent part of the input voice is very likely to be most similar to the phoneme set by the replacement, erroneous recognition of a word can be avoided.

[Brief description of drawings]

【図１】本発明の一実施例を示すフローチャート。FIG. 1 is a flowchart showing an embodiment of the present invention.

【図２】一実施例における無音区間検出部の動作を説明
するためのグラフ。FIG. 2 is a graph for explaining the operation of a silent section detection unit in one embodiment.

【図３】一実施例におけるしきい値θの設定方法を示す
フローチャート。FIG. 3 is a flowchart showing a method of setting a threshold value θ in one embodiment.

【図４】一実施例における無音区間を認識結果に当ては
めた例を示す説明図。FIG. 4 is an explanatory diagram showing an example in which a silent section is applied to a recognition result in one embodiment.

【図５】本発明の他の実施例のフローチャート。FIG. 5 is a flowchart of another embodiment of the present invention.

【図６】他の実施例の要部を説明するフローチャート。FIG. 6 is a flowchart illustrating a main part of another embodiment.

【図７】音声認識方法の概念図。FIG. 7 is a conceptual diagram of a voice recognition method.

【図８】ＤＰマッチング方法の様子を説明するグラフ。FIG. 8 is a graph illustrating a state of a DP matching method.

Claims

[Claims]

1. A logarithmic power P for each frame of input speech.
Let (i) be the following equation, P (i): Logarithmic power of the i-th frame AM _K : Speech segment in which the logarithmic power P (i) is less than or equal to a predetermined threshold value θ obtained by the amplitude of the k-th speech in the segment included in the i-th frame Silence interval detection process and a feature extraction by discrete Fourier transform for each frame, to recognize the phoneme by converting the input feature vector time series into a phoneme label time series, the phoneme recognition result Of these, phoneme recognition processing forcibly allocating silent data to frames in the silent section detected by the silent section detection processing, and phoneme label data obtained by the phoneme recognition processing,
A speech characterizing process for obtaining a similarity between a phoneme label time series standard template held as a dictionary and a standard template by a DP matching method and outputting a result of the standard template having the closest similarity as a speech recognition result. Word recognition method.

2. The predetermined threshold value θ is obtained by recording voice data of phonemes with low voice power and phoneme data in a silent state, calculating the frame power of each of the voice data, and calculating the frame power. The speech word recognition method according to claim 1, wherein the frame power is set to a central value of average values of the frame powers.

3. A phoneme recognition process for converting an input feature vector time series into a phoneme label time series by performing feature extraction by a discrete Fourier transform for each frame of input speech, and a standard template of a phoneme label time series held as a dictionary. The silent part in the inside is set by replacing it with a phoneme in which silence is likely to be erroneously recognized, and the similarity between the phoneme label data obtained by the phoneme recognition process and the standard template is obtained by the DP matching method, and the similarity is calculated. And a word recognition process for outputting the result of the closest standard template as the speech recognition result.

4. A phoneme whose silence is likely to be erroneously recognized is recorded as voice data in a silent state, and feature extraction is performed on the voice data by discrete Fourier transform to convert the input feature vector time series into a phoneme label time series. The speech word recognition method according to claim 3, wherein the phoneme result is converted and is the most phoneme result among the converted phoneme results.