JPH05119798A

JPH05119798A - Word recognition system

Info

Publication number: JPH05119798A
Application number: JP3282841A
Authority: JP
Inventors: Kazuhiko Okashita; 和彦岡下
Original assignee: Sekisui Chemical Co Ltd
Current assignee: Sekisui Chemical Co Ltd
Priority date: 1991-10-29
Filing date: 1991-10-29
Publication date: 1993-05-18

Abstract

PURPOSE:To provide the word recognition system which is tolerant to the nonlinear expansion or contraction of a speech in the time direction due to difference in voicing custom among speakers and variance in speaker's voicing and both reduced in arithmetic quantity (speeded up) and improved in recognition precision. CONSTITUTION:The word recognition system 10 which recognizes words from an input speech by using a neural network 17 finds power from the input speech, frame by frame and then frames corresponding to plural peaks from a time series of the power, discriminates between the found frames and other frames and divides the speech into blocks of frames between the start point and peak of the speech, the peak and peak, and the peak and end point, and inputs the mean frequency characteristic of the speech in each block to the neural network 17.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、認識するための入力と
して入力の音韻に対応した周波数特性の平均を用い、認
識手法としてニューラルネットワークを用いた単語認識
方式に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a word recognition method using an average of frequency characteristics corresponding to input phonemes as an input for recognition and a neural network as a recognition method.

【０００２】[0002]

【従来の技術】従来のニューラルネットワークを用いた
単語認識方式では、特願平1-98376 号に記載の如く、フ
レーム毎に入力音声の特徴パラメータを算出し、時間的
に等分割した音声区間を１つのブロックとして、そのブ
ロックの中での特徴パラメータの平均を入力パラメータ
としている。2. Description of the Related Art In a conventional word recognition method using a neural network, as described in Japanese Patent Application No. 1-98376, a characteristic parameter of an input voice is calculated for each frame and a voice segment equally divided in time is obtained. As one block, the average of characteristic parameters in the block is used as an input parameter.

【０００３】[0003]

【発明が解決しようとする課題】従来方式では、音声の
非線形な伸縮や音韻の区切りを考慮せず入力音声を等分
割してブロックとし、ブロック内での特徴パラメータの
平均を入力パラメータとしている。従って、発声者間の
発声習慣の違いや発声者内の発声の時間方向のばらつき
に弱く、演算量の削減（高速化）が難しく、かつ類似単
語の認識が難しい。In the conventional method, the input speech is equally divided into blocks without considering the nonlinear expansion and contraction of the speech and phonological divisions, and the average of the characteristic parameters in the blocks is used as the input parameter. Therefore, it is difficult to reduce the amount of calculation (speeding up) and it is difficult to recognize similar words because it is vulnerable to the difference in the utterance habits between the speakers and the variation in the utterance within the speaker.

【０００４】本発明は、発声者間の発声習慣の違いや発
声者内の発声のばらつきによる音声の時間方向の非線形
な伸縮に強く、演算量の削減（高速化）と認識精度の向
上が同時に図れる単語認識方式を提供することを目的と
する。The present invention is resistant to non-linear expansion and contraction of speech in the time direction due to differences in vocalization habits among speakers and variations in vocalizations within speakers, and at the same time reduces the amount of calculation (speeding up) and improves recognition accuracy. An object is to provide a word recognition method that can be achieved.

【０００５】[0005]

【課題を解決するための手段】請求項１に記載の本発明
は、ニューラルネットワークを用いて入力音声からその
単語を認識する単語認識方式において、入力音声よりフ
レーム毎にパワーを求め、パワーの時系列より複数のピ
ークに対応するフレームを求め、求めたフレームとそれ
以外のフレームとを区別し、音声の始点とピークとの間
に対応するフレームのブロック、ピークに対応するフレ
ームのブロック、ピークとピークとの間に対応するフレ
ームのブロック、ピークと終点との間に対応するフレー
ムのブロックの各ブロックに音声を分割し、各ブロック
内での音声の周波数特性の平均を算出した値を、ニュー
ラルネットワークへの入力とするようにしたものであ
る。According to a first aspect of the present invention, in a word recognition system for recognizing a word from an input voice by using a neural network, the power is calculated for each frame from the input voice, and when the power is the power. The frame corresponding to a plurality of peaks is obtained from the sequence, the obtained frame and the other frames are distinguished, and the block of the frame corresponding to the start point and the peak of the voice, the block of the frame corresponding to the peak, and the peak. The value obtained by dividing the voice into the blocks of the frame corresponding to the peak and the blocks of the frame corresponding to the peak and the end point and calculating the average of the frequency characteristics of the voice in each block It is intended to be input to the network.

【０００６】請求項２に記載の本発明は、請求項１に記
載の本発明において更に、前記ニューラルネットワーク
として、入力音声の分割数に対応した入力ユニットをも
つニューラルネットワークを用いるようにしたものであ
る。According to a second aspect of the present invention, in addition to the first aspect of the present invention, a neural network having an input unit corresponding to the number of divisions of the input voice is used as the neural network. is there.

【０００７】請求項３に記載の本発明は、請求項１に記
載の本発明において更に、前記ニューラルネットワーク
として、１つのニューラルネットワークを用いるように
したものである。According to a third aspect of the present invention, in addition to the first aspect of the present invention, one neural network is used as the neural network.

【０００８】[0008]

【作用】冗長な特徴量（周波数特性）を平均化して１つ
の代表点で置き換え、簡単な音韻の分類を考慮して入力
音声を分割するので、発声者間の発声習慣の違いや発声
者内の発声のばらつきによる音声の時間方向の非線形な
伸縮に強く、演算量の削減（高速化）と認識精度の向上
が同時に図れる。The redundant features (frequency characteristics) are averaged and replaced by one representative point, and the input voice is divided in consideration of the simple phoneme classification. It is resistant to non-linear expansion and contraction of speech in the time direction due to variations in utterances, and it is possible to reduce the calculation amount (speed) and improve recognition accuracy at the same time.

【０００９】然るに、本発明における「ニューラルネッ
トワーク」について説明すれば、下記(1) 〜(4) の如く
である。However, the description of the "neural network" in the present invention is as follows (1) to (4).

【００１０】(1)ニューラルネットワークは、その構造
から、図４（Ａ）に示す階層的ネットワークと図４
（Ｂ）に示す相互結合ネットワークの２種に大別でき
る。本発明は、両ネットワークのいずれを用いて構成す
るものであっても良いが、階層的ネットワークは後述す
る如くの簡単な学習アルゴリズムが確立されているため
より有用である。(1) From the structure of the neural network, the neural network and the hierarchical network shown in FIG.
It can be roughly classified into two types of mutual connection networks shown in (B). The present invention may be configured by using either of both networks, but the hierarchical network is more useful because a simple learning algorithm as described later has been established.

【００１１】(2)ネットワークの構造階層的ネットワークは、図５に示す如く、入力層、中間
層、出力層からなる階層構造をとる。各層は１以上のユ
ニットから構成される。結合は、入力層→中間層→出力
層という前向きの結合だけで、各層内での結合はない。(2) Network Structure As shown in FIG. 5, the hierarchical network has a hierarchical structure including an input layer, an intermediate layer, and an output layer. Each layer is composed of one or more units. The coupling is only forward coupling such as input layer → middle layer → output layer, and there is no coupling in each layer.

【００１２】(3)ユニットの構造ユニットは図６に示す如く脳のニューロンのモデル化で
あり構造は簡単である。他のユニットから入力を受け、
その総和をとり一定の規則（変換関数）で変換し、結果
を出力する。他のユニットとの結合には、それぞれ結合
の強さを表わす可変の重みを付ける。(3) Structure of the unit The unit is a model of a brain neuron as shown in FIG. 6, and the structure is simple. Receive input from other units,
The sum is taken and converted according to a certain rule (conversion function), and the result is output. A variable weight, which represents the strength of the bond, is attached to each of the bonds with other units.

【００１３】(4)学習（バックプロパゲーション）ネットワークの学習とは、実際の出力を目標値（望まし
い出力）に近づけることであり、一般的には図６に示し
た各ユニットの変換関数及び重みを変化させて学習を行
なう。(4) Learning (Back Propagation) Learning a network is to bring an actual output close to a target value (desired output). Generally, the conversion function and weight of each unit shown in FIG. Is learned by changing.

【００１４】また、学習のアルゴリズムとしては、例え
ば、Rumelhart, D.E.,McClelland,J.L. and the PDP Re
search Group, PARALLEL DISTRIBUTED PROCESSING, the
MIT Press, 1986.に記載されているバックプロパゲー
ションを用いることができる。As a learning algorithm, for example, Rumelhart, DE, McClelland, JL and the PDP Re
search Group, PARALLEL DISTRIBUTED PROCESSING, the
Backpropagation described in MIT Press, 1986. can be used.

【００１５】[0015]

【実施例】図１は本発明の第１実施例に用いられる単語
認識装置を示す模式図、図２は本発明の第２実施例に用
いられる単語認識装置を示す模式図、図３は入力音声の
分割状態を示す模式図、図４はニューラルネットワーク
を示す模式図、図５は階層的なニューラルネットワーク
を示す模式図、図６はユニットの構造を示す模式図であ
る。1 is a schematic diagram showing a word recognition device used in a first embodiment of the present invention, FIG. 2 is a schematic diagram showing a word recognition device used in a second embodiment of the present invention, and FIG. 3 is an input. FIG. 4 is a schematic diagram showing a voice division state, FIG. 4 is a schematic diagram showing a neural network, FIG. 5 is a schematic diagram showing a hierarchical neural network, and FIG. 6 is a schematic diagram showing a structure of a unit.

【００１６】（第１実施例）（図１、図３参照）単語認識装置１０は、図１に示す如く、音声入力部１
１、パワー計算部１２、ピーク位置検出部１３、ｎチャ
ンネルバンドパスフィルタ１４、ブロック分割平均化部
１５、ネットワーク選択部１６、ニューラルネットワー
ク１７Ａ、１７Ｂ、１７Ｃ…、判定部１８を有して構成
される。(First Embodiment) (Refer to FIGS. 1 and 3) The word recognition device 10 includes a voice input unit 1 as shown in FIG.
1, a power calculation unit 12, a peak position detection unit 13, an n-channel bandpass filter 14, a block division averaging unit 15, a network selection unit 16, neural networks 17A, 17B, 17C ..., And a determination unit 18. It

【００１７】（Ａ）学習 1.入力作成登録単語を100 単語とし、各単語音声を音声入力部１
１に入力した。各単語音声の振幅からフレーム毎にパワ
ーを求め、各単語音声に対するパワーの時系列を求め
る。また、このとき複数フレームの平均化によるスムー
ジング処理を行なう。(A) Learning 1. Input Creation The registered words are 100 words, and each word voice is input by the voice input unit 1.
Entered in 1. The power is obtained for each frame from the amplitude of each word voice, and the time series of the power for each word voice is obtained. At this time, smoothing processing is performed by averaging a plurality of frames.

【００１８】ピーク位置検出部１３を用いて、パワー
の時系列よりパワーの極大点（ピーク）に対応するフレ
ームを求める。Using the peak position detector 13, a frame corresponding to the maximum point (peak) of power is obtained from the power time series.

【００１９】ブロック分割平均化部１５を用いて、図
３に示すように、求めたフレームとそれ以外のフレーム
とを区別し、音声の始点とピークとの間に対応するフレ
ームのブロック、ピークに対応するフレームのブロッ
ク、ピークとピークとの間に対応するフレームのブロッ
ク、ピークと終点との間に対応するフレームのブロック
の各ブロックに音声を分割する。As shown in FIG. 3, the block division averaging unit 15 is used to distinguish the obtained frame from other frames, and to identify the blocks and peaks of the corresponding frame between the voice start point and the peak. The audio is divided into blocks of corresponding frames, blocks of corresponding frames between peaks and peaks, and blocks of corresponding frames between peaks and end points.

【００２０】分割されたブロック内の音声をｎチャン
ネルのバンドパスフィルタ１４に通し、ブロック分割平
均化部１５にて各帯域毎に平均化することによりニュー
ラルネットワークの入力を得る。The speech in the divided blocks is passed through an n-channel bandpass filter 14, and the block division averaging unit 15 averages each band to obtain an input of the neural network.

【００２１】2.学習上記で求めた入力を、ネットワーク選択部１６によ
り選択された、入力音声の分割数に対応した入力ユニッ
トをもつニューラルネットワーク１７Ａ、１７Ｂ、１７
Ｃ…に入力する。尚、ニューラルネットワーク１７Ａ、
１７Ｂ、１７Ｃ…は、登録単語の分割数毎に用意し、１
つのネットワークはｎ（チャンネル）×（登録単語の分
割数）のユニットをもつ入力層と、100 （個）のユニッ
トをもつ出力層のものを用いる。2. Learning The neural network 17A, 17B, 17 having the input unit selected by the network selection unit 16 and having the input unit corresponding to the number of divisions of the input voice
Enter in C ... In addition, the neural network 17A,
17B, 17C ... are prepared for each number of divisions of the registered word, and 1
One network uses an input layer having units of n (channel) x (number of divisions of registered word) and an output layer having units of 100 (pieces).

【００２２】100 単語に番号付けし出力層の100 個の
ユニットと対応させ、上記で求められた入力に対し、
その単語に対応した出力ユニットの値が1 、その他のユ
ニットの値が0 となるように、各ネットワーク１７Ａ、
１７Ｂ、１７Ｃ…を十分学習する。100 words are numbered and corresponded to 100 units in the output layer, and for the input determined above,
Each network 17A, so that the value of the output unit corresponding to the word is 1 and the value of the other units is 0,
17B, 17C ... are sufficiently learned.

【００２３】（Ｂ）評価 1.入力作成音声入力部１１にて採取した音声ついてパワー計算部
１２を用い、各単語音声の振幅からフレーム毎にパワー
を求め、各単語音声に対するパワーの時系列を求める。
また、このとき複数フレームの平均化によるスムージン
グ処理を行なう。(B) Evaluation 1. Input Creation Using the power calculation unit 12 for the voice collected by the voice input unit 11, the power is calculated for each frame from the amplitude of each word voice, and the time series of the power for each word voice is calculated. Ask.
At this time, smoothing processing is performed by averaging a plurality of frames.

【００２４】ピーク位置検出部１３を用いて、パワー
の時系列よりパワーの極大点（ピーク）に対応するフレ
ームを求める。Using the peak position detection unit 13, a frame corresponding to the maximum point (peak) of power is obtained from the power time series.

【００２５】ブロック分割平均化部１５を用いて、図
３に示すように、求めたフレームとそれ以外のフレーム
とを区別し、音声の始点とピークとの間に対応するフレ
ームのブロック、ピークに対応するフレームのブロッ
ク、ピークとピークとの間に対応するフレームのブロッ
ク、ピークと終点との間に対応するフレームのブロック
の各ブロックに音声を分割する。As shown in FIG. 3, the block division averaging unit 15 is used to distinguish the obtained frame from other frames, and to identify the blocks and peaks of the corresponding frame between the voice start point and the peak. The audio is divided into blocks of corresponding frames, blocks of corresponding frames between peaks and peaks, and blocks of corresponding frames between peaks and end points.

【００２６】分割されたブロック内の音声をｎチャン
ネルのバンドパスフィルタ１４に通し、ブロック分割平
均化部１５にて各帯域毎に平均化することによりニュー
ラルネットワークの入力を得る。The voices in the divided blocks are passed through an n-channel bandpass filter 14, and the block division averaging unit 15 averages each band to obtain an input of the neural network.

【００２７】2.評価ネットワーク選択部１６により分割数に対応した学習
済ネットワーク１７Ａ、１７Ｂ、１７Ｃ…を選択し、選
択されたニューラルネットワーク１７Ａ、１７Ｂ、１７
Ｃ…に上記で求めた値を入力する。2. Evaluation network selecting section 16 selects learned networks 17A, 17B, 17C ... Corresponding to the number of divisions, and selected neural networks 17A, 17B, 17
Enter the value obtained above in C.

【００２８】尚、分割数に該当するニューラルネットワ
ークがないときには、再度音声の入力を促す。When there is no neural network corresponding to the number of divisions, voice input is prompted again.

【００２９】判定部１８により、ニューラルネットワ
ーク１７Ａ、１７Ｂ、１７Ｃ…の出力層の各ユニットの
値より入力単語を判定する。The determining unit 18 determines the input word from the values of the units in the output layers of the neural networks 17A, 17B, 17C.

【００３０】（Ｃ）実験認識単語100 単語（人名）、特定話者 1名に付いて、下
記実験１（従来方式）、実験 2（本発明方式）を行なっ
た。(C) Experiment The following Experiment 1 (conventional method) and Experiment 2 (inventive method) were performed for 100 recognized words (person name) and one specific speaker.

【００３１】実験 1 入力音声を時間的に等分割（ 8個のブロック）する。各
ブロック内で周波数特性の平均を算出（１６チャンネル
のバンドパスフィルタを用いる）したものをニューラル
ネットワークの入力にし、学習、評価を行なった。Experiment 1 The input speech is temporally equally divided (8 blocks). The average of the frequency characteristics calculated in each block (using a bandpass filter of 16 channels) was input to the neural network for learning and evaluation.

【００３２】実験 2 入力音声よりフレーム毎にパワーを求め、パワーの時系
列より複数のピークに対応するフレームを求め、求めた
フレームとそれ以外のフレームとを区別し、音声の始点
とピークとの間に対応するフレームのブロック、ピーク
に対応するフレームのブロック、ピークとピークとの間
に対応するフレームのブロック、ピークと終点との間に
対応するフレームのブロックの各ブロックに音声を分割
する。次に各ブロック内での音声の周波数特性の平均を
算出した値をニューラルネットワークの入力とし、学
習、評価を行なった。Experiment 2 The power is calculated for each frame from the input speech, the frames corresponding to a plurality of peaks are calculated from the time series of the power, the obtained frames and the other frames are distinguished, and the start point and the peak of the speech are distinguished. The audio is divided into blocks of a frame corresponding to the intervals, blocks of a frame corresponding to the peaks, blocks of a frame corresponding to the peaks and peaks, and blocks of a frame corresponding to the peaks and end points. Next, the value obtained by calculating the average of the frequency characteristics of the voice in each block was used as the input of the neural network for learning and evaluation.

【００３３】本発明方式を用いた実験 2では従来方式の
実験 1に比べ、誤り率が約3/5 になった。In Experiment 2 using the method of the present invention, the error rate was about 3/5 as compared with Experiment 1 of the conventional method.

【００３４】即ち、本発明方式によれば、冗長な特徴量
（周波数特性）を平均化して１つの代表点で置き換え、
簡単な音韻の分類を考慮して入力音声を分割するので、
発声者間の発声習慣の違いや発声者内の発声のばらつき
による音声の時間方向の非線形な伸縮に強く、演算量の
削減（高速化）と認識精度の向上が同時に図れることを
認めた。That is, according to the method of the present invention, redundant feature quantities (frequency characteristics) are averaged and replaced by one representative point,
Since the input speech is divided considering the simple phoneme classification,
We confirmed that it is resistant to nonlinear expansion and contraction of speech in the time direction due to differences in vocalization habits among speakers and variations in vocalization within speakers, and can reduce the amount of computation (speed) and improve recognition accuracy at the same time.

【００３５】（第２実施例）（図２、図３参照）単語識装置２０が前記単語認識装置１０と異なる点は、
ニューラルネットワークとして、１つのニューラルネッ
トワーク１７のみを用いたことにある。(Second Embodiment) (See FIGS. 2 and 3) The word recognition device 20 differs from the word recognition device 10 in that
Only one neural network 17 is used as the neural network.

【００３６】（Ａ）学習 1.入力作成登録単語を100 単語とし、各単語音声を音声入力部１
１に入力した。各単語音声の振幅からフレーム毎にパワ
ーを求め、各単語音声に対するパワーの時系列を求め
る。また、このとき複数フレームの平均化によるスムー
ジング処理を行なう。(A) Learning 1. Input Creation The registered words are 100 words, and each word voice is input by the voice input unit 1.
Entered in 1. The power is obtained for each frame from the amplitude of each word voice, and the time series of the power for each word voice is obtained. At this time, smoothing processing is performed by averaging a plurality of frames.

【００３７】ピーク位置検出部１３を用いて、パワー
の時系列よりパワーの極大点（ピーク）に対応するフレ
ームを求める。Using the peak position detector 13, the frame corresponding to the maximum point (peak) of power is obtained from the power time series.

【００３８】ブロック分割平均化部１５を用いて、図
３に示すように、求めたフレームとそれ以外のフレーム
とを区別し、音声の始点とピークとの間に対応するフレ
ームのブロック、ピークに対応するフレームのブロッ
ク、ピークとピークとの間に対応するフレームのブロッ
ク、ピークと終点との間に対応するフレームのブロック
の各ブロックに音声を分割する。As shown in FIG. 3, the block division averaging unit 15 is used to distinguish the obtained frame from other frames, and to identify the blocks and peaks of the corresponding frame between the start point and the peak of the voice. The audio is divided into blocks of corresponding frames, blocks of corresponding frames between peaks and peaks, and blocks of corresponding frames between peaks and end points.

【００３９】分割されたブロック内の音声をｎチャン
ネルのバンドパスフィルタ１４に通し、ブロック分割平
均化部１５にて各帯域毎に平均化することによりニュー
ラルネットッワークの入力を得る。The voices in the divided blocks are passed through an n-channel bandpass filter 14, and the block division averaging unit 15 averages each band to obtain an input of the neural network.

【００４０】2.学習上記で求めた入力をニューラルネットワーク１７に
入力する。尚、ニューラルネットワーク１７は、ｎ（チ
ャンネル）×（登録単語の分割数のうち最大ブロック
数）のユニットをもつ入力層と100 （個）のユニットを
もつ出力層のものを用いる。尚、ニューラルネットワー
クの入力ユニットが余るときには、余ったユニットに0
を代入する。2. Learning The input obtained above is input to the neural network 17. The neural network 17 uses an input layer having units of n (channel) × (the maximum number of blocks among the number of divisions of registered words) and an output layer having 100 units. If the input units of the neural network are left over, 0
Is substituted.

【００４１】100 単語に番号付けし、出力層の100 個
のユニットと対応させ、上記で求められた入力に対
し、その単語に対応した出力ユニットの値が1 、その他
のユニットの値が0 となるように、ネットワーク１７を
十分学習する。The 100 words are numbered and corresponded to 100 units in the output layer. For the input obtained above, the value of the output unit corresponding to that word is 1, and the value of the other units is 0. The network 17 is sufficiently learned so that

【００４２】（Ｂ）評価 1.入力作成音声入力部１１にて採取した音声ついてパワー計算部
１２を用い、各単語音声の振幅からフレーム毎にパワー
を求め、各単語音声に対するパワーの時系列を求める。
また、このとき複数フレームの平均化によるスムージン
グ処理を行なう。(B) Evaluation 1. Input Creation Using the power calculation unit 12 for the voice collected by the voice input unit 11, the power is calculated for each frame from the amplitude of each word voice, and the time series of the power for each word voice is obtained. Ask.
At this time, smoothing processing is performed by averaging a plurality of frames.

【００４３】ピーク位置検出部１３を用いて、パワー
の時系列よりパワーの極大点（ピーク）に対応するフレ
ームを求める。Using the peak position detector 13, a frame corresponding to the maximum point (peak) of power is obtained from the power time series.

【００４４】ブロック分割平均化部１５を用いて、図
３に示すように、求めたフレームとそれ以外のフレーム
とを区別し、音声の始点とピークとの間に対応するフレ
ームのブロック、ピークに対応するフレームのブロッ
ク、ピークとピークとの間に対応するフレームのブロッ
ク、ピークと終点との間に対応するフレームのブロック
の各ブロックに音声を分割する。As shown in FIG. 3, the block division averaging unit 15 is used to distinguish the obtained frame from other frames, and to identify the blocks and peaks of the corresponding frame between the start point and the peak of the voice. The audio is divided into blocks of corresponding frames, blocks of corresponding frames between peaks and peaks, and blocks of corresponding frames between peaks and end points.

【００４５】分割されたブロック内の音声をｎチャン
ネルのバンドパスフィルタ１４に通し、ブロック分割平
均化部１５にて各帯域毎に平均化することによりニュー
ラルネットワークの入力を得る。The voices in the divided blocks are passed through an n-channel bandpass filter 14, and the block division averaging unit 15 averages each band to obtain an input of the neural network.

【００４６】2.評価学習済ネットワーク１７に上記で求めた値を入力す
る。尚、ニューラルネットワーク１７の入力ユニットが
余るときには、余ったユニットに0 を代入する。また、
不足したとき再度音声の入力を促す。2. Evaluation The value obtained above is input to the learned network 17. When there are surplus input units in the neural network 17, 0 is substituted into the surplus units. Also,
Prompt for voice input again when there is a shortage.

【００４７】判定部１８により、ニューラルネットワ
ーク１７の出力層の各ユニットの値より入力単語を判定
する。The judging section 18 judges the input word from the value of each unit in the output layer of the neural network 17.

【００４８】（Ｃ）実験認識単語100 単語（人名）、特定話者 1名に付いて、下
記実験１（従来方式）、実験 2（本発明方式）を行なっ
た。(C) Experiment The following Experiment 1 (conventional method) and Experiment 2 (inventive method) were performed for 100 recognized words (person name) and one specific speaker.

【００４９】実験 1 入力音声を時間的に等分割（ 8個のブロック）する。各
ブロック内で周波数特性の平均を算出（１６チャンネル
のバンドパスフィルタを用いる）したものをニューラル
ネットワークの入力にし、学習、評価を行なった。Experiment 1 The input voice is temporally equally divided (8 blocks). The average of the frequency characteristics calculated in each block (using a bandpass filter of 16 channels) was input to the neural network for learning and evaluation.

【００５０】実験 2 入力音声よりフレーム毎にパワーを求め、パワーの時系
列より複数のピークに対応するフレームを求め、求めた
フレームとそれ以外のフレームとを区別し、音声の始点
とピークとの間に対応するフレームのブロック、ピーク
に対応するフレームのブロック、ピークとピークとの間
に対応するフレームのブロック、ピークと終点との間に
対応するフレームのブロックの各ブロックに音声を分割
する。次に各ブロック内での音声の周波数特性の平均を
算出した値をニューラルネットワークの入力とし、学
習、評価を行なった。Experiment 2 The power is calculated for each frame from the input voice, the frames corresponding to a plurality of peaks are calculated from the power time series, the obtained frame and the other frames are distinguished, and the start point and the peak of the voice are determined. The audio is divided into blocks of a frame corresponding to the intervals, blocks of a frame corresponding to the peaks, blocks of a frame corresponding to the peaks and peaks, and blocks of a frame corresponding to the peaks and end points. Next, the value obtained by calculating the average of the frequency characteristics of the voice in each block was used as the input of the neural network for learning and evaluation.

【００５１】本発明方式を用いた実験 2では従来方式の
実験 1に比べ、誤り率が略2/3 になった。In Experiment 2 using the method of the present invention, the error rate was about 2/3 of that in Experiment 1 of the conventional method.

【００５２】即ち、本発明方式によれば、冗長な特徴量
（周波数特性）を平均化して１つの代表点で置き換え、
簡単な音韻の分類を考慮して入力音声を分割するので、
発声者間の発声習慣の違いや発声者内の発声のばらつき
による音声の時間方向の非線形な伸縮に強く、演算量の
削減（高速化）と認識精度の向上が同時に図れることを
認めた。That is, according to the method of the present invention, redundant feature quantities (frequency characteristics) are averaged and replaced by one representative point,
Since the input speech is divided considering the simple phoneme classification,
We confirmed that it is resistant to nonlinear expansion and contraction of speech in the time direction due to differences in vocalization habits among speakers and variations in vocalization within speakers, and can reduce the amount of computation (speed) and improve recognition accuracy at the same time.

【００５３】[0053]

【発明の効果】以上のように本発明によれば、発声者間
の発声習慣の違いや発声者内の発声のばらつきによる音
声の時間方向の非線形な伸縮に強く、演算量の削減（高
速化）と認識精度の向上が同時に図れる単語認識方式を
得ることができる。As described above, according to the present invention, it is resistant to non-linear expansion and contraction of voice in the time direction due to differences in vocalization habits among speakers and variations in vocalization within speakers, and reduction in the amount of calculation (speedup). ) And the recognition accuracy can be improved at the same time.

[Brief description of drawings]

【図１】図１は本発明の第１実施例に用いられる単語認
識装置を示す模式図である。FIG. 1 is a schematic diagram showing a word recognition device used in a first embodiment of the present invention.

【図２】図２は本発明の第２実施例に用いられる単語認
識装置を示す模式図である。FIG. 2 is a schematic diagram showing a word recognition device used in a second embodiment of the present invention.

【図３】図３は入力音声の分割状態を示す模式図であ
る。FIG. 3 is a schematic diagram showing a divided state of an input voice.

【図４】図４はニューラルネットワークを示す模式図で
ある。FIG. 4 is a schematic diagram showing a neural network.

【図５】図５は階層的なニューラルネットワークを示す
模式図である。FIG. 5 is a schematic diagram showing a hierarchical neural network.

【図６】図６はユニットの構造を示す模式図である。FIG. 6 is a schematic view showing a structure of a unit.

[Explanation of symbols]

１０、２０単語認識装置１１音声入力部１２パワー計算部１３ピーク位置検出部１４バンドパスフィルタ１５ブロック分割平均化部１６ネットワーク選択部１７、１７Ａ、１７Ｂ、１７Ｃニューラルネットワー
ク１８判定部10, 20 word recognition device 11 voice input unit 12 power calculation unit 13 peak position detection unit 14 bandpass filter 15 block division averaging unit 16 network selection unit 17, 17A, 17B, 17C neural network 18 determination unit

Claims

[Claims]

1. In a word recognition method for recognizing a word from an input voice using a neural network, the power is obtained for each frame from the input voice, and the frames corresponding to a plurality of peaks are obtained from the time series of the power. Differentiating between frames and other frames, the block of the frame corresponding to the start point and the peak of the voice, the block of the frame corresponding to the peak, the block of the frame corresponding to the peak to peak, the peak and the end point Divide the audio into each of the blocks of the corresponding frame between and,
A word recognition method characterized in that a value obtained by calculating an average of frequency characteristics of voices in each block is input to a neural network.

2. The word recognition method according to claim 1, wherein a neural network having an input unit corresponding to the number of divisions of input speech is used as the neural network.

3. As the neural network, 1
The word recognition method according to claim 1, wherein two neural networks are used.