JPS62119600A

JPS62119600A - Word voice recognition equipment

Info

Publication number: JPS62119600A
Application number: JP60260325A
Authority: JP
Inventors: 教幸藤本; 佐藤　泰雄
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 1985-11-20
Filing date: 1985-11-20
Publication date: 1987-05-30
Anticipated expiration: 2011-07-31
Also published as: JP2520392B2

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】〔概　　要〕種々の入力端末装置のうち音声をそのまま入力するいわ
ゆる音声入力装置を実現するための技術が音声認識技術
である。認識装置を実現するにあたって単語単位で区切
って発声した音声を認識する場合には比較的容易に実現
でき、特に特定話者の単語認識装置は種々の分野で実用
化されている。[Detailed Description of the Invention] [Summary] Among various input terminal devices, speech recognition technology is a technology for realizing a so-called voice input device that inputs voice as it is. In realizing a recognition device, it is relatively easy to recognize speech uttered by dividing it into words, and in particular, word recognition devices for specific speakers have been put into practical use in various fields.

この種の単語音声認識装置において、単語の標準パター
ンと入力音声の特徴パターンを照合する場合に同一発声
者の同一単語であっても音素によっては音響的特徴が時
間的に変化するために一般に継続時間長の伸縮を補正し
てマツチングを取るようにしている。In this type of word speech recognition device, when comparing the standard pattern of a word with the feature pattern of the input speech, the acoustic features of some phonemes change over time, so even if the word is the same from the same speaker, the acoustic features may change over time. Matching is achieved by correcting the expansion and contraction of time length.

この伸縮方法には非線形及び線形方式があり、非線形方
式は一般に動的計画法（ダイナミックプログラミングＤ
Ｐ）を用いている。ＤＰ法は標準パターンと入カバター
ンとの誤差が最小になるように時間軸伸縮用の変換関数
を最適化アルゴリズムに従って選択しているので認識性
能を」−げることができるが処理量が多くなり時間がか
かるという問題がある。一方、線形時間伸縮マツチング
法は最適化アルゴリズムが含まれていないので、処理量
が少なく処理時間を減少させることができるが、認識性
能が低下するという問題がある。This expansion/contraction method includes nonlinear and linear methods, and the nonlinear method is generally dynamic programming (dynamic programming D).
P) is used. In the DP method, the conversion function for time axis expansion/contraction is selected according to an optimization algorithm so that the error between the standard pattern and the input pattern is minimized, so recognition performance can be improved, but the amount of processing is increased. The problem is that it takes time. On the other hand, since the linear time warp matching method does not include an optimization algorithm, the amount of processing is small and the processing time can be reduced, but there is a problem that recognition performance deteriorates.

このような単語音声認識装置において本発明は対象とす
る単語をサブセントに分割し各サブセット毎にＤＰ法を
用いるか線形時間伸縮マツチング法を用いるかの選択情
報をテーブルとして記憶回路に格納している。そしてサ
ブセント毎に照合の処理量の少ない線形時間伸縮マツチ
ング法を用いるか、照合の処理量は多いけれども認識性
能の良いＤＰ法を用いるかを前記テーブルを参照して選
択することを特徴とする。このようにすれば認識対象単
語数が多い場合、少ない場合、誤認識が致命的となる場
合、誤認識してもあまり問題とならない場合、あるいは
すカバリ−の容易な場合等をあらかじめ考えてそれぞれ
に応じてザブセット毎でＤＰ法か線形時間伸縮マツチン
グ法かを選択することによって応答時間に対する要求が
厳しい場合と認識性能に対する要求が厳しい場合に対し
て対処できるという効果がある。In such a word speech recognition device, the present invention divides a target word into subcents, and stores selection information for each subset as to whether to use the DP method or the linear time warp matching method in the form of a table in a memory circuit. . The present invention is characterized in that, for each subcent, a selection is made with reference to the table as to whether to use the linear time expansion/contraction matching method, which requires a small processing amount of matching, or to use the DP method, which requires a large processing amount of matching but has good recognition performance. In this way, you can consider in advance whether the number of words to be recognized is large or small, when misrecognition would be fatal, when misrecognition is not a big problem, or when it is easy to recover. By selecting either the DP method or the linear time warp matching method for each subset according to the above, it is possible to cope with cases where requirements for response time are severe and cases where requirements for recognition performance are severe.

[Industrial application field]

本発明は音声入力装置を実現するための基本となる音声
認識装置に係り、特に音節や単語等の単位で区切って発
声した音声をｌｉｔ、　Ｍごとに認識していく特定話者
を対象とする１１′Ｌ藺音声認識装置の構成に関する。The present invention relates to a speech recognition device that is the basis for realizing a speech input device, and is particularly aimed at specific speakers who recognize speech uttered in units of syllables, words, etc. in lit and m units. 11'L A related to the configuration of a speech recognition device.

更に本発明は音声の各Ｉａｔ　詔をり′ブセソトに分割
しておき、各ザブセソ（・毎に照合の処理量の少ない線
形時間伸縮マソヂング法或いは照合の処理量は多いけれ
ども認識性能のよいＤＰ方式を選択的に利用できる単語
音声認識装置の構成に関する。Furthermore, the present invention divides each voice into ``subsesoto'' and uses a linear time warping method that requires a small amount of processing for matching, or a DP method that requires a large amount of processing for matching but has good recognition performance. The present invention relates to a configuration of a word speech recognition device that can selectively utilize words.

[Traditional technique]

集積化技術の進歩に伴い、マンマシンインターフェース
として利用する種々の入力端末装置のうち音声をそのま
ま入力する音声入力装置が実用化されてきた。音声入力
装置を用いれば情報とするべき入力データの入力速度を
早くでき、入力装置の操作に熟練していない人でも音声
で入力データを入力できるという特徴がある。この音声
入力装置を実現するための基本となるのが、音声認識技
術である。音声認識技術において人間が自然に発声した
文音声は音響的特性がアクセンＩ・や抑揚などによって
複雑に変形するので、認識するのが非常に難しく、従っ
て音節や単語などの単位で文音声を区切って発声させ、
個々の音節や単語を認識していく、いわゆる離散型単語
認識装置がまず実用化されている。そして語霊数は通常
数百語基下であるが、認識する語鴬数がこのように少な
くても工場の製品検査等には有効に利用できる。このよ
うな離散単語認識装置においては単語毎に区切りを検出
して順番に単語を認識していく単語認識装置において、
特定の人の音声を分析して得られる標準パターンを用い
るとその発声者の音声入力に対しては高い認識率が得ら
れる。そこで標準パターンを構成する場合には特定の発
声者に対して発声者毎に作り変える学習機能を用いて認
識するようにした特定話者用音声認識装置は全単語の学
習を数回行うことにより９９％以上の認識率を得ること
ができる。With the progress of integration technology, among various input terminal devices used as man-machine interfaces, voice input devices that input voice directly have been put into practical use. The use of a voice input device has the advantage that the input speed of input data to be information can be increased, and even a person who is not skilled in operating an input device can input input data by voice. Speech recognition technology is the basis for realizing this voice input device. In speech recognition technology, it is extremely difficult to recognize sentence sounds that are naturally uttered by humans because their acoustic characteristics are complexly modified by accents and intonation. utter it,
So-called discrete word recognition devices, which recognize individual syllables and words, were first put into practical use. The number of words to be recognized is usually on the order of several hundred words, but even if the number of words to be recognized is as small as this, it can be effectively used for product inspection in factories, etc. In such a discrete word recognition device, a word recognition device that detects a break for each word and recognizes the words in order,
If a standard pattern obtained by analyzing the voice of a specific person is used, a high recognition rate can be obtained for the voice input of that person. Therefore, when constructing a standard pattern, a speech recognition device for a specific speaker uses a learning function that is created for each speaker to recognize a specific speaker. A recognition rate of 99% or more can be obtained.

前記標準パターンと現在装置に入力している入力音声の
特徴パターンを比較照合するマツチング部が認識装置内
に必ず存在する。ここで、入力音６一声パターンは入力された源音声を一定なフレーム周期毎
に特徴を抽出してできる時系列である。一方、標準パタ
ーンは単語辞書として辞書部に格納されているもので予
め前記学習によって同様に源音声から一定フレーム周期
毎に特徴を抽出したものの時系列である。人力音声を入
力して単語辞書の各パターンを比較照合することにより
現入力音声は特定な単語であると決定することになる。There is always a matching section in the recognition device that compares and matches the standard pattern with the characteristic pattern of the input voice currently input to the device. Here, the input sound 6 one-voice pattern is a time series created by extracting features of the input source sound at every fixed frame period. On the other hand, the standard pattern is stored in the dictionary unit as a word dictionary, and is a time series of features extracted from the source speech at regular frame intervals through the learning process. By inputting human speech and comparing and collating each pattern in the word dictionary, it is determined that the current input speech is a specific word.

従来この種のマツチング方式には線形と非線形とがある
。すなわらｌｉ語のマツチングにおいては　　　゛入力
音声のパターンと標準パターンとを比較する場合に同一
話者が発生した音声における同一単語であっても時間軸
」二の伸縮があるため、時間軸の正規化を行う必要があ
る。一般にこの時間軸」二の伸縮は非線形的な伸縮であ
る。線形マツチングは一定の伸縮率で時間軸」二の対応
をとってしまうので処理方式は簡単となるが認識率は低
下するという問題がある。一方、非線形マツチングを非
線形の伸縮を調整して行う場合には時間軸の正規化を行
うための変換関数を入カバターンと標準パターンとの誤
差が最小になるように関数が選択される。Conventionally, there are two types of matching methods: linear and nonlinear. In other words, when matching li words, when comparing the input speech pattern and the standard pattern, even if the same word is produced by the same speaker, the time axis may be expanded or contracted. It is necessary to perform normalization. Generally, this expansion and contraction of the time axis is nonlinear expansion and contraction. Linear matching takes correspondence on the time axis at a constant expansion/contraction rate, which simplifies the processing method, but has the problem of lowering the recognition rate. On the other hand, when nonlinear matching is performed by adjusting nonlinear expansion and contraction, a conversion function for normalizing the time axis is selected so that the error between the cover pattern and the standard pattern is minimized.

このような最適化を行う場合に入カバターンと標準パタ
ーンの各時系列データのあらゆる組み合わせに対して誤
差が最小値となるように変換関数を選択するので膨大な
計算量が必要となる。従ってこの計算量を減少する方法
として一般的には動的計画法（ダイナミックプログラミ
ング）すなわちＤＰマツチングを用いることによって計
算量を大幅に減らしているが、このＤＰマツチング法を
用いても線形マツチング方式に比べるとかなり計算量が
大きくなり認識するまでの時間は線形マツチングに比べ
ると長いことになる。When performing such optimization, a huge amount of calculation is required because a conversion function is selected so that the error becomes the minimum value for every combination of each time-series data of the input cover pattern and the standard pattern. Therefore, the general method for reducing this amount of calculation is to use dynamic programming, or DP matching, to significantly reduce the amount of calculation. In comparison, the amount of calculation is considerably large, and the time required for recognition is longer than in linear matching.

[Problem that the invention seeks to solve]

本発明はこのような従来の単語音声認識装置の欠点を除
去し、音声の各単語をサブセット毎に分割しておき、サ
ブセット毎に照合の処理量の少ない線形時間伸縮マツチ
ング法か、照合の処理量は多いけれども認識性能の良い
ＤＰ方式のいずれかをテーブルを参照して選択すること
によって応答時間に対する要求が厳しい場合と、認識性
能に対する要求が厳しい場合の両方に対処できる用語音
声認識装置を提供するものである。The present invention eliminates such shortcomings of conventional word speech recognition devices, divides each word of speech into subsets, and uses a linear time warping matching method that requires less matching processing for each subset, or a matching process. To provide a terminology speech recognition device that can deal with both cases with severe demands on response time and cases with severe demands on recognition performance by referring to a table and selecting one of the DP methods having a large amount of data but with good recognition performance. It is something to do.

[Means to try to solve problems]

本発明によれば、音声信号を入力し音声の特徴を抽出し
且つ区間検出を実行する音響分析部と、予め前記音響分
析部を介しで分析された単語標準パターンを格納する辞
書部と、前記音響分析部を介して出力される前記音声信
号の特徴パターン・と前記辞書部の単語標準パターンと
を線形照合する第１の照合部と、前記辞書部からの単語
標準パターンと前記音声信号の特徴パターンとを非線形
照合する第２の照合部と、前記音響分析部の出力を前記
辞書部、前記第１の照合部または前記第２の照合部に転
送することを選択的に行う選択手段と、前記音響分析部
、辞書部、第１及び第２の照合部、及び前記選択手段を
４算処理部を介して制御する制御手段とを有し、ザブセ
ット毎に、照合の処理量の少ない簡単な前記第１の照合
部による９一方式と照合の処理量は多いけれども認識性能の良い前記
第２の照合部による方式のいづれを用いるかを予めテー
ブルで設定しておく手段とから構成されることを特徴と
する単語音声認識装置を提供することにより達成される
。According to the present invention, an acoustic analysis section inputs an audio signal, extracts speech features, and performs section detection; a dictionary section that stores word standard patterns analyzed in advance through the acoustic analysis section; a first matching unit that performs linear matching between the characteristic pattern of the audio signal outputted through the acoustic analysis unit and the standard word pattern of the dictionary unit; and the standard word pattern from the dictionary unit and the characteristics of the audio signal. a second matching unit that performs non-linear matching with a pattern; and a selection unit that selectively transfers the output of the acoustic analysis unit to the dictionary unit, the first matching unit, or the second matching unit; The acoustic analysis section, the dictionary section, the first and second collation sections, and a control means for controlling the selection means via a quadrupling processing section, and for each subset, a simple method with a small processing amount of collation is provided. It is constituted by a means for setting in advance in a table which one of the nine-way method by the first collation unit and the method by the second collation unit, which requires a large amount of verification processing but has good recognition performance, is to be used. This is achieved by providing a word speech recognition device featuring:

[For production]

音声の各単語をサブセット毎に分割し、各サブセットに
対して線形時間伸縮マツチング法を用いるかＤＰ法を用
いるかを示す対応を示すテーブルを参照するようにして
いる。Each word of the speech is divided into subsets, and a table showing a correspondence indicating whether to use the linear time warping matching method or the DP method is referred to for each subset.

〔Example〕

次に本発明を図面を参照して説明する。 Next, the present invention will be explained with reference to the drawings.

第１図の音声認識装置は入力された音声人力１を音響的
に分析し、音声入力中に含まれる単語の言語的特徴を抽
・出し、予め特定話者に関して音声に含まれる単語の言
語的特徴に関する標準パターンを辞書７に記憶しておき
、現在入力された音声入力の特徴パターンと比較しその
類似性に基づいて認識判定を行う。The speech recognition device shown in Fig. 1 acoustically analyzes the input speech input 1, extracts the linguistic features of the words included in the speech input, and analyzes the linguistic features of the words included in the speech with respect to a specific speaker in advance. Standard patterns related to features are stored in the dictionary 7, and compared with the feature patterns of the currently input voice input, and recognition is determined based on the similarity.

マイクより入力される音声人力１は前処理部２に入力さ
れると高域部分が強調される。あるいは以後の処理がデ
ィジタル処理されるものである場合には前処理部２にお
いてアナログ音声入力はディジタル信号にＡ／Ｄ変換器
を介して変換される。When the voice input 1 inputted from the microphone is inputted to the preprocessing section 2, the high frequency portion is emphasized. Alternatively, if the subsequent processing is to be digitally processed, the analog audio input is converted into a digital signal in the preprocessing section 2 via an A/D converter.

高域強調された音声入力はパラメータ計算部３において
音響的に分析され特に音声の周波数スペクトル包絡が計
算される。周波数スペクＩ−ル包絡特性は第２図に示す
ような帯域フィルタｎＹと各帯域フィルタに接続される
整＞ｊｔ平滑回路を用いて分析される。すなわち帯域フ
ィルタ群ＢＰＦは音声周波数帯域を１２個程度の小帯域
に分割する。１２個の帯域フィルタの各出力を整流し、
かつ平滑することによって各帯域成分におりる信勺のパ
ワーの量が直流電圧値として出力されることになる。ｎ
個の帯域フィルタの整流出力はｎ次元ベクトルＡ。The high-frequency emphasized audio input is acoustically analyzed in the parameter calculation unit 3, and in particular, the frequency spectrum envelope of the audio is calculated. The frequency spectrum I-envelope characteristic is analyzed using bandpass filters nY and an integer>jt smoothing circuit connected to each bandpass filter as shown in FIG. That is, the bandpass filter group BPF divides the audio frequency band into about 12 small bands. Rectifying the outputs of each of the 12 bandpass filters,
Furthermore, by smoothing, the amount of signal power that falls on each band component is output as a DC voltage value. n
The rectified outputs of the bandpass filters are n-dimensional vectors A.

Ａ２　・・Ａｎとなりこれによって音声の周波数スペク
トル包絡の特徴を表すことになる。パラメータ計算部３
の出力は区間検出部４に入力され、そこで各単語の開始
と終わりがパワーの闇値を用いて検出される。即ち入力
される音声入力のパワーを計算し、計算されたパワーが
闇値を越えれば単語の始まりであり、その闇値を上から
下に下がればその単語の終点とする。このようにして各
単語が区切られることになり、単語毎に順々に認識処理
を実施することができることになる。区間検出部４の出
力は切換部６に入力され、パラメータ計算部３及び区間
検出部４によって求められた各単語の特徴パターン、す
なわち特にスペクトラル包絡に関する特徴パターンは辞
書部７、線形照合部９或いはＤＰ照合部８に選択的に転
送される。辞書部７に格納するべき標準パターンは特定
話者に関する言語的内容が既知の単語について予め前記
前処理部２、パラメーター計算部３及び区間検出部４を
介して音響分析し、得られたパターンである。単語標準
パターンは認識単語のそれぞれについて全継続時間にわ
たって分析してできる特徴パラメーターの時系列で表さ
れている。゛例えば、単語Ａの継続時間長をＴ、とすれ
ばＴＡ内において単語Ａの標準パターンは帯域フィルタ
出力を時間標本化して時系列データとして記録されるの
が普通である。すなわち第３図に示すように、継続時間
長、すなわちフレームを横軸にとり、縦軸に各帯域のチ
ャネル数に対応してできる行列の各要素はパラメーター
計算部の出力、すなわち各チャネルのスペクトラル包絡
値である。その行列を複数の単語数分だけ用意して辞書
を構成している。A2...An, which represents the characteristics of the frequency spectrum envelope of the voice. Parameter calculation section 3
The output of is input to the section detection section 4, where the start and end of each word are detected using the dark power value. That is, the power of the input voice input is calculated, and if the calculated power exceeds the darkness value, it is the beginning of a word, and if the darkness value decreases from top to bottom, it is the end point of the word. In this way, each word is separated, and recognition processing can be performed for each word in turn. The output of the section detecting section 4 is inputted to the switching section 6, and the characteristic pattern of each word found by the parameter calculating section 3 and the section detecting section 4, that is, the characteristic pattern regarding the spectral envelope in particular, is sent to the dictionary section 7, the linear matching section 9 or It is selectively transferred to the DP verification unit 8. The standard pattern to be stored in the dictionary section 7 is a pattern obtained by acoustically analyzing words whose linguistic content related to a specific speaker is known through the preprocessing section 2, parameter calculation section 3, and interval detection section 4. be. The word standard pattern is expressed as a time series of feature parameters obtained by analyzing each recognized word over its entire duration. For example, if the duration length of word A is T, the standard pattern of word A within TA is usually recorded as time series data by time sampling the output of a bandpass filter. In other words, as shown in Fig. 3, each element of the matrix created by taking the duration length, or frame, on the horizontal axis and the number of channels in each band on the vertical axis is the output of the parameter calculation unit, that is, the spectral envelope of each channel. It is a value. A dictionary is constructed by preparing as many matrices as the number of words.

このように構成された標準パターンと、現時点でマイク
より人力される音声人力１の特徴パターンとの類似性を
線形照合部９、又はＤＰ照合部８によって選択的に照合
するところに本発明の特徴がある。The feature of the present invention is that the linear matching section 9 or the DP matching section 8 selectively matches the similarity between the standard pattern configured in this way and the feature pattern of the human voice input 1 manually inputted from the microphone at the present time. There is.

辞書部７に格納された標準パターンと前処理部にパラメ
ーター計算部３、区間検出部４を介して人力されている
音声人力の特徴パターンとの類似性を比較する場合に入
力される音声の音素によってはその音響的特徴が時間的
に変化するものがある。しかも同じ話者の同じ単語であ
っても、単語の時間的な継続時間には伸縮があるのでこ
の継続１３一時間長の伸縮を補正して標準パターンと音声入力の特徴
パターンがもっとも近い状態において比較する必要があ
る。これが継続時間長の補正であり、時間軸の正規化で
ある。この時間軸の正規化に対して標準パターンと音声
入力の特徴パターンとの比較照合方法がいろいろと異な
ってくる。今、認識しようとする音声入力を辞書部７に
記憶されている標準パターンの分析に用いたのと同じ帯
域フィルタＢＰＦを用いてパラメーター計算部３及び区
間検出部４で分析し、その出力を時間標本化して得られ
るパターンをＸ＝ｘ＋　＋　　ｘ２　＋　　・・・・ｘ
Ｉｌｌとする。すなわち入カバターンＸはｍ個の時系列
パターンより構成されているものとする。一方標準パタ
ーンに対しても同様で、標準パターンＹをｙ＋’、Ｖｚ
　　・・・ｙｎという時系列パターンから構成されてい
るものとする。なお、各時系列パターンは第３図に示す
行列の列に対応するものであるから帯域フィルタの各出
力を要素として持つベクトルで表現されているものであ
る。Phonemes of speech that are input when comparing the similarity between the standard pattern stored in the dictionary section 7 and the characteristic pattern of the speech human input manually entered into the preprocessing section via the parameter calculation section 3 and the interval detection section 4 In some cases, the acoustic characteristics change over time. Moreover, even if the words are the same from the same speaker, there is an expansion or contraction in the temporal duration of the word, so by correcting the expansion or contraction of the duration, the standard pattern and the characteristic pattern of the voice input are the closest. It is necessary to compare. This is correction of the duration length and normalization of the time axis. For this normalization of the time axis, there are various methods of comparison and matching between the standard pattern and the characteristic pattern of the audio input. Now, the speech input to be recognized is analyzed by the parameter calculation unit 3 and the interval detection unit 4 using the same bandpass filter BPF used to analyze the standard pattern stored in the dictionary unit 7, and the output is The pattern obtained by sampling is X=x+ + x2 +...x
Ill. In other words, it is assumed that the input cover pattern X is composed of m time-series patterns. On the other hand, the same applies to the standard pattern, and the standard pattern Y is y+', Vz
. . . It is assumed that the pattern is composed of a time series pattern called yn. Note that each time-series pattern corresponds to a column of the matrix shown in FIG. 3, and therefore is expressed by a vector having each output of the bandpass filter as an element.

今、入カバターンＸと標準パターンＹとをマンチングさ
せる場合に入カバターンＸの長さはｍに対して標準パタ
ーンＹの長さがｎであるから、各時系列パターンを１対
１に対応させて比較することができない。一般に同一話
者が発生した音声であっても時間軸上の伸縮があるため
に時間軸の正規化を行って比較する必要がある。しかも
この時間軸上の伸縮は一般的には非線形な伸縮であり、
非線形の伸縮に合わせて行う非線形マツチング方式を採
用するか、・強制的に一定の伸縮率で時間軸上の対応を
とってしまう線形マツチングがある。Now, when munching the input cover turn X and the standard pattern Y, the length of the input cover turn cannot be compared. In general, even voices generated by the same speaker are subject to expansion and contraction on the time axis, so it is necessary to normalize the time axis and compare them. Moreover, this expansion and contraction on the time axis is generally nonlinear expansion and contraction,
Either a non-linear matching method is adopted, which is performed according to non-linear expansion/contraction, or linear matching is forcibly matched on the time axis at a constant expansion/contraction rate.

線形照合部９は一定の伸縮率で時間軸上の対応をとるマ
ツチング方式で処理方法は早いが認識率が低下する照合
方法である。一方、非線形の伸縮を調整して時間軸の対
応をとる非線形伸縮マツチングであるがこの計算を行う
ために動的計画法すなわちＤＰマツチングが利用され、
ＤＰ照合部８はこれに基づく処理部である。例えば入カ
バターンＸの時系列パターンがｘｌからｘＩｌまでの８
個あり、それに対する標準パターンＹがｙｌからｙ。The linear matching unit 9 uses a matching method that takes correspondence on the time axis at a constant expansion/contraction rate, and is a matching method that is fast in processing but lowers the recognition rate. On the other hand, nonlinear expansion/contraction matching adjusts nonlinear expansion/contraction to match the time axis, but dynamic programming, or DP matching, is used to perform this calculation.
The DP matching section 8 is a processing section based on this. For example, the time series pattern of input cover turn X is 8 from xl to xIl.
The standard pattern Y for that is yl to y.

までの５つしか時系列パターンがない場合に線形マツチ
ング及び非線形マツチングはそれぞれ第４図及び第５図
に示すように各標本点の間の対応が決められる。When there are only five time-series patterns, linear matching and nonlinear matching determine the correspondence between each sample point as shown in FIGS. 4 and 5, respectively.

第４図（ａｌ、　（ｂｌに示すように、線形マツチング
はパターンＸと標準パターンＹの各標本の添字をそれぞ
れ横軸と縦軸にとった場合に時系列のパターン対応関係
を示す曲線が直線になるように時間的な正規化を行うも
のである。第４図（ｂｌにおいてはｘｌとｘｌは標準パ
ターンのｙｌと比較され、Ｘ＋はｙｚ＋ｘ４とＸ、は）
’３１Ｘ６はＹｓ。As shown in Figure 4 (al, (bl), in linear matching, when the subscripts of each sample of pattern Figure 4 (In bl, xl and xl are compared with the standard pattern yl, and
'31X6 is Ys.

Ｘ、とｘ６はｙ、と比較することによってこの対応関係
の経路は直線となり、従って線形マツチングになるよう
に間引きが行われている。このように線形マツチングを
行うのが線形照合部９である。By comparing X and x6 with y, the path of this correspondence relationship becomes a straight line, and therefore thinning is performed so that linear matching is achieved. The linear matching section 9 performs linear matching in this manner.

第５図（ａｌ、　ｆｂｌに示す非線形マツチングにおい
ては対応関係がｆｂｌ図に示すように非線形になってい
る。すなわちｘＩ　＋　　ｘｌ　＋　　ｘ３はｙ、と対
応し、ｘ４は）’ｚ＋Ｘｓとｘ６はｙ、に対応し、Ｘ、
はｙａ　＋　　Ｘ　ａはｙ５に対応するようになってい
る。In the nonlinear matching shown in Fig. 5 (al, fbl), the correspondence relationship is nonlinear as shown in the fbl diagram. That is, xI + xl + x3 corresponds to y, and x4 corresponds to)'z + Xs and x6 correspond to y. , corresponding to X,
is ya + X a corresponds to y5.

この場合曲線Ｕは非線形経路となる。そして、この経路
の選択には最適な経路が選択されるように最適アルゴリ
ズムが使われる。この最適化アルゴリズムは一般に最小
２乗法の概念が用いられ、入カバターンＸと標準パター
ンＹとの誤差が最小となるように単調増加関数Ｕが選択
される。最小２乗法に基づく場合に入力の時系列パター
ンＸと標準の時系列パターンＹとの間の全ての相関を計
算することになるので、Ｊ！Ｉ適な変換関数Ｕを求める
ことは非常に時間がかかる。そのため計算量を大幅に減
らすために一般的に動的計画法（ダイナミックプログラ
ミング：ＤＰ法）が用いられている。In this case, the curve U becomes a nonlinear path. An optimal algorithm is used to select this route so that the optimal route is selected. This optimization algorithm generally uses the concept of least squares method, and a monotonically increasing function U is selected so that the error between the input pattern X and the standard pattern Y is minimized. Since all the correlations between the input time series pattern X and the standard time series pattern Y are calculated based on the least squares method, J! Determining an appropriate conversion function U is very time consuming. Therefore, dynamic programming (DP method) is generally used to significantly reduce the amount of calculation.

このＤＰ法は標準パターンと人カバターンのあらゆるす
べての時系列パターンとを組み合わせてベクトル距離を
求めるのではなくベクトル距離を変換関数Ｕの初期値か
ら近傍の時系列パターンに関するベクトル距離のみを漸
化的に順次最適化を行って変換関数Ｕを求めていくもの
である。このように入カバターンと標準パターンとの誤
差を最小にするような変換関数Ｕを選択するＩ）Ｐ方式
は最適化アルゴリズムが含まれているので前記線形マソ
チング方式に比べて計算量は大きくなるが、時間伸縮に
関して最適化させるので認識性能が非常に良いことにな
る。・従って本発明では前記線形時間伸縮マツチング法
と前記ＤＰマツチング方式を切替部６の制御によって選
択して処理および認識性能に関して最適になるようにし
ている。This DP method does not calculate the vector distance by combining the standard pattern and all the time series patterns of the human cover pattern, but it recursively calculates only the vector distance related to the neighboring time series patterns from the initial value of the transformation function U. The conversion function U is determined by sequential optimization. The I)P method, which selects the conversion function U that minimizes the error between the input pattern and the standard pattern, includes an optimization algorithm, so the amount of calculation is larger than the linear masoching method. , the recognition performance is very good because it is optimized with respect to time expansion and contraction. - Therefore, in the present invention, the linear time expansion/contraction matching method and the DP matching method are selected under the control of the switching unit 6 to optimize processing and recognition performance.

線形照合部９またはＤＰ照合部８によって得られた認識
結果は制御部１０を介してホスト計算機５に転送され適
当な処理が行われる。なお、第１図の単語音声認識装置
において各部の制御は制御部ＩＯを介してホスト計算機
５からの制御命令に従って制御される。The recognition results obtained by the linear matching section 9 or the DP matching section 8 are transferred to the host computer 5 via the control section 10, where appropriate processing is performed. In the word speech recognition apparatus shown in FIG. 1, each part is controlled in accordance with control commands from the host computer 5 via the control unit IO.

以上説明したよ゛うに線形時間伸縮マツチング法は照合
の処理量は少ないが認識性能が劣るという方式であり、
５一方、ＤＰ法は処理量は多いけれども認識性能のよい
方式である。そこで、本発明の単語音声認識装置ではホ
スト計算機の命令によって制御、部１０からの制御信号
に基づいて切替部６を働かせて認識対象のサブセントに
よって線形時間伸縮マツチング法を採用するかＤＰ法を
採用するかを決定するところに特１毀がある。ここでサ
ブセットとは、例えば、第６図に示すように、種類によ
って分割された各部分集合のことである。サブセット１
は日本の地名に関する単語が格納されているもので比較
的単語数は多くなるがサブセット２はアルファベットの
２６単詔、サブセット３は数字の０から９までの１０単
關がそれぞれ分割的にわかれており、ザブセットを指定
することによってマツチングを取る対象の数が少なくす
ることがでる。このようにサブセットの指定ができる認
識装置において、認識対象のザブセット集合のうち予め
定めたサブセットの場合は照合の処理量の少ない簡単な
線形時間伸縮マツチング法で照合し、他のサブセットの
場合には照合の処理量は多いけれども認識性能の良いＤ
Ｐ方式で照合するようにしている。すなわち本発明の用
語音声認識装置の動作に対するフローヂャー＋−Ｌ；ｌ
第７図に示される。As explained above, the linear time warp matching method is a method in which the amount of matching processing is small, but the recognition performance is inferior.
5 On the other hand, although the DP method requires a large amount of processing, it has good recognition performance. Therefore, in the word speech recognition device of the present invention, the switching unit 6 is controlled by commands from the host computer and operated based on a control signal from the unit 10, and either the linear time warp matching method or the DP method is adopted depending on the subcent to be recognized. There is a special problem in deciding whether to do so. Here, the subset refers to each subset divided by type, for example, as shown in FIG. Subset 1
stores words related to Japanese place names, and has a relatively large number of words, but Subset 2 is divided into 26 letters of the alphabet, and Subset 3 is divided into 10 letters of numbers 0 to 9. By specifying subsets, it is possible to reduce the number of matching targets. In a recognition device that can specify subsets in this way, a predetermined subset of the set of subsets to be recognized is matched using a simple linear time warping matching method that requires less matching processing, and in the case of other subsets, D, which requires a large amount of matching processing but has good recognition performance
I use the P method to check. That is, the term flowchart for the operation of the speech recognition device according to the present invention is +-L;
It is shown in FIG.

動作が開始するとまず・リーブセン１〜指定を行う。When the operation starts, first specify Liebsen 1.

例えば第６図の下に描かれているＡ　００３８という記
号列を認識する場合に最初のＡはサブセント２のアルフ
ァベット集合に関するものであるからサブセット２を選
択する。次の４単語は数字に関する単語であるからサブ
セット３を選択する。このようにして照合対象の集合量
を減じておく。サブセット指定後、音声を入力する。例
えば音声発声者がＡ、Ｏ，０，３，８と各単語に区切っ
て読み上げて音声入力する。音声入力後、認識対象の単
語数がホスト計算機５から入力される。すなわち最初の
単語はサブセット２のアルファベントに関する音声であ
ることは予めわかっている。この認識対象のサブセント
によって本発明はＤＰ照合を行うか、線形照合を行うか
第１図のテーブルを用いて判断する。例えばアルファベ
ットはサブセット２であり、第８図に示すようなテーブ
ル参照に従って、ＤＰ照合を選択するようにする。ＤＰ
照合の結果、認識結果を出力し、その認識結果に基づい
てホスト計算機５は認識処理を実行する。認識が終了し
ていれば計算機は停止するが、終了していない場合には
サブセントの変更を行うかどうかの判定を行う。For example, when recognizing the symbol string A 0038 depicted at the bottom of FIG. 6, subset 2 is selected because the first A relates to the alphabet set of subcent 2. Subset 3 is selected because the next four words are words related to numbers. In this way, the set amount of matching targets is reduced. After specifying the subset, input the audio. For example, a voice speaker inputs voice by reading out each word in sections such as A, O, 0, 3, and 8. After voice input, the number of words to be recognized is input from the host computer 5. That is, it is known in advance that the first word is a speech related to alpha vent of subset 2. Based on the subcent to be recognized, the present invention uses the table shown in FIG. 1 to determine whether to perform DP matching or linear matching. For example, the alphabet is subset 2, and DP matching is selected according to table reference as shown in FIG. DP
As a result of the comparison, a recognition result is output, and the host computer 5 executes recognition processing based on the recognition result. If recognition has been completed, the computer will stop, but if recognition has not been completed, it will be determined whether or not to change the subcent.

第６図の例ではＡ文字の後には数字が４つ続くのでサブ
セット２の指定からサブセント３の指定に変更し、再び
音声の入力を行って同様にサブセットでＤＰ照合を行う
か線形照合を行うかを判定し、認識するようにしている
。もし数字の後に地名に関する単語が来る場合にはサブ
セット１を選択することになるがこの場合にもテーブル
を参照し、第８図のテーブルを用いれば線形、照合とな
る。In the example in Figure 6, the letter A is followed by four numbers, so change the specification from subset 2 to subcent 3, input the voice again, and perform DP matching or linear matching using the subset in the same way. We are trying to determine and recognize what is happening. If a word related to a place name comes after a number, subset 1 will be selected, but in this case as well, the table will be referred to, and if the table in FIG. 8 is used, linear matching will be performed.

以上述べたように本発明ではすべての単語に関し単語の
種類によってサブセットの集合を作り、各サブセットに
対してＤＰ方式を用いるか線形時間伸縮マツチング法を
用いるかを示すテーブルを用いている。すなわち、第１
図の構成図に示されるテーブル１１の内部に・す・ブｐ
ツ１〜番号とＤＰ方式を用いることを示す論理１、線形
時間伸縮マツチング法を用いることを示す論理０を第８
図に示されるテーブルのようにザブセット毎に対応させ
ておく。このテーブルを参照することによってＤＰ照合
か線形照合かを行うようにしている。このようにサブセ
ット指定のできる認識装置においては認識対象の単語数
が多い場面や少ない場面、誤認識が致命的となる場面、
誤認識してもあまり問題とならない場面、あるいはリカ
バリの容易な場面とをそれぞれに応じてあらかじめサブ
セットと方式選択との対応関係を第８図に示すテーブル
のように予め決めておく。このことによって応答時間に
対する要求が厳しい場合、認識性能に対する要求が厳し
い場合の両方に対して対処出来るようにしている。As described above, in the present invention, a set of subsets is created for all words depending on the type of word, and a table is used to indicate whether to use the DP method or the linear time expansion/contraction matching method for each subset. That is, the first
Inside the table 11 shown in the configuration diagram of the figure
Logical 1 indicating that the number and DP method are used, and logical 0 indicating that the linear time warp matching method is used,
As shown in the table shown in the figure, each subset is made to correspond. By referring to this table, either DP matching or linear matching is performed. In this way, a recognition device that can specify subsets can handle situations where the number of words to be recognized is large or small, situations where incorrect recognition would be fatal,
The correspondence between subsets and method selections is determined in advance, as shown in the table shown in FIG. 8, in accordance with situations in which erroneous recognition does not pose much of a problem or situations in which recovery is easy. This makes it possible to cope with both cases where the requirements for response time are severe and those where the requirements for recognition performance are severe.

〔Effect of the invention〕

本発明はこのようにサブセント毎に照合の処理量の少な
い線形時間伸縮マツチング法を選択するか、照合の処理
量は多いけれども認識性能の良いＤＰ法を用いるかの設
定をテーブルとして記憶回路に格納し、サブセット指定
時にこのテーブルを参照することよって方式の選択を行
い、応答時間に対する要求が厳しい場合と、認識性能に
対する要求が厳しい場合に対して対処することが出来る
という効果がある。In this way, the present invention stores the settings for each subcent in the storage circuit as a table, whether to select the linear time warping matching method with a small amount of matching processing, or to use the DP method, which requires a large amount of matching processing but has good recognition performance. However, by referring to this table when specifying a subset, a method can be selected, and there is an effect that it is possible to cope with cases where requirements for response time are severe and cases where requirements for recognition performance are severe.

[Brief explanation of drawings]

第１図は本発明の（９）語音声認識装置の構成図、第２
図は本発明のｍ語音声認識装置のパラメータ計算部の構
成図、第３図は本発明の単語音声認識装置の辞書部に格納され
る標準パターン、第４図は本発明の・単語音声認識装置の線形伸縮マツチ
ング法の実施例図、第５図は本発明の単語音声認識装置の非線形伸縮マツチ
ング法（ＤＰ法）の実施例図、第６図は本発明の単語音
声認識装置の動作のフローを示すフローチャート、第７図は本発明の単語音声認識装置の動作のフローを示
すフローチャー１・、第８図は本発明の単語音声認識装置の参照テーブルの実
施例図である。１・・・音声人力部、２・・・前処理部、３・・・パラメータ計算部、４・・・区間検出部、５・・・ホスト計算機、６・・・切替部、　　１７・・・辞書部、８・・・ＤＰ照合部、９・・・線形照合部、１１・・・テーブル。フレーム第３図第４図（０）第４図（ｂ）第５図（ａ）第５図（ｂ）サフ゛ヒツト１、Ｐ）（・ち、トおもり、−・寸プ亡ット２Ａ、Ｂ、Ｃ，Ｄ、・・　・・・サフ゛ぜソト３セ゛ろ）（・ち、に、さん、・　−・・サフ゛セット４第６図第８図第７図Figure 1 is a block diagram of the (9) word speech recognition device of the present invention;
Figure 3 is a block diagram of the parameter calculation section of the m-word speech recognition device of the present invention, Figure 3 is a standard pattern stored in the dictionary section of the word speech recognition device of the present invention, and Figure 4 is the word speech recognition device of the present invention. FIG. 5 is an example diagram of the non-linear expansion/contraction matching method (DP method) of the word speech recognition device of the present invention, and FIG. 6 is an example of the operation of the word speech recognition device of the present invention. FIG. 7 is a flowchart showing the flow of the operation of the word speech recognition device of the present invention. FIG. 8 is an embodiment of the reference table of the word speech recognition device of the present invention. DESCRIPTION OF SYMBOLS 1... Voice human power section, 2... Preprocessing section, 3... Parameter calculation section, 4... Section detection section, 5... Host computer, 6... Switching section, 1 7... - Dictionary section, 8... DP matching section, 9... Linear matching section, 11... Table. Frame Fig. 3 Fig. 4 (0) Fig. 4 (b) Fig. 5 (a) Fig. 5 (b) Suffix 1, P) B, C, D, ......Suffix set 3 (Seiro) (・chi, ni, san, --...Suffix set 4 Fig. 6 Fig. 8 Fig. 7)

Claims

[Claims]

(1) an acoustic analysis section that inputs an audio signal, extracts speech features, and performs section detection; a dictionary section that stores word standard patterns analyzed in advance through the acoustic analysis section; and the acoustic analysis section a first matching unit with a low processing amount for matching the feature pattern of the audio signal outputted through the word standard pattern of the dictionary unit; a second matching section that performs a large processing amount of matching against a standard pattern; and selectively transferring the output of the acoustic analysis section to the dictionary section, the first matching section, or the second matching section. and a control means for controlling the acoustic analysis section, the dictionary section, the first and second matching sections, and the selection section via a calculation processing section, for each subset, selecting the first and second matching sections. 1. A word speech recognition device comprising means for setting in advance in a table which of the method using the matching unit and the method using the second matching unit having better recognition performance is to be used.

(2) The word speech recognition device according to claim 1, wherein the first matching unit performs linear matching.

(3) The word speech recognition device according to claim 1, wherein the second matching unit performs nonlinear matching.