JPS6131880B2

JPS6131880B2 -

Info

Publication number: JPS6131880B2
Application number: JP55048082A
Authority: JP
Inventors: Isamu Nose
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 1980-04-14
Filing date: 1980-04-14
Publication date: 1986-07-23
Also published as: JPS56144498A

Description

[Detailed description of the invention]

本発明は認識率の高い音声認識装置に関するも
のである。従来の音声認識装置を第１図に示す。第１図に
おいて１はマイクロホン、２はフイルタ分析部、
３はパワ検出部、４はサンプルデータ格納メモ
リ、５は音声区間切出し部、６は特徴抽出部、７
は音声特徴登録メモリー、８は識別部の如く構成
されている。一般に認識装置は話者により特定話
者と不特定話者に大別される。特定話者の認識装
置では話し手が読み取り対象語を一度あるいは数
度発声して自分の声の特徴をあらかじめ登録す
る。（以下これを登録モードと称す）不特定話者
の場合はこの登録の過程がない。現在製品化され
ているのは殆んど特定話者であり、以下第１図に
て説明する。入力音声はマイクロホン１にて電気信号に変換
され、フイルタ分析部２にて周波数成分に分けら
れる。フイルタ分析部２は一般的にはバンドパス
フイルタ群、全波整流器群、ローパスフイルタ
群、及びマルチプレクサ、AD変換器等から構成
されており、音声帯域200Hz〜5KHz程度を約10
〜15のフイルタ群で分けて10〜20ｍｓ周期で各フ
イルタ出力を取り出している（以下この出力をサ
ンプルデータと称す）。この過程は一般的方法で
あり本発明の直接的な要素でないので図示してい
ない。なおサンプルデータは正負極性をもつも
の、あるいは一方の極性のもの絶対値（正負の
peak−to−peak）データ等の表現があるが、以
後説明の都合上絶対値表現とする。サンプルデー
タは遂次パワ検出部３に送出され各フイルタ出力
のサンプルデータの総和あるいは最大値等があら
かじめ定められた閾値以上になつたら音声区間の
始まりと考え以下サンプルデータ格納メモリ４に
順次格納する。一定時間のデータが格納されたら
このシーケンスを終了して次に音声区間切り出し
部５が動作する。音声区間切り出し部５では新たに音声区間の始
端及び終端の検出を行う。この方法としては、上
記サンプルデータの格納方法と同様に音声パワを
用いて閾値１、２を設定し始端は閾値１を越える
サンプルデータがある一定時間持続する先頭を
又、終端は閾値２以下のサンプルデータがある一
定時間持続する先頭あるいはその１サンプル前の
時点を終端としてその間を音声区間とする方法等
が考えられる。音声区間が決定すると、特徴抽出
部６では音声区間を等分割して、分割時間内で各
フイルタ出力毎の平均値を求めこれを特徴とし
て、登録モードではこの特徴を登録用メモリー７
に格納する。各語句の登録が終了すると次からは
新たに発声した語句の識別が可能となる。識別部
８の動作を以下説明する。発声特徴をＢ^w（ｕ、ｖ）として識別すべき発
声語句の特徴をＡ（ｕ、ｖ）とする。但しｗはｎ
番目に登録された語句の特徴、ｖは音声区間内の
分割に対して順次付された番号でｕは各フイルタ
出力に対応して順次付された番号を示す。Ｂ^w
（ｕ、ｖ）とＡ（ｕ、ｖ）の間の距離Ｍ^wを次の
様に定義する。Ｍ^wは非類似度を示しており、全ての登録語句
に対してＭ^wを求めてその中でＭ^wが最小となる
ｗに対応する語句が識別結果となる。この識別を
行う過程を以下識別モードと称す。しかしながら
上記技術は、発声動作においては同一人の同一語
句でも発声毎に発声時間は伸縮それも部分的に伸
縮することが多くこのような音声区間を等分割す
る線形マツチングでは対処しきれない場合が多
く、又似ている語句の識別は非常に困難であると
いう欠点をもつている。本発明の目的はこれらの欠点を除去することに
あり、伸縮の大きい母音区間、無音区間等に対し
てデータをまびき、音声区間を等分して平均特徴
を求め、第１段の普通のマツチングと似かよつた
語句に対応できるよう部分マツチングの２段判定
を特徴とし以下詳細に説明する。第２図は本発明の第１の実施例であつて切り出
し部迄は本発明の直接的目的ではないので省略し
てある。１０は再サンプル回路、２０は再サンプ
ルデータ格納部、３０は特徴計算器、４０は特徴
格納部、５０は登録特徴格納部、６０は第１マツ
チング部、７０は第２マツチング部、である。こ
れを動作するには、切り出された音声区間データ
（図示していない）に対し再サンプル回路１０で
始端より順次参照して定常性検出（母音部に対
応）、１サンプル内の各フイルタ出力値の最大値
検出（無音部の検出）、及び発声パワの正規化を
行う。第３図に再サンプル回路１０の詳細ブロツ
ク図を示す。１００は１サンプルデータ格納部、１０１は最
大値検出部、１０２は最大値レジスタ、１０３は
比較器、１０４は加算回路、１０５は加算レジス
タ、１０６は正規化部、１０７は正規化データ格
納部、１０８は差分極性計算部、１０９は現極性
レジスタ、１１０は前極性レジスタ、１１１は一
致検出部、２０は再サンプルデータ格納部、１１
２は一致計算部である。１サンプルデータ格納部
１００では切り出された音声区間内のデータを始
端より順次１サンプル分のデータを格納する。１
サンプルデータが格納されると最大値検出部１０
１及び加算回路１０４では、これらのデータを順
次調べて各々最大値及び加算値を最大値レジスタ
１０２及び加算レジスタ１０５に格納する。比較
器１０３では最大値レジスタ１０２の出力値とあ
らかじめ定められた定数値とを比較して、比較結
果を正規化部１０６に出力する。正規化部１０６
では最大値の方が定数値より大きい場合は、１サ
ンプルデータ格納部１００のデータと加算レジス
タ１０５の値を用いて比率（％）を計算する。
又、最大値が定数値より小さい場合は比率計算は
行わず“０”を出力する。出力値は正規化データ
格納部１０７に格納される。即ち、１サンプルデ
ータ格納部１００に格納された各フイルタの出力
値をFn(k)とする。ｎはフイルタに付された番号
でｋは音声区間データのサンプル番号である。最大値MAX(k)＝MAX｛F₁(k)、F₂(k)、………、
Fn(k)、………、Ｆ_l(k)｝である。但しフイルタ個
数をｌ個とする。 (1) MAX(k)定数値の場合、正規化部１０６の
出力NORMn(k)は NORMn(k)＝Ｆｎ(k)／ＡＤＤ(k)×100％ (2) MAX(k)＜定数値の場合は NORMn(k)＝０次に再サンプル動作について説明する。これは
サンプルデータの時系列において定常性を検出し
て（一般に音声データにおいて母音部では定常性
を示し子音部および過渡部では非定常性を示すこ
とはよく知られている）、定常部のサンプルを粗
くする動作である。差分極性計算部１０８では正
規化データ格納部１０７のデータを参照して隣接
フイルタ間の出力値の差分計算を行い差分極性を
３値でもとめる。但し、前記MAX(k)定数値の
場合、即ち全てのNORMn(k)においてNORMn(k)
＝０が成立しない場合において次の様に動作す
る。差分値 Dn(k)＝NORMn(k)−NORMn₊₁(k) 但しｎ＝_１、_２、………、l_-1である。 (1) ｜Dn(k)｜△ｄの場合（但し△ｄはあらか
じめ定められた定数）Dn(k)０なら差分極性
Sn(k)＝S⁺又Dn(k)＜０ならSn(k)＝S^- (2) ｜Dn(k)｜＜△ｄの場合Sn(k)＝S⁰とする。ここで、S⁺、S⁰、S^-は２ビツト表現で例えば
S⁺＝（０、１）、S⁰＝（０、０）、S^-＝（１、０）の
様に表現する。この様にして１サンプルデータの
差分の符号系列（傾斜識別符号系列）S₁(k)、S₂
(k)、………、Ｓ_l-1(k)を求める。但し前記MAX(k)
＜定数値の場合、即ち全てのNORMn(k)＝０場合
は差分極性の計算は行わず、差分極性計算で出現
することのない符号系列を出力する。例えばＳ
^×、Ｓ^×、Ｓ^×、………、の様なものである。但し
Ｓ^×＝（１、１）である。差分極性の計算結果は現
極性レジスタ１０９にセツトされると同時に、そ
れ迄の現極性レジスタ１０９の内容が、前極性レ
ジスタ１１０にセツトされる。前極性レジスタの
初期状態（１音声の処理に入いる前の状態）は先
程の例で述べたＳ^×、Ｓ^×、Ｓ^×、………、の様に
差分極性計算で出現しない符号系列がセツトされ
ているものとする。一致検出部１１１では現極性
レジスタ１０９の内容と前極性レジスタ１１０の
内容とが一致するか否かを検出する。すなわち、
一致検出部１１１は時間軸で隣接する傾斜識別符
号系列（傾斜識別符号群）が互いに完全に一致す
るか否かを検出する。一致しない場合は非定常点
と見なし正規化データ格納部１０７の１サンプル
正規化データを再サンプルデータ格納部２０に格
納する。一致する場合は一致計数部１１２で連続
して一致する回数を計数しある計数値（あらかじ
め定められた回数）に達した場合のみ正規化デー
タ格納部１０７の内容を再サンプルデータ格納部
２０に格納すると共に計数値“０”とする。この
様にして正規化された再サンプル動作を音声区間
データについて全て実行する。再サンプル及び正
規化動作が音声区間サンプルデータに対し終了す
ると次に特徴計算部３０にて再サンプルデータの
音声区間を等分割して各分割内のチヤネルフイル
タ出力値（正規化データ）に対して平均値を求め
これを特徴とする。再サンプルデータの音声区間
長をＩ、等分割数をＪとするとＩ／Ｊ＝ｉにより
１分割内のデータ個数(i)が求まる。この場合余り
（ｒとする）が生じたら、最初の分割より１デー
タづつ各分割に対し余りがなくなる迄加えて補正
する。例えばｒ＝３とすると最初の３つの分割内
データ個数はｉ＋１であり、それ以降はｉであ
る。平均値を求める式は平均値Mj（ｎ）、正規化
値は前記NORMn(k)とする。ｊは分割毎に付され
た番号、ｎはチヤネルフイルタに対応して付され
た番号、ｋは再サンプルデータに付された番号と
する。但し、ｊ＝１、２、………Ｊ、△ｉ＝k₂−k₁＋
１で分割ｊにおいてｒ＝０であれば△ｉ＝ｉ、ｒ
≠０であれば△ｉ＝ｉ＋１である。この過程を第
４図にフローチヤートで、第５図にブロツク図で
示す。第５図において、１２０は分割単位計算部、１
２１は再サンプルデータ参照アドレス制御部、１
２２は加算部、１２３は加算結果格納レジスタ、
１２４は平均値計算部、２０は再サンプルデータ
格納部、４０は特徴格納部である。格納された特徴は登録モードでは登録特徴格納
部５０に送出され保存される。保存された特徴を
以下登録特徴と称す。又、識別モードではこの特
徴（以下入力特徴と称す。）は登録特徴と順次比
較され非類似度の小さい登録特徴に対応する語句
が識別結果となる。以下識別モードの動作を説明
する。入力特徴をＡ（ｕ、ｖ）、登録特徴をＢ^w
（ｕ、ｖ）とする。ｕはフイルタ出力に対応して
付した番号でありｖは分割に対応して付した番
号、ｗは登録語に対応して付した番号を示す。一
般的に発声された音声は非線形に伸縮するので単
純に対応する分割部をマツチングしても良く合わ
ない場合が生じる。従来の方法ではある分割部ｖ
_Cの非類似度ｎ^w（ｖ_C）はで計算していた。本発明では、隣接分割部との重
複も考慮しつつ分割部ｖ_Cに対して新たに計算し
た特徴（以下重複特徴と称す。）を用いて各組み
合せ（４×４＝16通り）の中で最小の非類似度を
分割点ｖ_Cの非類似度ｍ^w（ｖ_C）とする。即ちｍ^w（ｖ_c）＝min｛_(C、_C)、_(C、_LC)、_(C、
_ＲＣ）、_(C、_LCR)、_(LC、_C)、_(LC、_LC)、
_(LC、_RC)、_(LC、_LCR)、_(RC、_C)、_(RC、_L
_Ｃ）、_(RC、_RC)、_(RC、_LCR)、_(LCR、_C)、
_(LCR、_LC)、_(LCR、_RC)、_(LCR、_LCR)｝但し₍〓、〓₎は、第６図と第１表に示す特徴
Ｆ〓とＦ〓を用いたある分割ｖ_Cの非類似度を示
す。例えば従つて入力特徴と登録語句ｗの特徴との総合非
類似度Ｍ^wは The present invention relates to a speech recognition device with a high recognition rate. A conventional speech recognition device is shown in FIG. In Fig. 1, 1 is a microphone, 2 is a filter analysis section,
3 is a power detection unit, 4 is a sample data storage memory, 5 is a voice section extraction unit, 6 is a feature extraction unit, 7
8 is a voice feature registration memory, and 8 is an identification section. In general, recognition devices are broadly classified into specific speaker and non-specific speaker depending on the speaker. In a specific speaker recognition device, the speaker utters the target word once or several times to register the characteristics of his/her own voice in advance. (Hereinafter, this will be referred to as registration mode) In the case of unspecified speakers, there is no registration process. Most of the products currently available are for specific speakers, which will be explained below with reference to FIG. Input audio is converted into an electrical signal by a microphone 1 and divided into frequency components by a filter analyzer 2. The filter analysis section 2 generally consists of a group of bandpass filters, a group of full-wave rectifiers, a group of low-pass filters, a multiplexer, an AD converter, etc., and it converts the audio band of about 200Hz to 5KHz into about 10
It is divided into ~15 filter groups and the output of each filter is extracted at a cycle of 10 to 20 ms (hereinafter, this output is referred to as sample data). This process is not shown because it is a common method and is not a direct element of the invention. Note that the sample data has positive and negative polarity, or the absolute value of one polarity (positive and negative polarity).
Although there are other expressions such as peak-to-peak (peak-to-peak) data, from now on, for convenience of explanation, we will use absolute value expression. The sample data is sequentially sent to the power detection section 3, and when the sum or maximum value of the sample data output from each filter exceeds a predetermined threshold value, it is considered as the beginning of a voice section and is sequentially stored in the sample data storage memory 4. . Once the data for a certain period of time has been stored, this sequence ends and the voice section cutting section 5 operates next. The voice section extraction unit 5 newly detects the start and end of the voice section. In this method, thresholds 1 and 2 are set using audio power in the same way as the sample data storage method described above. A conceivable method is to set the end point at the beginning of sample data that lasts for a certain period of time or a point in time one sample before that point, and set the period in between as a voice section. Once the voice section is determined, the feature extractor 6 divides the voice section into equal parts, calculates the average value for each filter output within the divided time, uses this as a feature, and stores this feature in the registration memory 7 in the registration mode.
Store in. Once each word has been registered, it becomes possible to identify newly uttered words. The operation of the identification section 8 will be explained below. Let the utterance feature be B ^w (u, v) and the feature of the uttered phrase to be identified be A(u, v). However, w is n
The characteristic of the word/phrase registered No. 20, v is a number sequentially assigned to the divisions within the voice section, and u is a number sequentially assigned corresponding to each filter output. ^Bw
The distance M ^w between (u, v) and A(u, v) is defined as follows. M ^w indicates the degree of dissimilarity; M ^w is determined for all registered words and phrases, and the word corresponding to w for which M ^w is the minimum is the identification result. The process of performing this identification is hereinafter referred to as identification mode. However, with the above technology, even when the same word is uttered by the same person, the utterance time expands and contracts, and even partially expands and contracts each time the same person utters the same phrase.There are cases where linear matching, which divides the vocal interval equally, cannot cope with this problem. It has the disadvantage that it is very difficult to identify many words and phrases that are similar. The purpose of the present invention is to eliminate these drawbacks, by multiplying data for vowel sections with large expansion and contraction, silent sections, etc., dividing the vocal section into equal parts, finding the average feature, and performing the ordinary matching in the first stage. This method is characterized by a two-stage partial matching judgment that can handle words and phrases that are similar to each other, and will be described in detail below. FIG. 2 shows a first embodiment of the present invention, and the cutout portion is omitted because it is not a direct object of the present invention. 10 is a resampling circuit, 20 is a resampling data storage section, 30 is a feature calculator, 40 is a feature storage section, 50 is a registered feature storage section, 60 is a first matching section, and 70 is a second matching section. In order to operate this, the resampling circuit 10 sequentially refers to the cut out voice section data (not shown) from the beginning to detect stationarity (corresponding to the vowel part), and detects the stationarity of each filter output value within one sample. Detects the maximum value of (detects silent parts) and normalizes the vocal power. FIG. 3 shows a detailed block diagram of the resampling circuit 10. 100 is a 1 sample data storage section, 101 is a maximum value detection section, 102 is a maximum value register, 103 is a comparator, 104 is an addition circuit, 105 is an addition register, 106 is a normalization section, 107 is a normalized data storage section, 108 is a difference polarity calculation unit, 109 is a current polarity register, 110 is a previous polarity register, 111 is a coincidence detection unit, 20 is a re-sampled data storage unit, 11
2 is a coincidence calculation section. The 1-sample data storage unit 100 sequentially stores 1-sample data within the cut-out voice section from the starting end. 1
When the sample data is stored, the maximum value detection unit 10
1 and addition circuit 104 sequentially examines these data and stores the maximum value and addition value in maximum value register 102 and addition register 105, respectively. The comparator 103 compares the output value of the maximum value register 102 with a predetermined constant value, and outputs the comparison result to the normalization section 106. Normalization unit 106
If the maximum value is larger than the constant value, the ratio (%) is calculated using the data in the 1-sample data storage section 100 and the value in the addition register 105.
Further, if the maximum value is smaller than the constant value, the ratio calculation is not performed and "0" is output. The output value is stored in the normalized data storage section 107. That is, the output value of each filter stored in the 1-sample data storage section 100 is assumed to be Fn(k). n is the number assigned to the filter, and k is the sample number of the voice section data. Maximum value MAX (k) = MAX {F ₁ (k), F ₂ (k), ......
Fn(k), ......, F _l (k)}. However, the number of filters is l. (1) In the case of MAX(k) constant value, the output NORMn(k) of the normalization unit 106 is NORMn(k)=Fn(k)/ADD(k)×100% (2) MAX(k)<constant value In the case of NORMn(k)=0 Next, the resampling operation will be explained. This detects stationarity in the time series of sample data (generally, it is well known that in speech data, vowel parts show stationarity, while consonant parts and transient parts show non-stationarity), and samples of stationary parts are detected. This is an action to make the surface coarser. The difference polarity calculation unit 108 refers to the data in the normalized data storage unit 107, calculates the difference between the output values of adjacent filters, and determines the difference polarity as three values. However, in the case of the above MAX(k) constant value, that is, for all NORMn(k), NORMn(k)
When =0 is not established, the following operation is performed. Difference value Dn(k)=NORMn(k)−NORMn ₊₁ (k) However, n= ₁ , ₂ , ......, l _-1 . (1) When |Dn(k)|△d (where △d is a predetermined constant) If Dn(k)0, the difference polarity
Sn(k)=S ⁺ If Dn(k)<0, Sn(k)=S ^- (2) If |Dn(k)|<△d, Sn(k)=S ⁰ . Here, S ⁺ , S ⁰ and S ^- are 2-bit representations, for example
Expressed as S ⁺ = (0, 1), S ⁰ = (0, 0), and S ^- = (1, 0). In this way, the code sequence of the difference of one sample data (slope identification code sequence) S ₁ (k), S ₂
Find (k),......S _l-1 (k). However, the above MAX(k)
<In the case of a constant value, that is, when all NORMn(k)=0, the difference polarity is not calculated, and a code sequence that does not appear in the difference polarity calculation is output. For example, S
^× , S ^× , S ^× , etc. However, S ^x = (1, 1). The difference polarity calculation result is set in the current polarity register 109, and at the same time, the contents of the current polarity register 109 up to that point are set in the previous polarity register 110. The initial state of the pre-polarity register (the state before it starts processing one voice) is the code sequence that does not appear in the difference polarity calculation, such as S ^× , S ^× , S ^× , etc. mentioned in the previous example. Assume that it has been set. The match detection unit 111 detects whether the contents of the current polarity register 109 and the contents of the previous polarity register 110 match. That is,
The coincidence detection unit 111 detects whether or not gradient identification code sequences (gradient identification code groups) that are adjacent on the time axis completely match each other. If they do not match, it is regarded as an unsteady point, and the 1-sample normalized data in the normalized data storage section 107 is stored in the re-sample data storage section 20. If they match, the match counting unit 112 counts the number of consecutive matches, and only when a certain count value (a predetermined number of times) is reached, the contents of the normalized data storage unit 107 are stored in the resampled data storage unit 20. At the same time, the count value is set to "0". All normalized resampling operations are performed on the voice section data in this manner. When the resampling and normalization operations are completed for the voice section sample data, the feature calculation unit 30 divides the voice section of the resampled data into equal parts and calculates the channel filter output value (normalized data) in each division. Find the average value and characterize it. Letting I be the voice section length of the resampled data and J be the number of equal divisions, the number (i) of data in one division can be found from I/J=i. In this case, if a remainder (r) is generated, one data is added to each division from the first division until there is no remainder left. For example, if r=3, the number of data in the first three divisions is i+1, and thereafter it is i. The formula for determining the average value is the average value Mj(n), and the normalized value is the NORMn(k). Let j be a number assigned to each division, n be a number assigned corresponding to a channel filter, and k be a number assigned to resampled data. However, j=1, 2,...J, △i=k ₂ −k ₁ +
1 and r=0 in division j, then △i=i, r
If ≠0, Δi=i+1. This process is shown in a flowchart in FIG. 4 and in a block diagram in FIG. In FIG. 5, 120 is a division unit calculation unit;
21 is a re-sample data reference address control unit; 1
22 is an adder, 123 is an addition result storage register,
124 is an average value calculation unit, 20 is a resampled data storage unit, and 40 is a feature storage unit. In the registration mode, the stored features are sent to the registration feature storage section 50 and stored therein. The saved features are hereinafter referred to as registered features. In the identification mode, these features (hereinafter referred to as input features) are sequentially compared with the registered features, and the words and phrases corresponding to the registered features with a small degree of dissimilarity become the identification results. The operation in the identification mode will be explained below. Input features are A(u, v), registration features are B ^w
Let it be (u, v). u is a number assigned corresponding to the filter output, v is a number assigned corresponding to division, and w is a number assigned corresponding to the registered word. Generally speaking, uttered sounds expand and contract in a non-linear manner, so simply matching corresponding divided parts may not match well. In the conventional method, a certain division part v
The dissimilarity n ^w (v _C ) of _C is I was calculating. In the present invention, in each combination (4 x 4 = 16 ways), newly calculated features (hereinafter referred to as overlapping features) for the divided part v _C are used, taking into account the overlap with adjacent divided parts. Let the minimum dissimilarity be the dissimilarity m ^w (v _C ) of the dividing point v _C . That is, m ^w (v _c )=min { _(C , _C) , _(C , _LC) , _(C ,
_RC) , _(C , _LCR) , _(LC , _C) , _(LC , _LC) ,
_(LC , _RC) , _(LC , _LCR) , _(RC , _C) , _(RC , _L
_C) , _(RC , _RC) , _(RC , _LCR) , _(LCR , _C) ,
_(LCR , _LC) , _(LCR , _RC) , _(LCR , _LCR) } However, ₍ 〓, 〓 ₎ is a non-standardization of a certain partition v _C using the features F〓 and F〓 shown in Fig. 6 and Table 1. Indicates similarity. for example Therefore, the overall dissimilarity M ^w between the input features and the features of the registered word w is

【式】である。なお第６図でｖ_C、ｖ_L、ｖ_Rは音声区間をｎ等分割して
その分割に付した番号である。[Formula]. Note that in FIG. 6, v _C , v _L , and v _R are numbers given to the divisions of the voice section divided into n equal parts.

【表】【table】

【表】全ての登録特徴について入力特徴との間で非類
似度を計算して、あらかじめ決められた閾値以下
の登録特徴が１個の場合はその登録特徴の対応語
句が認識結果として出力される。又、０個の場合
は対象語句が無いものとして認識不能（リジエク
ト）という結果が出力される。２個以上ある場合
は次の第２マツチング部７０が動作する。第７図に第１マツチング部６０の詳細ブロツク
図を示す。４０は特徴格納部、５０は登録特徴格
納部、１３０は入力特徴の重複特徴計算部、１３
１は登録特徴の重複特徴計算部、１３２は絶対値
演算部、１３３，１３７は加算部、１３４，１３
６，１３８はレジスタ、１３５，１３９は比較
部、１４０は第１マツチング結果格納部であり、
最初の比較部１３５である分割点の最小非類似度
を求め加算部１３７で全分割点の総合非類似度を
計算してあらかじめ決められた閾値と比較部１３
９で比較し閾値以下であれば格納部１４０にその
分割番号ｗを格納する。制御部（図示していな
い）では全登録語の結果を前記した判定条件を用
いて結果を出力する。第２マツチング部７０では
第１マツチング結果格納部１４０に格納されてい
る候補番号ｗに対応する登録特徴相互間の非類似
度をペアーで分割部毎に計算し（前式のｍ^w（ｖ
_C）に相当）、非類似度の大きいものから分割番号
及びその時の重複特徴の形式を順次ｐ個（あらか
じめ決められた定数）選び各々の登録特徴と入力
特徴との間でそれらの分割部に対してのみ、選択
された重複特徴の形式で非類似度を計算し、結果
的に非類似度の大きかつた候補は、第１マツチン
グ結果格納部１４０から消される。候補をペアー
で全てこの様に調べ格納部１４０に最終的に残つ
た候補があればその番号が認識結果となり無けれ
ば認識不能という結果になる。例えば候補番号が
α、β、γの３個とすると、（α、β）、（α、γ）、（β、γ）の各ペアーが３通りできるのでまずα、βに対応
する登録特徴間でｍ^w（ｖ_C）に相当する非類似
度を分割部毎に調べ、大きい非類似度をもつ分割
部に対して同じ重複特徴形式にて入力特徴と登録
特徴α及び入力特徴と登録特徴βとの間で非類似
度を計算し、大きい方の候補が脱落する。（α、
γ）、（β、γ）に対しても同様の処理を行うこと
になる。第８図に第２マツチング部の詳細ブロツク図を
示す。４０は特徴格納部、５０は登録特徴格納
部、１５０は第７図の各分割部の非類似度計算
部、１５１は結果格納部、１５２は第２マツチン
グ部選出部、１５３は加算部、、１５４，１５５
はレジスタ、１５６は比較部である。登録特徴間での非類似度を分割部毎に計算し
て、その中で非類似度の大きい分割部を決められ
た数だけ第２マツチング選出部１５２で計算され
結果は制御部へ転送される。制御部（図示してい
ない）ではその情報に従い入力特徴と登録特徴を
取り出し、非類似度計算１５０（第８図の下の
方）にて計算して、ペアーのうち片方の総和がレ
ジスタ１５５に格納されもう片方の総和と比較部
１５６で比較され結果が制御部へ転送される。以上説明したように、第１の実施例では、定常
部を再サンプルすることにより母音部のデータと
子音部のデータが認識に同程度関与するようにな
るので（一般に母音部の持続時間の方が子音部の
持続時間に比べて十分長いため）、バランスの取
れた特徴が抽出できる。又、母音部は時間的伸縮
が大きいがその影響を再サンプルによりある程度
押えることができ、かつ、重複特徴を用いて対処
しており、再サンプルデータを正規化しているの
で発声パワの大きさの違いも対処できる。そし
て、似か良つた語句は部分的に登録特徴の差の大
きい所（似かよつていない部分）にのみ着目して
判別することにより認識できる利点がある。第１の実施例は重複特徴を、入力特徴と登録特
徴について計算したがどちらか片側についてのみ
実施して相手側は重複させなくても音声の時間的
伸縮に対して十分対応できる。又重複特徴として着目分割部とその両隣の分割
部との重複特徴Ｆ_LCRは用いなくても十分効果が
ある。その他説明の都合上サンプルデータを一度
取り入れてから再サンプルして正規化しているが
各サンプルデータ入力時に同時に処理して良く、
重複特徴もめて計算せず最初に計算して登録特
徴として格納しても良いことは明らかである。
又、第２マツチング部において第１の実施例では
あらかじめ決められた分割部の個数で行つている
が、非類似度の大きい分割部についてのみ総非類
似度を計算しても良い。又、第２マツチング部も多段処理として、分割
部の個数を段々少なくしていく方法も処理時間は
かかるが有効である。本発明は、再サンプル回路、正規化回路、部分
マツチング回路を有しているので十分高い認識を
行うことができ、音声認識装置に利用できる。[Table] Calculate the degree of dissimilarity between all registered features and the input features, and if there is one registered feature that is less than a predetermined threshold, the corresponding phrase of that registered feature is output as the recognition result. . If the number is 0, it is assumed that there is no target word and a result indicating that it is unrecognizable (reject) is output. If there are two or more matching units, the next second matching unit 70 operates. FIG. 7 shows a detailed block diagram of the first matching section 60. 40 is a feature storage unit, 50 is a registered feature storage unit, 130 is an input feature overlap feature calculation unit, 13
1 is a registered feature overlap feature calculation unit, 132 is an absolute value calculation unit, 133, 137 is an addition unit, 134, 13
6, 138 are registers, 135, 139 are comparison units, 140 is a first matching result storage unit,
The first comparing unit 135 calculates the minimum dissimilarity of the dividing points, and the adding unit 137 calculates the total dissimilarity of all dividing points, and compares it with a predetermined threshold value.
9, and if it is less than the threshold value, the division number w is stored in the storage unit 140. A control unit (not shown) outputs the results of all registered words using the above-mentioned judgment conditions. The second matching unit 70 calculates the degree of dissimilarity between the registered features corresponding to the candidate number w stored in the first matching result storage unit 140 in pairs for each division unit (m ^w (v
(equivalent to _C )), sequentially select p division numbers and formats of overlapping features at that time (constant determined in advance) from those with the highest degree of dissimilarity between each registered feature and input feature. Only for these candidates, the degree of dissimilarity is calculated in the form of the selected overlapping feature, and candidates with a large degree of dissimilarity are deleted from the first matching result storage unit 140. All the candidates are examined in pairs in this way, and if there is a candidate that finally remains in the storage section 140, that number becomes the recognition result, otherwise the result is that it is unrecognizable. For example, if there are three candidate numbers α, β, and γ, there are three pairs each of (α, β), (α, γ), and (β, γ), so first, let's compare the registered features corresponding to α and β. The degree of dissimilarity corresponding to m ^w (v _C ) is checked for each division part, and for the division part with a large degree of dissimilarity, input features and registered features α and input features and registered features β are calculated using the same overlapping feature format. The degree of dissimilarity is calculated between the candidates, and the candidate with the larger value is dropped. (α,
Similar processing will be performed for γ) and (β, γ). FIG. 8 shows a detailed block diagram of the second matching section. 40 is a feature storage unit, 50 is a registered feature storage unit, 150 is a dissimilarity calculation unit for each division unit in FIG. 7, 151 is a result storage unit, 152 is a second matching unit selection unit, 153 is an addition unit, 154,155
is a register, and 156 is a comparison section. The degree of dissimilarity between the registered features is calculated for each division part, and a predetermined number of division parts with a large degree of dissimilarity are calculated by the second matching selection part 152, and the results are transferred to the control part. . A control unit (not shown) extracts input features and registered features according to the information, calculates them in a dissimilarity calculation 150 (lower part of FIG. 8), and stores the sum of one of the pairs in a register 155. It is stored and compared with the other sum in the comparison section 156, and the result is transferred to the control section. As explained above, in the first embodiment, by resampling the constant part, the data of the vowel part and the data of the consonant part become involved in recognition to the same extent (generally, the duration of the vowel part is more is sufficiently long compared to the duration of the consonant part), it is possible to extract well-balanced features. In addition, the vowel part has a large temporal expansion and contraction, but this effect can be suppressed to some extent by resampling, and by using overlapping features and normalizing the resampled data, the magnitude of vocal power can be suppressed. I can deal with differences. There is an advantage that similar phrases can be recognized by focusing only on parts where the difference in registered features is large (parts that are not similar). In the first embodiment, the overlapping features are calculated for the input features and the registered features, but even if it is performed only for one side and the other side does not overlap, it can sufficiently cope with the temporal expansion and contraction of speech. Further, as the overlapping feature, the overlapping feature F _LCR between the divided portion of interest and the divided portions on both sides thereof is sufficiently effective even if not used. For the convenience of explanation, sample data is taken in once and then re-sampled and normalized, but it is also possible to process each sample data at the same time when inputting it.
It is clear that the redundant features may be calculated first and stored as registered features without being calculated at the same time.
Further, although the second matching section performs matching using a predetermined number of division parts in the first embodiment, the total dissimilarity may be calculated only for division parts having a large degree of dissimilarity. Furthermore, a method in which the second matching section is also subjected to multi-stage processing and the number of division sections is gradually reduced is also effective, although it takes longer processing time. Since the present invention includes a resampling circuit, a normalization circuit, and a partial matching circuit, it is possible to perform sufficiently high recognition and can be used in a speech recognition device.

[Brief explanation of the drawing]

第１図は従来の音声認識装置のブロツク図、第
２図は本発明の一実施例のブロツク図、第３図は
再サンプル回路の詳細ブロツク図、第４図は特徴
計算部の詳細フローチヤート、第５図はそのブロ
ツク図、第６図は重複特徴の説明図、第７図は第
１マツチング部のブロツク図、第８図は第２マツ
チング部のブロツク図である。１……マイクロホン、２……フイルタ分析部、
３……パワ検出部、４……サンプルデータ格納メ
モリ、５……音声区間切出し部、６……特徴抽出
部、７……音声特徴登録メモリ、８……識別部、
１０……再サンプル回路、２０……再サンプルデ
ータ格納部、３０……特徴計算部、４０……特徴
格納部、５０……登録特徴格納部、６０……第１
マツチング部、７０……第２マツチング部、１０
０……１サンプルデータ格納部、１０１……
MAX値検出部、１０２……MAX値レジスタ、１
０３……比較器、１０４……加算回路、１０５…
…加算レジスタ、１０６……正規化部、１０７…
…正規化データ格納部、１０８……差分極性計算
部、１０９……現極性レジスタ、１１０……前極
性レジスタ、１１１……一致検出部、１１２……
一致計数部、１２０……分割単位計算部、１２１
……再サンプルデータ参照アドレス制御部、１２
２……加算部、１２３……加算結果格納レジス
タ、１２４……平均値計算部、１３０，１３１…
…重複特徴計算部、１３２……絶対値演算部、１
３３，１３７……加算部、１３５，１３９……比
較部、１３４，１３６，１３８……レジスタ、１
４０……第１マツチング結果格納部、１５０……
分割部非類似度計算部、１５１……結果格納部、
１５２……第２マツチング部選出部、１５３……
加算部、１５４，１５５……レジスタ、１５６…
…比較部。 Fig. 1 is a block diagram of a conventional speech recognition device, Fig. 2 is a block diagram of an embodiment of the present invention, Fig. 3 is a detailed block diagram of the resampling circuit, and Fig. 4 is a detailed flowchart of the feature calculation section. , FIG. 5 is a block diagram thereof, FIG. 6 is an explanatory diagram of overlapping features, FIG. 7 is a block diagram of the first matching section, and FIG. 8 is a block diagram of the second matching section. 1...Microphone, 2...Filter analysis section,
3... Power detection unit, 4... Sample data storage memory, 5... Voice section extraction unit, 6... Feature extraction unit, 7... Voice feature registration memory, 8... Identification unit,
DESCRIPTION OF SYMBOLS 10...Resampling circuit, 20...Resampling data storage part, 30... Feature calculation part, 40... Feature storage part, 50... Registered feature storage part, 60... First
Matching section, 70...Second matching section, 10
0...1 sample data storage section, 101...
MAX value detection section, 102...MAX value register, 1
03...Comparator, 104...Addition circuit, 105...
...Addition register, 106...Normalization unit, 107...
...Normalized data storage unit, 108...Difference polarity calculation unit, 109...Current polarity register, 110...Previous polarity register, 111...Coincidence detection unit, 112...
Match counting section, 120... Division unit calculation section, 121
... Re-sample data reference address control unit, 12
2... Addition unit, 123... Addition result storage register, 124... Average value calculation unit, 130, 131...
...Overlapping feature calculation unit, 132...Absolute value calculation unit, 1
33,137...addition section, 135,139...comparison section, 134,136,138...register, 1
40...first matching result storage section, 150...
Division dissimilarity calculation unit, 151...result storage unit,
152...Second Matching Division Selection Division, 153...
Addition unit, 154, 155...Register, 156...
...Comparison section.

Claims

[Claims]

1 Sampling that divides the input audio signal into multiple frequency components and samples them at regular time intervals and outputs them as primary audio data.
filter means; silence detection means for comparing the maximum value of the primary audio data at each sampling time point with a constant value to identify a sound time point and a silent time point;
normalizing means for normalizing the primary audio data at a time when there is a sound by audio power; and identifying a difference value between adjacent primary audio data in the frequency axis based on the output of the normalizing means, and the identification value thereof. a slope code creation means for creating a slope identification code group according to the above, and a steady state detection means for determining whether or not adjacent slope identification code groups on the time axis completely match each other to distinguish between a steady time point and an unsteady time point. Then, the primary audio data at a certain number of continuous silent points are represented by the data at one sampling point, and the primary audio data at non-stationary points are made to correspond to those at each sampling point, and the primary audio data at a certain number of continuous points at a steady point are represented by those at one sampling point. The primary audio data of
resampling means for outputting secondary audio data, the audio section is equally divided into a specific number, and for each divided section, the average value of the secondary audio data belonging to the divided section is calculated as a type 1 feature. Then, the average value of the secondary audio data belonging to the divided interval and the immediately preceding divided interval is calculated as the second type feature, and the average value of the secondary audio data belonging to the divided interval and the immediately preceding divided interval is A speech recognition device characterized in that the average value of next speech data is calculated as a third type feature, and the degree of dissimilarity between an input speech phrase and a registered speech phrase is calculated by selectively using these three kinds of features.