JP2008145996A

JP2008145996A - Speech recognition by template matching using discrete wavelet conversion

Info

Publication number: JP2008145996A
Application number: JP2006357183A
Authority: JP
Inventors: Shinji Karasawa; 信司唐澤
Original assignee: Individual
Current assignee: Individual
Priority date: 2006-12-11
Filing date: 2006-12-11
Publication date: 2008-06-26

Abstract

<P>PROBLEM TO BE SOLVED: To provide technique for recognizing a speech by using discrete wavelet conversion for converting data by resolution, and template matching having a keen selecting function. <P>SOLUTION: Phonemes are recognized using template matching wherein a sample cut within a time range of about the pitch of vocal chord vibrations from a peak value and a speech waveform to be recognized are standardized with a maximum amplitude and converted into a coefficient of a discrete wavelet and the coefficient is used as a feature vector. Syllables are recognized by template matching wherein a sample of short syllable level and a speech to be recognized are sampled within the same time range and addition values of absolute values by scales of discrete wavelet coefficients obtained by standardizing waveforms with the maximum amplitude are used as feature vectors. Phonemes of a continuous speech are sectioned by using the fact that the ratio of the addition values by the scales of absolute values of main discrete wavelet coefficients approximate 1 in a phoneme transition area. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、解像度別にデータを変換する離散ウェーブレット変換を用いて様々な解像度でテンプレートマッチングを行い音声の認識することに関する技術である。 The present invention relates to a technique for recognizing speech by performing template matching at various resolutions using discrete wavelet transform that converts data according to resolution.

現在は、小型化、高性能化、低価格化により普及しているコンピュータ等のヒューマンインターフェースとしてキーボード入力が使われている。しかし、キーボード入力は音声入力と較べれば人間になじみが浅い。また、最近では携帯電話が発達し、個人に特化し時間や場所の制限を受けず利用できる音声認識の技術が求められるようになった。ところが、現在の音声認識の主流は統計処理を駆使した不特定話者の音声を認識する多量生産用であり、個性のある音声から普遍的な特徴を抽出することが課題であった。 Currently, keyboard input is used as a human interface for computers and the like that have become widespread due to miniaturization, high performance, and low price. However, keyboard input is less familiar to humans than voice input. Recently, cellular phones have been developed, and voice recognition technology that can be used without being restricted by time and place has been demanded. However, the current mainstream of speech recognition is for mass production that recognizes unspecified speaker's speech using statistical processing, and it has been a challenge to extract universal features from unique speech.

テンプレートに個人の音声を用いたテンプレートマッチングで個人に特化した音声認識が実現できる。一般にテンプレートマッチングでは鋭い選択性を持たせられるが、その高い選択性では合わせ余裕が損なわれる。そこで、ＪＰＥＧ２０００などの画像情報の圧縮に用いられる離散ウェーブレット変換により解像度別に配列したデータに変換すれば、目的とする情報を抽出するために適した解像度でテンプレートマッチングができる。 Personalized voice recognition can be realized by template matching using personal voice as a template. In general, template matching provides sharp selectivity, but the high selectivity impairs the alignment margin. Therefore, if data is arranged by resolution by discrete wavelet transform used for compressing image information such as JPEG2000, template matching can be performed at a resolution suitable for extracting target information.

従来はテンプレートマッチングで直接的に行う音声認識では対応が困難であると考えられてきた。これまでウェーブレット変換を音声認識の処理に利用した例があるが、それらは統計的な見地からウェーブレット変換をフーリエ変換の代わりに行うというような見地で試みられており、従来は成果として高く評価されることがなかった。本発明は、音声の離散ウェーブレット変換をテンプレートマッチングの前処理として音声認識処理を行うものである。 Conventionally, it has been considered that it is difficult to cope with voice recognition performed directly by template matching. There have been examples of using wavelet transforms for speech recognition, but they have been tried from a statistical point of view, such as performing wavelet transforms instead of Fourier transforms. There was never. The present invention performs speech recognition processing using speech discrete wavelet transform as preprocessing for template matching.

本発明のテンプレートマッチングによる音声認識のアルゴリズムは脳神経回路網のモデルを基礎にしている。すなわち、脳は多種で多重の解像度を持つ非常に鋭い選択性を持つフィルターであるテンプレートマッチングの機能を持つ神経細胞によって組織されている。音声は周波数成分が時間変化する２次元的データであり，音声に伴う複数の成分の活動がインパルスとして神経回路網を転送される時にパターン上のデータが現れる．そのインパルスのパターンが，配線接続をした時のパターンと一致する時に神経細胞が再びインパルスを発生する．そのインパルスは活動単位であり、感覚細胞が活動を起こし，神経細胞が活動して，筋肉細胞を活動させていて、活動単位の意味は外界や各細胞の活動自体が担っている．生体ではインパルスという活動単位の転送で情報が処理されており，音声認識もインパルス的な活動単位に量子化され、デジタル的に処理できる。 The speech recognition algorithm based on template matching according to the present invention is based on a model of a cranial nerve network. In other words, the brain is organized by nerve cells having a template matching function, which is a very sharp selective filter with multiple resolutions. Speech is two-dimensional data whose frequency components change over time, and the data on the pattern appears when the activities of multiple components accompanying speech are transferred as impulses through the neural network. When the impulse pattern matches the pattern when the wiring is connected, the nerve cell generates an impulse again. The impulse is a unit of activity, the sensory cell is activated, the nerve cell is activated and the muscle cell is activated, and the meaning of the unit of activity is borne by the outside world and the activity of each cell itself. In a living body, information is processed by transferring an activity unit called impulse, and voice recognition is also quantized into an impulse activity unit and can be processed digitally.

認識処理では、弁別する機能を高めつつ経済的かつ高速化するために必要な情報を採取し、不必要な情報を除去することが課題である。テンプレートマッチング方式の認識処理ではどのような特徴ベクトルを持つテンプレートで照合するかが機能を決定する。音声認識をテンプレートマッチングで行う際には、照合するパターン状のデータの規格化を含めて、どのような音声の特徴をどのように抽出するかというアルゴリズムを設定する指針が必要である。 In the recognition process, it is a problem to collect information necessary for improving the function of discrimination while being economical and speeding up, and to remove unnecessary information. In the recognition process of the template matching method, a function is determined as to which template having a feature vector is used for matching. When performing speech recognition by template matching, a guideline for setting an algorithm for how to extract what kind of speech features is required, including normalization of pattern data to be collated.

実際の音声は発声自体にバラツキが多い。これを詳細な特徴を広範囲に行うテンプレートマッチングではテンプレートの数が多くなり処理が困難になる。そこで、［図１］に示すように音素のように時間領域の狭い音声の特徴抽出処理と音節のように中程度の時間領域の音声の特徴抽出処理と音素のセグメンテーションを行う処理を分けて行い、標本の音声から複数の方法で採取したデータと新たに入力する音声から同様な方法で採取したデータとをテンプレートマッチングで照合して認識する。 Actual voices vary widely in utterances themselves. In template matching in which detailed features are performed over a wide range, the number of templates increases and processing becomes difficult. Therefore, as shown in FIG. 1, the feature extraction processing of a speech having a narrow time domain such as a phoneme, the feature extraction processing of a middle time domain speech such as a syllable, and the processing of segmenting a phoneme are performed separately. The data collected by a plurality of methods from the voice of the sample and the data collected by the same method from the newly input voice are collated and recognized by template matching.

音声の認識処理は音声の発声の特徴に合わせて行う。すなわち、声帯振動のピッチ期間の１０ミリ秒程度の音声の特徴抽出にはウェーブレット係数をテンプレートの特徴ベクトルの成分にして音素の分析を行う。発声器官の動作単位として２００ミリ秒程度の短音節発声期間範囲の音声波形の特徴抽出では周期別変化量に相当するスケール別のウェーブレット係数の成分量を特徴ベクトルとしたテンプレートマッチングで認識する。 The speech recognition process is performed in accordance with the characteristics of the speech. In other words, phoneme analysis is performed using a wavelet coefficient as a component of a template feature vector for voice feature extraction of about 10 milliseconds in the pitch period of vocal cord vibration. In the feature extraction of the speech waveform in the short syllable utterance period range of about 200 milliseconds as the operation unit of the voicing organ, it is recognized by template matching using the component amount of the wavelet coefficient for each scale corresponding to the change amount for each cycle as a feature vector.

解像度別で位置順に配列されるデータに変換するにはハール（Ｈａａｒ）の離散ウェーブレット変換のスケール別に位置順に配列されるデータを使う。［図２］に示すように高解像度のスケールのウェーブレット係数を段階的に除いて逆変換すれば段階的に低解像度の静止画像のようなデータが得られる。ハールこのウェーブレット変換では、タイムスロット以外は０とし、区切られた波形を正負一対の矩形をマザーウェーブレット関数としたものであり、タイムスリット内のデータの後半の符号を変換して加え合わせてウェーブレット係数を求めるので短時間に処理できる。但し、このウェーブレット変換では処理するデータの数を２の冪乗とする。 In order to convert the data into the data arranged in the order of the position according to the resolution, the data arranged in the order of the position according to the scale of the Haar discrete wavelet transform is used. As shown in FIG. 2, if a wavelet coefficient of a high resolution scale is removed stepwise and inverse transformation is performed, data such as a low resolution still image is obtained stepwise. Haar In this wavelet transform, except for the time slot, 0 is set, and the divided waveform is a pair of positive and negative rectangles as the mother wavelet function. The wavelet coefficient is converted and added to the latter half of the data in the time slit. Can be processed in a short time. However, in this wavelet transform, the number of data to be processed is a power of two.

テンプレートマッチングの決定過程で一致度の評価をユークリッド距離より計算時間が短く距離に差が顕著に現れるハミング距離（差の絶対値の和）の値で評価する。 In the template matching determination process, the degree of coincidence is evaluated by the Hamming distance (sum of absolute values of differences) in which the calculation time is shorter than the Euclidean distance and the difference is remarkable in the distance.

実際に発声される音声は発音記号の種類より多くの発声記号のテンプレートを必要とする。短いテンプレートの方が共通に使えるのでできるだけ短いテンプレートを用いて認識する。 The speech actually uttered requires more voicing symbol templates than the phonetic symbol types. The short template can be used in common, so it is recognized using the shortest template possible.

音素の認識として声帯振動のピッチ期間程度の音声波形のテンプレートマッチングを行う際に波形の切り出しはピーク値を起点にしてその長さをピッチ期間以内の処理単位とし、ウェーブレット係数を特徴ベクトル成分としたテンプレートマッチングで音素の認識を行う。 When performing template matching of a speech waveform for the pitch period of the vocal cord vibration for phoneme recognition, the waveform is cut out with the peak value as the starting point and the length as the processing unit within the pitch period, and the wavelet coefficient as the feature vector component Recognize phonemes by template matching.

音節全体に処理単位を拡大すると発声の都度に伸縮する成分がその中に含まれてテンプレートを非常に多くしなければならなくなる。「いろは・・・」など日本語の発声動作の単位である拍（ｍｏｒａ）は２００ｍｓｅｃ程度で発声されており、早口で発声した短音節全体の波形のウェーブレット係数の絶対値をスケール別に加算した周期帯別の成分量に相当する値を特徴ベクトル成分としたテンプレートマッチングで短音節の認識を行う。 When the processing unit is expanded to the entire syllable, a component that expands and contracts with each utterance is included in the syllable, and the number of templates must be greatly increased. The beat (mora), which is the unit of Japanese utterances such as “Iroha ...”, is uttered in about 200 msec, and is a cycle in which the absolute value of the wavelet coefficient of the entire short syllable waveform uttered quickly is added for each scale. Short syllable recognition is performed by template matching using a value corresponding to the component amount for each band as a feature vector component.

音節の認識の特徴抽出の前処理として、早口で区切って発声して短音節単位の標本を採取し、採取した区間の最大振幅を１に振幅を規格化し、また入力する音声から照合するデータも標本と同じフレーム長で採取し、同様に規格化して特徴ベクトルを求める。 As preprocessing for feature extraction for syllable recognition, utterance is divided and collected, samples of short syllable units are collected, the amplitude is normalized to 1 for the maximum amplitude of the collected section, and data to be collated from the input speech is also included A feature vector is obtained by sampling with the same frame length as the sample and normalizing in the same manner.

連続音声の中で音素が遷移する領域では音素の認識が難しい。そこで、離散ウェーブレット変換のスケール別成分量を求めてその比率で音素遷移を検出する。［図３］に「い」を短く発声しているが「あいうえお」と連続的に発声した音声の波形を示す。この音声について離散ウェーブレット変換の主要なスケール別成分量の比を［図４］に示す。［図４］では、音素が遷移する領域では主要なスケール別成分量の比率が１に近くなる。この方法を利用して、連続音声の音節のセグメンテーションの検討ができる。 It is difficult to recognize phonemes in regions where phonemes transition in continuous speech. Therefore, the component amount for each scale of the discrete wavelet transform is obtained and the phoneme transition is detected with the ratio. [FIG. 3] shows a waveform of a voice in which “I” is uttered briefly but “Aiueo” is uttered continuously. FIG. 4 shows the ratio of the main component amounts by scale of the discrete wavelet transform for this sound. In FIG. 4, the ratio of the main component amounts by scale is close to 1 in the region where the phoneme transitions. Using this method, syllable segmentation of continuous speech can be studied.

音声から認識された音声の情報処理としての音韻記号列を登録番号に変換して、それを解凍する組織を構築すれば音声情報を圧縮して伝送したり、記憶したりすることができる。デジタル信号はＡＮＤ回路で一つに絞られＯＲ回路により複数の成分に展開できるので、採取されたデジタルデータのパターンをデジタル回路で変換することができる。その際に発明者の特願２００４−２１７８２８号書き込み可能型双方向論理回路が便利である。そこでは、出力の符号化の書き込みは逆方向から解読器として書き込み、その回路の導通点群を符号器として利用している。 If a phonological symbol string as information processing of speech recognized from speech is converted into a registration number and an organization for decompressing it is constructed, speech information can be compressed and transmitted or stored. Since the digital signal is narrowed down to one by the AND circuit and can be developed into a plurality of components by the OR circuit, the collected digital data pattern can be converted by the digital circuit. In that case, the inventor's Japanese Patent Application No. 2004-217828 writable bidirectional logic circuit is convenient. In this case, output coding is written as a decoder from the reverse direction, and the conduction point group of the circuit is used as an encoder.

現在の高性能パーソナルコンピュータのワークステーションは６４ビットを処理するレジスタの構造を持っているので音声言語を処理する技術環境は整っている。ちなみに、８ビットのデジタル信号で２５６個の発音記号は区別が可能である。そこで、単語を８個の発音記号で表現すると、単語の発音記号のデータ８×８＝６４ビットで指定される。６４ビットの単語の単位で入力して、８，１９２個の単語を書き込む場合にはそのマトリックスポイントは５２４，２８８個となる。８，１９２個の単語を登録番号で特定すれば１３ビットで指定できる。ここで、出力の登録番号は登録番号のカウンターで決めることができる。出力を１２８種７ビットの文字９個で表現すると文字出力は６３ビットとなる。 The current high-performance personal computer workstation has a register structure for processing 64 bits, so that a technical environment for processing a speech language is in place. Incidentally, 256 phonetic symbols can be distinguished by an 8-bit digital signal. Therefore, if a word is expressed by 8 phonetic symbols, it is specified by 8 × 8 = 64 bits of word phonetic symbol data. When the input is performed in units of 64-bit words and 8,192 words are written, the matrix points are 524,288. If 8,192 words are specified by registration numbers, they can be specified by 13 bits. Here, the registration number of the output can be determined by a registration number counter. If the output is expressed by nine 128-bit, 7-bit characters, the character output is 63 bits.

情報処理装置の中では音声情報を文字情報に変換せずに、パターンマッチングを階層化して行えば情報を圧縮して処理できる。すなわち、音韻識別情報を単語レベルの登録番号に変換し、さらにその単語レベルの登録番号列を文章レベルの登録番号に変換する。逆に、文章レベルの登録番号を元の単語レベルの登録番号列に戻し、さらに単語レベルの登録番号を音韻記号列に変換をする。１３ビットのデータで１個の単語を特定して、９個の単語の組み合わせ１３×９＝１１７ビットを１個の文章として入力し、８，１９２種類の文章を登録すれば、その入力側のマトリックスのポイントは９５８．４６４となる。８，１９２種に区別された文章は１３ビットの番号で区別できるので、出力側のマトリックス要素は１０６，４９６となる。この程度の規模の回路は半導体集積回路で実現可能である。 In the information processing apparatus, if the pattern matching is hierarchized without converting the voice information into the character information, the information can be compressed and processed. That is, the phoneme identification information is converted into a word level registration number, and the word level registration number string is converted into a sentence level registration number. Conversely, the sentence level registration number is returned to the original word level registration number string, and the word level registration number is converted into a phoneme symbol string. If one word is specified by 13-bit data, a combination of nine words 13 × 9 = 117 bits is input as one sentence, and if 8,192 kinds of sentences are registered, the input side The point of the matrix is 958.464. Since the sentences classified into 8,192 types can be distinguished by 13-bit numbers, the matrix elements on the output side are 106,496. A circuit of this scale can be realized with a semiconductor integrated circuit.

本発明はＡＤ変換カードを挿入したノートパソコンを用いて処理前後のデータをマイクロソフトのＯｆｆｉｃｅのＥｘｃｅｌで表示する方式で検討したもので、そのソフトウエアはＶｉｓｕａｌＢａｓｉｃｆｏｒＡｐｐｌｉｃａｔｉｏｎ（ＶＢＡ）でプログラムされており、パソコンやロボットの入力部に組み込みソフトウエアとして使うことができる。 In the present invention, a notebook computer with an AD conversion card inserted is used to examine the data before and after processing in Microsoft Office Excel, and the software is programmed with Visual Basic for Application (VBA). It can be used as software embedded in the input part of a personal computer or robot.

多量のテンプレートマッチングを高速で処理するにはテンプレートをハードウエアに書き込めば並列に照合できる。本発明の組織をプログラマブルな半導体集積回路で制作するのが最良である。半導体集積回路で構成するには活動単位の存在を電荷の有無として転送し、その電荷の転送はＣＣＤやダイナミックＭＯＳＩＣの回路で行う。活動単位を転送してデータ変換機能を実現する回路としては、発明者の特許第３４９６０６５号インパルス電子装置および発明者の特願２００４−２１７８２８号の書き込み可能型双方向論理回路がある。 In order to process a large amount of template matching at high speed, if templates are written in hardware, they can be matched in parallel. It is best to produce the organization of the present invention with programmable semiconductor integrated circuits. To construct a semiconductor integrated circuit, the presence of an active unit is transferred as the presence or absence of charge, and the charge is transferred by a CCD or dynamic MOS IC circuit. As a circuit that realizes a data conversion function by transferring an activity unit, there are an inventor's patent No. 3496065 impulse electronic device and an inventor's Japanese Patent Application No. 2004-217828's writable bidirectional logic circuit.

離散ウェーブレット変換を用いたテンプレートマッチングによる特定話者の音声認識の実施例を通して、以下に本発明の実施方法を説明する。 The implementation method of the present invention will be described below through an embodiment of speech recognition of a specific speaker by template matching using discrete wavelet transform.

同様に発声した音声が一致するので、標本も同様な波形切片とすればよい。「あいうえお」と連続的に発声した［図３］の音声波形自身から１２．８ｍｓｅｃ切り取った５種の波形とのテンプレートマッチングのＨａｍｍｉｎｇ距離を［図５］に示す。［図５］では、連続的な音声はテンプレートとの一致度も連続的に変化する様子を示している。 Similarly, since the uttered voices coincide with each other, the sample may have a similar waveform segment. FIG. 5 shows the Hamming distance of template matching with five types of waveforms cut out 12.8 msec from the voice waveform itself of FIG. 3 continuously uttered “Aiueo”. [FIG. 5] shows a state in which the continuous voice continuously changes the degree of coincidence with the template.

波形切片は短いほうが処理単位のデータが少なくて処理時間が短くてすむ。テンプレートの数が少ない場合にはコンピュータで処理する際に処理時間が短くてすむ。そのためには照合に使う標本のテンプレートの区間を短くかつ低い解像度にする。「あいうえお」と連続的に発声した音声の波形を自身の音声から６．４ミリ秒切り取った１５個の特徴ベクトルでテンプレートマッチングのＨａｍｍｉｎｇ距離を［図６］に示す。［図６］から選択性が低い条件の照合でも識別できることを示す。 A shorter waveform segment requires less processing unit data and a shorter processing time. When the number of templates is small, the processing time can be shortened when processing with a computer. For this purpose, the sample template section used for collation is made short and has a low resolution. FIG. 6 shows the Hamming distance of template matching with 15 feature vectors obtained by cutting the waveform of speech continuously uttered “Aiueo” from its own speech by 6.4 milliseconds. [FIG. 6] shows that it is possible to identify even with a collation with low selectivity.

テンプレートの数が多いと処理時間がかかるので、標本として共通に使える母音の音声を採取したい。声帯振動のピッチは母音の種類や発音の仕方によって相違し、ピッチが変われば波形も変化する。［図７］に「うーう、えーえ、おーお」と発声した時の音声のピッチの変化を示す。 Since there are many templates, processing time is required, so we want to collect vowel sounds that can be used as common samples. The pitch of the vocal cord vibration differs depending on the type of vowel and the way of pronunciation, and the waveform changes as the pitch changes. [Fig. 7] shows the change in pitch of the voice when uttering "Uh, uh, oo".

認識処理の処理時間を非常に短縮した実施例として、ピッチが９．５ミリ秒の音声波形からピーク値より６．４ミリ秒を切り取った波形切片を共通の標本とした離散ウェーブレット変換を用いたテンプレートマッチングによる特定話者の音素の認識の実施例をしめして、この認識方法の制作の指針を説明する。 As an example in which the processing time of the recognition processing is greatly shortened, a discrete wavelet transform is used in which a waveform sample obtained by cutting 6.4 milliseconds from a peak value from a voice waveform having a pitch of 9.5 milliseconds is used as a common sample. An example of recognition of phonemes of a specific speaker by template matching will be shown, and a guideline for producing this recognition method will be described.

「あいうえお」と早口で連続的に発声した音声と「あーあ、いーい、うーう、えーえ、おーお」と発声した母音標本とのテンプレートマッチングの距離を［図８］に示す。ここで、５種の母音標本はと発声した音声波形で９．５ミリ秒のピッチの波形から６．４ミリ秒切り取った。なお、入力音声の「い」は短く発声しており、「いーい」と発声した標本の「い」では認識できない。FIG. 8 shows the template matching distance between the voice “Aiueo” and the voice uttered continuously and the vowel sample “Ah, ii, woo, uh, oo”. Here, five kinds of vowel samples were cut out by 6.4 milliseconds from a waveform having a pitch of 9.5 milliseconds in a voice waveform uttered. Note that the input speech “I” is uttered briefly, and the sample “I” uttered “Ii” cannot be recognized.

「かきくけこ」と発声した音声でピーク値から６．４ミリ秒切り取った音声と［図８］の処理と同じ母音標本とのテンプレートマッチングの距離を［図９］に示す。「こ」の発声の母音は「う」の標本と認識されている。「こ」の発声を早く切り上げれば認識されるものと考えられる。同じ発音記号でも複数のテンプレートを必要とする。なお、「き」の波形の子音として特徴的な先頭部分を６．４ミリ秒切り取ってＫの標本波形とした。Ｋの特徴とした先頭領域の波形は声帯振動を持たず子音は発声の動作全体にも関係するので、短音節として拡大した音声切片から特徴を抽出して認識すべきである。 FIG. 9 shows the template matching distance between the voice uttered “Kakikukeko” and cut from the peak value by 6.4 milliseconds and the same vowel sample as the process of FIG. The vowels of “ko” are recognized as samples of “u”. It is thought that it will be recognized if the voice of “ko” is rounded up quickly. Multiple templates are required for the same phonetic symbol. Note that the K sample waveform was cut off by 6.4 milliseconds from the leading portion characteristic of the consonant of the “ki” waveform. Since the waveform of the head region as the feature of K does not have vocal cord vibration and the consonant is related to the whole utterance operation, the feature should be extracted and recognized from the speech segment expanded as a short syllable.

「さしすせそ」と発声した音声で６．４ミリ秒切り取った音声切片と［図７］と同じ母音標本とのテンプレートマッチングの距離を［図１０］に示す。「し」の発声の母音はいくつもの母音に認識されている。「し」の発声は「ｓｉ」ではなく「ｓｈｉ」であると考えられる。なお、Ｓの標本は「し」の波形の子音として特徴的な先頭部分を６．４ミリ秒切り取って波形である。Ｓの特徴とした先頭領域の波形はＫと同様に声帯振動を持たず子音は発声の動作全体にも関係する。 FIG. 10 shows the template matching distance between the voice segment cut out for 6.4 milliseconds with the voice uttered “Sashisuseso” and the same vowel sample as in FIG. The vowels of “shi” are recognized as a number of vowels. The utterance of “shi” is considered to be “shi” instead of “si”. Note that the sample of S is a waveform obtained by cutting off a characteristic leading portion as a consonant of the waveform of “shi” by 6.4 milliseconds. The waveform of the head region, which is a feature of S, does not have vocal cord vibrations as in the case of K, and the consonant is related to the entire utterance operation.

スケール別成分量を特徴ベクトルにしたテンプレートマッチングでは周期別の変化量成分に相当するので、振幅レベルの小さな量は影響が少ない。そこで、「あ、か、さ、た、な」と発声して、各音節を全てカバーする区域［４０９．６ミリ秒］の音節波形標本についてスケール別成分量を特徴ベクトルにしたテンプレートマッチングの照合を試みたところ［図１１］に示すように同じ音声から波形の切り出し位置を０．８ミリ秒シフトしただけで影響が現れる。 In template matching in which the component amount by scale is a feature vector, it corresponds to the change component by cycle, so that an amount with a small amplitude level has little influence. So, say "Ah, S, T, N," and collate template matching using a scale-based component quantity as a feature vector for the syllable waveform sample in the area covering all syllables [409.6 milliseconds]. As shown in [FIG. 11], the effect appears only by shifting the cut-out position of the waveform from the same voice by 0.8 milliseconds.

短く発声した２０４．８ミリ秒の短音節単位を標本としてその範囲のスケール別成分量を特徴ベクトルにして、普通に発声した音節とのテンプレートマッチングの照合をした様子を［図１２］に示す。中央部でその音節を認識し、終端部で母音を認識している。短音節期間の音声標本でテンプレートマッチングする時にはフレームのシフトを図１２に示すほど頻繁にする必要はない。 FIG. 12 shows a state in which template matching with a normally uttered syllable is collated by using a short syllable unit of 204.8 milliseconds as a sample and using a component amount by scale in the range as a feature vector. The central part recognizes the syllable and the terminal part recognizes the vowel. When template matching is performed with a voice sample of a short syllable period, it is not necessary to shift the frame as frequently as shown in FIG.

以上、本発明の要旨を説明してきたが、本発明は図面で示す実施例の条件に限られるものではなく、本発明の主旨に逸脱しない範囲における変更や追加があっても本発明に含まれる。多様な音声と多様な音声認識の用途があるので、本発明を実際に用いる際には本明細書で説明した事柄を指針として具体的に構築する。 Although the gist of the present invention has been described above, the present invention is not limited to the conditions of the embodiments shown in the drawings, and modifications and additions within the scope not departing from the gist of the present invention are included in the present invention. . Since there are various voices and various voice recognition uses, when the present invention is actually used, the matters described in this specification are specifically constructed.

本発明の音声を入力とする組織の活用例として音声の情報圧縮あるいは、レストラン等の注文書の音声入力、音声タイプライター、音声矯正装置、自動翻訳電話、産業ロボット、介護ロボットなどがある。 Examples of the use of the organization for inputting speech according to the present invention include speech information compression, speech input for a purchase order such as a restaurant, speech typewriter, speech correction device, automatic translation telephone, industrial robot, and nursing robot.

離散ウェーブレット変換とテンプレートマッチングを組み合わせて入力する音声から多重に採取したデータについて標本のそれと比較して判断する音声認識組織の構成を示す図である。It is a figure which shows the structure of the audio | voice recognition structure | tissue judged by comparing with the data of the sample about the data extract | collected from the audio | voice input combining combining discrete wavelet transform and template matching. 高解像度のスケールのウェーブレット係数を段階的に除いて逆変換によって求めた波形により離散ウェーブレット変換の多重解像度を示す図である。It is a figure which shows the multi-resolution of a discrete wavelet transform by the waveform calculated | required by inverse transformation after removing the wavelet coefficient of a high resolution scale in steps. 離散ウェーブレット変換を用いた音声の分析および認識処理の実施例で用いた「い」を短く「あいうえお」と連続的に発声した音声の波形を示す図である。It is a figure which shows the waveform of the audio | voice which uttered continuously "I" used short in the Example of the analysis of speech using a discrete wavelet transform, and the recognition process as "Aiueo". 図３に示す音声の離散ウェーブレット変換のスケール別成分量の比率で音素遷移が検出できることを示す図である。It is a figure which shows that a phoneme transition can be detected with the ratio of the component amount according to scale of the discrete wavelet transform of the audio | voice shown in FIG. 図３に示す音声の波形から１２．８ｍｓｅｃ切り取った５種の標本波形とのテンプレートマッチングのＨａｍｍｉｎｇ距離を示す図である。It is a figure which shows the Hamming distance of a template matching with five types of sample waveforms which cut out 12.8 msec from the audio | voice waveform shown in FIG. 図３に示す音声の波形を自身の音声から６．４ｍｓｅｃ切り取った５種の低解像度の波形でテンプレートマッチングの距離を示す図である。It is a figure which shows the distance of template matching with five types of low-resolution waveforms which clipped the waveform of the audio | voice shown in FIG. 3 from own audio | voice 6.4 msec. 共通のピッチとして９．５ｍｓｅｃを採用した際に参照した「うーう、えーえ、おーお」と発声した時の音声のピッチの変化を示す図である。It is a figure which shows the change of the pitch of an audio | voice when uttered "woo, uh, oo" referred when adopting 9.5 msec as a common pitch. 一部は図６に示す９．５ｍｓｅｃのピッチの波形から６．４ｍｓｅｃ切り取った５種の低解像度の波形で図３に示す音声のテンプレートマッチングの距離を示す図である。FIG. 7 is a diagram showing the template matching distances of the voice shown in FIG. 3 with five types of low-resolution waveforms obtained by cutting 6.4 msec from the waveform of the pitch of 9.5 msec shown in FIG. ９．５ｍｓｅｃのピッチの波形から６．４ｍｓｅｃ切り取った５種の低解像度の波形で「かきくけこ」と発声した音声のテンプレートマッチングの距離を示す図である。It is a figure which shows the distance of the template matching of the audio | voice which uttered "Kakikukeko" by five types of low resolution waveforms cut out from the waveform of the pitch of 9.5 msec 6.4 msec. ９．５ｍｓｅｃのピッチの波形から６．４ｍｓｅｃ切り取った５種の低解像度の波形で「さしすせそ」と発声した音声のテンプレートマッチングの距離を示す図である。It is a figure which shows the distance of the template matching of the audio | voice which uttered "Sashisuseso" by five types of low resolution waveforms cut out from the waveform of a pitch of 9.5 msec by 6.4 msec. 同じ音節［４０９．６ｍｓｅｃ］のスケール別成分量を特徴ベクトルにしたテンプレートマッチングの距離において、時間を０．８ｍｓｅｃシフトした影響を示す図である。It is a figure which shows the influence which shifted time by 0.8 msec in the distance of the template matching which used the component amount according to the scale of the same syllable [409.6 msec] as the feature vector. ６．４ｍｓｅｃ毎に短音節単位２０４．８ｍｓｅｃのフレームをシフトして、短音節単位のスケール別成分量を特徴ベクトルにしたテンプレートマッチングのＨａｍｍｉｎｇ距離で音節が認識できることを示す図である。It is a figure which shows that a syllable can be recognized by the Hamming distance of the template matching which shifted the frame of 204.8 msec of short syllable units every 6.4 msec, and made the component amount according to the scale of a short syllable unit into the feature vector.

Claims

Extract the speech in the time range of the pitch of the vocal cord vibration from the peak value, normalize the waveform in that region with the maximum amplitude, convert it to a discrete wavelet coefficient, and use the wavelet coefficient as a feature vector to match the sample by template matching Recognize phoneme level speech by finding a distance measure.

A speech waveform segment is collected in a time range of about a short syllable unit, the waveform in that region is normalized to the maximum amplitude, converted to discrete wavelet coefficients, and the sum of the absolute values of the discrete wavelet coefficients for each scale is a feature vector. Recognize short syllable level speech by obtaining a distance measure to the sample by template matching.

Segmentation of phoneme transitions in continuous speech using the fact that the ratio between the scales of the absolute values of the discrete wavelet coefficients of the main scales constituting the speech is close to 1 in the phoneme transition region.

A method including a plurality of claims 1, 2, and 3, data representing voice characteristics are sampled in a multiple manner, and a template matching distance measure with the sample data is obtained for the multiple samples. Recognize voice.

A software system or electronic circuit device having a function of transmitting, storing, or operating sound characterized by any of the items of claim 1, claim 2, claim 3, and claim 4.