JP2006343544A

JP2006343544A - Voice recognition method

Info

Publication number: JP2006343544A
Application number: JP2005169217A
Authority: JP
Inventors: Takashi Nakayama; 隆中山
Original assignee: Miyazaki Prefecture
Current assignee: Miyazaki Prefecture
Priority date: 2005-06-09
Filing date: 2005-06-09
Publication date: 2006-12-21
Anticipated expiration: 2025-06-09
Also published as: JP4890792B2

Abstract

<P>PROBLEM TO BE SOLVED: To discriminate voices by a defined reference irrespective of the level of audio signal. <P>SOLUTION: A ratio of amplitude or power of a fundamental wave and each higher harmonic component to the total sum of amplitude or power of the fundamental wave and each higher harmonic component included in a voice frequency region is acquired as a contribution ratio, and the phonemes of a consonant and a vowel are specified by the manner in the presence of the contribution ratio not affected by the level of the audio signal. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、話者の音声から、簡便な処理装置を用いて言語を認識することができる音声認識方法に関する。さらに詳しくは、音声信号の大小に拘わらず同じ基準で分析することができる音声認識方法に関する。 The present invention relates to a speech recognition method capable of recognizing a language from a speaker's voice using a simple processing device. More specifically, the present invention relates to a speech recognition method that can perform analysis based on the same reference regardless of the size of a speech signal.

従来、音声認識方法としては、音声波形から母音領域と子音領域を分別し、分別された母音領域の波形と子音領域の波形から当該母音と子音を特定して認識できるようにする方法が提案されている（例えば、特許文献１参照）。 Conventionally, as a speech recognition method, a method has been proposed in which a vowel region and a consonant region are separated from a speech waveform, and the vowel and consonant can be identified and recognized from the waveform of the separated vowel region and the waveform of the consonant region. (For example, refer to Patent Document 1).

また、上記母音と子音の特定方法として、分別された母音領域について、音声波形の音声信号レベルが電圧ゼロボルトを通過してから正電圧領域を推移して再び電圧ゼロボルトを通過するまでの時間を検知して当該母音を特定し、分別された子音領域について、音声波形の音声信号レベルが電圧ゼロボルトを通過または電圧ゼロボルト近傍から上昇後、正電圧領域を推移して再び電圧ゼロボルトを通過または電圧ゼロボルト近傍に達するまでの時間を検知して当該子音を特定する方法も提案されている（例えば、特許文献２参照）。 In addition, as a method for identifying the vowel and consonant, for the separated vowel region, the time from when the speech signal level of the speech waveform passes through the voltage zero volt until it passes through the positive voltage region and again passes the voltage zero volt is detected. The vowel is identified, and for the separated consonant region, the voice signal level of the speech waveform passes through the voltage zero volt or rises from near the voltage zero volt, then moves through the positive voltage region and again passes through the voltage zero volt or near the voltage zero volt. There has also been proposed a method for identifying the consonant by detecting the time required to reach (see, for example, Patent Document 2).

特開平９−１０１７９７号公報Japanese Patent Laid-Open No. 9-101797 特開２００１−２６５３７９号公報JP 2001-265379 A

上記従来の音声認識方法は、いずれも母音と子音を分けて識別しようとするものであるが、マイクロホンなどから採取した音声波形そのものを基準に識別するものとなっている。このため、特に声の大小（音声信号の大小）の影響を受けやすく、日常会話などの条件が不定な環境下では正確な識別が行いにくい問題がある。 All of the above conventional speech recognition methods attempt to identify vowels and consonants separately, but identify them based on the speech waveform itself collected from a microphone or the like. For this reason, there is a problem that accurate discrimination is difficult in an environment in which conditions such as daily conversation are uncertain because it is easily affected by the magnitude of the voice (the magnitude of the audio signal).

本願発明は、上記従来の音声認識法の問題点に鑑みてなされたもので、音声信号の大小に拘わらず一定の基準で音声を識別できるようにすることを目的とする。 The present invention has been made in view of the problems of the above-described conventional speech recognition method, and an object thereof is to make it possible to identify speech on a constant basis regardless of the size of the speech signal.

本発明は、上記目的のために、音声信号からサンプリングされＡ／Ｄ変換された音声データ群を周波数分析し、これによって得られる振幅スペクトルまたはパワースペクトルから、音声周波数領域に含まれる基本波および各高調波成分の振幅またはパワーの合計に対する、基本波および各高調波成分のそれぞれの振幅またはパワーの比率を寄与率として求め、この寄与率の現れ方から、子音と母音の音素を特定することを特徴とする音声認識方法を提供するものである。 For the above purpose, the present invention performs frequency analysis on audio data groups sampled from an audio signal and A / D-converted, and from the amplitude spectrum or power spectrum obtained thereby, the fundamental wave included in the audio frequency domain and each The ratio of the amplitude or power of each of the fundamental wave and each harmonic component to the sum of the amplitude or power of the harmonic components is obtained as a contribution rate, and the phoneme of the consonant and the vowel is specified from the appearance of this contribution rate. A characteristic speech recognition method is provided.

また、上記本発明は、音声データ群を子音領域と母音領域に区分し、子音領域の音声データ群と母音領域の音声データ群をそれぞれ周波数分析して寄与率を求め、各音声データ群における寄与率の現れ方から、子音と母音の音素を特定すること、
寄与率として、音声周波数領域に含まれる基本波および各高調波成分の振幅の合計に対する基本波および各高調波成分のそれぞれの振幅の比率を用いること、
音声データ群に対し、Ｎ個の音声データの分析区間毎に順次周波数分析を施し、各分析区間毎に寄与率を求めること、
音声データ群に対して窓関数処理を施してから周波数分析を施すこと、
窓関数処理にハミング窓を用い、周波数分析に高速フーリエ解析を用いること、
をその好ましい態様として含むものである。 In the present invention, the speech data group is divided into a consonant region and a vowel region, and the contribution rate is obtained by frequency analysis of the speech data group of the consonant region and the speech data group of the vowel region, and the contribution in each speech data group Identifying the phonemes of consonants and vowels from the way the rate appears,
As a contribution rate, using the ratio of the amplitude of each of the fundamental wave and each harmonic component to the sum of the amplitudes of the fundamental wave and each harmonic component included in the audio frequency domain,
The frequency analysis is sequentially performed for each analysis section of the N pieces of voice data for the voice data group, and a contribution rate is obtained for each analysis section.
Performing frequency analysis after performing window function processing on the audio data group,
Use a Hamming window for window function processing, use fast Fourier analysis for frequency analysis,
Is included as a preferred embodiment thereof.

本発明の音声認識方法は、寄与率を用いて子音と母音の特定を行うものとなっている。 The speech recognition method of the present invention specifies consonants and vowels using the contribution rate.

ところで、本発明における寄与率は、音声周波数領域に含まれる基本波および各高調波成分の振幅の合計に対する基本波および各高調波成分のそれぞれの振幅の比率、または、音声周波数領域に含まれる基本波および各高調波成分のパワーの合計に対する基本波および各高調波成分のそれぞれのパワーの比率である。そして、寄与率は、上記のような比率であることから、音声信号の大小に影響を受けることのない値であり、本発明は、この寄与率を基準に音声認識を行うものであることから、音声信号の大小に拘わらず、確度の高い識別を行うことができるものである。 By the way, the contribution rate in the present invention is the ratio of the amplitude of each of the fundamental wave and each harmonic component to the sum of the amplitudes of the fundamental wave and each harmonic component contained in the speech frequency region, or the fundamental contained in the speech frequency region. The ratio of the power of each of the fundamental wave and each harmonic component to the total power of the wave and each harmonic component. Since the contribution rate is the ratio as described above, it is a value that is not affected by the magnitude of the audio signal, and the present invention performs speech recognition based on this contribution rate. Therefore, it is possible to identify with high accuracy regardless of the size of the audio signal.

本発明に係る音声識別方法の基本的な手順を説明する。なお、ここでの説明は、説明の便宜上、被験者が日本語における五十音の一音を発してこれをサンプリングした場合を例にして説明する。 The basic procedure of the speech identification method according to the present invention will be described. In addition, description here demonstrates as an example the case where a test subject has sampled and emitted one Japanese syllabary sound for convenience of explanation.

まず、本発明に係る音声認識方法の一例を図１に基づいて説明する。 First, an example of the speech recognition method according to the present invention will be described with reference to FIG.

音声信号は、例えばマイクロホンなどによりアナログ信号として採取し、必要に応じて増幅したりフィルター処理を加えた後、サンプリングし、Ａ／Ｄ変換して、一旦メモリーに音声データ群として記録する。 The audio signal is collected as an analog signal using, for example, a microphone, amplified or filtered as necessary, sampled, A / D converted, and temporarily recorded as an audio data group in a memory.

音声を認識する上で分析が必要な周波数は、言語によっても多少異なるが、例えば日本語においては、５〜５．５ｋＨｚ程度までは必要であろうと考えられる。また、連続信号に含まれる周波数成分を正しくサンプリングデータとして得るには、サンプリング周波数が連続信号の持つ周波数の上限の２倍以上でなければならないとされていることから、サンプリング周波数は１０ｋＨｚ以上であることが好ましい。後述する具体例では５０ｋＨｚでサンプリングを行っているが、現実的にはこれほどの高い周波数とする必要はない。また、予めサンプリング周波数の１／２を超える周波数成分をフィルター（ローパスフィルター）でカットしておくことが好ましい。 The frequency that needs to be analyzed for recognizing speech differs somewhat depending on the language, but for example, in Japanese, it may be necessary to have a frequency of about 5 to 5.5 kHz. In addition, in order to correctly obtain the frequency component included in the continuous signal as sampling data, the sampling frequency must be at least twice the upper limit of the frequency of the continuous signal, so the sampling frequency is 10 kHz or more. It is preferable. In a specific example to be described later, sampling is performed at 50 kHz, but it is not necessary to set the frequency as high as practical. Moreover, it is preferable to cut in advance a frequency component exceeding 1/2 of the sampling frequency with a filter (low-pass filter).

メモリーに格納された音声データは、通常、最初と最後に存在する無信号領域（無音領域）を除いて取り出されて周波数分析が施されるが、周波数分析誤差をできるだけ少なく抑えるため、その前処理として、窓関数処理を施すことが好ましい。 The audio data stored in the memory is usually extracted and subjected to frequency analysis except for the no-signal area (silence area) that exists at the beginning and end, but in order to minimize the frequency analysis error, the pre-processing is performed. It is preferable to perform window function processing.

窓関数処理を行う場合の窓関数としては、ハニング窓、ハミング窓、ブラックマン窓、矩形窓などがあり、いずれを用いることも可能であるが、音声はランダム波形であることから、音声解析で最も一般的に用いられているハミング窓が好ましい。 As window functions when performing window function processing, there are Hanning window, Hamming window, Blackman window, rectangular window, etc., and any of them can be used. The most commonly used Hamming window is preferred.

ハミング窓を用いた場合、元の音声データ値をｄ、データ番号をｎ、周波数分析に用いるデータ数をＮとすると、変換後のデータＸは以下の通りとなる。 When the Hamming window is used, if the original audio data value is d, the data number is n, and the number of data used for frequency analysis is N, the converted data X is as follows.

Ｘ＝ｄ×〔０．５４−０．４６×ｃｏｓ｛２×π×ｎ／（Ｎ−１）｝〕
ｎ＝０〜（Ｎ−１） X = d × [0.54-0.46 × cos {2 × π × n / (N−1)}]
n = 0 to (N-1)

周波数分析に用いるデータ数Ｎは、これが少ない場合周波数分解能は低下するが、分析区間内での時間分解能が大きく現れる。逆に、データ数Ｎが多いと、分析区間内での時間分解能は小さくなるが、周波数分解能が向上する。周波数分解能が過剰に低下すると、後述する寄与率が音声波形に含まれる周波数成分を正しく反映しにくくなり、寄与率の現れ方の特徴を掴みにくくなることから、４０〜１００Ｈｚの分解能が得られるようにサンプリング周波数に応じてデータ数を調整することが好ましい。なお、高速フーリエ解析の場合、データ数は２の整数乗となる。 When the number N of data used for frequency analysis is small, the frequency resolution decreases, but the time resolution within the analysis interval appears greatly. Conversely, when the number of data N is large, the time resolution in the analysis interval is small, but the frequency resolution is improved. If the frequency resolution is excessively reduced, it will be difficult for the contribution rate described later to correctly reflect the frequency component contained in the speech waveform, and it will be difficult to grasp the characteristics of how the contribution rate appears, so that a resolution of 40 to 100 Hz can be obtained. It is preferable to adjust the number of data according to the sampling frequency. In the case of fast Fourier analysis, the number of data is an integer power of 2.

音声データ群について、必要に応じて上記窓関数処理を施してから周波数分析を行い、振幅スペクトルおよび／またはパワースペクトルを求める。この周波数分析にはフーリエ解析、特に処理時間が短い高速フーリエ解析が好ましい。 The audio data group is subjected to the window function processing as necessary and then subjected to frequency analysis to obtain an amplitude spectrum and / or a power spectrum. For this frequency analysis, Fourier analysis, particularly fast Fourier analysis with a short processing time is preferable.

フーリエ解析（高速フーリエ解析）を用いた場合、１回の解析で、前記音声データ群中のＮ個（例えば５１２個、１０２４個などの２の整数乗の個数）の音声データについて、ｍ＝１（ｍは次数）の時の基本周波数と基本周波数の整数倍（基本周波数の次数倍）の周波数（高調波）とについて、それぞれ対応する正弦波成分の係数ａ_mと余弦波成分の係数ｂ_mが得られる。そして、これらの係数を用い、以下のようにして振幅スペクトルＸ_mとパワースペクトルＸ_m ²を求めることができる。なお、ｍ＝０は直流成分に対応する。 When Fourier analysis (fast Fourier analysis) is used, m = 1 for N pieces of voice data (for example, the number of integer powers of 2 such as 512, 1024, etc.) in the voice data group in one analysis. (m is the order) the fundamental frequency and the frequency (harmonic) of integer multiples of the fundamental frequency (the next multiple of the fundamental frequency), the coefficient b _m of the coefficient a _m and the cosine wave component of the corresponding sine wave component when the Is obtained. Then, using these coefficients, the amplitude spectrum X _m and the power spectrum X _m ² can be obtained as follows. Note that m = 0 corresponds to a DC component.

Ｘ_m＝√（ａ_m ²＋ｂ_m ²）
Ｘ_m ²＝ａ_m ²＋ｂ_m ² _{_{^{X m = √ (a m 2}}} + b m 2)
X _m ² = a _m ² + b _m ²

本発明における寄与率は、ｍ＝１の基本周波数からｍ＝２以上の各高調波成分の振幅スペクトルＸ_mの合計に対する各振幅スペクトルＸ_mの比率Ｃ、または、ｍ＝１の基本周波数からｍ＝２以上の各高調波成分のパワースペクトルＸ_m ²の合計に対する各パワースペクトルＸ_m ²の比率Ｃ’として求めることができる。ＣまたはＣ’は、比として求めても百分率で求めても良い。比として求める場合、下記式となる。百分率で求める場合、それぞれ１００を乗じた値となる。 The contribution ratio in the present invention is the ratio C of each amplitude spectrum X _m to the sum of the amplitude spectra X _m of each harmonic component of m = 2 or more from the fundamental frequency of m = 1, or m from the fundamental frequency of m = 1. = The ratio C ′ of each power spectrum X _m ² to the sum of the power spectra X _m ² of each harmonic component equal to or greater than 2 can be obtained. C or C ′ may be obtained as a ratio or as a percentage. When calculating | requiring as a ratio, it becomes a following formula. When obtained as a percentage, each value is multiplied by 100.

Ｃ＝（１／ΣＸ_m）×（Ｘ_m）
Ｃ’＝（１／ΣＸ_m ²）×（Ｘ_m ²） C = (1 / ΣX _m ) × (X _m )
C ′ = (1 / ΣX _m ² ) × (X _m ² )

なお、ｍの上限は、周波数分析を行う際の周波数分解能によって異なるが、音声認識を行うに必要な周波数まで分析できる次数までで足る。具体的には、サンプリング周波数が５０ｋＨｚ、データ数Ｎが１０２４個であるとすると、周波数分析で得られる次数は、（１０２４／２）―１＝５１１であるが、周波数分解能＝５００００÷１０２４≒４８Ｈｚであることと、前記のように日本語の音声認識では５．５ｋＨｚ程度までの分析が必要であると考えると、ｍ＝５５００÷４８≒１１４となる。 The upper limit of m depends on the frequency resolution when performing frequency analysis, but is sufficient up to the order that can be analyzed up to the frequency necessary for speech recognition. Specifically, if the sampling frequency is 50 kHz and the number of data N is 1024, the order obtained by frequency analysis is (1024/2) −1 = 511, but the frequency resolution = 50000 ÷ 1024≈48 Hz. And, as described above, if Japanese speech recognition needs to be analyzed up to about 5.5 kHz, m = 5500 ÷ 48≈114.

本発明で用いる寄与率は、上記ＣとＣ’のいずれでも良いが、Ｃ’の場合変動がほぼ二乗で現れることから、振幅スペクトルよりも周波数成分の大小が強調されやすくなるため、Ｃを用いることが好ましい。 The contribution ratio used in the present invention may be either C or C ′. However, in the case of C ′, the fluctuation appears almost squared, so that the magnitude of the frequency component is more easily emphasized than the amplitude spectrum. It is preferable.

周波数分析は、例えば、音声データ群の適宜の領域のＮ個の音声データについて１回行うだけとすることもできる。しかし、通常、音声データ群がＮ個を超える音声データの集まりとなるようにデータ数Ｎやサンプル数を定めることから、Ｎ個の音声データを１つの分析区間（１フレーム）とし、各分析区間を所定音声データ数ずつずらせながら、音声データ群全体を複数回に分けて周波数分析することが好ましい。このようにして音声データ群全体を分析することで精度を向上させることができる。この場合、各分析区間毎に寄与率を求めることになる。分析区間番号をｊとすると、前記寄与率Ｃ，Ｃ’は下記のように表すことができる。 For example, the frequency analysis may be performed only once for N pieces of sound data in an appropriate region of the sound data group. However, since the number of data N and the number of samples are usually determined so that the sound data group is a collection of more than N sound data, N sound data is defined as one analysis section (one frame), and each analysis section It is preferable that the entire audio data group is divided into a plurality of times and the frequency analysis is performed while shifting the predetermined number of audio data. Thus, the accuracy can be improved by analyzing the entire audio data group. In this case, the contribution rate is obtained for each analysis section. When the analysis section number is j, the contribution rates C and C ′ can be expressed as follows.

Ｃ_j＝（１／ΣＸ_jm）×（Ｘ_jm）
Ｃ_j’＝（１／Σ（Ｘ_jm ²）×（Ｘ_jm ²） C _j = (1 / ΣX _jm ) × (X _jm )
C _j '= (1 / Σ (X _jm ² ) × (X _jm ² )

上記周波数分析により、通常、ｍ次までの寄与率が各分析区間毎に求められる。そして、求められた寄与率の状態と、予め定められた判定基準とを対比することにより、音声データ群から識別できる母音の音素と子音の音素が特定される。例えば、子音の音素が「ｋ」で母音の音素が「ａ」と特定された場合、「カ」との識別結果となる。また、その音声データ群から「ａ」だけしか特定されない場合、「ア」との識別結果となる。 According to the frequency analysis, a contribution rate up to the m-th order is usually obtained for each analysis section. Then, the vowel phonemes and consonant phonemes that can be identified from the speech data group are identified by comparing the state of the determined contribution rate with a predetermined criterion. For example, when the phoneme of the consonant is “k” and the phoneme of the vowel is specified as “a”, the identification result is “K”. Further, when only “a” is specified from the audio data group, an identification result “a” is obtained.

次に、本発明に係る音声識別方法の他の例を図２に基づいて説明する。 Next, another example of the speech identification method according to the present invention will be described with reference to FIG.

音声信号のサンプリングおよびＡ／Ｄ変換は図１の例と同様である。 The sampling of the audio signal and the A / D conversion are the same as in the example of FIG.

本例においては、メモリーに格納された音声データ群を、窓関数処理に先立って、母音領域の音声データ群と子音領域の音声データ群に区分する。この母音領域と子音領域の区分けは、例えば次のようにして行うことができる。 In this example, the sound data group stored in the memory is divided into a sound data group in the vowel area and a sound data group in the consonant area prior to the window function processing. The division between the vowel area and the consonant area can be performed as follows, for example.

１音声データ群の信号領域の先頭位置から所定個数の音声データを次々に比較し、音声データ中で最大のピーク（最大ピークＰ_max）の値とその位置（最大ピークＰ_maxは音声データ群の中間部に存在する）を求める。 1. A predetermined number of audio data are sequentially compared from the beginning position of the signal area of the audio data group, and the value of the maximum peak (maximum peak P _max ) in the audio data and its position (the maximum peak P _max is the In the middle part).

２適宜の音声データ数の区間を設定し、最大ピークＰ_maxの位置から音声データ群の先頭に向かって、順次区間内で最も大きいピーク（区間ピークＰ_n）を求める。 2. An appropriate number of audio data sections are set, and the largest peak (interval peak P _n ) is sequentially obtained from the position of the maximum peak P _max toward the beginning of the audio data group.

３母音領域においては急激なピークの低下はないことから、最大ピークＰ_maxと、これに隣接する区間における区間ピークＰ₁、さらにＰ₁の区間に隣接する次の区間における区間ピークＰ₂のように次々に対比し、例えば区間ピークＰ₁が最大ピークＰ_maxの６０％以上である場合は母音領域であると判別することができ、また区間ピークＰ_nが一つ前の区間ピークＰ_n-1の６０％以上である場合には母音領域の続きであると判別することができる。 3 Since there is no sudden peak drop in the vowel region, the maximum peak P _max , the section peak P ₁ in the section adjacent to this, and the section peak P ₂ in the next section adjacent to the section of P ₁ For example, when the section peak P ₁ is 60% or more of the maximum peak P _max , it can be determined that it is a vowel area, and the section peak P _n is the previous section peak P _{n−. When} it is 60% or more of ₁ , it can be determined that it is a continuation of the vowel region.

４上記対比を行って、区間ピークＰ_nが一つ前の区間ピークＰ_n-1に比して大きく低下する位置を求め、これが先頭位置であれば全体が母音領域と判別でき、これが音声データ群の中間位置であれば、子音領域と母音領域の境界であると判別できる。また、ピーク値が急激に低下した位置が先頭位置ではない場合でも、先頭位置から当該位置までのデータ数が極端に少ないときには、母音の立ち上がり領域であると判断することができる。 4 performs the comparison, determine the position to be lowered significantly in comparison with the interval peak P _n-1 immediately preceding the section peak P _n, which can be determined overall if the head position is a vowel area, which is audio data If it is the middle position of the group, it can be determined that it is the boundary between the consonant region and the vowel region. Even when the position at which the peak value has sharply decreased is not the head position, if the number of data from the head position to the position is extremely small, it can be determined that the region is a vowel rising region.

５また、最大ピークＰ_maxの位置から音声データ群の最後尾に向かって同様のピーク値の対比を行うと、母音領域の最後尾の位置を検知することができる。複数音を連続して発生した場合、この位置を検知することで、音間の境界を検知することができる。 5. Further, when the same peak value is compared from the position of the maximum peak P _max toward the tail of the voice data group, the position of the tail of the vowel area can be detected. When a plurality of sounds are generated continuously, the boundary between the sounds can be detected by detecting this position.

上記の母音領域と子音領域の区分け方法はその一例で、本発明における母音領域と子音領域の区分け方法としては、従来公知のいずれの方法でも適用することができる。例えば背景技術で挙げた特許文献１の方法で行うこともできる。また、複数の区分け方法を併用することもできる。 The above vowel region and consonant region segmentation method is one example, and any conventionally known method can be applied as the vowel region and consonant region segmentation method in the present invention. For example, it can also carry out by the method of patent document 1 quoted by background art. In addition, a plurality of sorting methods can be used in combination.

音声データ群を母音領域の音声データ群と子音領域の音声データ群に区分した後、必要に応じてそれぞれに前述と同様の窓関数処理を施し、さらに前述と同様にして周波数分析を行い、振幅スペクトルおよび／またはパワースペクトルを求める。 After dividing the voice data group into a voice data group in the vowel area and a voice data group in the consonant area, each is subjected to window function processing similar to that described above as necessary, and frequency analysis is performed in the same manner as described above to determine the amplitude. A spectrum and / or power spectrum is determined.

上記周波数分析は、前記母音領域の音声データ群と、子音領域の音声データ群のそれぞれについて施され、それぞれｍ次までの寄与率が各分析区間毎に求められる。そして、求められた寄与率の状態と、予め定められた判定基準と対比されて、母音と子音が特定される。例えば、子音が「ｋ」で母音が「ａ」と特定された場合、「カ」との識別結果となる。また、母音領域のみであって、その音声データ群から「ａ」と特定された場合、「ア」との識別結果となる。特に本例の場合、子音領域の音声データ群から求められた寄与率は子音の音素を特定するための判定基準のみの対比とし、母音領域の音声データ群から求められた寄与率は母音の音素を特定するための判定基準のみの対比とすることができ、予め母音領域と子音領域を区分けしておくことで、対比を簡略化することができる。 The frequency analysis is performed for each of the speech data group in the vowel region and the speech data group in the consonant region, and a contribution rate up to m-th order is obtained for each analysis section. Then, the vowel and the consonant are specified by comparing the obtained contribution rate state with a predetermined criterion. For example, when the consonant is “k” and the vowel is specified as “a”, the identification result is “K”. In addition, if only the vowel area is identified as “a” from the voice data group, the identification result is “a”. In particular, in the case of this example, the contribution rate obtained from the consonant speech data group is a comparison of only the criteria for identifying the consonant phoneme, and the contribution rate obtained from the vowel speech data group is the vowel phoneme. The comparison can be made only by the determination criterion for specifying the vowel, and the comparison can be simplified by dividing the vowel area and the consonant area in advance.

子音と母音の音素を特定するための判定基準は、予めできるだけ多数の被験者から五十音の寄与率を求め、各被験者の五十音それぞれの音素についての寄与率の現れ方を整理しておくことで用意することができる。具体的には、どのような周波数領域にどのような大きさの寄与率が何個現れるか、最大の寄与率を生じる周波数領域、特定の周波数領域の寄与率と他の特定の周波数領域の寄与率との大小関係などを五十音の音素についてデータベース化しておくことで用意することができる。 The criterion for identifying phonemes of consonants and vowels is to determine the contribution rate of the fifty sounds from as many subjects as possible in advance, and organize how the contribution ratios of the respective phonemes of each subject appear. Can be prepared. Specifically, how many contribution ratios appear in what frequency domain, frequency domain that produces the maximum contribution ratio, contribution ratio of a specific frequency domain and other specific frequency domain It can be prepared by creating a database of the phoneme of the Japanese syllabary, such as the magnitude relationship with the rate.

判定基準との対比により、複数の音素が該当する結果が得られる場合などにおいては、例えば音素に優先順位を定めておいて、その順番で特定したり、原波形を参照することでいずれかを選択することが可能である。 In the case where a result corresponding to a plurality of phonemes is obtained by comparison with the judgment criterion, for example, a priority order is set for the phonemes, and either is specified by specifying the order or referring to the original waveform. It is possible to select.

判定基準を作成する場合や、未知の音声を識別する場合に、ニューラルネットワークなどを導入することにより、より認識精度を高めることが可能である。また、コンピューター以外にも、適当な電子回路を用いることにより、目的を達成することが可能である。 The recognition accuracy can be further improved by introducing a neural network or the like when creating a criterion or identifying an unknown voice. In addition to the computer, the object can be achieved by using an appropriate electronic circuit.

次に、実際に寄与率を求めた例について説明する。 Next, an example in which the contribution rate is actually obtained will be described.

―「ア」について―
図１に示す手順で音素の判定を行った。 ―About “A” ―
Phonemes were determined according to the procedure shown in FIG.

まず、被験者に単音で「ア」を発声してもらい、その音声をマイクロホンで採取し、サンプリングし、Ａ／Ｄ変換して、１番から順次時系列でデータ番号を付してパーソナルコンピューターのメモリーに格納した。サンプリング周波数は５０ｋＨｚで、Ａ／Ｄで変換を行う際に、ローパスフィルターで２５ｋＨｚを超える周波数成分をカットした。 First, ask the test subject to say “a” with a single sound, collect the sound with a microphone, sample it, A / D convert it, and add the data number in chronological order starting from No. 1. Stored. The sampling frequency was 50 kHz, and frequency components exceeding 25 kHz were cut with a low-pass filter when A / D conversion was performed.

採取した音声波形を図３に示す。 The collected voice waveform is shown in FIG.

メモリーに格納した音声データ群の音声信号領域（無信号領域を除いた領域）を取り出し、ハミング窓関数による窓関数処理を施し、高速フーリエ変換を施した。高速フーリエ変換のデータ数Ｎは１０２４、周波数分析次数ｍは１１４までとした。なお、今回の場合は、（１０２４／２）−１＝５１１次までのスペクトルが求められているが、１１５次から５１１次までのスペクトルは全て無視できる値であった（０に近かった）。 The audio signal area (area excluding the no-signal area) of the audio data group stored in the memory was taken out, subjected to window function processing using a Hamming window function, and subjected to fast Fourier transform. The number of fast Fourier transform data N is 1024, and the frequency analysis order m is 114. In this case, spectra from (1024/2) -1 = 511th order are obtained, but all the spectra from the 115th order to the 511th order are negligible values (close to 0).

百分率で求めた寄与率を表１〜表１８に示す。 Tables 1 to 18 show the contribution ratios obtained as percentages.

表１は、データ番号が３１４〜１３３８までの１０２４の音声データを１分析区間（１フレーム）として高速フーリエ変換して求めた寄与率を示し、表２は、データ番号が７１４〜１７３８までの１０２４の音声データ１分析区間としてを高速フーリエ変換して求めた寄与率を示す。表１のデータ番号が３１４からであるのに対し、表２のデータ番号が７１４であるのは、各分析区間の間を４００の音声データ分だけずらせながら分析を行ったことを示す。１０２４の音声データを１フレームとし、各フレーム間を４００の音声データ分だけずらせているのは以後の他の表においても同様である。 Table 1 shows contribution ratios obtained by performing fast Fourier transform on 1024 audio data with data numbers 314 to 1338 as one analysis section (one frame), and Table 2 shows 1024 with data numbers 714 to 1738. The contribution rate calculated | required by carrying out the fast Fourier transform as the audio | voice data 1 analysis area is shown. The data number in Table 1 is from 314, while the data number in Table 2 is 714 indicates that the analysis was performed while shifting each analysis section by 400 audio data. The same applies to the other tables hereinafter, in which 1024 audio data are defined as one frame and each frame is shifted by 400 audio data.

また、各表の末尾に示される「判定」の欄の記載は、特定された母音または子音の音素を示し、「判定基準」の欄の記載は、後述する表３１１〜３２２に示される「音素」の欄にカッコ書きで示される符号に対応する。「判定」と「判定基準」の欄が空欄である場合は、判定には使用されなかったデータ（後述する判定基準には該当しなかったデータ）であったことを示す。これらは以後の他の表においても同様である。 The description in the “determination” column at the end of each table indicates the phoneme of the specified vowel or consonant, and the description in the “determination criteria” column indicates “phoneme” shown in tables 311 to 322 described later. Corresponding to the reference numerals shown in parentheses in the "" column. If the “determination” and “determination criteria” fields are blank, it indicates that the data was not used for the determination (data that did not correspond to the determination criteria described later). The same applies to other tables below.

―「イ」について―
被験者に単音で「イ」を発生してもらい、以下「ア」の測定と同様にして寄与率を求めた。 ―About “I” ―
The subject was asked to “a” with a single sound, and the contribution rate was obtained in the same manner as the measurement of “a”.

採取した音声波形を図４に示すと共に、百分率で求めた寄与率を表１９〜表４３に示す。なお、表１のデータ番号が３１４から始まっているのに対し、表１９のデータ番号が２１からとなっているのは、表１においては３１３までが無信号状態（無音状態）であったために処理対象外とされ、表１９においてはそれが２０までであったことによる。以後の他の音の表におけるデータ番号のズレも同様である。 The collected speech waveforms are shown in FIG. 4, and the contribution rates obtained as a percentage are shown in Table 19 to Table 43. The data number in Table 1 starts from 314, whereas the data number in Table 19 starts from 21 because in Table 1, up to 313 was a no-signal state (silent state). It is excluded from the processing target, and in Table 19, it is up to 20. The same applies to the deviation of the data numbers in the other sound tables thereafter.

―「ウ」について―
被験者に単音で「ウ」を発音してもらい、以下「ア」の測定と同様にして寄与率を求めた。 -About "U"-
The subject was asked to pronounce “U” with a single note, and the contribution rate was obtained in the same manner as the measurement of “A”.

採取した音声波形を図５に示すと共に、百分率で求めた寄与率を表４４〜表６８に示す。 The collected speech waveforms are shown in FIG. 5, and the contribution ratios obtained as a percentage are shown in Table 44 to Table 68.

―「エ」について―
被験者に単音で「エ」を発音してもらい、以下「ア」の測定と同様にして寄与率を求めた。 ―About “E” ―
The subject was asked to pronounce “e” with a single note, and the contribution rate was obtained in the same manner as the measurement of “a”.

採取した音声波形を図６に示すと共に、百分率で求めた寄与率を表６９〜表９３に示す。 The collected speech waveforms are shown in FIG. 6 and the contribution rates obtained as percentages are shown in Tables 69 to 93.

―「オ」について―
被験者に単音で「オ」を発音してもらい、以下「ア」の測定と同様にして寄与率を求めた。 ―About “O” ―
The subject was asked to pronounce “o” with a single note, and the contribution rate was obtained in the same manner as the measurement of “a”.

採取した音声波形を図７に示すと共に、百分率で求めた寄与率を表９４〜表１２３に示す。 The collected speech waveforms are shown in FIG. 7, and the contribution ratios obtained as percentages are shown in Tables 94 to 123.

―「カ」行について―
被験者に単音で「カ」を発音してもらい、以下「ア」の測定と同様にして寄与率を求めた。 ―About “K” Line―
The subject was asked to pronounce “ka” with a single note, and the contribution rate was obtained in the same manner as the measurement of “a”.

採取した音声波形を図８に示すと共に、百分率で求めた寄与率を表１２４〜表１４４に示す。 The collected voice waveforms are shown in FIG. 8, and the contribution ratios obtained as percentages are shown in Tables 124 to 144.

なお、「キ」、「ク」、「ケ」、「コ」については、子音の音素判別自体は「カ」と同様であることから省略する。 Note that “ki”, “ku”, “ke”, and “ko” are omitted because the phoneme discrimination of the consonant is the same as “ka”.

―「サ」行について―
被験者に単音で「サ」を発音してもらい、以下「ア」の測定と同様にして寄与率を求めた。 ―About the “sa” line―
The subject was asked to pronounce “sa” with a single note, and the contribution rate was obtained in the same manner as the measurement of “a”.

採取した音声波形を図９に示すと共に、百分率で求めた寄与率を表１４５〜表１７３に示す。 The collected voice waveforms are shown in FIG. 9 and the contribution ratios obtained as percentages are shown in Tables 145 to 173.

なお、「シ」、「ス」、「セ」、「ソ」については、子音の音素判別自体は「サ」と同様であることから省略する。 Note that “shi”, “su”, “se”, and “so” are omitted because the phoneme discrimination of the consonant is the same as “sa”.

―「タ」行について―
被験者に単音で「タ」を発音してもらい、以下「ア」の測定と同様にして寄与率を求めた。 ―About “Ta” line―
The subject was asked to pronounce “ta” with a single note, and the contribution rate was obtained in the same manner as the measurement of “a”.

採取した音声波形を図１０に示すと共に、百分率で求めた寄与率を表１７４〜表１９４に示す。 The collected voice waveforms are shown in FIG. 10, and the contribution ratios obtained as percentages are shown in Tables 174 to 194.

なお、「チ」、「ツ」、「テ」、「ト」については、子音の音素判別自体は「タ」と同様であることから省略する。 Note that “chi”, “tsu”, “te”, and “g” are omitted because the phoneme discrimination of the consonant is the same as “ta”.

―「ナ」行について―
被験者に単音で「ナ」を発音してもらい、以下「ア」の測定と同様にして寄与率を求めた。 ―About “Na” Line―
The subject was asked to pronounce “na” with a single note, and the contribution rate was obtained in the same manner as the measurement of “a”.

採取した音声波形を図１１に示すと共に、百分率で求めた寄与率を表１９５〜表２２３に示す。 The collected speech waveforms are shown in FIG. 11, and the contribution ratios obtained as percentages are shown in Tables 195 to 223.

なお、「ニ」、「ヌ」、「ネ」、「ノ」については、子音の音素判別自体は「ナ」と同様であることから省略する。 Note that “ni”, “nu”, “ne”, and “no” are omitted because the phoneme discrimination of the consonant is the same as “na”.

―「ハ」行について―
被験者に単音で「ハ」を発音してもらい、以下「ア」の測定と同様にして寄与率を求めた。 ―About “C” line―
The subject was asked to pronounce “ha” with a single note, and the contribution rate was obtained in the same manner as the measurement of “a”.

採取した音声波形を図１２に示すと共に、百分率で求めた寄与率を表２２４〜表２５０に示す。 The collected speech waveforms are shown in FIG. 12, and the contribution rates obtained as percentages are shown in Tables 224 to 250.

なお、「ヒ」、「フ」、「ヘ」、「ホ」については、子音の音素判別自体は「ハ」と同様であることから省略する。 Note that “hi”, “fu”, “he”, and “ho” are omitted because the phoneme discrimination of the consonant is the same as “ha”.

―「マ」行について―
被験者に単音で「マ」を発音してもらい、以下「ア」の測定と同様にして寄与率を求めた。 -About "Ma" line-
The subject was asked to pronounce “ma” with a single note, and the contribution rate was obtained in the same manner as the measurement of “a”.

採取した音声波形を図１３に示すと共に、百分率で求めた寄与率を表２５１〜表２８０に示す。 The collected speech waveforms are shown in FIG. 13 and the contribution rates obtained as percentages are shown in Tables 251 to 280.

なお、「ミ」、「ム」、「メ」、「モ」については、子音の音素判別自体は「マ」と同様であることから省略する。 Note that “mi”, “mu”, “me”, and “mo” are omitted because the phoneme discrimination of the consonant is the same as “ma”.

―「ヤ」行について―
「ヤ」、「ユ」、「ヨ」については、「ｉａ」、「ｉｕ」、「ｉｏ」に準ずると考えられることから省略する。 ―About “Ya” line―
“Ya”, “Yu”, and “Yo” are omitted because they are considered to be equivalent to “ia”, “iu”, and “io”.

―「ラ」行について―
被験者に単音で「ラ」を発音してもらい、以下「ア」の測定と同様にして寄与率を求めた。 ―About “La” line―
The subject was asked to pronounce “ra” with a single note, and the contribution rate was obtained in the same manner as the measurement of “a”.

採取した音声波形を図１５に示すと共に、百分率で求めた寄与率を表２８１〜表３１０に示す。 The collected speech waveforms are shown in FIG. 15 and the contribution rates obtained as percentages are shown in Tables 281 to 310.

なお、「リ」、「ル」、「レ」、「ロ」については、子音の音素判別自体は「ラ」と同様であることから省略する。 Note that “li”, “le”, “le”, and “ro” are omitted because the phoneme discrimination of the consonant is the same as “la”.

―「ワ」行について―
「ワ」、「ヲ」については、「ｕａ」、「ｕｏ」に準ずると考えられることから省略する。 ―About “Wa” line―
“Wa” and “Wo” are omitted because they are considered to be equivalent to “ua” and “uo”.

―「ン」について―
「ン」については「ｕｎ」または「n」若しくは「m」に準ずると考えられることから省略する。 ―About “N” ―
“N” is omitted because it is considered to be equivalent to “un”, “n” or “m”.

―判定基準について―
男女複数の被験者から五十音を測定した結果得られた判定基準の一例を表３１１〜３２２に示す。 ―About judgment criteria―
Examples of determination criteria obtained as a result of measuring the Japanese syllabary from a plurality of male and female subjects are shown in Tables 311 to 322.

この表３１１〜３２２においては、表示を簡略化するため、１次高調波（４９Ｈｚ）と２次高調波（９８Ｈｚ）の寄与率を足し合わせた値を９８Ｈｚの寄与率とし、３次高調波（１４７Ｈｚ）と４次高調波（１９６Ｈｚ）の寄与率を足し合わせた値を１９６Ｈｚの寄与率とし、以下同様にして、ｍ−１次高調波の寄与率とｍ次高調波の寄与率を足し合わせた値をｍ次の周波数における寄与率として表したものとなっている（ただし、ここでのｍは２以上の整数）。しかし、判定基準は、ｍ−１次高調波の寄与率とｍ次高調波の寄与率を足し合わせた値をｍ次の周波数における寄与率として表したものを基準としなければならないものではなく、各分析区間における１次からｍ次までの寄与率をそのまま表したものを基準とすることもできる。 In Tables 311 to 322, in order to simplify the display, the value obtained by adding the contribution ratios of the first harmonic (49 Hz) and the second harmonic (98 Hz) is defined as the contribution ratio of 98 Hz and the third harmonic ( 147 Hz) and the contribution ratio of the fourth harmonic (196 Hz) are added as the contribution ratio of 196 Hz, and in the same manner, the contribution ratio of the m-1st harmonic and the contribution ratio of the mth harmonic are added together. The value is expressed as a contribution rate at the m-th order frequency (where m is an integer of 2 or more). However, the criterion should not be based on the value obtained by adding the contribution ratio of the m-1st harmonic and the contribution ratio of the mth harmonic as the contribution ratio at the mth order frequency. It is also possible to use the one that directly represents the contribution rate from the first order to the m-th order in each analysis section.

なお、表３１１〜３２２において、「周波数」の項目における上段と下段の数字は、９８Ｈｚに乗ずべき数字を意味し、上段の数字は十の位を指し、下段の数字は一の位を指す。また、「区間」の欄に示されるＡ，Ｂ，Ｃ，…などの符号は、「周波数」の欄に矢印で示される領域を意味するが、以下の説明の便宜上付したもので、各表に付されている符号が同じ周波数領域を意味するものではない。 In Tables 311-322, the upper and lower numbers in the “frequency” item mean numbers that should not be multiplied by 98 Hz, the upper number indicates the tens place, and the lower number indicates the first place. Further, symbols such as A, B, C,... Shown in the “section” column mean areas indicated by arrows in the “frequency” column, but are attached for convenience of the following description. The symbols attached to do not mean the same frequency region.

以下、表３１１〜３２２を補足説明する。 Hereinafter, Tables 311 to 322 will be supplementarily described.

（１）「ａ」の判定基準について
表３１１に示されるように、Ａ−１とＡ−２の２つの判定基準のいずれか一方を満たすときに「ａ」と判定することができる。 (1) Determination criteria for “a” As shown in Table 311, “a” can be determined when either one of the two determination criteria A-1 and A-2 is satisfied.

Ａ−１は、以下の条件を総て満たすときに「ａ」と判定するものである。
・区間Ａ（１×９８〜４×９８Ｈｚ）にあるスペクトルには、寄与率の大きさが１０以上のものが存在しないこと。
・区間Ｂ（５×９８〜９×９８Ｈｚ）にあるスペクトルには、寄与率の大きさが１０以上のものが２個未満であること。
・区間Ｃ（８×９８〜１５×９８Ｈｚ）にあるスペクトルには、寄与率の大きさが３以上のものが３個を超えて存在すること。
・区間Ｄ（１３×９８〜２５×９８Ｈｚ）にあるスペクトルには、寄与率の大きさが３以上のものが０個でないこと。 A-1 is determined as “a” when all of the following conditions are satisfied.
-The spectrum in the section A (1 × 98 to 4 × 98 Hz) should not have a contribution ratio of 10 or more.
-The spectrum in the section B (5 × 98 to 9 × 98 Hz) should have less than 2 contribution factors with a magnitude of 10 or more.
-In the spectrum in the section C (8 × 98 to 15 × 98 Hz), there must be more than three with a contribution ratio of 3 or more.
-The spectrum in the section D (13 × 98 to 25 × 98 Hz) should not have 0 contribution factor of 3 or more.

Ａ−２は、以下の条件を総て満たすときに「ａ」と判定するものである。
・区間Ａ（１×９８〜４×９８Ｈｚ）にあるスペクトルには、寄与率の大きさが１０以上のものが存在しないこと。
・区間Ｂ（２×９８〜７×９８Ｈｚ）にあるスペクトルには、寄与率の大きさが３以上のものが１個を超えて存在すること。
・区間Ｃ（５×９８〜９×９８Ｈｚ）にあるスペクトルには、寄与率の大きさが１０以上のものが２個未満であること。
・区間Ｄ（９×９８〜１５×９８Ｈｚ）にあるスペクトルには、寄与率の大きさが１０以上のものが２個を超えて存在すること。
・区間Ｅ（１３×９８〜２５×９８Ｈｚ）にあるスペクトルには、寄与率の大きさが３以上のものが０個でないこと。 A-2 is determined as “a” when all of the following conditions are satisfied.
-The spectrum in the section A (1 × 98 to 4 × 98 Hz) should not have a contribution ratio of 10 or more.
-In the spectrum in the section B (2 × 98 to 7 × 98 Hz), there should be more than one with a contribution ratio of 3 or more.
-The spectrum in the section C (5 × 98 to 9 × 98 Hz) should have less than 2 contributions with a magnitude of 10 or more.
-In the spectrum in the section D (9 × 98 to 15 × 98 Hz), there must be more than two having a contribution ratio of 10 or more.
-The spectrum in the section E (13 × 98 to 25 × 98 Hz) must not have 0 contribution factor of 3 or more.

（２）「ｉ」の判定基準について
表３１２に示されるように、Ｉ−１とＩ−２の２つの判定基準のいずれかを満たすときに「ｉ」と判定することができる。 (2) Determination criteria for “i” As shown in Table 312, it is possible to determine “i” when either of the two determination criteria I-1 and I-2 is satisfied.

Ｉ−１の表の見方は前記「ａ」の判定基準を示す表３１１に準ずる。 The way of reading the table of I-1 is based on the table 311 showing the determination criteria of the above “a”.

Ｉ−２は、以下の条件を総て満たすときに「ｉ」と判定するものである。
・区間Ａ（２×９８〜４×９８Ｈｚ）にあるスペクトルには、寄与率の大きさが９以上のものが０個でないこと。
・区間Ｂ（１１×９８〜１５×９８Ｈｚ）にあるスペクトルには、寄与率の大きさが２．５以上のものが０個であること。
・区間Ｃ（１７×９８〜２６×９８Ｈｚ）にあるスペクトルには、寄与率の大きさが２．５以上のものが６個未満であること。
・区間Ｄ（１７×９８〜２０×９８Ｈｚ）にあるスペクトルには、寄与率の大きさが１．５以上のものが０個であること。
・区間Ｅ１（２８×９８〜４１×９８Ｈｚ）にあるスペクトルには、寄与率の大きさが
０．５以上のものが８個以上あること、または区間E２（２８×９８〜４１×９８Ｈｚ）にあるスペクトルには、寄与率の大きさが１以上のものが３個以上あること、若しくは
区間Ｆ（２８×９８〜４１×９８Ｈｚ）にあるスペクトルには、寄与率の大きさが０．５以上のものが３個以上あり、かつ区間Ｇ（２８×９８〜４１×９８Ｈｚ）にあるスペクトルには、寄与率の大きさが１以上のものが０個でないこと。
・区間Ｈ（３５×９８〜４６×９８Ｈｚ）にあるスペクトルには、寄与率の大きさが
２．５以上のものが０個であること。
・区間１×９８〜１０×９８Ｈｚにおいては、寄与率の大きさが３以上のものは７×９８
Ｈｚ以上には存在しないこと。 I-2 is determined as “i” when all of the following conditions are satisfied.
-In the spectrum in the section A (2 × 98 to 4 × 98 Hz), there should be no zero having a contribution ratio of 9 or more.
-The spectrum in the section B (11 × 98 to 15 × 98 Hz) has zero contribution factor of 2.5 or more.
-The spectrum in the section C (17 × 98 to 26 × 98 Hz) should have less than six with a contribution ratio of 2.5 or more.
-The spectrum in the section D (17 × 98 to 20 × 98 Hz) should have zero contribution factor of 1.5 or more.
-The spectrum in the section E1 (28 × 98 to 41 × 98 Hz) has eight or more contribution factors having a magnitude of 0.5 or more, or the section E2 (28 × 98 to 41 × 98 Hz). A certain spectrum has three or more contribution factors having a magnitude of 1 or more, or a spectrum in a section F (28 × 98 to 41 × 98 Hz) has a contribution rate of 0.5 or more. The spectrum in the section G (28 × 98 to 41 × 98 Hz) must have no more than one with a contribution ratio of 1 or more.
-The spectrum in the section H (35 × 98 to 46 × 98 Hz) has zero contribution factor of 2.5 or more.
・ In sections 1 × 98 to 10 × 98 Hz, those with a contribution ratio of 3 or more are 7 × 98
Does not exist above Hz.

（３）「ｕ」、「ｅ」、「ｏ」、「ｓ」、「ｔ」の判定基準について
「ｕ」は表３１３、「ｅ」は表３１４、「ｏ」は表３１５、「ｓ」は表３１７、「ｔ」は表３１８に示される判定基準によって判定することができる。「ｕ」、「ｅ」、「ｏ」、「ｓ」および「ｔ」のＴ−１の表の見方は上記「ａ」の判定基準を示す表３１１に準ずる。「ｔ」のＴ−１の表の見方は次に述べるＫ−２の見方に準ずる。 (3) “u”, “e”, “o”, “s”, “t” Determination Criteria “u” is Table 313, “e” is Table 314, “o” is Table 315, “s” Table 317, “t” can be determined according to the determination criteria shown in Table 318. The reading of the T-1 table of “u”, “e”, “o”, “s”, and “t” is in accordance with Table 311 showing the determination criteria of “a”. The view of the T-1 table of “t” is in accordance with the view of K-2 described below.

（４）「ｋ」の判定基準について
表３１６に示されるように、Ｋ−１とＫ−２とＫ−３の３つの判定基準のいずれか一つを満たすときに「ｋ」と判定することができる。 (4) About the judgment criterion of “k” As shown in Table 316, it is judged as “k” when any one of the three judgment criteria of K-1, K-2, and K-3 is satisfied. Can do.

Ｋ−１とＫ−３の表の見方は前記「ａ」の判定基準を示す表３１１に準ずる。 The way of viewing the tables of K-1 and K-3 is in accordance with Table 311 showing the determination criteria of “a”.

Ｋ−２は、以下の条件を総て満たすときに「ｋ」と判定するものである。
・区間Ａ（１×９８〜５×９８Ｈｚ）にあるスペクトルには、寄与率の大きさが６以上のものが０個であること。
・区間Ｂ（１６×９８〜２０×９８Ｈｚ）にあるスペクトルには、寄与率の大きさが２．５以上のものが０個であること。
・区間Ｃ１（３６×９８〜４０×９８Ｈｚ）にあるスペクトルには、寄与率の大きさが２以上のものが１個以上あること、または、区間Ｃ２（４６×９８〜５５×９８Ｈｚ）にあるスペクトルには、寄与率の大きさが２以上のものが１個以上あること。
・区間Ｄ（４１×９８〜４５×９８Ｈｚ）にあるスペクトルには、寄与率の大きさが３以上のものが１個以上あること。 K-2 is determined as “k” when all of the following conditions are satisfied.
-The spectrum in the section A (1 × 98 to 5 × 98 Hz) must have zero contribution ratio of 6 or more.
-The spectrum in the section B (16 × 98 to 20 × 98 Hz) has zero contribution factor of 2.5 or more.
-The spectrum in the section C1 (36 × 98 to 40 × 98 Hz) has one or more contribution magnitudes of 2 or more, or is in the section C2 (46 × 98 to 55 × 98 Hz). There must be at least one spectrum with a contribution ratio of 2 or more.
-The spectrum in the section D (41 × 98 to 45 × 98 Hz) must have at least one contribution factor of 3 or more.

（５）「ｎ」の判定基準について
表３１９に示されるように、以下の条件を総て満たすときに「ｎ」と判定することができる。
・区間Ａ（１×９８〜６×９８Ｈｚ）にあるスペクトルには、寄与率の大きさが３０以上のものが０個であること。
・区間Ｂ（１×９８〜６×９８Ｈｚ）にあるスペクトルには、寄与率の大きさが１０以上のものが１個を超えること。
・区間Ｃ（１×９８〜６×９８Ｈｚ）にあるスペクトルには、寄与率の大きさが５以上のものが２個を超えること。
・区間Ｄ（７×９８〜９×９８Ｈｚ）にあるスペクトルの最大寄与率をｐ０とし、区間Ｅ（１０×９８〜１５×９８Ｈｚ）にあるスペクトルの最大寄与率をｐ１とし、区間Ｆ（１６×９８〜２１×９８Ｈｚ）にあるスペクトルの最大寄与率をｐ２とし、区間Ｇ（２２×９８〜３０×９８Ｈｚ）にあるスペクトルの最大寄与率をｐ３としたときに、ｐ０、ｐ２、ｐ３のうちの最低１個がｐ１よりも大きく、かつ、ｐ０、ｐ２、ｐ３のうちの最低１個の寄与率が２以上であること。
・区間Ｈ（３１×９８〜５５×９８Ｈｚ）にあるスペクトルには、寄与率の大きさが２以上のものが０個であること。 (5) Determination criteria for “n” As shown in Table 319, “n” can be determined when all of the following conditions are satisfied.
-The spectrum in the section A (1 × 98 to 6 × 98 Hz) must have zero contribution factor of 30 or more.
-The spectrum in the section B (1 × 98 to 6 × 98 Hz) must have more than one contribution factor of 10 or more.
-The spectrum in the section C (1 × 98 to 6 × 98 Hz) should have more than 2 contributions with a contribution ratio of 5 or more.
The maximum contribution ratio of the spectrum in the section D (7 × 98 to 9 × 98 Hz) is p0, the maximum contribution ratio of the spectrum in the section E (10 × 98 to 15 × 98 Hz) is p1, and the section F (16 × The maximum contribution ratio of the spectrum in the range 98 to 21 × 98 Hz is p2, and the maximum contribution ratio of the spectrum in the section G (22 × 98 to 30 × 98 Hz) is p3. At least one is larger than p1, and the contribution of at least one of p0, p2, and p3 is 2 or more.
-The spectrum in the section H (31 × 98 to 55 × 98 Hz) should have zero contribution factor of 2 or more.

（６）「ｈ」の判定基準について
表３２０に示されるように、Ｈ−１〜Ｈ−４の４つの判定基準のいずれか一つを満たすときに「ｈ」と判定することができる。 (6) Determination criteria for “h” As shown in Table 320, “h” can be determined when any one of the four determination criteria H-1 to H-4 is satisfied.

Ｈ−２の表の見方は前記表３２０のＫ−２に準じ、Ｈ−３の表の見方は前記「ａ」の判定基準を示す表３１１に準ずる。 The way of reading the table of H-2 is based on K-2 of the above table 320, and the way of reading the table of H-3 is based on the table 311 showing the judgment criteria of the above “a”.

Ｈ−１は、以下の条件を総て満たすときに「ｈ」と判定するものである。
・区間Ａ１（１×９８〜５×９８Ｈｚ）にあるスペクトルには、寄与率の大きさが７以上のものが０個でないこと、または、区間Ａ２（２１×９８〜２６×９８Ｈｚ）にあるスペクトルには、寄与率の大きさが３以上のものが０個でないこと。
・区間Ｂ（６×９８〜１０×９８Ｈｚ）にあるスペクトルには、寄与率の大きさが３以上のものが０個でないこと。
・区間Ｃ（１１×９８〜１５×９８Ｈｚ）にあるスペクトルには、寄与率の大きさが３以上のものが０個でないこと。
・区間Ｄ（１６×９８〜２０×９８Ｈｚ）にあるスペクトルには、寄与率の大きさが３以上のものが０個でないこと。
・区間Ｅ（６×９８〜３０×９８Ｈｚ）にスペクトルの最大寄与率ｐ０が存在し、かつ、このｐ０の大きさが８以上であること。 H-1 is determined as “h” when all of the following conditions are satisfied.
-The spectrum in the section A1 (1 × 98 to 5 × 98 Hz) has zero contribution factor of 7 or more, or the spectrum in the section A2 (21 × 98 to 26 × 98 Hz) Is that there are no zero contributions of 3 or more.
-The spectrum in the section B (6 × 98 to 10 × 98 Hz) must not have zero contribution factor of 3 or more.
-The spectrum in the section C (11 × 98 to 15 × 98 Hz) must not have 0 contribution factor of 3 or more.
-The spectrum in the section D (16 × 98 to 20 × 98 Hz) must not have 0 contribution factor of 3 or more.
-The maximum contribution p0 of the spectrum exists in the section E (6 × 98 to 30 × 98 Hz), and the size of this p0 is 8 or more.

Ｈ−４は、以下の条件を総て満たすときに「ｈ」と判定するものである。
・区間Ａ（１×９８〜５×９８Ｈｚ）にあるスペクトには、寄与率の大きさが２０以上のものが０個であること。
・区間Ｃ（１×９８〜２６×９８Ｈｚ）にスペクトルの最大寄与率ｐ０が存在し、かつ、このｐ０の大きさが８以上であること。
・上記最大寄与率ｐ０が属する区間を除く区間Ｂ１〜Ｂ８のいずれか２区間以上で、寄与率の大きさが４以上のものが１個以上存在すること。 H-4 is determined as “h” when all of the following conditions are satisfied.
-The spectrum in the section A (1 × 98 to 5 × 98 Hz) has zero contribution ratio of 20 or more.
-The maximum contribution p0 of the spectrum exists in the section C (1 × 98 to 26 × 98 Hz), and the size of this p0 is 8 or more.
-There must be at least one of two or more of the sections B1 to B8 excluding the section to which the maximum contribution rate p0 belongs and having a contribution rate of 4 or more.

（７）「ｍ」の判定基準について
表３２１に示されるように、Ｍ−１とＭ−２の２つの判定基準のいずれか一方を満たすときに「ｍ」と判定することができる。 (7) Judgment criteria of “m” As shown in Table 321, it can be judged as “m” when one of the two judgment criteria of M-1 and M-2 is satisfied.

Ｍ−１は、以下の条件を総て満たすときに「ｍ」と判定するものである。
・区間Ａ（１×９８〜６×９８Ｈｚ）にあるスペクトルには、寄与率の大きさが１０以上のものが１個を超えること。
・区間Ｂ（１×９８〜６×９８Ｈｚ）にあるスペクトルには、寄与率の大きさが５以上のものが２個を超えること。
・区間Ｃ（７×９８〜１０×９８Ｈｚ）にあるスペクトルの最大寄与率をｐ０とし、区間Ｄ（１１×９８〜１５×９８Ｈｚ）にあるスペクトルの最大寄与率をｐ１とし、区間Ｅ（１６×９８〜２１×９８Ｈｚ）にあるスペクトルの最大寄与率をｐ２とし、区間Ｆ（２２×９８〜３０×９８Ｈｚ）にあるスペクトルの最大寄与率をｐ３としたときに、ｐ１は、ｐ０、ｐ２、ｐ３のいずれよりも大きく、かつ、ｐ１は２以上であること。
・区間Ｇ（３１×９８〜５５×９８Ｈｚ）にあるスペクトルには、寄与率の大きさが４以上のものが０個であること。 M-1 is determined to be “m” when all of the following conditions are satisfied.
-The spectrum in the section A (1 × 98 to 6 × 98 Hz) must have more than one contribution factor of 10 or more.
-The spectrum in the section B (1 × 98 to 6 × 98 Hz) must have more than 2 contribution factors with a magnitude of 5 or more.
The maximum contribution ratio of the spectrum in the section C (7 × 98 to 10 × 98 Hz) is p0, the maximum contribution ratio of the spectrum in the section D (11 × 98 to 15 × 98 Hz) is p1, and the section E (16 × P1 is p0, p2, and p3, where p2 is the maximum contribution ratio of the spectrum in the range 98 to 21 × 98 Hz) and p3 is the maximum contribution ratio of the spectrum in the section F (22 × 98 to 30 × 98 Hz). And p1 is 2 or more.
-The spectrum in the section G (31 × 98 to 55 × 98 Hz) has zero contribution factor of 4 or more.

Ｍ−２の表の見方は前記「ａ」の判定基準を示す表３１１に準ずる。 The way to read the table of M-2 is based on the table 311 showing the determination criteria of the “a”.

（８）「ｒ」の判定基準について
「ｒ」は表３２２に示される判定基準によって判定することができる。この表の見方は上記表３２１のＭ−１に準ずる。 (8) Determination criteria for “r” “r” can be determined according to the determination criteria shown in Table 322. The way of reading this table conforms to M-1 in Table 321 above.

本発明に係る音声認識方法の一例を示すブロック線図である。It is a block diagram which shows an example of the speech recognition method which concerns on this invention. 本発明に係る音声認識方法の他の例を示すブロック線図である。It is a block diagram which shows the other example of the speech recognition method which concerns on this invention. 「ア」の音声波形を示す図である。It is a figure which shows the audio | voice waveform of "a". 「イ」の音声波形を示す図である。It is a figure which shows the audio | voice waveform of "I". 「ウ」の音声波形を示す図である。It is a figure which shows the audio | voice waveform of "U". 「エ」の音声波形を示す図である。It is a figure which shows the audio | voice waveform of "d". 「オ」の音声波形を示す図である。It is a figure which shows the audio | voice waveform of "o". 「カ」の音声波形を示す図である。It is a figure which shows the audio | voice waveform of "K". 「サ」の音声波形を示す図である。It is a figure which shows the audio | voice waveform of "sa". 「タ」の音声波形を示す図である。It is a figure which shows the audio | voice waveform of "ta". 「ナ」の音声波形を示す図である。It is a figure which shows the audio | voice waveform of "na". 「ハ」の音声波形を示す図である。It is a figure which shows the audio | voice waveform of "c". 「マ」の音声波形を示す図である。It is a figure which shows the audio | voice waveform of "ma". 「ラ」の音声波形を示す図である。It is a figure which shows the audio | voice waveform of "La".

Claims

A frequency analysis is performed on a group of audio data sampled from the audio signal and A / D converted, and the amplitude spectrum or power spectrum obtained thereby is used to calculate the amplitude or power of the fundamental wave and each harmonic component included in the audio frequency domain. A speech recognition method characterized in that the ratio of the amplitude or power of each of the fundamental wave and each harmonic component is obtained as a contribution rate, and phonemes of consonants and vowels are specified from the appearance of the contribution rate.

The speech data group is divided into a consonant region and a vowel region, and the contribution rate is obtained by frequency analysis of the speech data group of the consonant region and the speech data group of the vowel region, and from the appearance of the contribution rate in each speech data group, the consonant 2. The speech recognition method according to claim 1, wherein phonemes of vowels are specified.

The ratio of the amplitude of each of the fundamental wave and each harmonic component to the sum of the amplitudes of the fundamental wave and each harmonic component included in the audio frequency region is used as the contribution rate. Speech recognition method.

The voice according to any one of claims 1 to 3, wherein the voice data group is sequentially subjected to frequency analysis for each analysis section of N pieces of voice data, and a contribution rate is obtained for each analysis section. Recognition method.

5. The speech recognition method according to claim 1, wherein a frequency analysis is performed after a window function process is performed on the speech data group.

The speech recognition method according to claim 1, wherein a Hamming window is used for window function processing, and fast Fourier analysis is used for frequency analysis.