JP2005070377A

JP2005070377A - Device and method for speech recognition, and speech recognition processing program

Info

Publication number: JP2005070377A
Application number: JP2003299498A
Authority: JP
Inventors: Koichi Nakagome; 浩一中込; Shigeru Kafuku; 滋加福
Original assignee: Casio Computer Co Ltd
Current assignee: Casio Computer Co Ltd
Priority date: 2003-08-25
Filing date: 2003-08-25
Publication date: 2005-03-17
Anticipated expiration: 2023-08-25
Also published as: JP4479191B2

Abstract

<P>PROBLEM TO BE SOLVED: To improve a speech recognition rate more by performing recognition while seizing redundant variation of a frequency-base system feature parameter and momentary variation of a power-system feature parameter although feature vectors are in the same cycle by making the window length of the power-system feature parameter representing a feature of an abrupt variation part of a consonant most shorter than the window length of the frequency-base-system feature parameter representing a feature of a redundant part of a vowel most. <P>SOLUTION: The upper part shows a speech section of fixed length to be analyzed in an inputted speech waveform; and the lateral axis is a time base and the longitudinal axis represents the amplitude (energy) of the speech waveform. The intermediate stage shows four time windows F(i) (i=1 to 4) for the frequency-base-system feature parameter obtained by shifting at equal intervals of a shift length FS within a range of an analytic frame, and a time window P(i) for the power-system feature parameter. Those time windows have window lengths fL and pL of fixed lengths respectively, the window length pL of the time window P(i) for the power-system feature parameter being shorter than the window length fL of the time window F(i) for the frequency-axis-system feature parameter. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

本発明は、音声認識装置、音声認識方法及び音声認識処理プログラムに関するものである。 The present invention relates to a voice recognition device, a voice recognition method, and a voice recognition processing program.

近年、人間の音声を機械にて認識させるための音声認識の研究が盛んに行われ、部分的には実用化の域に達している。音声認識の方法としては、入力音声と、予め与えられた標準パターンとを比較して、最も類似度の大きいものを選択し、出力する方法が採用されている。 In recent years, research on speech recognition for recognizing human speech by a machine has been actively conducted, and it has partially reached the practical range. As a speech recognition method, a method is adopted in which an input speech and a standard pattern given in advance are compared, and the one with the highest similarity is selected and output.

この標準パターンは、採用する音声認識方法によって異なるものが採用され、例えばＤＰ（Dynamical Programing；動的計画法）を用いたＤＴＷ（Dynamic Time Warpimg；時間軸非線形マッチング）と呼ばれる音声認識方法においては、典型的な音声特徴量の時系列を標準パターンとして用いている。 This standard pattern is different depending on the speech recognition method employed. For example, in a speech recognition method called DTW (Dynamic Time Warpimg) using DP (Dynamical Programming), A time series of typical speech feature values is used as a standard pattern.

また、ＨＭＭ（Hidden Markov Model；隠れマルコフモデル）を用いた音声認識方法は、統計的手法が駆使されたものであり、音声データに含まれている各単語を音素程度の比較的少ない複数の状態で表し、単語毎に状態の遷移確率と、各状態から入力された特徴量を出力する確率と、をパラメータとして与えられたものを標準パターンとして用いている。現時、このＨＭＭが、中心的な音声認識方法として広く利用されている。 In addition, the speech recognition method using HMM (Hidden Markov Model) uses a statistical method, and each word included in speech data is in a plurality of states with relatively few phonemes. In this case, a standard pattern is used in which a state transition probability for each word and a probability of outputting a feature amount input from each state are given as parameters. At present, this HMM is widely used as a central speech recognition method.

このような音声認識方式においては、人間が発声した音声信号の中から、音声認識に必要な情報、すなわち音声特徴量を抽出し、標準パターンとの比較を行うが、この音声特徴量の抽出精度の善し悪しが、音声認識全体に関わる性能（処理速度、及び認識率）の善し悪しを左右する。 In such a speech recognition method, information necessary for speech recognition, that is, a speech feature amount is extracted from a speech signal uttered by a human and compared with a standard pattern. The quality of performance affects the performance (processing speed and recognition rate) related to overall speech recognition.

従来の音声入力の音声特徴量を抽出する方法は、分析フレームの範囲内でハミング窓のような一定値の窓長を有する時間窓を時系列に一定時間ずつシフトさせて設定し、設定された各時間窓によって順次切り出された分析フレーム内の音声波形からの音声サンプルをそれぞれ取り出し、時間窓の音声サンプルの音声特徴量を抽出する方法を用いる。 A conventional method for extracting speech feature values of speech input is set by shifting a time window having a constant window length, such as a hamming window, within a range of an analysis frame by shifting the time series by a certain time. A method is used in which speech samples are extracted from speech waveforms in analysis frames sequentially extracted by each time window, and speech feature values of the speech samples in the time window are extracted.

このような音声特徴量は、時間窓毎に順次切り出された所定数の音声サンプルを周波数軸上に変換することによって得られる音声特徴量（以後、これを周波数軸系特徴パラメータと呼ぶ）と、線形ＰＣＭ（Pulse Code Modulation；パルス符号化）により量子化された音声振幅の２乗和またはその対数をとることによって得られる音声特徴量（以後、これをパワー系特徴パラメータと呼ぶ）とを組み合わせることによって得られるものである。 Such a voice feature amount is a voice feature amount obtained by converting a predetermined number of voice samples sequentially cut out for each time window onto the frequency axis (hereinafter referred to as a frequency axis system feature parameter), Combining with the voice feature quantity (hereinafter referred to as power system feature parameter) obtained by taking the sum of squares of the voice amplitude quantized by linear PCM (Pulse Code Modulation; pulse coding) or its logarithm. Is obtained.

例えば、周波数軸系特徴パラメータ１２成分（１２次元）とパワー系特徴パラメータ１成分（１次元）、及び直前の時間窓の各成分との差分を取ったもの、すなわち周波数軸系特徴パラメータ１２成分（１２次元）とパワー系特徴パラメータ１成分（１次元）の、合計２６成分を２６次元ベクトル量として特徴量を構成する等が考えられる。 For example, the difference between the frequency axis system characteristic parameter 12 component (12 dimensions), the power system characteristic parameter 1 component (1 dimension), and each component of the immediately preceding time window, that is, the frequency axis system characteristic parameter 12 component ( It is conceivable that the feature amount is composed of a total of 26 components, ie, 12 dimensions) and one power system feature parameter component (one dimension) as a 26-dimensional vector quantity.

入力音声の音声特徴量は、上述のように分析フレームの範囲内で時間窓を等間隔にずらして一部重なり合うよう設定して、各時間窓で切り出された音声サンプルを基に抽出される。この時間窓の重なり合う時間を短く設定した場合は、時間窓の数が減少するので入力音声から切り出す音声サンプル数が少なくなり、音声特徴量の抽出処理回数、及び、その後段に続く音声認識処理回数が共に低く抑えられ、処理速度を上げることが可能となる一方で、サンプリングが粗くなるので統計性が悪くなり、音声認識率は低下してしまう。 As described above, the voice feature amount of the input voice is extracted based on the voice samples cut out in each time window by setting the time windows so as to partially overlap within the range of the analysis frame. If the time window overlap is set to be short, the number of time windows decreases, so the number of audio samples cut out from the input speech decreases, the number of voice feature extraction processes, and the number of subsequent voice recognition processes. Are both kept low, and the processing speed can be increased. On the other hand, since the sampling becomes rough, the statistical property is deteriorated and the speech recognition rate is lowered.

他方、時間窓の重なり合う時間を長く設定した場合、時間窓の数が増大するのでサンプリングが細かくなり入力音声から切り出す音声サンプル数が多くなるので、統計性が向上し、精度の良い音声認識が可能となるが、その反面、音声特徴量の抽出処理回数、及び、その後段に続く音声認識処理回数が共に増大し、処理速度の低下を招いてしまう。
すなわち、処理速度を上げると認識率が下がり、認識率を高めようとすると処理速度が低下してしまい、処理速度と認識率を同時に向上させることが困難であった。 On the other hand, if the time window overlaps is set longer, the number of time windows increases, so the sampling becomes finer and the number of audio samples cut out from the input speech increases, thus improving statistics and enabling accurate speech recognition. However, on the other hand, the number of voice feature extraction processes and the number of subsequent voice recognition processes increase, resulting in a decrease in processing speed.
That is, when the processing speed is increased, the recognition rate decreases, and when the recognition rate is increased, the processing speed decreases, and it is difficult to improve the processing speed and the recognition rate at the same time.

この問題を解決するために、周波数軸系特徴パラメータを抽出する周期と、パワー系特徴パラメータを抽出する周期とを異ならせることによって、最も効率良く音声認識ができるように認識処理速度と認識率を最適化する提案がなされている（例えば、特許文献１参照）。
特開２０００−３５６７９０号公報 In order to solve this problem, the recognition processing speed and the recognition rate are set so that speech recognition can be performed most efficiently by differentiating the period for extracting the frequency axis system characteristic parameter and the period for extracting the power system characteristic parameter. Proposals for optimization have been made (see, for example, Patent Document 1).
JP 2000-356790 A

人間の音の認識は、定常的な音に比べて、突発的な音に敏感であることが知られている。定常的な音に対して、瞬間的な音はわずかな違いにも明確に判別できる。つまり人間は、突発的な音声にはより敏感に細かく判別してその発音内容を認識している。これに対処するには、単純に窓長あるいは時間窓のシフト幅を細かくすればよいが、処理量の増大につながる。また時間窓を短くすれば、音声の子音部分の突発的な変動部分の特徴を最も表現しているパワー系特徴パラメータは抽出できるものの、母音部分の冗長な部分の特徴を最もよく表現している周波数軸系の特徴パラメータが抽出できなくなる怖れがでてくる。 Human sound recognition is known to be more sensitive to sudden sounds than to stationary sounds. In contrast to the stationary sound, the instantaneous sound can be clearly distinguished even with a slight difference. In other words, human beings recognize the content of pronunciation by sensitively and finely discriminating sudden sounds. To deal with this, the window length or the time window shift width can be simply made fine, but this leads to an increase in the amount of processing. In addition, if the time window is shortened, the power system characteristic parameter that best represents the characteristics of the sudden fluctuation part of the consonant part of the speech can be extracted, but the characteristic of the redundant part of the vowel part is best expressed. There is a fear that the characteristic parameter of the frequency axis system cannot be extracted.

そこで本発明の目的は、処理量を増加させることなく、突発的な音に対しても人間と同じように判別かつ認識できるようにすることにある。 Accordingly, an object of the present invention is to make it possible to discriminate and recognize sudden sounds as well as humans without increasing the processing amount.

上記目的を達成するために、本発明は、分析対象音声に対して所定長の時間窓を所定周期で設定し、この時間窓を処理単位として、音声の周波数に関する周波数軸系特徴パラメータと、音声の振幅に関するパワー系特徴パラメータとからなる特徴量を抽出し、この抽出された特徴量に基づいて、分析対象音声を認識するものにおいて、パワー系特徴パラメータのみを抽出する時間窓の長さを周波数軸系特徴パラメータのみを抽出する時間窓の長さに比べて短くして特徴量を抽出することを特徴とする。 In order to achieve the above object, the present invention sets a time window of a predetermined length with respect to an analysis target voice at a predetermined cycle, and uses the time window as a processing unit, a frequency axis system characteristic parameter related to the frequency of the voice, and a voice Extracting a feature quantity consisting of power system feature parameters related to the amplitude of the signal, and recognizing the analysis target speech based on the extracted feature quantity, the frequency of the length of the time window for extracting only the power system feature parameters It is characterized in that the feature quantity is extracted by making it shorter than the length of the time window for extracting only the axis system feature parameters.

また、前記周波数軸系特徴パラメータのみを抽出する時間窓の長さは、入力音声の母音部分における基本ピッチ成分の周期以上であることが望ましい。
また、所定周期で順次発生する特徴パラメータのうち、時間的に隣り合う特徴パラメータの差分を表す差分特徴パラメータを演算して音声認識を行うことが望ましい。 The length of the time window for extracting only the frequency axis system characteristic parameter is preferably equal to or longer than the period of the basic pitch component in the vowel part of the input speech.
Moreover, it is desirable to perform speech recognition by calculating a difference feature parameter representing a difference between temporally adjacent feature parameters among feature parameters that are sequentially generated in a predetermined cycle.

そして、分析対象音声に対して３ｋＨｚ〜８ｋＨｚの帯域だけを通過させるバンドパスフィルタ手段をさらに有し、このバンドパスフィルタ手段を介した入力音声のパワー系特徴パラメータを前記所定周期で順次抽出し、この抽出された時間的に隣り合うパワー系特徴パラメータの差分を音声認識手段に用いるようにしてもよい。 And it further has a band pass filter means for passing only the band of 3 kHz to 8 kHz with respect to the analysis target voice, and sequentially extracts the power system characteristic parameters of the input voice through the band pass filter means at the predetermined period, The extracted difference between the power system characteristic parameters adjacent in time may be used for the voice recognition means.

本発明は上記構成を有することにより、母音の冗長的な部分の特徴を最もよく表現している周波数軸系特徴パラメータの窓長に対して、子音の突発的な変動部分の特徴を最もよく表現しているパワー系特徴パラメータの窓長を短くすることにより、同じ周期での特徴ベクトルでありながら、周波数軸系特徴パラメータは冗長的な、パワー系特徴パラメータは瞬時的な変化を捉えながら認識を行うことができ、より音声認識率が向上する。 With the above configuration, the present invention best represents the characteristics of the sudden fluctuation part of the consonant with respect to the window length of the frequency axis system characteristic parameter that best represents the characteristic of the redundant part of the vowel. By shortening the window length of the power system feature parameter, the frequency axis feature parameter is redundant while the power axis feature parameter is recognized while capturing the instantaneous change. This can be done and the speech recognition rate is further improved.

図１は、本発明の一実施形態におけるＨＭＭモデルを用いた音声認識装置１の内部構成を示すブロック図である。
この図１に示すように、音声認識装置１は、時間窓位置設定部１１、音声特徴量抽出部１２、比較部１３、および記憶装置（図示せず）内に予め格納されている標準パターン（ＨＭＭモデル１４１〜14ｎ）とから構成されている。 FIG. 1 is a block diagram showing an internal configuration of a speech recognition apparatus 1 using an HMM model according to an embodiment of the present invention.
As shown in FIG. 1, the speech recognition apparatus 1 includes a standard pattern (preliminarily stored in a time window position setting unit 11, a speech feature amount extraction unit 12, a comparison unit 13, and a storage device (not shown)). HMM models 141 to 14n).

図２は、時間窓位置設定部１１における、入力音声波形に対する時間窓の設定の様子を示す図であり、一例として、分析対象とする音声区間（分析フレーム）から４個の時間窓を切り出す例を示している。
図２の上部は、入力された音声波形の分析対象とする一定長の音声区間を示しており、横軸は時間軸、縦軸は音声波形の振幅（エネルギー）を表している。図２の中段は、分析フレームの範囲内においてシフト長ＦＳずつ等間隔でシフトしてなる４つの周波数軸系特徴パラメータ用時間窓Ｆ（ｉ）（ｉ＝１〜４）とパワー系特徴パラメータ用時間窓Ｐ（ｉ）を示している。これらの時間窓はそれぞれ一定長の窓長ｆＬとｐＬを有しており、この図２から明白なとおり、パワー系特徴パラメータ用時間窓Ｐ（ｉ）の窓長ｐＬは、周波数軸系特徴パラメータ用時間窓Ｆ（ｉ）の窓長ｆＬより短い。 FIG. 2 is a diagram illustrating how the time window position setting unit 11 sets a time window for an input speech waveform. As an example, four time windows are cut out from a speech section (analysis frame) to be analyzed. Is shown.
The upper part of FIG. 2 shows a fixed-length speech section that is an analysis target of the input speech waveform, the horizontal axis represents the time axis, and the vertical axis represents the amplitude (energy) of the speech waveform. The middle part of FIG. 2 shows four frequency axis system feature parameter time windows F (i) (i = 1 to 4) and power system feature parameters that are shifted at equal intervals by the shift length FS within the range of the analysis frame. A time window P (i) is shown. Each of these time windows has window lengths fL and pL having a certain length, and as is apparent from FIG. 2, the window length pL of the power system characteristic parameter time window P (i) is the frequency axis system characteristic parameter. It is shorter than the window length fL of the working time window F (i).

また、周波数軸系特徴パラメータ用時間窓Ｆ（ｉ）は隣り合う時間窓が一部重複するようになっており、パワー系特徴パラメータ用時間窓Ｐ（ｉ）は本実施例では周期ＦＳと同じサイズとなっており、しかもその開始位置は、周波数軸系特徴パラメータ用時間窓Ｆよりｏｆｆｓｅｔ量だけ遅延している。なお、このｏｆｆｓｅｔ量は、周波数軸系特徴パラメータ用時間窓Ｆ（ｉ）の開始位置に対して、パワー系特徴パラメータ用時間窓Ｐ（ｉ）の開始位置が常に同一の相対時間位置にあるようにするためのものであり、「０」であってもよい。 Also, the frequency axis feature parameter time window F (i) is such that adjacent time windows partially overlap, and the power feature parameter time window P (i) is the same as the period FS in this embodiment. Further, the start position is delayed by an offset amount from the time axis F for the frequency axis system characteristic parameter. The offset amount is such that the start position of the power system characteristic parameter time window P (i) is always at the same relative time position with respect to the start position of the frequency axis system characteristic parameter time window F (i). It may be “0”.

図１に示す時間窓位置設定部１１には、図２に示す時間窓の窓長ｆＬ、ｐＬ、ｏｆｆｓｅｔ量、及び時間窓設定周期ＦＳが設定されている。
時間窓位置設定部１１は、分析フレームの範囲内において設定されている窓長ｆＬ、ｐＬ、ｏｆｆｓｅｔ量、及び時間窓設定周期ＦＳに従って、順次時間窓Ｆ（ｉ）およびＰ（ｉ）を設定し、周波数軸系パラメータ用の音声特徴量の抽出を開始させるための開始制御信号を音声特徴量抽出部１２に出力する。 In the time window position setting unit 11 shown in FIG. 1, the window lengths fL, pL, the offset amount, and the time window setting period FS shown in FIG. 2 are set.
The time window position setting unit 11 sequentially sets the time windows F (i) and P (i) according to the window lengths fL, pL, the amount of offset, and the time window setting period FS set within the range of the analysis frame. Then, a start control signal for starting the extraction of the voice feature quantity for the frequency axis system parameter is output to the voice feature quantity extraction unit 12.

これよりｏｆｆｓｅｔ量に対応する時間経過後に、パワー系パラメータ用の音声特徴量の抽出を開始させるための開始制御信号を音声特徴量抽出部１２に出力する。これから窓長ｐＬの後、パワー系パラメータ用の音声特徴量の抽出を終了させるための終了制御信号を音声特徴量抽出部１２に出力する。
そして周波数軸系パラメータ用の音声特徴量の抽出を開始させてから窓長ｆＬの後、音声特徴量の抽出を終了させるための終了制御信号を音声特徴量抽出部１２に出力する。この一連の動作を、周期ＦＳ毎に分析フレームが終了するまで繰り返す。 Thus, after a time corresponding to the offset amount has elapsed, a start control signal for starting extraction of the speech feature amount for the power system parameter is output to the speech feature amount extraction unit 12. After this, after the window length pL, an end control signal for ending the extraction of the speech feature amount for the power system parameter is output to the speech feature amount extraction unit 12.
Then, after the extraction of the voice feature quantity for the frequency axis system parameter is started, after the window length fL, an end control signal for terminating the voice feature quantity extraction is output to the voice feature quantity extraction unit 12. This series of operations is repeated for every cycle FS until the analysis frame ends.

音声特徴量抽出部１２は、前記時間窓位置設定部１１から入力される時間窓Ｆ（ｉ）の開始制御信号及び終了制御信号に基づいて、入力音声を分析フレーム内の時間窓Ｆ（ｉ）で切り出し、この時間窓Ｆ（ｉ）内の音声データｄ（ｎ）から周波数軸系特徴パラメータｆ（ｉ）（例えば、Ｄ次元ベクトル量）を抽出し、時間窓Ｐ（ｉ）内の音声データｄ（ｎ）からパワー系特徴パラメータｐ（ｉ）（１次元ベクトル量）を計算する。 Based on the start control signal and the end control signal of the time window F (i) input from the time window position setting unit 11, the voice feature amount extraction unit 12 converts the input voice into the time window F (i) in the analysis frame. The frequency axis system characteristic parameter f (i) (for example, a D-dimensional vector amount) is extracted from the audio data d (n) in the time window F (i), and the audio data in the time window P (i) is extracted. The power system characteristic parameter p (i) (one-dimensional vector quantity) is calculated from d (n).

パワー系特徴パラメータｐ（ｉ）とは音声の振幅に関する特徴量であり、例えば音声データｄ（ｎ）のニ乗和やその対数を計算するといった比較的少ない計算量で求められる１次元ベクトル量である。周波数軸系特徴パラメータｆ（ｉ）とは、例えば、ケプストラム、メルケプストラムと呼ばれる音声の周波数に関する特徴量であり、音声データｄ（ｎ）に対してのＦＴ（Fourier Transform；フーリエ変換）、対数変換、メル軸変換等の複数（例えば、Ｄ個）の計算結果から構成されるＤ次元ベクトル量である。 The power system characteristic parameter p (i) is a feature quantity related to the amplitude of the voice, and is a one-dimensional vector quantity obtained with a relatively small calculation quantity, for example, calculating the sum of squares of the voice data d (n) and its logarithm. is there. The frequency axis system characteristic parameter f (i) is, for example, a feature quantity related to the frequency of speech called cepstrum or mel cepstrum, and FT (Fourier Transform) or logarithmic transformation for speech data d (n). , A D-dimensional vector quantity composed of a plurality of (for example, D) calculation results such as mel-axis transformation.

また、図３に示すように音声特徴量抽出部１２は、時間的に隣り合う周波数軸系特徴パラメータの差分Δｆ（ｉ）（ｆ（ｉ）と同じ次元を持つベクトル量で、例えばＤ次元）や隣り合うパワー系特徴パラメータの差分Δｐ（ｉ）（１次元ベクトル量）を演算し、これらの差分Δｆ（ｉ）、Δｐ（ｉ）を単位音声特徴量に付加する。そして本実施の形態では、パワー系特徴パラメータの差分Δｐ（ｉ）のさらに差分であるΔΔｐ（ｉ）も演算して単位音声特徴量に付加している。 Further, as shown in FIG. 3, the audio feature quantity extraction unit 12 is a difference Δf (i) between frequency axis system feature parameters adjacent in time (vector quantity having the same dimension as f (i), for example, D dimension). Or the difference Δp (i) (one-dimensional vector quantity) between adjacent power system feature parameters is calculated, and these differences Δf (i) and Δp (i) are added to the unit voice feature quantity. In this embodiment, ΔΔp (i), which is a further difference of the power system feature parameter difference Δp (i), is calculated and added to the unit voice feature amount.

ここで、Δｐ（ｉ）はパワー系特徴パラメータの動特性的特徴を示し、母音の特徴である基本ピッチ成分を包括しなければならない周波数軸系特徴パラメータ用時間窓Ｆ（ｉ）に対して、窓幅の短い独自の専用窓長ｐＬを持つことにより、短い時間で急峻に変化する子音部の特徴をよく表し、良好な認識結果が得られるようになる。 Here, Δp (i) indicates a dynamic characteristic of the power system characteristic parameter, and with respect to a frequency axis system characteristic parameter time window F (i) that must include a basic pitch component that is a characteristic of a vowel, By having a unique dedicated window length pL with a short window width, the characteristics of the consonant part that changes sharply in a short time are well expressed, and a good recognition result can be obtained.

比較部１３は、前記フレーム内で抽出された各種パラメータからなる前記単位音声特徴量と、記憶装置（図示せず）内に予め格納してある標準パターンとを比較照合して入力音声の認識を行い、認識結果を出力する。本実施例では、ＨＭＭに基づく統計的手法により音声認識を行う。 The comparison unit 13 recognizes the input voice by comparing and comparing the unit voice feature amount composed of various parameters extracted in the frame and a standard pattern stored in advance in a storage device (not shown). And output the recognition result. In this embodiment, speech recognition is performed by a statistical method based on HMM.

ここで、上記ＨＭＭを利用した統計的手法に基づく音声認識の手法を説明する。音声特徴量抽出部１２にて抽出された入力音声の特徴量を用いて、予め与えられているＨＭＭモデル１４１〜１４ｎと呼ばれる標準パターンに基づいて音声認識を行う。ＨＭＭモデル１４１〜１４ｎとは、音声データに含まれている各単語を音素程度の比較的少ない複数の状態で表し、単語ごとに状態の遷移確率と、各状態から入力された特徴量を出力する確率と、をパラメータとして与えたものである。比較部１３では、ＨＭＭモデル１４１〜１４ｎの中で、どのＨＭＭモデルが、与えられた音声特徴量を最も高い確率で出力するか、を尤度（確率）計算し、その確率を最大とするＨＭＭモデルに対応する単語を音声認識結果として出力する。 Here, a speech recognition method based on a statistical method using the HMM will be described. Voice recognition is performed based on a standard pattern called HMM models 141 to 14n given in advance using the feature quantity of the input voice extracted by the voice feature quantity extraction unit 12. The HMM models 141 to 14n represent each word included in the speech data in a plurality of states having relatively few phonemes, and output a state transition probability and a feature amount input from each state for each word. Probability is given as a parameter. The comparison unit 13 calculates the likelihood (probability) of which HMM model outputs the given speech feature amount with the highest probability among the HMM models 141 to 14n, and the HMM that maximizes the probability. The word corresponding to the model is output as a speech recognition result.

次に動作を説明する。図４は、音声認識装置１の音声認識処理を説明するためのフローチャートである。
まず、音声認識装置１に対して音声が入力されると、入力された音声は、入力段に設けられたＡ／Ｄ変換機（図示せず）により、所定のサンプリング間隔でサンプリングして標本化を行う線形ＰＣＭ方法に基づいてＰＣＭ符号化され音声データｄ（ｎ）に変換され、時間窓位置設定部１１及び音声特徴量抽出部１２に出力される。 Next, the operation will be described. FIG. 4 is a flowchart for explaining the speech recognition processing of the speech recognition apparatus 1.
First, when speech is input to the speech recognition apparatus 1, the input speech is sampled by sampling at a predetermined sampling interval by an A / D converter (not shown) provided in the input stage. PCM-encoded based on the linear PCM method for performing conversion to speech data d (n), which is output to the time window position setting unit 11 and the speech feature amount extraction unit 12.

音声データｄ（ｎ）が時間窓位置設定部１１に入力されると、窓長ｆＬ、ｐＬ及びｏｆｆｓｅｔ量に基づいて、ハミング窓のような時間窓Ｆ（ｉ）及びＰ（ｉ）を特定する開始制御信号・終了制御信号とが、時間窓位置設定部１１から音声特徴量抽出部１２へ出力される（ステップＳ４０）。次いで、音声特徴量抽出部１２では、ステップＳ４０で出力された前記制御信号に基づいて、時間窓Ｆ（ｉ）における音声データｄ（ｎ）が切り出され、この時間窓Ｆ（ｉ）の開始位置からｏｆｆｓｅｔ量に対応した時間経過した位置からＰ（ｉ）が切り出される（ステップＳ４１）。 When the audio data d (n) is input to the time window position setting unit 11, the time windows F (i) and P (i) such as a Hamming window are specified based on the window lengths fL, pL, and the amount of offset. The start control signal / end control signal is output from the time window position setting unit 11 to the audio feature amount extraction unit 12 (step S40). Next, the audio feature quantity extraction unit 12 extracts audio data d (n) in the time window F (i) based on the control signal output in step S40, and the start position of the time window F (i). P (i) is cut out from the position where the time corresponding to the offset amount has elapsed (step S41).

音声特徴量抽出部１２は、ステップＳ４１で切り出された時間窓Ｐ（ｉ）における音声データｄ（ｎ）の２乗和又はその対数をとることによってパワー系特徴パラメータｐ（ｉ）を計算する。さらにＦ（ｉ）における音声データｄ（ｎ）を、ＦＴ等により周波数軸上に変換することによって得られる周波数軸系特徴パラメータｆ（ｉ）を抽出する。（ステップＳ４２）。抽出されたパワー系特徴パラメータｐ（ｉ）と共に、時間窓Ｆ（ｉ）の音声特徴量も記憶する（ステップＳ４３）。 The speech feature amount extraction unit 12 calculates the power system feature parameter p (i) by taking the square sum of the speech data d (n) or the logarithm thereof in the time window P (i) cut out in step S41. Further, a frequency axis system characteristic parameter f (i) obtained by converting the audio data d (n) in F (i) onto the frequency axis by FT or the like is extracted. (Step S42). Along with the extracted power system feature parameter p (i), the audio feature quantity of the time window F (i) is also stored (step S43).

次にこの時間窓Ｆ（ｉ）で切り出された音声区間が音声区間が終了したかを判断する（ステップＳ４４）。終了していない場合は、ステップＳ４５に移行し、時間窓Ｆ（ｉ）及びＰ（ｉ）から時間窓Ｆ（ｉ＋１）及びＰ（ｉ＋１）へシフトして、ステップＳ４１に戻る。 Next, it is determined whether or not the voice segment cut out in this time window F (i) has ended (step S44). If not completed, the process proceeds to step S45, the time windows F (i) and P (i) are shifted to the time windows F (i + 1) and P (i + 1), and the process returns to step S41.

ステップＳ４４において当該時間窓Ｆ（ｉ）で音声区間が終了したと判断すると、ステップＳ４６において時間窓Ｆ（ｉ）毎に記憶された音声特徴量を時系列に配置する。 If it is determined in step S44 that the voice section has ended in the time window F (i), the voice feature values stored for each time window F (i) are arranged in time series in step S46.

次いで、ステップＳ４６において時系列に配列された音声特徴量は、音声特徴抽出部１２において下記の式により差分（Δ及びΔΔ）を計算し特徴ベクトル列が生成される（θは考慮する前後のフレーム数）。 Next, the speech feature extraction unit 12 calculates a difference (Δ and ΔΔ) by the following formula to generate a feature vector sequence for the speech feature amounts arranged in time series in step S46 (θ is a frame before and after considering). number).

このようにして得られた特徴ベクトル列（ｆ（ｉ）、Δｐ（ｉ）、Δｆ（ｉ）、ΔΔｐ（ｉ））を用いて、比較部１３にて予め記憶されている標準パターン（ＨＭＭモデル１〜ＨＭＭモデルｎ）と比較照合される（ステップＳ４７）。ステップＳ４７で得られた結果は音声認識装置１の図示しない出力段に出力され（ステップＳ４８）、一連の音声認識処理を終了する。 Using the feature vector sequence (f (i), Δp (i), Δf (i), ΔΔp (i)) thus obtained, a standard pattern (HMM model) stored in advance in the comparison unit 13 is used. 1 to HMM model n) are compared (step S47). The result obtained in step S47 is output to an output stage (not shown) of the speech recognition apparatus 1 (step S48), and the series of speech recognition processing is terminated.

このように、本実施の形態における音声認識装置によれば、音声分析の対象となる分析フレームから時間窓を設定して音声特徴量を抽出する場合に、同じ周期でありながら、パワー系特徴パラメータを抽出する時間窓ｐ（ｉ）の窓長が周波数軸系特徴パラメータを抽出する時間窓に比べて短く設定されている。
このため、母音部分における基本ピッチ成分の周期以上の窓長で周波数軸系特徴パラメータを抽出しながら同時刻のパワー系特徴パラメータの動的特徴（Δｐ（ｉ））をより細かい応答性で得られるので、音声認識率が向上する。 As described above, according to the speech recognition apparatus in the present embodiment, when extracting a speech feature amount by setting a time window from an analysis frame to be subjected to speech analysis, the power system feature parameter has the same period. The window length of the time window p (i) for extracting is set shorter than the time window for extracting the frequency axis system characteristic parameter.
Therefore, the dynamic feature (Δp (i)) of the power system feature parameter at the same time can be obtained with finer responsiveness while extracting the frequency axis feature parameter with a window length equal to or longer than the period of the basic pitch component in the vowel part. Therefore, the voice recognition rate is improved.

また、この実施形態においては、特徴ベクトル列として（ｆ（ｉ）、Δｐ（ｉ）、Δｆ（ｉ）、ΔΔｐ（ｉ））を使用したが、これとは別に人間の発声音の子音情報が多く含まれる３ｋＨｚ〜８ｋＨｚの帯域だけ通過するバンドパスフィルタを設け、このバンドパスフィルタを通過させた音声データから短い窓長ｐＬで短時間パワー専用特徴ベクトルΔｔｐ（ｉ）を抽出し、特徴ベクトル列として（ｆ（ｉ）、Δｐ（ｉ）、Δｆ（ｉ）、ΔΔｐ（ｉ）、Δｔｐ（ｉ））を用いるようにしてもよい。 In this embodiment, (f (i), Δp (i), Δf (i), ΔΔp (i)) is used as the feature vector sequence. However, consonant information of a human utterance is not included. A band-pass filter that passes only a band of 3 kHz to 8 kHz that is included in a large amount is provided, a short-time power-only feature vector Δtp (i) is extracted from audio data that has passed through the band-pass filter with a short window length pL, and a feature vector sequence (F (i), Δp (i), Δf (i), ΔΔp (i), Δtp (i)) may be used.

本発明の実施形態に係る音声認識装置のブロック図。1 is a block diagram of a speech recognition apparatus according to an embodiment of the present invention. 時間窓位置設定部における、入力音声波形に対する時間窓の設定の様子を示す図。The figure which shows the mode of the setting of the time window with respect to the input audio | voice waveform in a time window position setting part. 本発明の実施形態に係る音声認識装置に用いられる特徴パラメータを示す図。The figure which shows the characteristic parameter used for the speech recognition apparatus which concerns on embodiment of this invention. 本発明の実施形態に係る音声認識装置の動作を示すフローチャート。The flowchart which shows operation | movement of the speech recognition apparatus which concerns on embodiment of this invention.

Explanation of symbols

１音声認識装置
１１時間窓位置設定部
１２音声特徴量抽出部
１３比較部
１４１〜１４ｎＨＭＭモデル DESCRIPTION OF SYMBOLS 1 Speech recognition apparatus 11 Time window position setting part 12 Voice feature-value extraction part 13 Comparison part 141-14n HMM model

Claims

A time window of a predetermined length is set with a predetermined period for the analysis target voice, and this time window is used as a processing unit and includes a frequency axis characteristic parameter related to the frequency of the voice and a power system characteristic parameter related to the amplitude of the voice. A speech recognition apparatus comprising: a feature amount extraction unit that extracts a feature amount; and a speech recognition unit that recognizes the analysis target speech based on the feature amount extracted by the feature amount extraction unit,
The feature quantity extraction unit extracts the feature quantity by shortening a length of a time window for extracting only the power system feature parameter as compared with a length of a time window for extracting only the frequency axis system feature parameter. A voice recognition device characterized by the above.

The speech recognition apparatus according to claim 1, wherein the length of the time window for extracting only the frequency axis system characteristic parameter is equal to or longer than the period of the basic pitch component in the vowel part of the input speech.

The feature amount extraction unit calculates a difference feature parameter indicating a difference between temporally adjacent feature parameters among the feature parameters sequentially generated at the predetermined period, and supplies the difference feature parameter to the voice recognition unit. The speech recognition apparatus according to claim 1.

The feature amount extraction unit further includes a band pass filter unit that passes only a band of 3 kHz to 8 kHz with respect to the analysis target voice, and sets a power system characteristic parameter of the input voice via the band pass filter unit as the predetermined parameter. 2. The speech recognition apparatus according to claim 1, wherein the speech recognition apparatus sequentially extracts in a cycle and supplies the extracted difference between power system characteristic parameters adjacent in time to the speech recognition means.

A time window of a predetermined length is set with a predetermined period for the analysis target voice, and this time window is used as a processing unit, and includes a frequency axis characteristic parameter related to the frequency of the voice and a power system characteristic parameter related to the amplitude of the voice. A speech recognition method for extracting a feature amount and recognizing the analysis target speech based on the extracted feature amount,
A speech recognition method for extracting the feature quantity by shortening a length of a time window for extracting only the power system feature parameter as compared with a length of a time window for extracting only the frequency axis system feature parameter .

A time window of a predetermined length is set with a predetermined period for the analysis target voice, and this time window is used as a processing unit, and includes a frequency axis characteristic parameter related to the frequency of the voice and a power system characteristic parameter related to the amplitude of the voice. A speech recognition processing program comprising: a feature amount extracting step for extracting a feature amount; and a speech recognition step for recognizing the analysis target speech based on the extracted feature amount,
A speech recognition process for extracting the feature quantity by shortening a length of a time window for extracting only the power system feature parameter as compared with a length of a time window for extracting only the frequency axis system feature parameter program.