JP4760179B2

JP4760179B2 - Voice feature amount calculation apparatus and program

Info

Publication number: JP4760179B2
Application number: JP2005207775A
Authority: JP
Inventors: 靖雄吉岡
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2005-07-15
Filing date: 2005-07-15
Publication date: 2011-08-31
Anticipated expiration: 2025-07-15
Also published as: JP2007025296A

Description

本発明は、音声認識装置の利用に供される信号の生成技術に関する。 The present invention relates to a signal generation technique for use in a speech recognition apparatus.

音声認識を自動的に行う音声認識装置が種々提案されている。一般的に、音声認識装置は予め記憶されている様々な言葉に対応する音声の特徴量と、発声者により発声された音声の特徴量との間の類似度に基づき、発声者の発声した言葉を認識する。 Various speech recognition apparatuses that automatically perform speech recognition have been proposed. In general, the speech recognition apparatus is based on the similarity between the speech feature amount corresponding to various words stored in advance and the speech feature amount uttered by the speaker. Recognize

音声認識装置に用いられる音声の特徴量の算出方法は様々なものが提案されている。それらの算出方法の一つに、音声のスペクトルを複数の周波数帯域ごとに設けられたフィルタにより濾波して得られるフィルタバンク出力値を離散コサイン変換または離散逆フーリエ変換を用いて変換し、音声の特徴量を示す係数列を算出する方法がある。ＭＦＣＣ（Ｍｅｌ−ＦｒｅｑｕｅｎｃｙＣｅｐｓｔｒｕｍＣｏｅｆｆｉｃｉｅｎｔ、メル周波数ケプストラム係数）は、そのような方法により算出される係数列の一例であり、広く用いられている。以下に、離散コサイン変換により得られるＭＦＣＣを用いる場合を例として、従来技術にかかる音声認識の仕組みを説明する。 Various methods for calculating the feature amount of speech used in speech recognition devices have been proposed. As one of those calculation methods, the filter bank output value obtained by filtering the speech spectrum with a filter provided for each of a plurality of frequency bands is transformed using discrete cosine transform or discrete inverse Fourier transform, and the speech There is a method for calculating a coefficient sequence indicating a feature amount. MFCC (Mel-Frequency Cepstrum Coefficient, Mel Frequency Cepstrum Coefficient) is an example of a coefficient sequence calculated by such a method and is widely used. The mechanism of speech recognition according to the prior art will be described below by taking as an example the case of using MFCC obtained by discrete cosine transform.

図７は、従来技術による音声認識システム９の構成を示すブロック図である。音声認識システム９は、発声者の音声を音声信号に変換する音声信号生成装置９０、音声信号生成装置９０により生成された音声信号を用いてＭＦＣＣを算出する音声特徴量算出装置９１、音声特徴量算出装置９１により算出されたＭＦＣＣを用いて音声認識を行う音声認識装置９２を備えている。 FIG. 7 is a block diagram showing the configuration of a speech recognition system 9 according to the prior art. The voice recognition system 9 includes a voice signal generation device 90 that converts a voice of a speaker into a voice signal, a voice feature value calculation device 91 that calculates a MFCC using a voice signal generated by the voice signal generation device 90, a voice feature value A speech recognition device 92 that performs speech recognition using the MFCC calculated by the calculation device 91 is provided.

音声信号生成装置９０は、音声を収音し音声信号に変換する音声信号生成部９０１、音声信号生成部９０１により生成された音声信号のうち例えば所定の閾値以上の振幅値をとる区間を発声区間として切り出す発声区間切出部９０２を備えている。発声区間切出部９０２により切り出された発声区間の音声信号は例えば４０ミリ秒長のフレームに分割された後、音声信号生成装置９０から音声特徴量算出装置９１に出力される。 The audio signal generation device 90 includes an audio signal generation unit 901 that collects audio and converts it into an audio signal, and an audio signal generated by the audio signal generation unit 901 includes, for example, an interval having an amplitude value equal to or greater than a predetermined threshold. As shown in FIG. The speech signal of the speech segment extracted by the speech segment extraction unit 902 is divided into, for example, 40 ms long frames, and then output from the speech signal generation device 90 to the speech feature value calculation device 91.

音声特徴量算出装置９１は、音声信号生成装置９０から受け取ったフレーム単位の音声信号に例えばハミング窓等の時間窓関数を時間軸方向にスライドさせながら乗ずることによりフレーム分割による高周波数ノイズの低減された音声信号を生成する窓かけ処理部９１１、窓かけ処理部９１１による窓かけ処理が施されたフレーム単位の音声信号にＦＦＴ（ＦａｓｔＦｏｕｒｉｅｒＴｒａｎｓｆｏｒｍ、高速フーリエ変換）処理を施し音声信号のスペクトルを算出するＦＦＴ処理部９１２、メルスケール帯域フィルタと呼ばれるフィルタ群（後述）によりＦＦＴ処理部９１２により算出されたスペクトルを濾波することにより複数の周波数帯域の各々に関する周波数成分のパワーを示す指標値を算出するメルスケール帯域フィルタ処理部９１３、メルスケール帯域フィルタ処理部９１３により算出された指標値の各々の対数値を算出する対数値算出部９１４、対数値算出部９１４により算出された対数値の集まりを離散コサイン変換（後述）することによりＭＦＣＣを算出する離散コサイン変換処理部９１５を備えている。 The audio feature quantity calculation device 91 reduces the high-frequency noise due to frame division by multiplying the audio signal in units of frames received from the audio signal generation device 90 while sliding a time window function such as a Hamming window in the time axis direction. A windowing processing unit 911 that generates a voice signal, and an FFT (Fast Fourier Transform) process is performed on the audio signal of each frame subjected to the windowing process by the windowing processing unit 911 to calculate the spectrum of the voice signal An FFT processing unit 912 and a filter group called a mel scale band filter (described later) filter the spectrum calculated by the FFT processing unit 912 to calculate an index value indicating the power of the frequency component for each of the plurality of frequency bands. Mel scale band filter processing unit 9 13. A logarithmic value calculation unit 914 that calculates the logarithmic value of each index value calculated by the mel scale band filter processing unit 913, and a set of logarithmic values calculated by the logarithmic value calculation unit 914 are subjected to discrete cosine transform (described later). Accordingly, a discrete cosine transform processing unit 915 for calculating the MFCC is provided.

メルスケール帯域フィルタとは、線形の周波数軸を次式（１）により変換して得られるメル周波数軸上に等間隔に配置された複数の中心周波数の各々に関し、中心周波数における乗数が１、隣接するフィルタの中心周波数における乗数が０（ゼロ）となるように、線形で乗数が変化するフィルタの集まりである。

The mel-scale bandpass filter has a multiplier at the center frequency of 1 adjacent to each of a plurality of center frequencies arranged at equal intervals on the mel frequency axis obtained by converting the linear frequency axis by the following equation (1). It is a collection of filters whose multipliers are linear and change so that the multiplier at the center frequency of the filter to be 0 becomes zero.

図８は、メルスケール帯域フィルタを示すグラフである。図８に示されるように、例えば中心周波数ｆ_k（Ｈｚ）のフィルタ９５は中心周波数ｆ_k（Ｈｚ）において乗数１をとり、低周波数側の隣接するフィルタの中心周波数ｆ_k-1（Ｈｚ）および高周波数側の隣接するフィルタの中心周波数ｆ_k+1（Ｈｚ）において乗数０をとる三角形状をしている。この場合、図８における三角形状の各々がフィルタバンクと呼ばれる。 FIG. 8 is a graph showing a melscale bandpass filter. As shown in FIG. 8, for example, the center frequency f _k filter 95 (Hz) takes the multiplier 1 at the center frequency f _k (Hz), the low-frequency side of the adjacent filter center frequency f _k-1 (Hz) And a triangular shape having a multiplier of 0 at the center frequency f _{k + 1} (Hz) of adjacent filters on the high frequency side. In this case, each triangular shape in FIG. 8 is called a filter bank.

ところで、線形の周波数軸をメル周波数軸に変換する目的は、低周波数帯域における音高の変化に比較し高周波数帯域における音高の変化に鈍感な人間の聴覚の特性を考慮して、人間の聴覚に沿った周波数間の距離を示すことを可能とするためである。 By the way, the purpose of converting the linear frequency axis to the Mel frequency axis is to consider human auditory characteristics that are less sensitive to pitch changes in the high frequency band than in the low frequency band. This is because it is possible to indicate the distance between frequencies along the auditory sense.

メルスケール帯域フィルタ処理部９１３は、ＦＦＴ処理部９１２により算出されたスペクトルにメルスケール帯域フィルタの各フィルタバンクを乗じて加算することにより、各フィルタバンクによりカバーされる周波数帯域に含まれるスペクトルのパワーの指標値として、フィルタバンク出力値ｒ_k（ただし、ｋはフィルタバンク番号）を算出する。なお、以下、フィルタバンクの数をＬとする。 The melscale band filter processing unit 913 multiplies the spectrum calculated by the FFT processing unit 912 by each filter bank of the mel scale band filter and adds the spectrum power to the spectrum included in the frequency band covered by each filter bank. As an index value, a filter bank output value r _k (where k is a filter bank number) is calculated. Hereinafter, the number of filter banks is L.

対数値算出部９１４はメルスケール帯域フィルタ処理部９１３により算出されたフィルタバンク出力値ｒ_kの各々の対数値Ｒ_kを算出する。離散コサイン変換処理部９１５は、対数値算出部９１４により算出されたフィルタバンク出力値の対数値Ｒ_kを次式（２）に従い離散コサイン変換することで、係数列であるＭＦＣＣを算出する。ただし、式（２）におけるＣ_iはＭＦＣＣにおける第ｉ次の係数を示す。

The logarithmic value calculation unit 914 calculates each logarithmic value R _k of the filter bank output value r _k calculated by the mel scale band filter processing unit 913. The discrete cosine transform processing unit 915 calculates a MFCC that is a coefficient sequence by performing a discrete cosine transform on the logarithmic value R _k of the filter bank output value calculated by the logarithmic value calculation unit 914 according to the following equation (2). However, C _i in equation (2) shows the i-th order coefficient in MFCC.

ここで、ｉはＬの約１／２程度を上限として有効な数値が得られる。例えば、フィルタバンク数が１２であれば、Ｃ₁、Ｃ₂、・・・、Ｃ₆が有効なＭＦＣＣとして得られる。音声特徴量算出装置９１は、上記のように算出したＣ_i群を音声認識装置９２に出力する。 Here, an effective numerical value is obtained with i as an upper limit of about 1/2 of L. For example, if the number of filter banks is 12, C ₁ , C ₂ ,..., C ₆ can be obtained as effective MFCCs. Audio feature amount calculation unit 91 outputs the C _i group that has been calculated as described above to the speech recognition device 92.

音声認識装置９２は、学習モードと認識モードの２つの動作モードを持っている。学習モードにおいては、音声認識装置９２は音声特徴量算出装置９１から発声者の音声を示す音声信号のフレームごとにＣ_i群を受け取り、一連の発音を示す音声信号に関するフレームごとのＣ_i群を、発声者により発音された言葉に対応付けてデータベース９２１に順次格納する。従って、データベース９２１には例えば「おはよう」という言葉に対応する特徴量を示す係数列群として、「おはよう」の音声信号のフレーム数に応じたＣ_i群が時系列的に格納されることになる。以下、Ｃ_i群の時系列的な集まりを「Ｃ_i群列」と呼ぶ。発声者は様々な言葉を順次発音するとともに、発音した言葉を例えば音声認識装置９２に接続されたキーボード（図示略）等の操作手段により音声認識装置９２に入力することにより、データベース９２１に特定の言葉に対応するＣ_i群列を順次格納させることができる。 The voice recognition device 92 has two operation modes, a learning mode and a recognition mode. In the learning mode, the speech recognition device 92 receives the C _i group for each frame of the speech signal indicating the speech utterance's from the audio feature amount calculation unit 91, a C _i group for each frame to a speech signal indicating a series of sound , And sequentially stored in the database 921 in association with words pronounced by the speaker. Accordingly, the coefficient sequence group showing a feature amount corresponding to the word, for example "good morning" in database 921, C _i group is to be stored in chronological order in accordance with the number of frames the audio signal of "good morning" . Below, a series gatherings when the C _i group referred to as the "C _i group columns". The speaker speaks various words sequentially, and inputs the pronounced words into the speech recognition device 92 by operating means such as a keyboard (not shown) connected to the speech recognition device 92, for example. C _i group row corresponding to the word can be sequentially stored.

一方、音声認識装置９２は認識モードにおける処理を行うＤＰマッチング部９２２および判定部９２３を備えている。ＤＰマッチング部９２２は、音声特徴量算出装置９１から受け取るＣ_i群列とデータベース９２１に格納されているＣ_i群列の各々との類似度を示す距離をＤＰ（ＤｙｎａｍｉｃＰｒｏｇｒａｍｍｉｎｇ）マッチング法により算出する。また、判定部９２３はデータベース９２１に格納されているＣ_i群列のいずれに関し算出された距離が最短であるかを判定する。さらに、判定部９２３は、距離が最短であると判定したＣ_i群列に対応付けてデータベース９２１に格納されている言葉を、発声者により発音された言葉であると特定し、特定した言葉を示すデータを他の装置に送信したり、ユーザにメッセージとして通知したりする。
以上が、従来技術にかかる音声認識システム９により音声認識が行われる仕組みである。 On the other hand, the voice recognition device 92 includes a DP matching unit 922 and a determination unit 923 that perform processing in the recognition mode. DP matching unit 922 calculates a distance indicating a similarity between each of the C _i groups string stored in C _i group columns and database 921 receives from the audio feature amount calculation unit 91 by the DP (Dynamic Programming) matching method . The determination unit 923 determines whether the distance calculated relates to any of the C _i groups string stored in the database 921 is the shortest. Furthermore, the determination unit 923, the distance the words are stored in the database 921 in association with the C _i group string is determined to be the shortest, and specified as the word was uttered by the speaker, the identified word Data to be shown is transmitted to another device or notified as a message to the user.
The above is the mechanism in which speech recognition is performed by the speech recognition system 9 according to the conventional technology.

ところで、音声認識システム９において、音声認識装置９２が学習モードおよび認識モードのいずれの場合であっても発声者が置かれた音空間が低ノイズであれば期待される精度で音声認識が行われるが、一般的には、発声者が発音を行う音空間には無視できない程度の環境雑音が存在する。従って、音声認識システム９により生成されるＭＦＣＣは、発声者の音声に環境雑音が混ざった音の特徴量を示すものとなる。その結果、音声認識システム９においては、必ずしも常に期待される精度で音声認識が行われるとは限らない。 By the way, in the speech recognition system 9, speech recognition is performed with the expected accuracy if the sound space where the speaker is placed is low noise regardless of whether the speech recognition device 92 is in the learning mode or the recognition mode. However, in general, there is an environmental noise that cannot be ignored in the sound space where the speaker speaks. Therefore, the MFCC generated by the speech recognition system 9 indicates a feature amount of sound in which environmental noise is mixed with the voice of the speaker. As a result, in the speech recognition system 9, speech recognition is not always performed with the expected accuracy.

上記の問題を解決するために、音声信号に対し、例えばスペクトルサブストラクションと呼ばれる雑音低減処理を施すことが考えられる。スペクトルサブストラクションとは、環境雑音を示す音信号のスペクトルを音声と環境雑音の混ざった音を示す音信号のスペクトルから減ずることにより音声を示す音声信号のスペクトルを取り出す技術である。例えば特許文献１には、スペクトルサブストラクションを用いて音信号から音声区間を検出する技術が開示されている。
特開２０００−４７６９６号公報 In order to solve the above problem, it is conceivable to perform a noise reduction process called a spectral subtraction on the audio signal, for example. Spectral subtraction is a technique for extracting the spectrum of a sound signal indicating sound by subtracting the spectrum of a sound signal indicating environmental noise from the spectrum of a sound signal indicating sound mixed with sound and environmental noise. For example, Patent Document 1 discloses a technique for detecting a speech section from a sound signal using spectrum subtraction.
JP 2000-47696 A

スペクトルサブストラクション等の雑音低減処理の多くは、優れた効果をもたらすと同時に多くの計算量を要し、例えば携帯端末装置等のリソース制約が厳しい装置において実現することが困難な場合がある。 Many noise reduction processes such as spectral subtraction have excellent effects and require a large amount of calculation, and may be difficult to implement in a device with severe resource constraints such as a portable terminal device.

上記の状況に鑑み、本発明は、音声の特徴量を示す係数列を算出するシステムにおいて、当該係数列に含まれる環境雑音による影響を、簡便かつ低負荷な処理により低減することを可能とする手段を提供することを目的とする。 In view of the above situation, the present invention makes it possible to reduce the influence of environmental noise included in a coefficient sequence by a simple and low-load process in a system that calculates a coefficient sequence indicating a feature amount of speech. It aims to provide a means.

上記課題を達成するために、本発明は、音声信号から前記音声信号のスペクトルを算出するスペクトル算出手段と、前記スペクトル算出手段により算出されたスペクトルに、所定の複数の周波数帯域の各々に応じたフィルタ処理を施すことにより、前記複数の周波数帯域の各々に関し、前記音声信号に含まれる当該周波数帯域内の周波数成分のパワーを示す指標値を算出するフィルタ手段と、前記複数の周波数帯域の各々に関し前記フィルタ手段により算出された複数の指標値の各々の対数値を算出する対数値算出手段と、前記複数の周波数帯域の各々に関し前記対数値算出手段により算出された複数の対数値の最小値をｍ、最大値をＭとしたとき、定数ｐ、定数ｑ（ただし、ｍ≦ｐ＜ｑ≦Ｍ）および定数ｎ（ただし、ｎ＞１）に関し、入力値ｘに対する出力値ｙが（ａ）ｘ＝ｍのときｙ≧ｍであり、（ｂ）ｘ＝Ｍのときｙ≦Ｍであり、（ｃ）ｍ≦ｘ≦Ｍの範囲において、ｘに対するｙの変化率が常に０以上であり、（ｄ）ｐ≦ｘ≦ｑの範囲において、ｘに対するｙの変化率の変化率が常に正であるとの条件を満たす次式（３）に従い、前記複数の指標値の各々を変換することにより、前記音声信号により示される音声の特徴量を示す数値列を算出する変換手段とを備えることを特徴とする音声特徴量算出装置を提供する。

In order to achieve the above object, the present invention provides a spectrum calculation unit that calculates a spectrum of the audio signal from an audio signal, and a spectrum calculated by the spectrum calculation unit according to each of a plurality of predetermined frequency bands. Filter means for calculating an index value indicating the power of the frequency component in the frequency band included in the audio signal for each of the plurality of frequency bands by performing filtering, and each of the plurality of frequency bands A logarithmic value calculating means for calculating a logarithmic value of each of the plurality of index values calculated by the filter means; and a minimum value of the plurality of logarithmic values calculated by the logarithmic value calculating means for each of the plurality of frequency bands. m, where M is the maximum value, constant p , constant q (where m ≦ p <q ≦ M) and constant n (where n> 1) Y ≧ m when the output value y for the value x is (a) x = m, (b) y ≦ M when x = M, and (c) y for x in the range of m ≦ x ≦ M. In accordance with the following expression (3) that satisfies the condition that the rate of change of y with respect to x is always positive in the range of (d) p ≦ x ≦ q. There is provided a speech feature quantity calculation device comprising: a conversion means for calculating a numerical value sequence indicating a feature quantity of speech indicated by the speech signal by converting each of the index values.

かかる音声特徴量算出装置によれば、音声と比べ相対的に環境雑音の特徴量の成分を多く含むフィルタバンクに関する小さい値の対数値はより小さく変換され、環境雑音の特徴量の成分をあまり含まないフィルタバンクに関する大きい値の対数値はより大きく変換されると同時に、変換後の対数値が変換前の対数値の最小値と最大値の範囲を超えることがなく、小さい値の対数値に含まれる音声の特徴量の成分が過小評価されることがないため、例えば音声認識に用いられる際に望ましい認識結果をもたらす数値列が算出される。 According to such an audio feature amount calculation device, a logarithm value of a small value related to a filter bank including a relatively large amount of environmental noise feature amount components is converted to be smaller than that of speech, and the environmental noise feature amount component is not much included. The logarithm of a large value for a filter bank that is not converted is converted to a larger value, and the converted logarithm does not exceed the range of the minimum and maximum values of the logarithm before conversion, and is included in the logarithm of the small value Since the component of the voice feature value is not underestimated, for example, a numerical sequence that provides a desired recognition result when used for voice recognition is calculated.

また、本発明は、音声信号から前記音声信号のスペクトルを算出するスペクトル算出手段と、前記スペクトル算出手段により算出されたスペクトルに、所定の複数の周波数帯域の各々に応じたフィルタ処理を施すことにより、前記複数の周波数帯域の各々に関し、前記音声信号に含まれる当該周波数帯域内の周波数成分のパワーを示す指標値を算出するフィルタ手段と、前記複数の周波数帯域の各々に関し前記フィルタ手段により算出された複数の指標値の各々の対数値を算出する対数値算出手段と、前記複数の周波数帯域の各々に関し前記対数値算出手段により算出された複数の対数値の最小値をｍ、最大値をＭとしたとき、定数ｐ、定数ｑ（ただし、ｍ≦ｐ＜ｑ≦Ｍ）、定数ａ（ただし、ａ＞０）および定数ｃ（ただし、ｃ＞０）に関し、入力値ｘに対する出力値ｙが（ａ）ｘ＝ｍのときｙ≧ｍであり、（ｂ）ｘ＝Ｍのときｙ≦Ｍであり、（ｃ）ｍ≦ｘ≦Ｍの範囲において、ｘに対するｙの変化率が常に０以上であり、（ｄ）ｐ≦ｘ≦ｑの範囲において、ｘに対するｙの変化率の変化率が常に正であるとの条件を満たす次式（２）に従い、前記複数の指標値の各々を変換することにより、前記音声信号により示される音声の特徴量を示す数値列を算出する変換手段とを備えることを特徴とする音声特徴量算出装置を提供する。Further, the present invention provides a spectrum calculation unit that calculates a spectrum of the audio signal from an audio signal, and a filter process corresponding to each of a plurality of predetermined frequency bands on the spectrum calculated by the spectrum calculation unit. Filter means for calculating an index value indicating the power of the frequency component in the frequency band included in the audio signal for each of the plurality of frequency bands; and the filter means for each of the plurality of frequency bands. Logarithmic value calculation means for calculating the logarithmic value of each of the plurality of index values; m for the minimum value of the plurality of logarithmic values calculated by the logarithmic value calculation means for each of the plurality of frequency bands; , Constant p, constant q (where m ≦ p <q ≦ M), constant a (where a> 0) and constant c (where c> 0) When the output value y with respect to the input value x is (a) x = m, y ≧ m, (b) when x = M, y ≦ M, and (c) in the range of m ≦ x ≦ M, In accordance with the following equation (2) that satisfies the condition that the rate of change of y is always positive in the range of (d) p ≦ x ≦ q, the rate of change of y is always 0 or more. There is provided a speech feature quantity calculation device comprising: a conversion means for computing a numerical value sequence indicating a feature quantity of speech indicated by the speech signal by converting each of a plurality of index values.

また、前記フィルタ手段が用いるフィルタの好適な一例としては、メルスケール帯域フィルタがある。その場合、前記音声特徴量算出装置は、前記変換手段により算出された数値列を離散コサイン変換することにより、メル周波数ケプストラム係数列を算出する係数列算出手段を備えるように構成されてもよい。 A preferred example of the filter used by the filter means is a mel scale band filter. In this case, the speech feature quantity calculation device may be configured to include coefficient sequence calculation means for calculating a mel frequency cepstrum coefficient sequence by performing discrete cosine transform on the numerical value sequence calculated by the conversion means.

また、本発明は、上記の音声特徴量算出装置により行われる処理をコンピュータに実行させるプログラムを提供する。 In addition, the present invention provides a program that causes a computer to execute processing performed by the above-described audio feature amount calculation apparatus.

［実施形態］
図１は本発明の実施形態にかかる音声認識システム１の構成を示すブロック図である。音声認識システム１は上述した従来技術にかかる音声認識システム９と多くの点で共通しており、以下、異なる点のみ説明する。なお、図１においては、音声認識システム１と音声認識システム９で共通する構成部については図７におけるものと同じ符号が付されている。 [Embodiment]
FIG. 1 is a block diagram showing a configuration of a speech recognition system 1 according to an embodiment of the present invention. The speech recognition system 1 is common in many respects to the speech recognition system 9 according to the above-described prior art, and only different points will be described below. In FIG. 1, components common to the speech recognition system 1 and the speech recognition system 9 are denoted by the same reference numerals as those in FIG.

音声認識システム１は音声認識システム９の音声特徴量算出装置９１の代わりに、音声特徴量算出装置１１を備えている。また、音声認識システム１は音声特徴量算出装置１１に対しユーザが指示を与えるために用いるキーボード１２を備えている。キーボード１２は複数のキーを備え、ユーザにより押下されたキーに応じた信号を音声特徴量算出装置１１に出力する。なお、キーボード１２の代わりに、例えばマウスポインタ等が用いられてもよい。 The voice recognition system 1 includes a voice feature quantity calculation device 11 instead of the voice feature quantity calculation device 91 of the voice recognition system 9. The voice recognition system 1 also includes a keyboard 12 that is used by the user to give instructions to the voice feature quantity calculation device 11. The keyboard 12 includes a plurality of keys, and outputs a signal corresponding to the key pressed by the user to the sound feature quantity calculation device 11. For example, a mouse pointer may be used instead of the keyboard 12.

音声特徴量算出装置１１は、音声特徴量算出装置９１が備える構成部に加え、対数値算出部９１４と離散コサイン変換処理部９１５との間に介挿された変換部１０１を備えている。変換部１０１は、対数値算出部９１４により算出されるフィルタバンク出力値ｒ_k（ただし、ｋはフィルタバンク番号）の対数値Ｒ_kを受け取り、受け取った対数値Ｒ_kを入力値ｘとして上述した式（３）に代入することにより、対数値Ｒ_kに応じた出力値ｙを算出する。以下、対数値Ｒ_kに応じた出力値ｙを変形対数値γ_kと呼ぶ。 The speech feature quantity calculation device 11 includes a conversion unit 101 interposed between a logarithmic value calculation unit 914 and a discrete cosine transform processing unit 915 in addition to the components included in the speech feature quantity calculation device 91. Conversion unit 101, the filter bank output value is calculated by the logarithm calculator 914 r _k (Here, k filter bank number) receives logarithmic value R _k of the above-described logarithmic values R _k received as an input value x By substituting into the equation (3), the output value y corresponding to the logarithmic value R _k is calculated. Hereinafter referred to as modified logarithmic value gamma _k output value y corresponding to the logarithm R _k.

また、音声特徴量算出装置１１はユーザの操作に応じてキーボード１２から出力される信号に従い、変換部１０１に対しパラメータの指定を行う指定部１０２を備えている。この場合、指定部１０２が指定するパラメータは式（３）における定数ｎである。 In addition, the audio feature quantity calculation device 11 includes a designation unit 102 that designates parameters for the conversion unit 101 in accordance with a signal output from the keyboard 12 in accordance with a user operation. In this case, the parameter designated by the designation unit 102 is the constant n in the expression (3).

音声特徴量算出装置１１における離散コサイン変換処理部９１５は、対数値算出部９１４により生成される対数値群Ｒ₁、Ｒ₂、・・・、Ｒ_L（ただし、Ｌはフィルタバンクの総数）の代わりに、変換部１０１により算出された変形対数値群γ₁、γ₂、・・・、γ_Lを受け取り、受け取った変形対数値群を上述した式（２）に従い離散コサイン変換することにより、Ｃ_i群、すなわちＭＦＣＣを算出する。ただし、音声認識システム１におけるＭＦＣＣは従来技術におけるＭＦＣＣとは異なる特性を有する対数値群を用いて生成されたものであるので、従来技術におけるＭＦＣＣとは異なる特性を備える。 The discrete cosine transform processing unit 915 in the speech feature quantity calculation device 11 is a logarithmic value group R ₁ , R ₂ ,..., R _L (where L is the total number of filter banks) generated by the logarithmic value calculation unit 914. Instead, by receiving the modified logarithmic value groups γ ₁ , γ ₂ ,..., Γ _L calculated by the conversion unit 101, the received modified logarithmic value group is subjected to discrete cosine transform according to the above-described equation (2), C _i group, that is, MFCC is calculated. However, since the MFCC in the speech recognition system 1 is generated using a logarithmic value group having characteristics different from those of the MFCC in the prior art, the MFCC has characteristics different from those of the MFCC in the prior art.

図２は、式（３）で表される関数を横軸を入力値ｘ、縦軸を出力値ｙとする座標に描いたグラフである。ただし、図２においてグラフ１５、グラフ１６およびグラフ１７は、それぞれｎ＝１．５、ｎ＝３．０およびｎ＝４．５の場合のグラフを示しており、例としてｎ＝３．０の場合における対数値Ｒ_kに対する変形対数値γ_kが図示されている。 FIG. 2 is a graph in which the function represented by Expression (3) is drawn at coordinates with the horizontal axis representing the input value x and the vertical axis representing the output value y. However, in FIG. 2, graph 15, graph 16 and graph 17 show graphs in the case of n = 1.5, n = 3.0 and n = 4.5, respectively. The modified logarithmic value γ _k for the logarithmic value R _k in the case is shown.

図２に示されるように、変換部１０１は入力値ｘとして対数値Ｒ_kを式（３）に代入することにより、出力値ｙとして変形対数値γ_kを算出するが、そのように算出される変形対数値γ_kは以下の特徴を備えている。
（イ）入力値の大小関係は出力値の大小関係において常に維持される。
（ロ）入力値が大きい領域（ｘ＝Ｍの左側近傍の領域）における出力値においては、入力値の大きさがほぼ維持される。
（ハ）入力値が小さい領域（ｘ＝ｍの右側近傍の領域）もしくは入力値が中程度の領域においては、入力値が大きい領域（ｘ＝Ｍの左側近傍の領域）における入力値に対する出力値の減少幅と比較して、減少幅がより大きい範囲が広く存在する。
（ニ）出力値は必ず入力値の最小値および最大値の範囲内に収まる。 As shown in FIG. 2, the conversion unit 101 substitutes the logarithmic value R _k as the input value x into the equation (3) to calculate the modified logarithmic value γ _k as the output value y. The modified logarithmic value γ _k has the following characteristics.
(A) The magnitude relationship between the input values is always maintained in the magnitude relationship between the output values.
(B) In the output value in the region where the input value is large (region near the left side of x = M), the size of the input value is substantially maintained.
(C) In the region where the input value is small (region near the right side of x = m) or the region where the input value is medium, the output value for the input value in the region where the input value is large (region near the left side of x = M) There is a wide range in which the reduction range is larger than the reduction range.
(D) The output value is always within the range of the minimum value and the maximum value of the input value.

環境雑音の特徴量の成分は、対数値群Ｒ₁、Ｒ₂、・・・、Ｒ_Lのうち、その値が小さいものにより多く含まれている。環境雑音のスペクトルのパワーは、全周波数帯域に関して、一般的に音声のスペクトルのパワーと比較して小さいためである。従って、上記の（ロ）および（ハ）のような特徴を有する変形対数値群γ₁、γ₂、・・・、γ_Lにおいては、対数値群Ｒ₁、Ｒ₂、・・・、Ｒ_Lにおける場合と比較して、環境雑音の特徴量の成分が小さく評価され、音声の特徴量の成分はあまり小さく評価されないことになる。その結果、変形対数値群γ₁、γ₂、・・・、γ_Lを用いて算出されるＣ_i群、すなわちＭＦＣＣもまた、対数値群Ｒ₁、Ｒ₂、・・・、Ｒ_Lを用いて算出されるＣ_i群、すなわちＭＦＣＣと比較して、環境雑音の特徴量の成分をより少なく含む指標となる。 Component of the feature quantities of the environmental noise, the logarithmic value group R _1, R _2, · · ·, of R _L, contains many by what that value is small. This is because the power of the ambient noise spectrum is generally smaller than the power of the voice spectrum for the entire frequency band. Therefore, in the modified logarithmic value groups γ ₁ , γ ₂ ,..., Γ _L having the characteristics as described in (b) and (c) above, the logarithmic value groups R ₁ , R ₂ ,. Compared to the case of _L, the component of the feature amount of the environmental noise is evaluated to be small, and the component of the speech feature amount is not evaluated to be very small. As a result, variations logarithm group _{_{γ 1, γ 2, ···,}} C i group to be calculated using the gamma _L, i.e. MFCC also logarithm groups R _1, R _2, · · ·, a R _L using C _i group that is calculated, that is, compared to MFCC, the indicators including fewer components of the feature amount of environmental noise.

ところで、上記の（ロ）および（ハ）のような特徴を有する変形対数値群γ₁、γ₂、・・・、γ_Lを生成するためには、例えば次式（５）に従った変換を行うことも考えられる。

By the way, in order to generate the modified logarithmic value groups γ ₁ , γ ₂ ,..., Γ _L having the characteristics as described in (b) and (c) above, for example, conversion according to the following equation (5) Can also be considered.

しかしながら、式（５）によれば、対数値群Ｒ₁、Ｒ₂、・・・、Ｒ_Lのうち最小値をとる対数値は０に変換され、最小値に近い対数値は０ではないものの、かなり小さい値に変換される。その結果、環境雑音の特徴量だけでなく、音声の特徴量のうち、スペクトルのパワーが小さい周波数帯域に関するものが過小評価されてしまう。その結果、式（５）に従うような変換により得られる変形対数値群を用いて算出されるＭＦＣＣによっては、望ましい音声認識の結果が得られない場合がある。 However, according to Equation (5), logarithmic value group R _1, R _2, · · ·, logarithm of the minimum value of R _L is converted to 0, although logarithm value not zero close to the minimum value , Converted to a fairly small value. As a result, not only the environmental noise feature quantity but also the voice feature quantity related to the frequency band in which the spectrum power is small is underestimated. As a result, a desired speech recognition result may not be obtained depending on the MFCC calculated using the modified logarithmic value group obtained by the conversion according to the equation (5).

これに対し、変換部１０１により算出される変形対数値群γ₁、γ₂、・・・、γ_Lは上記（ニ）の特徴を有するため、上記のような弊害を生ずることがない。 On the other hand, the modified logarithmic value groups γ ₁ , γ ₂ ,..., Γ _L calculated by the conversion unit 101 have the above-mentioned feature (d), and thus do not cause the above-described adverse effects.

また、音声認識システム１においては、ユーザがキーボード１２を用いて音声特徴量算出装置１１に対し指示を与えることにより、パラメータｎを変更することができる。その結果、ユーザは図２に例示されるような異なる特性の関数の中から望ましいと思われる関数を容易に選択し、音声特徴量算出装置１１に対し異なる特定のＭＦＣＣの生成を行わせることができる。従って、環境雑音の状況に応じたより適するＭＦＣＣの生成が可能である。なお、学習モードにおいて用いられたパラメータは例えば変換部１０１において記憶され、認識モード時においては学習モードにおいて用いられたものと同じパラメータが用いられる。 In the speech recognition system 1, the parameter n can be changed by the user giving an instruction to the speech feature quantity calculation device 11 using the keyboard 12. As a result, the user can easily select a function that seems to be desirable from functions having different characteristics as illustrated in FIG. 2, and can cause the audio feature quantity calculation device 11 to generate different specific MFCCs. it can. Therefore, it is possible to generate a more suitable MFCC according to the environmental noise situation. Note that the parameters used in the learning mode are stored in, for example, the conversion unit 101, and the same parameters as those used in the learning mode are used in the recognition mode.

以上のように、本発明の実施形態にかかる音声認識システム１によれば、環境雑音に関する特徴量をあまり含まないが音声に関する特徴量が過小評価されていない、という好ましい特性を備えたＭＦＣＣが算出される。その結果、従来技術にかかる音声認識システム９における場合と比較して、より精度の高い音声認識の結果が得られることになる。その際、音声特徴量算出装置１１は従来技術にかかる音声特徴量算出装置９１と比較し、式（３）に示される関数に対数値Ｒ_kをそれぞれ代入して変形対数値γ_kを算出する処理が追加されただけである。従って、リソースに制限のある装置によっても音声特徴量算出装置１１の実現が可能である。 As described above, according to the speech recognition system 1 according to the embodiment of the present invention, the MFCC having a preferable characteristic that the feature amount related to the environmental noise is not included but the feature amount related to the speech is not underestimated is calculated. Is done. As a result, a more accurate speech recognition result can be obtained as compared with the case of the speech recognition system 9 according to the prior art. At this time, the speech feature quantity calculation device 11 calculates the modified logarithmic value γ _k by substituting the logarithmic value R _k for the function shown in the equation (3), as compared with the speech feature quantity calculation device 91 according to the prior art. Only processing has been added. Therefore, the speech feature quantity calculation device 11 can be realized even by a device with limited resources.

［変形例］
ところで、上述した実施形態における変換部１０１は式（３）により対数値Ｒ_kを変形対数値γ_kに変換するものとして説明したが、それに限られず、以下の条件を満たす様々な関数が変換部１０１の変換において利用可能である。
対数値群Ｒ₁、Ｒ₂、・・・、Ｒ_Lの最小値をｍ、最大値をＭとしたとき、定数ｐおよび定数ｑ（ただし、ｍ≦ｐ＜ｑ≦Ｍ）に関し、入力値ｘに対する出力値ｙが
（ａ）ｘ＝ｍのときｙ≧ｍである。
（ｂ）ｘ＝Ｍのときｙ≦Ｍである。
（ｃ）ｍ≦ｘ≦Ｍの範囲において、ｘに対するｙの変化率が常に０以上である。
（ｄ）ｐ≦ｘ≦ｑの範囲において、ｘに対するｙの変化率の変化率が常に正である。 [Modification]
By the way, the conversion unit 101 in the embodiment described above has been described as converting the logarithmic value R _k to a modified logarithmic value gamma _k by equation (3) is not limited thereto, it satisfies the condition different functions: converting unit It can be used in 101 conversion.
When the minimum value of the logarithmic value groups R ₁ , R ₂ ,..., _L is m and the maximum value is M, the input value x is related to the constant p and the constant q (where m ≦ p <q ≦ M). When the output value y for (a) is x = m, y ≧ m.
(B) When x = M, y ≦ M.
(C) In the range of m ≦ x ≦ M, the rate of change of y with respect to x is always 0 or more.
(D) In the range of p ≦ x ≦ q, the rate of change of y with respect to x is always positive.

上記の条件を満たす関数による変換において算出される変形対数値群γ₁、γ₂、・・・、γ_Lは、上述した（イ）乃至（ニ）の特徴を備えることになる。上記の条件を満たす関数の一例として、例えば上記の式（４）がある。式（４）はロジスティック曲線を最小値ｍおよび最大値Ｍを用いて変形したものである。図３および図４は式（４）で表される関数を横軸を入力値ｘ、縦軸を出力値ｙとする座標に描いたグラフである。ただし、図３においてグラフ２１、グラフ２２およびグラフ２３は、定数ａをａ＝１０で固定し、定数ｃをそれぞれｃ＝２０、ｃ＝１００およびｃ＝４００と変化させた場合の形状の変化を示しており、図４においてグラフ２４、グラフ２５およびグラフ２６は、定数ｃをｃ＝１００で固定し、定数ａをそれぞれａ＝２０、ａ＝１０およびａ＝７と変化させた場合の形状の変化を示している。このように、変換部１０１が式（４）に従った変換を行う場合、ユーザはキーボード１２を用いてパラメータａおよびパラメータｃを音声特徴量算出装置１１に対し指定することにより、より望ましい変換結果をもたらす関数を選択することができる。 The modified logarithmic value groups γ ₁ , γ ₂ ,..., Γ _L calculated in the conversion by the function satisfying the above conditions have the above-described features (a) to (d). An example of a function that satisfies the above condition is, for example, the above expression (4). Equation (4) is obtained by deforming a logistic curve using the minimum value m and the maximum value M. 3 and 4 are graphs in which the function represented by Expression (4) is drawn at coordinates with the horizontal axis representing the input value x and the vertical axis representing the output value y. However, in FIG. 3, graph 21, graph 22 and graph 23 show the change in shape when constant a is fixed at a = 10 and constant c is changed to c = 20, c = 100 and c = 400, respectively. In FIG. 4, the graph 24, the graph 25, and the graph 26 have the shapes when the constant c is fixed at c = 100 and the constant a is changed to a = 20, a = 10, and a = 7, respectively. It shows a change. As described above, when the conversion unit 101 performs the conversion according to the equation (4), the user designates the parameter a and the parameter c with respect to the voice feature amount calculation device 11 using the keyboard 12, and thus a more desirable conversion result. Can be selected.

さらに、変換部１０１は式（３）や式（４）で示されるような関数を用いて入力値から出力値への変換を行う代わりに、図５に示すような変換表を予め記憶しておき、変換表に従って同様の変換を行うようにしてもよい。変換部１０１が用いる変換表に含まれる入力値ｘと入力値ｙは、上記の（ａ）乃至（ｄ）の条件を満たすような数値の組である。また、変換表に含まれる数値の組は、例えば入力値の最小値ｍ＝０、入力値の最大値Ｍ＝１の場合を想定して作成されたものである。以下、ｍ＝０、Ｍ＝１として作成された変換表を「基準変換表」という。図６は、図５に示される変換表の入力値ｘおよび入力値ｙをプロットしたグラフである。 Furthermore, the conversion unit 101 stores a conversion table as shown in FIG. 5 in advance, instead of performing conversion from input values to output values using functions such as those shown in equations (3) and (4). Alternatively, the same conversion may be performed according to the conversion table. The input value x and the input value y included in the conversion table used by the conversion unit 101 are a set of numerical values that satisfy the conditions (a) to (d). In addition, a set of numerical values included in the conversion table is created on the assumption that, for example, the minimum value m = 0 of the input value and the maximum value M = 1 of the input value. Hereinafter, the conversion table created with m = 0 and M = 1 is referred to as a “reference conversion table”. FIG. 6 is a graph in which the input value x and the input value y of the conversion table shown in FIG. 5 are plotted.

上記のように、基準変換表はｍ＝０、Ｍ＝１の場合のものであるため、変換部１０１は基準変換表をそのまま用いるのではなく、受け取った対数値群Ｒ₁、Ｒ₂、・・・、Ｒ_Lの最小値ｍおよびＭに応じて基準変換表を変換して用いる。具体的には、変換部１０１は基準変換変の入力値ｘおよびｙをそれぞれ（Ｍ−ｍ）倍したのち、入力値ｘおよびｙにそれぞれｍを加算したものを作成し、そのように作成した変換表を用いて、対数値群Ｒ_kを変形対数値γ₁に変換する。また、変換表に含まれる入力値ｘおよび出力値ｙは離散値であるので、変換部１０１は入力値ｘに対する出力値ｙを算出する際、必要に応じて線形補間等により数値を補間する。 As described above, since the reference conversion table is for m = 0 and M = 1, the conversion unit 101 does not use the reference conversion table as it is, but the received logarithmic value groups R ₁ , R ₂ ,. .., The reference conversion table is converted and used according to the minimum values m and M of _RL . Specifically, the conversion unit 101 creates a value obtained by multiplying the input values x and y of the reference conversion variable by (M−m), respectively, and adding m to the input values x and y, respectively. Using the conversion table, the logarithmic value group R _k is converted into a modified logarithmic value γ ₁ . Since the input value x and the output value y included in the conversion table are discrete values, the conversion unit 101 interpolates numerical values by linear interpolation or the like as necessary when calculating the output value y for the input value x.

なお、上述した実施形態においては、音声信号のスペクトルを算出するにあたり、ＦＦＴ処理を行うものとしたが、その代わりに離散フーリエ変換処理等の他の方法を用いてもよい。また、上述した実施形態においては、メルスケールを用いたが、その代わりにバークスケール等の他の周波数軸を用いてもよい。さらに、メルスケール帯域フィルタの代わりに、フィルタバンク出力を行う他の種類のフィルタ群を用いるようにしてもよい。 In the above-described embodiment, the FFT processing is performed when calculating the spectrum of the audio signal, but other methods such as discrete Fourier transform processing may be used instead. In the above-described embodiment, the mel scale is used, but another frequency axis such as a bark scale may be used instead. Furthermore, instead of the mel scale band filter, another type of filter group that performs filter bank output may be used.

また、上述した実施形態においては、ＭＦＣＣの算出を、離散コサイン変換を用いて行うものとしたが、離散コサイン変換の代わりに、離散逆フーリエ変換等の他の方式の直交変換を用いて音声の特徴量を示す係数列を算出するようにしてもよい。 In the above-described embodiment, the MFCC is calculated using the discrete cosine transform. However, instead of the discrete cosine transform, the speech may be converted using another type of orthogonal transform such as a discrete inverse Fourier transform. A coefficient sequence indicating the feature amount may be calculated.

また、音声特徴量算出装置１１は、専用のハードウェアにより実現されてもよいし、音信号の入出力が可能な汎用コンピュータにアプリケーションプログラムに従った処理を実行させることにより実現されてもよい。音声特徴量算出装置１１が汎用コンピュータにより実現される場合、音声特徴量算出装置１１の各構成部は、汎用コンピュータが備えるＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）およびＣＰＵの制御下で動作するＤＳＰ（ＤｉｇｉｔａｌＳｉｇｎａｌＰｒｏｃｅｓｓｏｒ）が、アプリケーションプログラムに含まれる各モジュールに従った処理を同時並行して行うことにより、汎用コンピュータの機能として実現される。 The audio feature quantity calculation device 11 may be realized by dedicated hardware, or may be realized by causing a general-purpose computer capable of inputting / outputting sound signals to execute processing according to an application program. When the speech feature quantity calculation device 11 is realized by a general-purpose computer, each component of the speech feature quantity calculation device 11 includes a CPU (Central Processing Unit) included in the general-purpose computer and a DSP (Digital Signal Processor) that operates under the control of the CPU. ) Is realized as a function of a general-purpose computer by performing processing according to each module included in the application program in parallel.

本発明の実施形態にかかる音声認識システムの構成を示すブロック図である。It is a block diagram which shows the structure of the speech recognition system concerning embodiment of this invention. 本発明の実施形態にかかる変換部が用いる関数のグラフである。It is a graph of the function which the conversion part concerning embodiment of this invention uses. 本発明の実施形態にかかる変換部が用いる関数のグラフである。It is a graph of the function which the conversion part concerning embodiment of this invention uses. 本発明の実施形態にかかる変換部が用いる関数のグラフである。It is a graph of the function which the conversion part concerning embodiment of this invention uses. 本発明の実施形態にかかる変換部が用いる変換表である。It is a conversion table which the conversion part concerning embodiment of this invention uses. 本発明の実施形態にかかる変換部が用いる変換表のグラフである。It is a graph of the conversion table which the conversion part concerning embodiment of this invention uses. 従来技術にかかる音声認識システムの構成を示すブロック図である。It is a block diagram which shows the structure of the speech recognition system concerning a prior art. メルスケール帯域フィルタを示すグラフである。It is a graph which shows a mel scale bandpass filter.

Explanation of symbols

１・９…音声認識システム、１１・９１…音声特徴量算出装置、１２…キーボード、９０…音声信号生成装置、９２…音声認識装置、１０１…変換部、１０２…指定部、９０１…音声信号生成部、９０２…発声区間切出部、９１１…窓かけ処理部、９１２…ＦＦＴ処理部、９１３…メルスケール帯域フィルタ処理部、９１４…対数値算出部、９１５…離散コサイン変換処理部、９２１…データベース、９２２…ＤＰマッチング部、９２３…判定部。 DESCRIPTION OF SYMBOLS 1 * 9 ... Voice recognition system, 11.91 ... Voice feature-value calculation apparatus, 12 ... Keyboard, 90 ... Voice signal generation apparatus, 92 ... Voice recognition apparatus, 101 ... Conversion part, 102 ... Designation part, 901 ... Voice signal generation , 902 ... utterance section extraction unit, 911 ... windowing processing unit, 912 ... FFT processing unit, 913 ... melscale band filter processing unit, 914 ... logarithmic value calculation unit, 915 ... discrete cosine transform processing unit, 921 ... database 922... DP matching unit, 923.

Claims

Spectrum calculating means for calculating the spectrum of the audio signal from the audio signal;
A frequency within the frequency band included in the audio signal is included for each of the plurality of frequency bands by performing a filtering process corresponding to each of the plurality of predetermined frequency bands on the spectrum calculated by the spectrum calculating unit. Filter means for calculating an index value indicating the power of the component;
Logarithmic value calculation means for calculating the logarithmic value of each of the plurality of index values calculated by the filter means for each of the plurality of frequency bands;
When the minimum value of the plurality of logarithmic values calculated by the logarithmic value calculation means for each of the plurality of frequency bands is m and the maximum value is M, a constant p , a constant q (where m ≦ p <q ≦ M ) And constant n (where n> 1) , y ≧ m when the output value y with respect to the input value x is (a) x = m,
(B) y ≦ M when x = M,
(C) In the range of m ≦ x ≦ M, the rate of change of y with respect to x is always 0 or more,
(D) In the range of p ≦ x ≦ q, by converting each of the plurality of index values according to the following equation (1) that satisfies the condition that the rate of change of y with respect to x is always positive : And a conversion means for calculating a numerical string indicating the feature amount of the voice indicated by the voice signal.

Spectrum calculating means for calculating the spectrum of the audio signal from the audio signal;
A frequency within the frequency band included in the audio signal is included for each of the plurality of frequency bands by performing a filtering process corresponding to each of the plurality of predetermined frequency bands on the spectrum calculated by the spectrum calculating unit. Filter means for calculating an index value indicating the power of the component;
Logarithmic value calculation means for calculating the logarithmic value of each of the plurality of index values calculated by the filter means for each of the plurality of frequency bands;
When the minimum value of the plurality of logarithmic values calculated by the logarithmic value calculation means for each of the plurality of frequency bands is m and the maximum value is M, a constant p , a constant q (where m ≦ p <q ≦ M ) , With respect to the constant a (where a> 0) and the constant c (where c> 0) , y ≧ m when the output value y with respect to the input value x is (a) x = m,
(B) y ≦ M when x = M,
(C) In the range of m ≦ x ≦ M, the rate of change of y with respect to x is always 0 or more,
(D) In the range of p ≦ x ≦ q, by converting each of the plurality of index values according to the following equation (2) that satisfies the condition that the rate of change of y with respect to x is always positive : And a conversion means for calculating a numerical string indicating the feature amount of the voice indicated by the voice signal.

Before SL filter means, audio feature amount calculating apparatus according to claim 1 or 2, characterized in that for calculating the index value by mel scale band filter.

By discrete cosine transform calculated numerical sequence by prior Symbol conversion unit, the audio feature amount calculating apparatus according to claim 3, characterized in that it comprises a coefficient string calculation means for calculating a Mel-frequency cepstral coefficient string.

A process of calculating a spectrum of the audio signal from the audio signal,
An index value indicating the power of the frequency component in the frequency band included in the audio signal for each of the plurality of frequency bands by performing filtering processing on the spectrum according to each of the predetermined frequency bands. A process of calculating
Processing for calculating logarithmic values of each of the plurality of index values for each of the plurality of frequency bands;
When the minimum value of the plurality of logarithmic values for each of the plurality of frequency bands is m and the maximum value is M, a constant p, a constant q (where m ≦ p <q ≦ M) and a constant n (where n > 1) , y ≧ m when the output value y with respect to the input value x is (a) x = m,
(B) y ≦ M when x = M,
(C) In the range of m ≦ x ≦ M, the rate of change of y with respect to x is always 0 or more,
(D) In the range of p ≦ x ≦ q, by converting each of the plurality of index values according to the following equation (1) that satisfies the condition that the rate of change of y with respect to x is always positive : And a program for causing a computer to execute a process of calculating a numerical string indicating a feature amount of a voice indicated by the voice signal.

A process of calculating a spectrum of the audio signal from the audio signal,
An index value indicating the power of the frequency component in the frequency band included in the audio signal for each of the plurality of frequency bands by performing filtering processing on the spectrum according to each of the predetermined frequency bands. A process of calculating
Processing for calculating logarithmic values of each of the plurality of index values for each of the plurality of frequency bands;
When the minimum value of the plurality of logarithmic values for each of the plurality of frequency bands is m and the maximum value is M, a constant p, a constant q (where m ≦ p <q ≦ M) , a constant a (where a > 0) and constant c (where c> 0) , y ≧ m when the output value y with respect to the input value x is (a) x = m,
(B) y ≦ M when x = M,
(C) In the range of m ≦ x ≦ M, the rate of change of y with respect to x is always 0 or more,
(D) In the range of p ≦ x ≦ q, by converting each of the plurality of index values according to the following equation (2) that satisfies the condition that the rate of change of y with respect to x is always positive : And a program for causing a computer to execute a process of calculating a numerical string indicating a feature amount of a voice indicated by the voice signal.