JPH0558559B2

JPH0558559B2 -

Info

Publication number: JPH0558559B2
Application number: JP63130784A
Authority: JP
Inventors: Ryuichi Oka; Hiroshi Matsumura
Original assignee: Agency of Industrial Science and Technology; Sanyo Electric Co Ltd; Sanyo Denki Co Ltd
Current assignee: National Institute of Advanced Industrial Science and Technology AIST; Sanyo Electric Co Ltd; Sanyo Denki Co Ltd
Priority date: 1987-09-30
Filing date: 1988-05-27
Publication date: 1993-08-26
Also published as: JPH01158496A

Abstract

PURPOSE: To obtain a high recognition rate by utilizing the pattern of a vector field, performing a defocusing process (blurring process) by directions, and utilizing its result for voice recognition. CONSTITUTION: A time space pattern is converted into a vector field pattern having size and a direction at each grating point in a space by spatial differentiation, and as to the vector of the vector field pattern, its direction parameter is quantized into N values (N: integer). This quantized value is separated by vectors to generate N two-dimensional patterns having the sizes of the vectors as values at respective grating points, and the blurring process is performed by the directions of the two-dimensional patterns by the directions as to a time base and/or a space axis to extract a pattern as a feature of the voice. Consequently, high recognition rates is obtained even for voice recognition regarding large vocabulary and voice recognition regarding an unspecified speaker.

Description

【発明の詳細な説明】〔産業上の利用分野〕本発明は音声認識等に利用する音声の特徴抽出
方式に関し、更に詳述すればベクトル場のパター
ンを利用し、またその方向別にボカシ処理（ボケ
処理ともいう）を施して、音声認識に利用する場
合は高い認識率を得ることができる新規な方式を
提供するものである。[Detailed Description of the Invention] [Field of Industrial Application] The present invention relates to a speech feature extraction method used for speech recognition, etc. More specifically, the present invention utilizes vector field patterns and performs blurring processing ( The present invention provides a new method that can obtain a high recognition rate when used for speech recognition by applying blurring processing.

[Prior art]

音声認識は、一般に、認識させるべき単語から
特徴を抽出して得た音声の標準パターンを単語
夫々に用意しておき、認識対象として入力された
音声から同様にして抽出した特徴パターンと複数
の標準パターンとを整合し、最も類似性が高い標
準パターンを求め、この標準パターンに係る単語
が入力されたものと判定する方式をとつている。
そして、従来は上記特徴パターンとして、音声信
号を分析して得られる、時間軸を横軸、空間軸を
縦軸とするスカラー場の時空間パターンそのもの
を用いていた。このようなスカラー場の時空間パ
ターンとしては、周波数を空間軸とするスペクト
ルが代表的なものであり、この他、ケフレンシー
を空間軸とするケプストラム、PARCOR係数、
LSP係数、声道断面積関数等種々の時空間パター
ンが用いられていた。 Generally, in speech recognition, a standard pattern of speech obtained by extracting features from the word to be recognized is prepared for each word, and feature patterns extracted in the same way from the speech input as recognition target and multiple standards are prepared for each word. The standard pattern with the highest similarity is found by matching the patterns, and the word associated with this standard pattern is determined to have been input.
Conventionally, the spatio-temporal pattern itself of a scalar field with the horizontal axis as the time axis and the vertical axis as the spatial axis, which is obtained by analyzing the audio signal, has been used as the feature pattern. A typical spatio-temporal pattern of such a scalar field is a spectrum with frequency as its spatial axis, as well as a cepstrum with quefrency as its spatial axis, a PARCOR coefficient,
Various spatiotemporal patterns such as LSP coefficients and vocal tract cross-sectional area functions were used.

また、音声認識の分野において解決すべき課題
の１つとして多数話者又は不特定話者への対応が
あり、これには１つの単語に多数の標準パターン
を用意することで認識率の向上を図つていた。更
に、話者が同一であつても発音速度が異なること
があり、このような場合にも対応できるように時
間軸変動を吸収し得るDPマツチング法が開発さ
れていた。 In addition, one of the issues to be solved in the field of speech recognition is dealing with multiple speakers or unspecified speakers, and improving the recognition rate by preparing a large number of standard patterns for one word. I was thinking about it. Furthermore, even if speakers are the same, their pronunciation speeds may differ, and a DP matching method that can absorb time axis fluctuations has been developed to cope with such cases.

[Problem to be solved by the invention]

スカラー場の時空間パターンそのものを特徴と
して用いる従来の方式では、大語彙又は不特定話
者を対象とした場合、必ずしも十分な認識率が得
られておらず、たとえ、上述の如く１つの単語に
多数の標準パターンを用意したり、あるいはDP
マツチング法を用いても、これらは本格的な解決
にはならなかつた。 Conventional methods that use the spatiotemporal pattern of the scalar field itself as a feature do not necessarily achieve a sufficient recognition rate when targeting large vocabularies or unspecified speakers. Many standard patterns are available or DP
Even using the matching method, these problems could not be fully resolved.

従つて、大語彙又は不特定話者を対象とした音
声認識システムの実用化が停滞しているのであ
る。そこで、本発明者の１人は、特開昭60−
59394号公報及び「スペクトルベクトル場とスペ
クトルの音声認識における有効性比較について」
電子通信学会論文誌(D)、Vo1.J69−D.No.1P1704
（1986）において、時間−周波数の時空間パター
ンであるスカラー場のスペクトルを空間微分して
スペクトルベクトル場パターンを得、このパター
ンを音声の特徴として用いる手法を提案した。 Therefore, the practical application of speech recognition systems for large vocabularies or for unspecified speakers has stalled. Therefore, one of the inventors of the present invention
Publication No. 59394 and "Comparison of effectiveness of spectral vector field and spectrum in speech recognition"
Journal of the Institute of Electronics and Communication Engineers (D), Vo1.J69−D.No.1P1704
(1986) proposed a method for spatially differentiating the spectrum of a scalar field, which is a time-frequency spatiotemporal pattern, to obtain a spectral vector field pattern, and using this pattern as a feature of speech.

過去スペクトルの時空点の偏微分を特徴として
用いた研究はT.B.Martinによつて為され、
“Practical applications of voice input to
machines”Proc.IEEE，64−４（1976）に開示さ
れている。しかしながら、T.B.Martinは時空間
パターンｆ（ｔ，ｘ）から ∂f（ｔ，ｘ）／∂t，∂f（ｔ，ｘ）／∂xを算出し、
これによつて各フレームについての32種類の音韻
性を識別する関数を構成し、その結果を32個の２
値で表現したものを単語単位の線形整合に用いて
おり、上述のスペクトルスカラー場からスペクト
ルベクトル場を作成する手法とは異なつていた。 Research using partial differentials of space-time points in past spectra as features was conducted by TB Martin.
“Practical applications of voice input to
machines" Proc. IEEE, 64-4 (1976). However, TBMartin uses the spatio-temporal pattern f(t, x) as ∂f(t,x)/∂t, ∂f(t,x). /∂x is calculated,
In this way, we construct a function that identifies 32 types of phonology for each frame, and divide the results into 32 types of phonology.
This method uses values expressed as values for word-by-word linear matching, which is different from the method described above that creates a spectral vector field from a spectral scalar field.

本発明は上述の手法を工学的観点から更に一歩
勧めて実用化に適した改良を施した音声の特徴抽
出方式を提供することを主な目的とする。 The main object of the present invention is to provide a speech feature extraction method that is improved from an engineering perspective and is suitable for practical use.

また本発明は大語彙を対象とする音声認識、不
特定話者を対象とする音声認識においても高い認
識率が得られる音声の特徴抽出方式を提供するこ
とを他の目的としている。 Another object of the present invention is to provide a speech feature extraction method that can obtain a high recognition rate even in speech recognition that targets a large vocabulary and speech recognition that targets unspecified speakers.

[Means to solve the problem]

本発明の基本的特徴は、音声信号を分析して時
間軸と空間軸とで規定されるスカラー場の時空間
パターンを得、該時空間パターンを用いて音声の
特徴を抽出する音声の特徴抽出方式において、前
記時空間パターンを空間微分することにより空間
の各格子点で大きさ及び方向をもつベクトル場パ
ターンに変換し、該ベクトル場パターンのベクト
ルについて、その方向パラメータをＮ値（Ｎ：整
数）に量子化し、この量子化値を同じくするベク
トル毎に各々分離して、そのベクトルの大きさを
各格子点の値としたＮ個の方向別２次元パターン
を作成し、該方向別２次元パターンの方向別に、
時間軸のみ又は時間軸及び空間軸の両方に関して
ボカシ処理を施してなるパターンを音声の特徴と
して抽出するにある。 The basic feature of the present invention is speech feature extraction that analyzes an audio signal to obtain a spatiotemporal pattern of a scalar field defined by a time axis and a spatial axis, and extracts speech features using the spatiotemporal pattern. In this method, the spatiotemporal pattern is spatially differentiated to convert it into a vector field pattern having a magnitude and direction at each grid point in space, and the direction parameter of the vector of the vector field pattern is set to N values (N: an integer). ), separate this quantized value into vectors with the same value, create N two-dimensional patterns for each direction with the size of the vector as the value of each grid point, and create a two-dimensional pattern for each direction. Depending on the direction of the pattern,
The purpose of this method is to extract a pattern obtained by blurring only the time axis or both the time and space axes as audio features.

このボカシ処理は、男、女一方の性のみの音声
の特徴を抽出する場合は時間軸に関してのみ行え
ばよい。 This blurring process only needs to be performed on the time axis when extracting voice features of only one gender, male or female.

男女両性の音声の特徴を抽出する場合は空間軸
についてもボカシ処理を行うが、時間軸に関する
ボカシ処理を空間軸に関するボカシ処理よりも強
調して行う。 When extracting features of voices of both sexes, blurring processing is also performed on the spatial axis, but the blurring processing on the time axis is emphasized more than the blurring processing on the spatial axis.

更にこれらのボカシ処理を予め定められた重み
値を有するマスクパターンをマスク演算すること
によつて行う。 Further, these blurring processes are performed by performing a mask operation using a mask pattern having a predetermined weight value.

[Effect]

入力された音声信号は時間軸と空間軸とで規定
されるスカラー場の時空間パターンからベクトル
の方向パラメータが量子化され、量子化された方
向毎に分離された複数の方向別２次元パターンに
変換される。そしてこの方向別２次元パターンは
ボカシ処理を施され方向性パターン特徴の集積化
が行われる。これによつて音声の特徴の強調と安
定化が得られる。 The input audio signal is quantized from the spatio-temporal pattern of the scalar field defined by the time and space axes, and the vector direction parameters are quantized into two-dimensional patterns separated for each quantized direction. converted. This directional two-dimensional pattern is then subjected to blurring processing, and directional pattern features are integrated. This results in enhancement and stabilization of voice features.

この集積化は時空点（ｔ，ｘ）の一種の構造化
を行うものである。すなわち、この構造化とはＮ
枚の方向性パターンを統合して考えるとき、時空
点（ｔ，ｘ）には最大Ｎ個のベクトルを付加する
ことである（第６図参照）。このことによる音声
認識における効果は音韻性をよりよく表す特徴の
形成とその安定な表現にあり、また音韻性の特徴
がある時空間区間のスペクトルの変化に対応して
いるとする。 This integration performs a kind of structuring of the space-time point (t, x). In other words, this structuring is N
When considering the directional patterns in an integrated manner, a maximum of N vectors are added to the space-time point (t, x) (see Fig. 6). The effect of this on speech recognition lies in the formation of features that better represent phonology and their stable representation, and also corresponds to changes in the spectrum of spatiotemporal intervals in which phonology features exist.

この特徴は、まず微視的にスペクトルベクトル
場で抽出され、次に異なつた方向区間にあるベク
トルが独立した特徴としてみなされた後にそれら
が独立して各時空点に集積される。方向ごとに独
立し、ボカシのマスクパターン内で積分すると
き、特徴の構造性が保たれたままでより巨視的な
特徴（広い時空間領域がつくる音声特徴）が捉え
られる。また、この特徴の集積が時空点（ｔ，
ｘ）ごとに行われるとすると、この音声特徴は特
定の時空間点のみに巨視的な特徴が形成されるの
ではなく、少しづつは異なるが広い（特に時間）
領域にわたつて安定に形成されることとなる。 This feature is first extracted microscopically in the spectral vector field, and then vectors in different directional intervals are considered as independent features, and then they are independently integrated at each space-time point. When integrating within a blurred mask pattern independently for each direction, more macroscopic features (audio features created by a wide spatiotemporal region) can be captured while the structural nature of the features is maintained. Moreover, the accumulation of this feature is the space-time point (t,
x), this audio feature is not a macroscopic feature formed only at a specific spatiotemporal point, but a wide range (especially in time) that differs slightly from time to time.
It will be stably formed over the area.

従つてこのボカシ処理による強調、安定化によ
つて音韻の区別化、話者の正規化が従来よりも高
精度で行える。 Therefore, by emphasizing and stabilizing the blurring process, phoneme differentiation and speaker normalization can be performed with higher accuracy than before.

〔Example〕

以下本発明をその実施例を示す図面に基づいて
詳述する。 DESCRIPTION OF THE PREFERRED EMBODIMENTS The present invention will be described in detail below based on drawings showing embodiments thereof.

第１図は本発明方式を実施するための装置の構
成を示すブロツク図である。 FIG. 1 is a block diagram showing the configuration of an apparatus for implementing the method of the present invention.

この実施例では分析部２で音声信号をスペクト
ル分析してスカラー場の時空間パターンとして空
間軸を周波数軸とするスペクトルを用いる。 In this embodiment, the analysis unit 2 spectrally analyzes the audio signal and uses a spectrum with the spatial axis as the frequency axis as the spatiotemporal pattern of the scalar field.

標準パターン作成のための音声の入力又は認識
対象の音声の入力はマイクロホン等の音声検出器
及びＡ／Ｄ変換器からなる音声入力部１によつて
行われ、これによつて得られた音声信号は通過周
波数帯域を夫々に異にする複数チヤネル（例えば
10〜30）のバンドパスフイルタを並列的に接続し
てなる分析部２に入力される。分析部では、分析
の結果、時空間パターンが得られ、このパターン
が単語区間切出部３によつて認識単位の単語ごと
に区分されて特徴抽出部４へ与えられる。単語区
間切出部３としては従来から知られているものを
用いればよい。 The input of the voice for standard pattern creation or the voice to be recognized is performed by the voice input unit 1 consisting of a voice detector such as a microphone and an A/D converter, and the voice signal obtained thereby is for multiple channels with different passing frequency bands (e.g.
10 to 30) bandpass filters connected in parallel. In the analysis section, a spatio-temporal pattern is obtained as a result of the analysis, and this pattern is segmented into each recognition unit word by the word section extraction section 3 and is provided to the feature extraction section 4 . As the word section cutting section 3, a conventionally known one may be used.

なお周波数帯域ごとに音声信号を分割する分析
部２として、以後の説明においては、上記した如
くバンドパスフイルタ群を用いることとするが、
高速フーリエ変換器を用いてもよい。 In the following description, a group of bandpass filters will be used as the analysis section 2 that divides the audio signal into frequency bands, as described above.
A fast Fourier transformer may also be used.

さて本発明方式は次に説明する特徴抽出部によ
つて特徴づけられる。特徴抽出部４への入力パタ
ーンは横軸を時間軸、縦軸を周波数とする時空間
パターンであり、単語区間切出部３によつて切出
された第２図に示す時空間パターンをｆ（ｔ，ｘ）
（但しｔはサンプリングの時刻を示す番号、ｘは
バンドパスフイルタのチヤネル番号又は周波数帯
域を特定する番号。１≦ｔ≦Ｔ，１≦ｘ≦Ｌ但し
Ｔ，Ｌは夫々ｔ，ｘの最大値）と表す。 Now, the method of the present invention is characterized by the feature extraction section described below. The input pattern to the feature extraction unit 4 is a spatio-temporal pattern with the horizontal axis as the time axis and the vertical axis as the frequency, and the spatio-temporal pattern shown in FIG. (t, x)
(However, t is a number indicating the sampling time, x is a channel number of a bandpass filter or a number specifying a frequency band. 1≦t≦T, 1≦x≦L However, T and L are the maximum values of t and x, respectively. ).

単語区間切出部３出力は特徴抽出部４の正規化
部４１へ入力され、正規化部４１は時間軸の線形
正規化をする。これは単語の長短、入力音声の長
短等をある程度吸収するためであり、時間軸をＴ
フレームからＭフレーム（例えば１６〜３２フレ
ーム程度）にする。具体的にはＭ≦Ｔの場合は、
正規化した時空間パターンＦ（ｔ，ｘ）は下記(1)
式で求められる。 The output of the word section extraction section 3 is input to the normalization section 41 of the feature extraction section 4, and the normalization section 41 linearly normalizes the time axis. This is to absorb the length of words, the length of the input voice, etc. to a certain extent, and the time axis is T.
frame to M frames (for example, about 16 to 32 frames). Specifically, if M≦T,
The normalized spatiotemporal pattern F(t, x) is as follows (1)
It is determined by the formula.

Ｆ（ｔ，ｘ）＝_(T/M) F(t,x)= _(T/M)

Claims

[Claims] 1. A speech feature extraction method that analyzes a speech signal to obtain a spatiotemporal pattern of a scalar field defined by a time axis and a spatial axis, and extracts speech features using the spatiotemporal pattern. When extracting the features of voices of only one gender, male or female, the spatiotemporal pattern is transformed into a vector field pattern having a magnitude and direction at each grid point in space by spatial differentiation, and the vector field pattern is For vectors, quantize their direction parameters into N values (N: integer), separate these quantized values for each vector,
N directional two-dimensional patterns are created with the size of the vector as the value of each grid point, and the pattern is obtained by blurring only the time axis for each direction of the two-dimensional pattern. A voice feature extraction method characterized by extracting as follows. 2. In a speech feature extraction method in which a speech signal is analyzed to obtain a spatiotemporal pattern of a scalar field defined by a temporal axis and a spatial axis, and speech features are extracted using the spatiotemporal pattern, the spatiotemporal pattern is is transformed into a vector field pattern that has a magnitude and direction at each grid point in space by spatially differentiating, and the direction parameter of the vector of the vector field pattern is quantized to N values (N: integer), and this quantum Separate each vector with the same value, create N directional two-dimensional patterns with the size of the vector as the value of each grid point, and create a time axis for each direction of the two-dimensional pattern. and extracting a pattern obtained by performing blurring processing on the spatial axis as a feature of the sound, and the blurring processing is performed by emphasizing the blurring processing on the temporal axis more than the blurring processing on the spatial axis. method. 3. In a speech feature extraction method in which a speech signal is analyzed to obtain a spatiotemporal pattern of a scalar field defined by a temporal axis and a spatial axis, and speech features are extracted using the spatiotemporal pattern, the spatiotemporal pattern is is transformed into a vector field pattern that has a magnitude and direction at each grid point in space by spatially differentiating, and the direction parameter of the vector of the vector field pattern is quantized to N values (N: integer), and this quantum Separate each vector with the same value, create N directional two-dimensional patterns with the size of the vector as the value of each grid point, and create a time axis for each direction of the two-dimensional pattern. and a pattern obtained by performing blurring processing on the spatial axis is extracted as a voice feature, and the blurring processing is performed by emphasizing the blurring processing on the time axis more than the blurring processing on the spatial axis, and the features of voices of both sexes are extracted. A voice feature extraction method characterized in that the blurring processing on the spatial axis in the case of is performed more emphatically than the blurring processing on the spatial axis in the case of extracting features of voices of only one gender. 4. In a speech feature extraction method in which a speech signal is analyzed to obtain a spatiotemporal pattern of a scalar field defined by a temporal axis and a spatial axis, and speech features are extracted using the spatiotemporal pattern, the spatiotemporal pattern is is transformed into a vector field pattern that has a magnitude and direction at each grid point in space by spatially differentiating, and the direction parameter of the vector of the vector field pattern is quantized to N values (N: integer), and this quantum Separate each vector with the same value, and create N two-dimensional patterns for each direction with the size of the vector as the value of each grid point, and at least time for each direction of the two-dimensional pattern for each direction A pattern formed by performing blurring processing on the axis is extracted as a sound feature, and the blurring processing has a center point corresponding to the grid point and a center point from the center point for each grid point of the two-dimensional pattern for each direction. A voice feature extraction method characterized in that the process performs a mask calculation on a mask pattern having a spread of two or more grid points in both directions of a time axis and having a predetermined weight value. 5. The audio feature extraction method according to claim 4, wherein the weight values are all "1". 6. In a speech feature extraction method in which a speech signal is analyzed to obtain a spatiotemporal pattern of a scalar field defined by a temporal axis and a spatial axis, and speech features are extracted using the spatiotemporal pattern, the spatiotemporal pattern is is transformed into a vector field pattern that has a magnitude and direction at each grid point in space by spatially differentiating, and the direction parameter of the vector of the vector field pattern is quantized to N values (N: integer), and this quantum Separate each vector with the same value, create N directional two-dimensional patterns with the size of the vector as the value of each grid point, and create a time axis for each direction of the two-dimensional pattern. and/or a pattern formed by performing blurring processing on the spatial axis is extracted as a sound feature, and the blurring processing has a central point corresponding to the grid point for each grid point of the two-dimensional pattern for each direction, and A mask pattern that extends from a center point by two or more grid points in each direction on the time axis, and one grid point or more in each direction from the center point in both directions along the spatial axis, and further has a predetermined weight value. A voice feature extraction method characterized by performing a mask operation on the voice. 7. The audio feature extraction method according to claim 6, wherein the spread of the mask pattern in the time axis direction is larger than the spread in the spatial axis direction. 8. The audio feature extraction method according to claim 6, wherein the center point of the mask pattern and the weight value in the time axis direction are all "1", and the weight value in the spatial axis direction is smaller than "1".