JP2000285104A

JP2000285104A - Method and device for signal processing

Info

Publication number: JP2000285104A
Application number: JP2000015517A
Authority: JP
Inventors: Toshio Irino; 俊夫入野; D Paterson Roy; ロイ・ディ・パターソン
Original assignee: ATR NINGEN JOHO TSUSHIN KENKYU; Medical Research Council; ATR Advanced Telecommunications Research Institute International
Current assignee: ATR NINGEN JOHO TSUSHIN KENKYU; Medical Research Council; ATR Advanced Telecommunications Research Institute International
Priority date: 1999-01-28
Filing date: 2000-01-25
Publication date: 2000-10-13
Anticipated expiration: 2020-01-25
Also published as: JP3174777B2

Abstract

PROBLEM TO BE SOLVED: To perform signal processing that extracts a common feature which does not depend on the physical size of an object and processes it from a signal of an analysis object by performing wavelet transformation of an input signal, making its output synchronize with the input signal by a computer, performing Mellin transform and extracting a signal characteristic. SOLUTION: This method includes a wavelet transformation step in which a computer performs wavelet transformation of an input signal and a characteristic extraction step in which the characteristic of a signal is extracted by such a manner that the computer synchronizes an output of the step performing wavelet transformation with the input signal and performs Mellin transform of it. In this device, stabilizing wavelet processing part 2 performs wavelet transformation processing of an input signal 1. A Mellin transform processing part 3 performs Mellin transform of the signal 1 which is outputted from the part 2 and is subjected to stabilizing wavelet processing. A signal processing part 4 performs prescribed signal processing of an output of the part 3 and outputs the results 5.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】この発明は、従来、自己回帰
モデル等の統計的手法やフーリエ変換によって行なわれ
てきた、時系列データの解析の改良に関する。本発明は
たとえば、楽音認識、音声による個人認識、音声認識、
建築音響の分析、ならびに音声または音楽の信号分析、
符号化、信号分離、および信号強調処理に応用できる。
本発明は、また音響信号等に限らず、機械音および地震
波等の機械的振動の解析、脳波、心臓拍動音、超音波エ
コー、および神経細胞信号等の生体信号解析、ならびに
一般的な時系列データを収集するためのセンサー信号の
解析等にも広範囲に応用される。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an improvement in analysis of time series data, which has been conventionally performed by a statistical method such as an autoregressive model or a Fourier transform. The present invention provides, for example, music recognition, voice-based personal recognition, voice recognition,
Analysis of architectural acoustics, as well as voice or music signal analysis,
It can be applied to encoding, signal separation, and signal enhancement processing.
The present invention is not limited to acoustic signals and the like. It is also widely applied to the analysis of sensor signals for collecting sequence data.

【０００２】[0002]

【従来の技術】従来より、信号処理一般の情報処理の基
本として行なわれてきたのは、スペクトログラムつまり
「時間−周波数表現」を求めることであった。高速ディ
ジタル変換（たとえば高速フーリエ変換）を使っても、
線形予測分析を使っても、求めるものは、ある一時点で
の周波数表現としてのスペクトルに直接的に対応するベ
クトルであり、これを時系列で持つことにより、スペク
トログラムに相当する表現を用いていることになる。こ
れら表現は、フーリエ変換から始まる信号のスペクトル
表現に由来している。たとえば音声信号の特徴のための
表現として最もよく用いられているのはサウンドスペク
トログラム（sound spectrogram）であろう。サウンド
スペクトログラムとは、音声スペクトルの時間的な変化
を、濃淡図形表現、等高線表現、またはカラー表示など
を用いて見やすく表現したものである。2. Description of the Related Art Conventionally, signal processing has been performed as a basis of general information processing in order to obtain a spectrogram, that is, a "time-frequency expression". Even if you use high-speed digital conversion (for example, fast Fourier transform),
Even when using linear prediction analysis, what is sought is a vector that directly corresponds to a spectrum as a frequency expression at a certain point in time, and by having this in a time series, an expression equivalent to a spectrogram is used. Will be. These representations are derived from the spectral representation of the signal starting from the Fourier transform. For example, the most commonly used representation for features of audio signals will be a sound spectrogram. The sound spectrogram expresses a temporal change of a voice spectrum in an easy-to-read manner using a gray scale graphic expression, a contour line expression, a color display, or the like.

【０００３】スペクトル表現は、波形自体で信号を表現
するよりも信号の特徴を良く表現できること、人間の聴
覚系は複数の正弦波からなる信号の相対的な位相関係に
はあまり敏感でないとされていること、それらを効率的
に計算できる計算手法が確立されていること等の特徴を
有しており、音声等の情報処理にはちょうど良く整合が
取れたために、広く使われるようになった。It is said that the spectral expression can express the characteristics of the signal better than the signal itself by the waveform itself, and that the human auditory system is not so sensitive to the relative phase relation of the signal composed of a plurality of sine waves. And the fact that a calculation method that can calculate them efficiently has been established, and it has come to be widely used because the information processing of speech and the like has been properly matched.

【０００４】従来はさまざまな信号処理において、あり
とあらゆることをもっぱら上記したスペクトル表現で見
ることによって極限まで性能向上を図ってきた。しか
し、すでに性能向上の限界に近くなっている感がある。
たとえば、音声認識装置では一般的に事前に多数の人間
の音声による学習が必要である。ところが、多数の大人
の男声・女声で学習を行なった音声認識装置に子供の声
を入力しても、ほとんど認識されないであろう。これ
は、基本的には、大人と子供とでは、声道や声帯の物理
的大きさが異なるために、それぞれの発する音声のスペ
クトル構造およびピッチ周期が異なり、その結果それぞ
れの音声から抽出される特徴ベクトルが異なってくるた
めである。Heretofore, in various signal processings, performance has been improved to the utmost by viewing almost everything in the above-mentioned spectral representation. However, there is a feeling that performance improvement is already approaching its limit.
For example, a speech recognition apparatus generally needs to learn a lot of human speech in advance. However, even if a child's voice is input to a speech recognition device that has learned with a large number of adult male and female voices, it will hardly be recognized. This is basically because the physical structure of the vocal tract and vocal cords differs between adults and children, so the spectral structure and pitch period of each uttered voice are different, and as a result it is extracted from each voice This is because the feature vectors are different.

【０００５】この問題を解決するために、その音声認識
装置に多数の子供の声を学習させたり、子供のためだけ
に特別に準備した音声認識装置を大人と子供とを判別す
るための装置とともに用意したりする方策がある。しか
し、子供の声の大規模データベースは現在は存在してい
ないので、そうした子供専用の音声認識装置を容易に準
備することはできない。さらに、仮にそうした子供の声
の大規模データベースを手間をかけて構築したとして
も、上記したような解決方法はあまり効率的とは言えな
い。[0005] In order to solve this problem, the voice recognition device is made to learn a large number of children's voices, and a voice recognition device specially prepared only for children is used together with a device for discriminating between adults and children. There are measures to prepare. However, since a large-scale database of children's voices does not exist at present, it is not easy to prepare such a child-specific speech recognizer. Furthermore, even if such a large-scale database of children's voices is constructed with great effort, the above-mentioned solution is not very efficient.

【０００６】[0006]

【発明が解決しようとする課題】この問題を本質的に解
決するためには、スペクトログラムでは行ないにくい声
道や声帯の物理的大きさの正規化が自動的に行なえる表
現が不可欠である。ここでは、音声認識だけの例を挙げ
たが、たとえば楽器の発する音の分析およびエンジン音
の分析におけるように、音源の物理的大きさにかかわら
ず不変な音響的な特徴抽出が必要となる問題はさまざま
な局面で出ている。音響信号等に限らず、機械音および
地震波等の機械的振動の解析、脳波、心臓拍動音、超音
波エコーおよび神経細胞信号等の生体信号解析、一般的
な時系列データを収集するためのセンサー信号の解析
等、広範囲な分野でこうした問題に対する解決が必要で
ある。In order to essentially solve this problem, an expression that can automatically normalize the physical size of the vocal tract and vocal cords, which is difficult to perform with a spectrogram, is indispensable. Here, an example of only speech recognition has been given, but a problem that requires invariant acoustic feature extraction regardless of the physical size of the sound source, such as in the analysis of sounds emitted by musical instruments and the analysis of engine sounds. Comes out in various phases. For analyzing not only acoustic signals but also mechanical vibrations such as mechanical sounds and seismic waves, biological signal analysis such as brain waves, heart beat sounds, ultrasonic echoes and nerve cell signals, and general time series data Solutions to these problems are needed in a wide range of fields, such as the analysis of sensor signals.

【０００７】それゆえに、本発明の目的は、振動源の物
理的な大きさに依存しない何らかの表現を利用すること
によって、上記の例に関連して述べたようなスペクトル
表現に由来する本質的な限界を超える信号処理を行なう
方法およびそれを利用した装置を提供することである。[0007] It is therefore an object of the present invention to utilize some representation that does not depend on the physical magnitude of the source of the vibration, thereby making it essential to derive from the spectral representation as described in connection with the above example. An object of the present invention is to provide a method of performing signal processing exceeding the limit and an apparatus using the method.

【０００８】[0008]

【課題を解決するための手段】請求項１に記載の発明に
かかる信号処理方法は、入力信号をコンピュータにおい
てウェーブレット変換するウェーブレット変換ステップ
と、ウェーブレット変換するステップの出力をコンピュ
ータにおいて入力信号に同期させてメリン変換すること
によって信号の特性を抽出する特性抽出ステップとを含
む。According to a first aspect of the present invention, there is provided a signal processing method, wherein a wavelet transform step of performing a wavelet transform on an input signal by a computer and an output of the wavelet transform step are synchronized with the input signal by a computer. And extracting a characteristic of the signal by performing a Mellin transform.

【０００９】請求項２に記載の発明にかかる信号処理方
法は、請求項１に記載の発明の構成に加えて、特性抽出
ステップは、ウェーブレット変換ステップによって得た
ランニングスペクトルに相当する表現を、応答波形の微
細構造を保ちながら信号同期で時間的に安定化させて時
間間隔−対数周波数表現に変換するステップと、時間間
隔−対数周波数表現において、時間間隔と周波数との積
または比の値が一定となる線に沿って、メリン変換に相
当する処理を行なうステップとを含む。According to a second aspect of the present invention, in the signal processing method according to the first aspect of the present invention, in the characteristic extracting step, the characteristic extraction step converts a representation corresponding to a running spectrum obtained by the wavelet transform step into a response spectrum. Converting the time interval to logarithmic frequency expression by stabilizing in time with signal synchronization while maintaining the fine structure of the waveform, and in the time interval to logarithmic frequency expression, the value of the product or ratio of the time interval and the frequency is constant And performing a process corresponding to the Mellin transformation along the line

【００１０】請求項３に記載の発明にかかる信号処理方
法は、入力信号の、原点が特定される安定化された時間
間隔−対数周波数表現において、時間間隔軸を対数変換
した対数時間間隔−対数周波数表現をコンピュータを用
いて得るステップと、さらにコンピュータを用いて、対
数時間間隔−対数周波数表現を時間間隔と周波数との積
を横軸に対数周波数を縦軸に持つ新たな表現に変換し、
その縦軸方向または横軸方向の表現に対し、積分変換を
行なうことによって信号の特性を抽出するステップとを
含む。According to a third aspect of the present invention, in the stabilized time interval-log frequency expression in which the origin of the input signal is specified, the logarithmic time interval-logarithm of the time interval axis is logarithmically converted. Obtaining the frequency expression using a computer, and further using a computer, converting the logarithmic time interval-logarithmic frequency expression into a new expression having the product of the time interval and the frequency on the horizontal axis and the logarithmic frequency on the vertical axis;
Extracting the characteristic of the signal by performing integral conversion on the expression in the vertical axis direction or the horizontal axis direction.

【００１１】請求項４に記載の発明にかかる信号処理方
法は、請求項３に記載の発明の構成に加えて、積分変換
により得られた表現空間を、ある一時点における表現ベ
クトルの時系列として表現するステップをさらに含む。According to a fourth aspect of the present invention, in the signal processing method according to the third aspect of the present invention, the expression space obtained by the integral transformation is used as a time series of expression vectors at a certain point. Further comprising the step of expressing.

【００１２】請求項５に記載の発明にかかる信号処理方
法は、請求項１〜請求項４のいずれかに記載の発明の構
成に加えて、コンピュータによって処理可能な形式に変
換した信号を、聴覚特性を考慮して周波数分析を行なっ
た出力をメリン変換ステップに与えるステップをさらに
含む。A signal processing method according to a fifth aspect of the present invention provides the signal processing method according to any one of the first to fourth aspects, further comprising: The method further includes the step of providing an output subjected to frequency analysis in consideration of the characteristic to the Mellin transform step.

【００１３】請求項６に記載の発明にかかる信号処理装
置は、コンピュータによって処理可能な予め定める形式
に変換した入力信号をウェーブレット変換するためのウ
ェーブレット変換手段と、ウェーブレット変換手段の出
力を前記入力信号に同期させてメリン変換することによ
って信号の特性を抽出するための特性抽出手段とを含
む。According to a sixth aspect of the present invention, there is provided a signal processing apparatus, comprising: a wavelet transform unit for performing a wavelet transform on an input signal converted into a predetermined format which can be processed by a computer; And characteristic extracting means for extracting the characteristic of the signal by performing the Merin conversion in synchronization with the characteristic.

【００１４】請求項７に記載の発明にかかる信号処理装
置は、請求項６に記載の発明の構成に加えて、特性抽出
手段は、ウェーブレット変換手段によって得たランニン
グスペクトルに相当する表現を、応答波形の微細構造を
保ちながら信号同期で時間的に安定化させて時間間隔−
対数周波数表現に変換するための手段と、時間間隔−対
数周波数表現において、時間間隔と周波数との積または
比の値が一定となる線に沿って、メリン変換に相当する
処理を行なうための手段とを含む。According to a seventh aspect of the present invention, in addition to the configuration of the sixth aspect of the present invention, the characteristic extracting means converts the expression corresponding to the running spectrum obtained by the wavelet transform means into a response. Time interval is stabilized by signal synchronization while maintaining the fine structure of the waveform.
Means for converting to a logarithmic frequency representation, and means for performing processing equivalent to the Mellin transformation along a line in which the value of the product or ratio of the time interval and the frequency is constant in the time interval-logarithmic frequency representation And

【００１５】請求項８に記載の発明にかかる信号処理装
置は、コンピュータによって処理可能な形式に変換した
入力信号の、原点が特定される安定化された時間間隔−
対数周波数表現において、時間間隔軸を対数変換した対
数時間間隔−対数周波数表現を得るための手段と、さら
に対数時間間隔−対数周波数表現を時間間隔と周波数と
の積を横軸に対数周波数を縦軸に持つ新たな表現に変換
し、その縦軸方向または横軸方向の表現に対し、積分変
換を行なうことによって信号の特性を抽出するための手
段とを含む。The signal processing device according to the present invention is characterized in that the input signal converted into a format that can be processed by a computer has a stabilized time interval at which the origin is specified.
In the logarithmic frequency expression, a means for obtaining a logarithmic time interval-logarithmic frequency expression obtained by logarithmically converting the time interval axis, and a logarithmic time interval-logarithmic frequency expression are further represented by the product of the time interval and the frequency on the horizontal axis and the logarithmic frequency on the vertical axis. Means for converting into a new expression on the axis and extracting the characteristic of the signal by performing integral conversion on the expression in the vertical axis direction or the horizontal axis direction.

【００１６】請求項９に記載の発明にかかる信号処理装
置は、請求項８に記載の発明の構成に加えて、さらに、
積分変換により得られた表現空間を、ある一時点におけ
る表現ベクトルの時系列として表現するための手段をさ
らに含む。According to a ninth aspect of the present invention, in addition to the configuration of the eighth aspect, the signal processing apparatus further comprises:
It further includes means for expressing the expression space obtained by the integral transformation as a time series of expression vectors at a certain point.

【００１７】請求項１０に記載の発明にかかる信号処理
装置は、請求項６〜請求項９のいずれかに記載の発明の
構成に加えて、コンピュータによって処理可能な形式に
変換した信号を、聴覚特性を考慮して周波数分析を行な
った出力をメリン変換に与えるための手段をさらに含
む。According to a tenth aspect of the present invention, in addition to the configuration according to any one of the sixth to ninth aspects, the signal processing device according to any one of the sixth to ninth aspects further comprises: The apparatus further includes means for providing an output obtained by performing frequency analysis in consideration of the characteristic to the Mellin transform.

【００１８】請求項１１に記載の発明にかかる信号処理
装置は、各々入力信号を受けるように接続された、互い
に同一のウェーブレット核関数を持ちそれぞれ別個の周
波数を持つウェーブレットにより変換を行なう複数個の
ウェーブレットフィルタからなるウェーブレットバンク
と、ウェーブレットバンクの出力を受けるように接続さ
れ、ウェーブレットバンクの出力から、聴覚図形を抽出
するための聴覚図形抽出手段と、聴覚図形抽出手段によ
って抽出された聴覚図形から入力信号の寸法−形状イメ
ージを生成するための寸法−形状イメージ生成手段と、
寸法−形状イメージから入力信号の特徴を抽出するため
の特徴抽出手段とを含む。A signal processing apparatus according to an eleventh aspect of the present invention includes a plurality of signal processing units connected to receive an input signal, each of which performs conversion by wavelets having the same wavelet kernel function and mutually different frequencies. A wavelet bank comprising a wavelet filter, connected to receive an output of the wavelet bank, an auditory figure extracting means for extracting an auditory figure from an output of the wavelet bank, and an auditory figure extracted by the auditory figure extracting means. Size-shape image generating means for generating a size-shape image of the signal;
Feature extracting means for extracting features of the input signal from the size-shape image.

【００１９】請求項１２に記載の発明にかかる信号処理
装置は、請求項１１に記載の発明の構成に加えて、特徴
抽出手段は、寸法−形状イメージに対して、各ウェーブ
レットフィルタのインパルス応答線に沿ってフーリエ変
換を行なうことによりメリンイメージを生成するための
メリンイメージ生成手段を含む。According to a twelfth aspect of the present invention, in the signal processing apparatus according to the eleventh aspect, in addition to the configuration of the eleventh aspect, the characteristic extracting means includes an impulse response line of each wavelet filter for the size-shape image. And a merin image generating means for generating a merin image by performing a Fourier transform along.

【００２０】請求項１３に記載の発明にかかる信号処理
装置は、請求項１２に記載の発明の構成に加えて、聴覚
図形抽出手段は、ウェーブレットフィルタバンクの出力
に含まれる周期性を検出することにより、ウェーブレッ
トフィルタバンクの各チャンネルの出力に対して時間ス
トローブ積分を行ない安定化された聴覚イメージを生成
するための時間ストローブ積分手段と、時間ストローブ
積分手段の検出した周期性に基づいて、時間ストローブ
積分によって得られた安定化聴覚イメージのうちの所定
番目の一周期を聴覚図形として抽出するための安定化聴
覚イメージ抽出手段とを含む。According to a thirteenth aspect of the present invention, in the signal processing device according to the twelfth aspect, in addition to the configuration of the twelfth aspect, the auditory figure extracting means detects a periodicity included in an output of the wavelet filter bank. Time strobe integration means for performing a time strobe integration on the output of each channel of the wavelet filter bank to generate a stabilized auditory image, and a time strobe integration based on the periodicity detected by the time strobe integration means. A stabilized auditory image extracting means for extracting a predetermined one cycle of the stabilized auditory image obtained by the integration as an auditory figure.

【００２１】請求項１４に記載の発明にかかる信号処理
装置は、請求項１３に記載の発明の構成に加えて、安定
化聴覚イメージ抽出手段は、安定化聴覚イメージの一番
目の周期を聴覚図形として抽出するための手段を含む。According to a fourteenth aspect of the present invention, in the signal processing device according to the thirteenth aspect of the present invention, the stabilized auditory image extracting means sets the first period of the stabilized auditory image to an auditory graphic. Means for extracting as

【００２２】請求項１５に記載の発明にかかる信号処理
装置は、請求項１３に記載の発明の構成に加えて、安定
化聴覚イメージ抽出手段は、安定化聴覚イメージの二番
目の周期を聴覚図形として抽出するための手段を含む。According to a fifteenth aspect of the present invention, in the signal processing apparatus according to the thirteenth aspect, the stabilized auditory image extracting means includes a step of setting the second period of the stabilized auditory image to an auditory graphic. Means for extracting as

【００２３】請求項１６に記載の発明にかかる信号処理
装置は、請求項１１に記載の発明の構成に加えてさら
に、フィルタバンクの出力に対して、フィルタバンクの
出力が聴神経での神経活性度に類似した出力となるよう
に変換を行なって聴覚図形抽出手段に与えるための聴神
経発火パターン変換手段を含む。According to a sixteenth aspect of the present invention, in addition to the configuration of the eleventh aspect, the output of the filter bank is different from the output of the filter bank in terms of nerve activity in the auditory nerve. And an auditory nerve firing pattern conversion means for performing conversion so as to obtain an output similar to the above and giving the conversion to the auditory figure extraction means.

【００２４】[0024]

【発明の実施の形態】［発明の背景をなす基本的事項］
まず、本発明、特に以下に述べる実施の形態の課題を明
確化するために、メリン変換と音響物理とについて述べ
る。１．メリン変換メリン（Mellin）変換は、フーリエ変換と同様な積分変
換の一種類であり、発明の実施の形態の説明の最後に添
付した付録Ａに示される式で定義される（森口・宇田川
・一松著「数学公式II」岩波書店、1957年刊行、Titchm
arsh,"Introduction to the Theory of Fourier Integr
als," Oxford U.P., London, 2nd ed.）。付録Ａの式
（Ａ２）によっても表わされるように、分析する信号の
応答が相似形のまま時間的に拡大・縮小しても、メリン
変換して得た分布の絶対値は定数倍以外不変となること
がメリン変換の重要な特徴である。本願発明では、メリ
ン変換のこの特徴を利用して、たとえば声道の大きさの
相違に由来するスペクトル構造の相違およびピッチ周期
の相違にもかかわらず、音声認識が行えるような、適切
な信号処理を行なう。２．音響管の物理無損失な音響管を考える。その音響管を伝搬する波の解
は、その波を平面波で近似することによって得ることが
できる。均一の口径の音響管またはホーン形の音響管の
解析解は、初頭的な物理の教科書にも書いてあるほどよ
く知られている。また、音響管の断面積が変化する場合
でも、断面積関数を多数の微小な円筒で近似することに
よって、その音響管内を伝搬する波を数値的に解くこと
ができる。声道をそのような方法で近似して解くこと
は、音声生成モデルの教科書の教えるところである（例
えば、中田著「音声」コロナ社、改定版、1995）。BEST MODE FOR CARRYING OUT THE INVENTION
First, in order to clarify the problems of the present invention, particularly the embodiments described below, a description will be given of the Mellin transform and acoustic physics. 1. Mellin Transform The Mellin transform is a type of integral transform similar to the Fourier transform, and is defined by the equation shown in Appendix A attached at the end of the description of the embodiment of the invention (Moriguchi, Udagawa, and Matsu, Mathematical Formula II, Iwanami Shoten, 1957, Titchm
arsh, "Introduction to the Theory of Fourier Integr
als, "Oxford UP, London, 2nd ed.) As expressed by the equation (A2) in Appendix A, even if the response of the signal to be analyzed is temporally enlarged or reduced while the response is similar, the Merin transform is performed. It is an important feature of the Mellin transform that the absolute value of the distribution obtained by the method is invariable except for a constant multiple. Appropriate signal processing is performed so that speech recognition can be performed despite the difference in spectral structure and the difference in pitch period 2. Physicality of acoustic tube Consider a lossless acoustic tube. The solution can be obtained by approximating the wave with a plane wave.The analytical solution for a sound tube of uniform caliber or horn-shaped sound tube is so well known as to be described in an elementary physics textbook. Also, the cross-sectional area of the sound tube changes. In this case, it is possible to numerically solve the wave propagating in the acoustic tube by approximating the cross-sectional area function with many small cylinders. A textbook on speech generation models teaches (for example, Nakata, Speech, Corona, Revised, 1995).

【００２５】さて、その音響管の一端をインパルスで駆
動した場合の、他端でのインパルス応答を考える。ここ
で重要な特徴は、その音響管の大きさを比例的に拡大・
縮小した場合、そのインパルス応答波形が時間軸上で拡
大・縮小されることである。つまり物理的な音響管の大
きさは、そのインパルス応答と直接的に関係している。Now, consider an impulse response at the other end when one end of the acoustic tube is driven by an impulse. The important feature here is that the size of the sound tube is enlarged proportionally.
When contracted, the impulse response waveform is enlarged / reduced on the time axis. That is, the size of the physical acoustic tube is directly related to its impulse response.

【００２６】大人の発声したある音韻と子供の発声した
同じ音韻とは、それぞれの音響管の大きさが全く違うの
にもかかわらず聞き手には同じように聞こえる。音声学
の教科書または英語の教科書には、発声される母音（vo
wel）とそれに対応した調音位置（place of articulati
on）との対応図が記載されている。しかしそうした対応
図には、その縮尺のようなものは記載されていない。そ
うした対応図は大人でも子供でも、かれらの調音器官の
大きさの相違にかかわらず共用できる。つまり、調音器
官の大きさの相違にかかわらず、相似的に調音のかまえ
を似たものにすれば、同じ音韻が発声できる。いいかえ
れば、声道の物理的な大きさが異なっても、声道断面積
関数の相似性を保つことにより同じ音韻が発声できる。A certain phoneme uttered by an adult and the same phoneme uttered by a child sound the same to a listener despite the fact that the size of each sound tube is completely different. Phonetic or English textbooks may contain vowels (vo
wel) and the corresponding articulation position (place of articulati)
on)). However, such a correspondence map does not show such a scale. Such maps can be shared by adults and children regardless of the size of their articulators. That is, regardless of the difference in the size of articulatory organs, the same phoneme can be uttered if the articulation is made similar. In other words, even if the physical size of the vocal tract is different, the same phoneme can be produced by maintaining the similarity of the vocal tract cross-sectional area functions.

【００２７】物理的に声道断面積関数が相似で、その全
長が異なる場合、声道のインパルス応答は、時間的に拡
大・縮小したものになる。そのため、大人の声に対して
子供の声は、声道のインパルス応答が時間軸上で縮小さ
れた音響管を音声パルスで駆動したことに相当する。も
ちろん個人差があるため、以上は理想的な話ではある
が、上記したようなインパルス応答の時間軸上での縮小
は、物理的考察に立った子供の音声の特徴の良い第一次
近似であるはずである。こうした類推は、音声において
妥当であるという理由ばかりでなく、大きさの異なるバ
イオリン、チェロおよびコントラバスが同じバイオリン
族の楽器として類似の音を発生すること、および同じ形
状で異なる大きさのエンジンが類似の音を発生すること
など、音声以外の事象の観察からも正当化できる。３．課題の設定もし、上記のような声道のインパルス応答の時間軸上で
の拡大・縮小に対し不変な内部表現を直接作り出すこと
ができれば、スペクトル分析を行なって抽出の難しい高
次ホルマントを利用することにより拡大・縮小の計算を
行なって正規化しなくてもよくなり、大人でも子供でも
同じ音韻は同じものとして処理することができる。この
ように時間軸上での波形の拡大および縮小に対して不変
な性質を有するという特徴は、上記で示したメリン変換
を通して得ることができるメリン表現の特徴に他ならな
い。すなわち、メリン変換およびメリン表現が、今求め
られている音声などの信号の解析において従来のスペク
トル表現に由来する分析とは本質的に異なる重要性を持
つことが分かる。If the vocal tract cross-sectional area functions are physically similar and their lengths are different, the impulse response of the vocal tract is expanded or reduced in time. Therefore, a child's voice, compared to an adult's voice, corresponds to driving a sound tube in which the impulse response of the vocal tract is reduced on the time axis with a sound pulse. Of course, because of individual differences, the above is an ideal story.However, the reduction of the impulse response on the time axis as described above is a good first-order approximation of the child's voice characteristics based on physical considerations. There should be. Not only is this analogy valid in speech, but also because violins, cellos and contrabass of different sizes produce similar sounds as instruments of the same violin family, and engines of the same shape and different sizes It can be justified by observing events other than speech, such as generating similar sounds. 3. Task assignment If we can directly create an internal representation that is invariant to the expansion and contraction of the vocal tract impulse response on the time axis as described above, we will use spectral analysis to extract higher-order formants that are difficult to extract This eliminates the need for normalization by calculating the enlargement / reduction, and the same phoneme can be processed as the same in both adults and children. The feature of having a property invariant to the expansion and contraction of the waveform on the time axis is nothing less than the feature of the melin expression that can be obtained through the melin transformation described above. In other words, it can be seen that the Mellin transform and the Mellin expression have a substantially different importance in the analysis of signals such as voices that are required now than the analysis derived from the conventional spectral expression.

【００２８】ところが、従来はメリン変換は信号処理で
はあまり実用的には使用されてこなかった。その理由
は、以下で述べるように、メリン変換は「シフト変動」
（shift varying）であり、その振幅が「シフト不変」
（sihft invariant）なフーリエ変換などに比べて扱い
づらかったためである。付録Ａの式（Ａ１）からも分か
るように、メリン変換では積分の起点（以下ではこれを
「解析の原点」と呼ぶ。）が確定している必要があり、
この解析の原点が移動するとその結果が異なってくる。
これが「シフト変動」という性質である。一方フーリエ
変換では（−∞，∞）の範囲で積分をすればよいので、
このような積分範囲の移動という問題がない。これが
「シフト不変」という性質である。However, conventionally, the Mellin transform has not been used very practically in signal processing. The reason for this is that, as described below,
(Shift varying) and its amplitude is "shift invariant"
(Sihft invariant) It was difficult to handle compared to Fourier transform. As can be seen from the equation (A1) in Appendix A, the starting point of the integration (hereinafter, this is referred to as the “origin of analysis”) needs to be determined in the Mellin transform.
If the origin of this analysis moves, the result will be different.
This is the property of "shift fluctuation". On the other hand, in the Fourier transform, it is sufficient to integrate within the range of (−∞, ∞).
There is no problem of such movement of the integration range. This is the "shift invariant" property.

【００２９】メリン変換についての研究に関しては、Um
eshらがメリン変換の性質から周波数軸だけの変形を提
案している（Umesh, Cohen, and Nelson, "Frequency-w
arping and speaker-normalization," IEEE Int. Conf.
Acoust., Speech Signal Processing （ICASSP-97）,1
997; Umesh, Cohen, and Nelson,"Improved scale-ceps
tral analysis in speech," IEEE Int. Conf. Acoust.,
Speech Signal Processing （ICASSP-98）,1998）、ま
たAltesはフーリエ変換とメリン変換との組み合わせを
提案している（Altes, "The Fourier-Mellin transform
and mammalianhearing," J. Acoust. Soc. Am., 63,p
p.174-183, 1978）、またメリン変換の音声認識への応
用（Chen, Xu, and Huang, "A novel robust feature o
f speechsignal based on the Mellin transform for s
peaker-independent speech recognition," ICASSP ユ9
8,1998）も提案されている。For work on the melin conversion, see Um
esh et al. proposed a transformation on the frequency axis only due to the nature of the Merin transform (Umesh, Cohen, and Nelson, "Frequency-w
arping and speaker-normalization, "IEEE Int. Conf.
Acoust., Speech Signal Processing (ICASSP-97), 1
997; Umesh, Cohen, and Nelson, "Improved scale-ceps
tral analysis in speech, "IEEE Int. Conf. Acoust.,
Speech Signal Processing (ICASSP-98), 1998), and Altes proposed a combination of Fourier transform and Merin transform (Altes, "The Fourier-Mellin transform").
and mammalianhearing, "J. Acoust. Soc. Am., 63, p.
p.174-183, 1978) and application of the Mellin transform to speech recognition (Chen, Xu, and Huang, "A novel robust feature o
f speechsignal based on the Mellin transform for s
peaker-independent speech recognition, "ICASSP U9
8,1998) has also been proposed.

【００３０】しかしながら、これらはいずれも周波数振
幅情報を用いた周波数軸方向へのメリン変換であり、位
相情報すなわち時間的な情報の考察がない。したがって
これら論文はいずれも「シフト変動性」を克服するため
の解析の原点の特定の問題には触れておらず、音に対す
る安定な時間的な微細構造を保持した表現を求めていな
い。音の音色の情報は、主にこの微細時間構造に存在す
ると考えられるので、この情報を保持したまま、物理的
な音源寸法を正規化する手法が望まれる。However, these are all Mellin transforms in the frequency axis direction using frequency amplitude information, and there is no consideration of phase information, that is, temporal information. Therefore, none of these papers mentions the specific problem of the origin of the analysis to overcome "shift variability", and does not seek a representation that preserves a stable temporal fine structure for sound. Since the information of the tone color of a sound is considered to mainly exist in this fine time structure, a method of normalizing the physical sound source dimensions while maintaining this information is desired.

【００３１】現状の音声認識装置等の信号処理の限界を
打開するためには、やはり音声や音響振動の本質に迫る
優秀な機能を持つメリン変換を、その「シフト変動」で
あるという欠点を克服して利用することにより信号処理
のための計算を正確に行なうことが必要である。本発
明、特に以下に記載した実施の形態の方法および装置の
目的は、時間的に安定な表現を導出することによりメリ
ン変換を計算可能にしてメリン表現を得ることにある。In order to overcome the limitations of the current signal processing of a speech recognition device or the like, the disadvantage that the Mellin transform, which has an excellent function approaching the essence of speech and acoustic vibration, is a "shift variation" is overcome. It is necessary to perform the calculation for the signal processing accurately by utilizing it. SUMMARY OF THE INVENTION It is an object of the present invention, and in particular, the method and apparatus of the embodiments described below to derive a temporally stable representation so that the Mellin transform can be calculated to obtain a Melin representation.

【００３２】［本発明の原理］以下、本発明、特に以下
に述べる発明の実施の形態の構成と動作との原理を明確
にするため、発明の基本的思想について述べる。１．発明の概要上記のメリン変換の「シフト変動」であるという欠点を
克服するためには、どの時点においても安定な原点を持
つ表現においてメリン変換を実行しなければならない。
図１を参照して、本発明での解決法を実現するための一
般的な装置は、入力信号１に対して、後述する安定化ウ
ェーブレット変換処理を行なうための安定化ウェーブレ
ット処理部２と、安定化ウェーブレット処理部２から出
力される安定化ウェーブレット処理された入力信号に対
してメリン変換を行なうためのメリン変換処理部３と、
メリン変換処理部３の出力に対してたとえば音声認識、
音声の符号化などの信号処理を行なって結果５を出力す
るための信号処理部４とを含む。安定化ウェーブレット
処理部２で行なわれる安定化ウェーブレット変換処理
は、入力信号をウェーブレットフィルタバンクを通して
時間周波数分析を行なうとともに、解析の原点を定め
る。安定化ウェーブレット処理部２によって解析の原点
を定めることにより、安定化ウェーブレット処理部２の
出力に対してメリン変換処理部３でメリン変換を行なう
ことが可能になる。[Principle of the Present Invention] The basic concept of the present invention will be described below in order to clarify the principle of the present invention, particularly the configuration and operation of the embodiment of the present invention described below. 1. SUMMARY OF THE INVENTION To overcome the "shift variation" disadvantages of the Mellin transform described above, the Mellin transform must be performed at a representation with a stable origin at any point in time.
Referring to FIG. 1, a general apparatus for realizing the solution according to the present invention includes a stabilizing wavelet processing unit 2 for performing a stabilizing wavelet transform process described later on an input signal 1, A Merin transform processor 3 for performing a Merin transform on the stabilized wavelet-processed input signal output from the stabilized wavelet processor 2,
For example, speech recognition,
A signal processing unit 4 for performing signal processing such as voice encoding and outputting a result 5; In the stabilized wavelet transform process performed by the stabilized wavelet processing unit 2, the input signal is subjected to time-frequency analysis through a wavelet filter bank, and the origin of the analysis is determined. By determining the origin of the analysis by the stabilizing wavelet processing unit 2, it becomes possible to perform the Mellin transform on the output of the stabilizing wavelet processing unit 2 by the Mellin transform processing unit 3.

【００３３】この装置では、入力信号１は、安定化ウェ
ーブレット処理部２によって安定化ウェーブレット変換
されて、さらにその出力に対して安定化ウェーブレット
処理部２で定められた解析の原点を積分の起点としてメ
リン変換３が行なわれ、メリン表現が得られる。得られ
たメリン表現は、音源の寸法や波形の周期性の変動に関
して正規化された音声信号の特徴表現である。この表現
は、従来の音声分析で主として利用されていたスペクト
ルや線形予測係数と同様に、ベクトルとしても表すこと
もできる。したがって、このメリン表現を、従来から用
いられてきたありとあらゆる信号処理に対する入力とし
て与えることができ、それらに対応する結果５が得られ
る。たとえば音声認識装置においては、メリン表現され
た多数の特徴ベクトルを予め準備しておき、入力された
特徴ベクトルとの間で従来と全く同様のマッチングを行
なうことにより音声認識を行なうことが可能となり、そ
のためのハードウェアも従来と同様でよい。２．ウェーブレット変換図２を参照して、本発明における安定化ウェーブレット
変換を計算するための安定化ウェーブレット処理部２
は、入力信号６（請求項１の入力信号１と同じであり、
通常は周期性を有することが想定されている。）に対し
てウェーブレット変換を行なうためのフィルタバンクか
らなるウェーブレット変換処理部７と、ウェーブレット
変換処理部７の出力の振幅を対数圧縮または指数圧縮に
より圧縮するための振幅圧縮部８と、振幅圧縮部８の出
力を受けて、周期性を表わす事象を検出して検出出力を
発生するための事象検出処理部９と、事象検出処理部９
の出力に応答して、前述した通り解析の原点を定めるよ
うに振幅圧縮部８の出力波形の時間間隔を安定化させて
安定化ウェーブレット変換出力１１として出力するため
の時間間隔安定化処理部１０とを含む。In this apparatus, an input signal 1 is subjected to a stabilized wavelet transform by a stabilizing wavelet processing unit 2, and the output of the input signal 1 is defined as the starting point of the integration determined by the stabilizing wavelet processing unit 2 for the analysis. The Mellin transform 3 is performed, and a Melin expression is obtained. The obtained Mellin expression is a characteristic expression of the audio signal normalized with respect to the fluctuation of the size of the sound source and the periodicity of the waveform. This expression can also be expressed as a vector, similar to the spectrum or linear prediction coefficient mainly used in conventional speech analysis. Therefore, this Melin expression can be given as an input to any and all signal processing conventionally used, and the corresponding result 5 is obtained. For example, in a speech recognition device, it is possible to perform speech recognition by preparing a large number of feature vectors expressed in Melin in advance and performing exactly the same matching as that of the related art with the input feature vector, The hardware for that may be the same as the conventional one. 2. Wavelet Transform Referring to FIG. 2, a stabilized wavelet processing unit 2 for calculating a stabilized wavelet transform according to the present invention.
Is the input signal 6 (the same as the input signal 1 of claim 1,
Usually, it is assumed that it has periodicity. ), A wavelet transform processing section 7 comprising a filter bank for performing a wavelet transform, an amplitude compressing section 8 for compressing the amplitude of the output of the wavelet transform processing section 7 by logarithmic compression or exponential compression, and an amplitude compression section. 8, an event detection processing unit 9 for detecting an event representing periodicity and generating a detection output, and an event detection processing unit 9
, The time interval stabilization processing section 10 for stabilizing the time interval of the output waveform of the amplitude compression section 8 so as to determine the origin of the analysis as described above, and outputting it as the stabilized wavelet transform output 11 And

【００３４】ウェーブレット変換処理部７で行なわれる
ウェーブレット変換を定義する式は実施の形態の説明の
最後に添付した付録Ｂの式Ｂ１〜Ｂ７に示す。ウェーブ
レット変換は、フーリエ変換における基底関数である正
弦波に替えて、ウェーブレット核（「マザーウェーブレ
ット」とも呼ばれる。）と呼ばれる、波形の小片を定め
る関数を用いる。そしてこのウェーブレット核を時間軸
上で拡大、縮小した（互いに周波数が異なる）波形が、
解析対象となる波形にどの程度の大きさで含まれるかを
調べることにより、解析対象の波形を時間と周波数との
二次元に分けて解析することができる。The equations defining the wavelet transform performed by the wavelet transform processing unit 7 are shown in equations B1 to B7 in Appendix B attached at the end of the description of the embodiment. In the wavelet transform, instead of a sine wave, which is a basis function in the Fourier transform, a function called a wavelet kernel (also called “mother wavelet”) that determines a small piece of a waveform is used. Then, a waveform obtained by expanding or reducing this wavelet kernel on the time axis (different in frequency from each other)
By examining how large the waveform to be analyzed is, the waveform to be analyzed can be analyzed in two dimensions of time and frequency.

【００３５】フーリエ変換では正弦波を用いている。正
弦波は時間軸上で（−∞，∞）の範囲に一様に広がった
周期関数である。そのため、フーリエ変換では入力信号
のある一部にどの周波数の信号がどの程度存在している
か、という局所的な情報を得ることはできない。それに
対してウェーブレット変換では、どの位置に、どの周波
数のウェーブレットが、どの程度の大きさで含まれてい
るかという局所的な情報を知ることができる。このた
め、ウェーブレット変換によって入力信号を時間と周波
数との二次元から解析できる。In the Fourier transform, a sine wave is used. The sine wave is a periodic function uniformly spread in the range of (−∞, ∞) on the time axis. Therefore, in Fourier transform, it is not possible to obtain local information as to how many signals of which frequency exist in a certain part of the input signal. On the other hand, in the wavelet transform, it is possible to know local information as to which position contains a wavelet of which frequency and in what size. Therefore, the input signal can be analyzed from the two-dimensional time and frequency by the wavelet transform.

【００３６】またウェーブレット変換では、目的に応じ
てウェーブレット核を変え、応用ごとに適切な波形のウ
ェーブレット核を用いることができることが知られてい
る。たとえば、Daubechiesのウェーブレット、メキシカ
ンハット、フレンチハット、Shannonのウェーブレッ
ト、Haarのウェーブレット、Gaborのウェーブレット、M
eyerのウェーブレットなどが知られている。以下に述べ
る実施の形態では、特定のウェーブレットを用いている
が、応用に応じて上記した、およびここにあげていない
種々のウェーブレットを用いることが可能である。In the wavelet transform, it is known that the wavelet nucleus can be changed according to the purpose, and a wavelet nucleus having an appropriate waveform can be used for each application. For example, Daubechies wavelets, Mexican hats, French hats, Shannon wavelets, Haar wavelets, Gabor wavelets, M
Eyer wavelets are known. In the embodiment described below, a specific wavelet is used, but various wavelets described above and not described here can be used depending on the application.

【００３７】多くの場合周期性を持つ（式Ｂ１）入力信
号１は、ウェーブレット変換処理部７によりウェーブレ
ット変換され解析される（Combes et al.（Eds.）,"Wav
elets", Springer-Verlag,Berlin,1989）。ウェーブレ
ット核としては、例えば所定周波数で周波数変調され、
ガンマ分布を包絡線として持つガンマチャープ関数（式
Ｂ２）を選ぶことができる。このガンマチャープ関数
は、メリン変換において、最小不確定性の意味で最適な
関数であることが知られている（Irino and Patterso
n,"A time-domain, level-dependent auditory filter:
The gammachirp,"J. Acoust. Soc. Am., 101,pp.412-4
19, 1997）。なお、ウェーブレット核は上記したガンマ
チャープ関数に限定されるわけではなく、既に述べたよ
うに解析においてどの特徴を重視するかに応じて適切な
関数により定められる波形を用いることができる。In many cases, the input signal 1 having the periodicity (Equation B1) is wavelet-transformed and analyzed by the wavelet transform processing unit 7 (Combes et al. (Eds.), "Wav
elets ", Springer-Verlag, Berlin, 1989). The wavelet kernel is, for example, frequency-modulated at a predetermined frequency,
A gamma chirp function (formula B2) having a gamma distribution as an envelope can be selected. It is known that this gamma chirp function is an optimal function in the sense of minimum uncertainty in the Mellin transform (Irino and Patterso
n, "A time-domain, level-dependent auditory filter:
The gammachirp, "J. Acoust. Soc. Am., 101, pp. 412-4
19, 1997). Note that the wavelet kernel is not limited to the above-described gamma chirp function, and as described above, a waveform determined by an appropriate function depending on which feature is emphasized in the analysis can be used.

【００３８】ウェーブレット核を時間軸上で伸縮したウ
ェーブレットフィルタ（式Ｂ３）の組を用いることによ
りウェーブレット変換処理部７のフィルタバンクを実現
できる。ここでは、最大周波数と帯域幅とが比例する定
Ｑ型で、対数周波数軸上で等間隔に配置したフィルタバ
ンクの各フィルタと信号との間で畳み込み積分を行なう
（式Ｂ４）。A filter bank of the wavelet transform processing unit 7 can be realized by using a set of wavelet filters (formula B3) obtained by expanding and contracting a wavelet kernel on the time axis. Here, convolution integration is performed between each signal and a filter of a filter bank arranged at equal intervals on a logarithmic frequency axis in a constant Q type in which the maximum frequency and the bandwidth are proportional (Equation B4).

【００３９】仮に、外界の信号が、時間的に圧縮または
伸長されても、ウェーブレット変換はその出力波形には
歪みを与えない。単にその信号の出力がより高い、また
はより低い最大周波数のフィルタの位置に移動するだけ
である。これは、ウェーブレットフィルタ自体が元のウ
ェーブレット核関数を時間軸上で拡大・縮小したもの
で、いずれも同じフィルタ形状を有するからである。Even if an external signal is temporally compressed or decompressed, the wavelet transform does not distort its output waveform. It simply moves the output of the signal to the location of the higher or lower maximum frequency filter. This is because the wavelet filter itself is obtained by enlarging or reducing the original wavelet kernel function on the time axis, and both have the same filter shape.

【００４０】得られた各フィルタ出力の振幅値に対して
は、図２の振幅圧縮部８で対数圧縮（式Ｂ５）または指
数圧縮（式Ｂ６）が行なわれる。この時、目的に応じ、
波形の正負の部分の両方を残す場合と、半波整流して正
部分のみを残す場合とのふた通りが考えられる。以下に
示す各例では、半波整流した場合を示す。正負の両部分
を残す場合も、後の処理は基本的の以下の説明と同じで
ある。３．メリン変換の前提と安定化ウェーブレット変換既に延べ、式Ａ１からわかるように、メリン変換は必ず
解析の原点を特定することが必要で、原点がずれると表
現も変わってしまう「シフト変動（shift-varying）」
な変換である。メリン変換がシフト変動である、という
点が、シフト不変なフーリエ変換に対して不利な点で、
これがメリン変換がいままであまり用いられてこなかっ
た理由でもある。しかし、上記のような、物理的大きさ
の変動に対して耐性があるという音声信号処理にとって
魅力ある性質をもっている。したがって、解析の原点を
確実かつ安定に決定できれば、シフト変動であるという
メリン変換の欠点を克服でき、メリン変換を音声信号処
理に有効に利用することが可能となる。本発明はそのた
めの一つの解決策を与える。Logarithmic compression (Equation B5) or exponential compression (Equation B6) is performed on the obtained amplitude value of each filter output by the amplitude compression unit 8 in FIG. At this time, depending on the purpose,
There are two cases in which both the positive and negative portions of the waveform are left and the case where only the positive portion is left after half-wave rectification. In each example shown below, a case where half-wave rectification is performed is shown. When both the positive and negative parts are left, the subsequent processing is basically the same as that described below. 3. Assumptions and Stabilization Wavelet Transform of Merin Transformation As can be seen from Equation A1, the Mellin transform must always specify the origin of analysis, and if the origin deviates, the expression changes. ) "
Conversion. The point that the Mellin transform is a shift variation is disadvantageous to the shift-invariant Fourier transform,
This is why the Mellin transform has not been used so far. However, it has an attractive property for audio signal processing that it is resistant to variations in physical size as described above. Therefore, if the origin of the analysis can be determined reliably and stably, it is possible to overcome the disadvantage of the Mellin transform, which is a shift variation, and to effectively use the Mellin transform for audio signal processing. The present invention provides one solution for that.

【００４１】信号は常に時間的に流れているので、ウェ
ーブレット変換を行なった後の「ウェーブレットスペク
トル」も時間的に流れる「ランニングスペクトル」に相
当する。そのためウェーブレットスペクトルのみからで
は解析の原点を決められない。この解析の原点を事象検
出処理部９で決定する。以下、事象検出処理部９で行な
う処理の詳細について説明する。Since the signal always flows temporally, the "wavelet spectrum" after performing the wavelet transform also corresponds to the "running spectrum" flowing temporally. Therefore, the origin of the analysis cannot be determined only from the wavelet spectrum. The origin of this analysis is determined by the event detection processing unit 9. Hereinafter, details of the processing performed by the event detection processing unit 9 will be described.

【００４２】周期信号（式Ｂ２）や疑似周期信号の場
合、各ウェーブレットフィルタ出力は、１周期に１つの
最大値を持つ。本願発明は、音源情報はそうした最大値
を固定して見た時の波形として表現されている点に着目
する。そのために本願発明では、フィルタ出力の周期性
を事象検出処理部９によって検出し、そこを原点にして
メリン変換を取ることにより振幅圧縮部８の出力信号の
時間間隔を安定化させる。In the case of a periodic signal (Equation B2) or a pseudo-periodic signal, each wavelet filter output has one maximum value in one cycle. The present invention focuses on the point that the sound source information is expressed as a waveform when the maximum value is fixed and viewed. To this end, in the present invention, the periodicity of the filter output is detected by the event detection processing unit 9 and the time interval of the output signal of the amplitude compression unit 8 is stabilized by performing Mellin transform using the event as an origin.

【００４３】最大値検出の方法については既に報告がさ
れている（Irino and Patterson, "Temporal asymmerty
in the auditory sytem, "J.Acoust. Soc. Am., 99, p
p.2316-2331, 1996; Patterson and Irino," Modeling
temporal asymmerty in theauditory sytem," J.Acous
t. Soc. Am., 104, pp.2967-2979, 1998 ）。それ以外
にもピッチ周期検出に関しては過去から多くの報告があ
る（たとえばHess, "Pitch Determination of Speech S
ignals," Springer-Verlag, NY, 1983）。A method for detecting the maximum value has already been reported (Irino and Patterson, "Temporal asymmerty").
in the auditory sytem, "J. Acoust. Soc. Am., 99, p.
p.2316-2331, 1996; Patterson and Irino, "Modeling
temporal asymmerty in theauditory sytem, "J. Acous
t. Soc. Am., 104, pp. 2967-2979, 1998). There have been many reports on pitch period detection from the past (eg, Hess, "Pitch Determination of Speech S
ignals, "Springer-Verlag, NY, 1983).

【００４４】本願発明では、各チャンネルにおける最大
値の時点を、図２の時間間隔安定化処理部１０で行なわ
れる時間積分の開始時点とする。時間間隔安定化処理部
１０が行なう時間積分では、ある開始時点から次の開始
時点までを１周期として各ウェーブレットフィルタ出力
をコピーして、イメージバッファの対応するチャンネル
の既に存在する１周期分の表現に一点一点加えあわせる
ことによって新たな表現を生成する。この操作をストロ
ーブ時間積分（Patterson, Allerhand and Giguere, "T
ime-domain modelling of peripheral auditory proces
sing: a modular architecture and a software platfo
rm", J.Acoust. Soc. Am., 98,1890-1894, 1995; Patte
rson and Holdsworth, "Apparatus and methods for th
e generation of stabilised images from waveforms,"
United Kingdom Patent: 2232801 （1993）, United S
tates Patent: 5,422,977 （1995）, European Patent:
0473664 （1995））と呼び、ここまでの操作全体を安
定化ウェーブレット変換と呼ぶ。In the present invention, the time point of the maximum value in each channel is set as the start time point of the time integration performed by the time interval stabilization processing section 10 in FIG. In the time integration performed by the time interval stabilization processing unit 10, each wavelet filter output is copied with a period from a certain start time to the next start time as one cycle, and an expression for an already existing one cycle of the corresponding channel of the image buffer is provided. A new expression is generated by adding each point to. This operation is called strobe time integration (Patterson, Allerhand and Giguere, "T
ime-domain modeling of peripheral auditory proces
sing: a modular architecture and a software platfo
rm ", J. Acoust. Soc. Am., 98, 1890-1894, 1995; Patte
rson and Holdsworth, "Apparatus and methods for th
e generation of stabilised images from waveforms, "
United Kingdom Patent: 2232801 (1993), United S
tates Patent: 5,422,977 (1995), European Patent:
0473664 (1995)), and the entire operation up to this point is called a stabilized wavelet transform.

【００４５】安定化ウェーブレット変換によって、次周
期の各ウェーブレット出力、次々周期のウェーブレット
出力、さらに先の周期のウェーブレットフィルタ出力を
構成する各点の値はイメージバッファ内の同じ位置に加
算されるため、信号の流れが止まり安定な表現となる。
また、この表現では、横軸として一つ前のピークからの
時間間隔がとられるため、原点は常に零である。By the stabilized wavelet transform, the values of the points constituting the next cycle wavelet output, the next cycle wavelet output, and the further cycle wavelet filter output are added to the same position in the image buffer. The signal flow stops and the expression becomes stable.
Also, in this expression, the origin is always zero since the time interval from the previous peak is taken as the horizontal axis.

【００４６】周期信号（式Ｂ２）や疑似周期信号の安定
化ウェーブレット変換（式Ｂ７）は、その微細構造に音
源情報を保存していて周期的に繰返したパターンにな
る。ここで、安定化ウェーブレット変換により得られる
安定化された時間間隔パターンの１周期分を音源情報図
形（式Ｂ８）または聴覚図形と呼ぶことにする。この音
源情報図形は安定で、開始点が常に決まっているので、
シフト変動性の問題を回避して、この上でメリン変換を
取ることができる。すなわち、安定化ウェーブレット変
換は、メリン変換が音源情報を解析するのに必要な条件
を準備したことになる。４．メリン変換の計算メリン変換は、量子力学で使われるオペレータで表現で
きることが知られている（Cohen,"The scale transfor
m," IEEE Trans. Acoust. Speech and Signal Processi
ng, 1993; Irino, "An optimal auditory filter," IEE
E Workshop on Applications of Signal Processing to
Audio and Acoustics, 1995; Irino, "A'gammachirp'
function as as optimal auditory filter with the Me
llin transform," IEEE Int. Conf. Acoust., Speech S
ignal Processing （ICASSP-96）, 1996）。その場合、
メリン変換は、Gaborが用いた時間オペレータと周波数
オペレータ（Gabor,"Theory of communication," J. IE
E （London）,93,42-457,1946）との積をとった形式に
なっている。すなわち、時間と周波数との積がメリン変
換にとって重要な概念である。メリン変換を定義する式
を、実施の形態の最後に添付した付録Ｂの式Ｂ８〜Ｂ１
２に示す。The stabilized wavelet transform (Equation B7) of the periodic signal (Equation B2) or the pseudo-periodic signal is a pattern in which sound source information is stored in its fine structure and is periodically repeated. Here, one cycle of the stabilized time interval pattern obtained by the stabilized wavelet transform will be referred to as a sound source information graphic (formula B8) or an auditory graphic. Since this sound source information figure is stable and the starting point is always fixed,
The problem of shift variability can be avoided and the Melin transform can be taken on this. That is, in the stabilized wavelet transform, the conditions necessary for the Mellin transform to analyze the sound source information are prepared. 4. It is known that the Mellin transform can be expressed by operators used in quantum mechanics (Cohen, "The scale transfor
m, "IEEE Trans. Acoust. Speech and Signal Processi
ng, 1993; Irino, "An optimal auditory filter," IEE
E Workshop on Applications of Signal Processing to
Audio and Acoustics, 1995; Irino, "A'gammachirp '
function as as optimal auditory filter with the Me
llin transform, "IEEE Int. Conf. Acoust., Speech S
ignal Processing (ICASSP-96), 1996). In that case,
The Merin transform is based on the time and frequency operators used by Gabor (Gabor, "Theory of communication," J. IE
E (London), 93, 42-457, 1946). That is, the product of time and frequency is an important concept for the Mellin transform. The equations defining the Mellin transformation are given in Equations B8 to B1 in Appendix B attached at the end of the embodiment.
It is shown in FIG.

【００４７】本発明では、原理的には、音源情報図形
（式Ｂ８）に対して、時間と周波数との積が一定となる
等値線（式Ｂ９）に沿ってメリン変換（式Ｂ１０）を行
なう。ここで、メリン変換のパラメータＰは複素数（式
Ｂ１１）なので、式Ｂ１０は式Ｂ１２のように書き換え
ることができる。これにより、音源情報図形のメリン変
換として、横軸を時間間隔と周波数との積、縦軸をメリ
ン変換核の複素変数とした２次元表現を得ることができ
る。この表現をメリンイメージと呼ぶことにする。In the present invention, in principle, the Mellin transform (Equation B10) is applied to the sound source information graphic (Equation B8) along an isoline (Equation B9) where the product of time and frequency is constant. Do. Here, since the parameter P of the Mellin transform is a complex number (formula B11), formula B10 can be rewritten as formula B12. As a result, as the Merin transform of the sound source information graphic, a two-dimensional expression can be obtained in which the horizontal axis is the product of the time interval and the frequency and the vertical axis is the complex variable of the Merin transform nucleus. This expression will be called a merin image.

【００４８】この表現の上では、音源情報は正規化され
ていて音源の周期性や物理的大きさの拡大・縮小に対し
て不変の表現になっている。したがって、従来から提案
されている信号処理手法にしたがった信号処理部４に対
してこの正規化音源情報を与えることにより、より優れ
た信号処理が実現できる。In this expression, the sound source information is normalized and is invariant to the periodicity of the sound source and the enlargement / reduction of the physical size. Therefore, by giving the normalized sound source information to the signal processing unit 4 according to the signal processing method conventionally proposed, more excellent signal processing can be realized.

【００４９】図３のフローチャートに以上の処理の流れ
を示す。メリン変換の計算に関しては、さらに詳しく第
１の実施の形態において述べる。図３を参照して、波形
入力を受けると、これらをウェーブレット変換のフィル
タバンクを通すことによりウェーブレット変換の計算が
行なわれる。The flow of the above processing is shown in the flowchart of FIG. The calculation of the Mellin transform will be described in further detail in the first embodiment. Referring to FIG. 3, when a waveform input is received, it is subjected to a wavelet transform filter bank to perform a wavelet transform calculation.

【００５０】ウェーブレット変換の出力から信号周期情
報を抽出し、この情報をもとにウェーブレット変換の出
力を安定化させ、一つ前のピークからの時間間隔−対数
周波数表現の計算を行なうことにより、音源情報図形を
得る。By extracting signal period information from the output of the wavelet transform, stabilizing the output of the wavelet transform based on this information, and calculating the time interval-logarithmic frequency expression from the immediately preceding peak, Get sound source information figure.

【００５１】こうして得られた音源情報図形上の、時間
間隔と周波数との積が一定となる線に沿ってメリン変換
の計算を行なう。こうして、音源の周期性および物理的
大きさの拡大または縮小に対して不変な表現であるメリ
ンイメージが得られる。５．メリンイメージの時系列前節では、ある一時点の安定化ウェーブレット変換から
メリンイメージを計算する方法を示した。信号は時々刻
々変化しており、それに対応した安定化ウェーブレット
変換から得た音源情報図形も変化する。そこで、ある間
隔ごとに音源情報図形を抽出し、それをもとにそれぞれ
メリンイメージを計算する。このメリンイメージの各々
から１つの特徴ベクトルを抽出することができる。する
と、スペクトログラムのように、横軸に時間をとり、縦
軸にメリンイメージベクトルの軸をとって、メリンイメ
ージベクトルを並べた表現を作ることができる。これ
は、スペクトログラムとは全く異なるものではあるが、
形式的には同じとなるので、従来スペクトログラムを用
いてきた信号処理手法にそのまま入力でき、様々な分野
に容易に応用することができる。The Mellin transform is calculated along a line on the sound source information graphic obtained in this way, where the product of the time interval and the frequency is constant. In this way, a melin image is obtained which is invariant to the periodicity of the sound source and the expansion or contraction of the physical size. 5. Time Series of Merin Image In the previous section, we showed how to calculate a Merin image from a stabilized wavelet transform at a certain point. The signal changes every moment, and the sound source information figure obtained from the stabilized wavelet transform corresponding thereto also changes. Therefore, sound source information figures are extracted at certain intervals, and a merin image is calculated based on each figure. One feature vector can be extracted from each of the Merin images. Then, as in a spectrogram, an expression in which the horizontal axis is time and the vertical axis is the Melin image vector axis, and the Melin image vectors are arranged, can be created. This is completely different from the spectrogram,
Since they are the same in form, they can be directly input to the signal processing method using the conventional spectrogram, and can be easily applied to various fields.

【００５２】［作用・効果］音源の物理的な大きさに依
存して、解析する波形が時間的に拡大・縮小しても、メ
リンイメージのスケール分布は不変である。これは、フ
ーリエスペクトルにはない性質である。また、同時にフ
ーリエスペクルとは表現は異なるものの、メリンイメー
ジベクトルによる表現は、解析の対象となる波形の拡大
・縮小以外の違いは明確に表わすことができる。音声の
場合は、異なる声道長の発声もメリンイメージベクトル
による表現では同様に扱うことができる。したがって逆
にメリンイメージベクトルによる表現を用いて音韻の違
いだけを強調することができる。たとえば、メリンイメ
ージベクトルによる表現を用いれば、大人のデータで学
習した音声認識装置をそのまま子供の認識に使うことが
できる可能性がある。これ以外にもメリンイメージベク
トルを用いた表現を適用することができる局面は多くあ
り、音声認識装置等の性能向上が期待できる。さらに、
メリンイメージベクトルによる表現を従来より用いられ
ているスペクトル分布と組み合わせて用いることによ
り、従来の性能を超えた音声信号処理を実現できる。ま
た、対象となる波形は、時系列データであれば何でもか
まわないので、音声や音楽といった音響信号ばかりでな
く、機械的振動、生体信号、および時系列的な計測デー
タのいずれにも本発明にかかる手法を応用することが可
能である。[Operation / Effect] Depending on the physical size of the sound source, the scale distribution of the Merin image remains unchanged even if the waveform to be analyzed is enlarged or reduced in time. This is a property not found in the Fourier spectrum. At the same time, although the expression is different from the Fourier speckle, the expression using the Melin image vector can clearly express the difference other than the enlargement / reduction of the waveform to be analyzed. In the case of speech, utterances of different vocal tract lengths can be treated in the same way by the expression using the Melin image vector. Therefore, conversely, only the difference between phonemes can be emphasized using the expression based on the Melin image vector. For example, if the expression based on the Melin image vector is used, there is a possibility that the speech recognition device learned from adult data can be used as it is for child recognition. There are many other situations where the expression using the Merin image vector can be applied, and improvement in the performance of a speech recognition device or the like can be expected. further,
By using the expression based on the Melin image vector in combination with the spectrum distribution conventionally used, audio signal processing exceeding the conventional performance can be realized. In addition, since any waveform can be used as long as it is time-series data, the present invention can be applied not only to acoustic signals such as voice and music, but also to mechanical vibrations, biological signals, and time-series measurement data. Such a technique can be applied.

【００５３】以上において、本願発明の実施の形態の基
本的手法と、その背景とについて説明した。以下、本願
発明の実施の形態について詳細に説明する。第１の実施の形態図４を参照して、本発明の第１の実施の形態の音声認識
装置は、図１に示すものと同様、安定化ウェーブレット
処理部２と、メリン変換処理部３と、信号処理部４とを
含む。In the above, the basic method of the embodiment of the present invention and its background have been described. Hereinafter, embodiments of the present invention will be described in detail. First Embodiment Referring to FIG. 4, a speech recognition apparatus according to a first embodiment of the present invention includes a stabilized wavelet processing unit 2, a Melin transform processing unit 3, , A signal processing unit 4.

【００５４】安定化ウェーブレット処理部２は、音声信
号１２を入力として受け、音声信号１２に対してウェー
ブレット変換を行なって周波数分析を行なうための聴覚
フィルタバンク１３と、聴覚フィルタバンク１３の出力
に対して、聴神経での神経活性度に類似した出力を得る
ような変換を行なうための聴神経発火パターン変換部１
４と、時間積分を制御するために、ある近傍での最大値
を検出するための事象検出（ピッチ検出）回路１５と、
事象検出（ピッチ検出）回路１５の出力を合図（ストロ
ーブ）として、聴神経発火パターン変換部１４の出力す
る現在の一定区間を取出して前述した時間積分を行なっ
て安定化聴覚イメージを生成し出力するための安定化聴
覚イメージ処理部１６とを含む。これら各構成要素につ
いては後に詳述する。The stabilizing wavelet processing unit 2 receives the audio signal 12 as an input, performs a wavelet transform on the audio signal 12, and performs a frequency analysis to perform an audio filter bank 13 and an output of the audio filter bank 13. And an auditory nerve firing pattern converter 1 for performing a conversion to obtain an output similar to the nerve activity in the auditory nerve.
4, an event detection (pitch detection) circuit 15 for detecting a maximum value in a certain neighborhood to control time integration;
The output of the event detection (pitch detection) circuit 15 is used as a signal (strobe) to extract a current fixed section output from the auditory nerve firing pattern converter 14 and perform the above-described time integration to generate and output a stabilized auditory image. And a stabilized auditory image processing unit 16. Each of these components will be described later in detail.

【００５５】メリン変換処理部３は、安定化聴覚イメー
ジ処理部１６の出力する安定化聴覚イメージを変形し、
新しい表現である寸法−形状イメージを出力するための
寸法−形状イメージ処理部１７と、寸法−形状イメージ
処理部１７の出力する寸法−形状イメージからメリンイ
メージを計算し、メリンイメージベクトルに基づく表現
として出力するためのメリンイメージ処理部１８とを含
む。The Merin transform processing section 3 transforms the stabilized auditory image output from the stabilized auditory image processing section 16,
A dimension-shape image processing unit 17 for outputting a dimension-shape image, which is a new expression, and a merin image calculated from the dimension-shape image output from the dimension-shape image processing unit 17, as an expression based on the merin image vector And a merin image processing unit 18 for outputting.

【００５６】信号処理部４は、メリンイメージ処理部１
８の出力するメリンイメージベクトルに基づく表現を、
予め準備されたテンプレートとマッチングして音声認識
し音声認識結果２０を出力するための音声認識回路１９
を含む。The signal processing unit 4 includes the merin image processing unit 1
8, the expression based on the Merin image vector
Speech recognition circuit 19 for performing speech recognition by matching with a prepared template and outputting speech recognition result 20
including.

【００５７】図４に示す装置において、入力される音声
信号１２は、メリン変換処理部３によって安定化聴覚イ
メージ（Stabilized Auditory Image, SAI）に変換され
る。この安定化聴覚イメージは、安定化ウェーブレット
変換２で得られる表現の聴覚版である。安定化聴覚イメ
ージは、寸法−形状イメージ処理部１７によって寸法−
形状イメージ１７に変換され、さらにメリンイメージ処
理部１８によってメリンイメージ１８に変換される。こ
の処理は、メリン変換３に相当する。なお、以下に述べ
る聴覚イメージモデルをもとにした安定化ウェーブレッ
ト−メリン変換を示す式等については実施の形態の説明
の最後に添付した付録Ｃに記載してある。１．安定化聴覚イメージの構成この節では、安定化ウェーブレット処理部２の各構成要
素の動作について述べる。入力される音声信号１２は、
聴覚フィルタバンク１３で周波数分析される。この実施
の形態の装置では、聴覚フィルタバンク１３の各々の聴
覚フィルタは、ガンマ分布関数の包絡線で周波数変調さ
れた搬送波を持つガンマチャープ（式Ｃ１）で近似でき
る。また、聴覚フィルタバンク１３はおおよそ５００Ｈ
ｚ以上では最大周波数と帯域幅が比例する定Ｑ型のフィ
ルタとなっている（式Ｃ２）。すなわち、聴覚フィルタ
バンクはガンマチャープ（式Ｃ１）を核関数としたウェ
ーブレット変換（式Ｃ３、式Ｃ４）になっていて、この
関数のパラメータは人間の聴覚フィルタを模擬するよう
に設定できる（Irino and Patterson,"A time-domain,
level-dependent auditory filter: The gammachirp,"
J. Acoust. Soc. Am., 101,pp.412-419, 1997）。聴覚
フィルタを並べた聴覚フィルタバンク１３はIIRフィル
タで構成できる（たとえば特開平１１−２４６９６号公
報、特開平１１−１１９７９７号公報を参照）。In the apparatus shown in FIG. 4, an input audio signal 12 is converted into a stabilized auditory image (Stabilized Auditory Image, SAI) by the Mellin conversion processing unit 3. This stabilized auditory image is an auditory version of the representation obtained with the stabilized wavelet transform 2. The stabilized auditory image is measured by the size-shape image processing unit 17 in size-
The image is converted into a shape image 17 and further converted into a melin image 18 by a merin image processing unit 18. This processing corresponds to Merin transform 3. Expressions and the like showing the stabilized wavelet-Mellin transform based on the auditory image model described below are described in Appendix C attached at the end of the description of the embodiment. 1. Configuration of Stabilized Auditory Image In this section, the operation of each component of the stabilized wavelet processing unit 2 will be described. The input audio signal 12 is
The frequency is analyzed by the auditory filter bank 13. In the apparatus of this embodiment, each auditory filter of the auditory filter bank 13 can be approximated by a gamma chirp (Equation C1) having a carrier frequency-modulated by the gamma distribution function envelope. The auditory filter bank 13 is approximately 500H
Above z, the filter is a constant Q-type filter in which the maximum frequency and the bandwidth are proportional (Equation C2). That is, the auditory filter bank is a wavelet transform (equations C3 and C4) using gamma chirp (equation C1) as a kernel function, and the parameters of this function can be set to simulate a human auditory filter (Irino and Patterson, "A time-domain,
level-dependent auditory filter: The gammachirp, "
J. Acoust. Soc. Am., 101, pp. 412-419, 1997). The auditory filter bank 13 in which the auditory filters are arranged can be constituted by an IIR filter (see, for example, JP-A-11-24696 and JP-A-11-119797).

【００５８】聴覚フィルタバンク出力は、聴神経発火パ
ターン変換部１４によって聴神経発火パターン（Neural
Activity Pattern, NAP）に変換される。具体的には、
聴覚フィルタバンク１３の出力に対して半波整流が行な
われて、振幅が対数圧縮（式Ｃ５）または指数圧縮（式
Ｃ６）され、さらに適応処理により信号の立ち上がり部
分が強調されて、聴神経での神経活性度に類似した出力
を得る。The output of the auditory filter bank is converted by the auditory nerve firing pattern converter 14 into an auditory nerve firing pattern (Neural
Activity Pattern, NAP). In particular,
The output of the auditory filter bank 13 is subjected to half-wave rectification, and the amplitude is logarithmically compressed (Equation C5) or exponentially compressed (Equation C6). An output similar to nerve activity is obtained.

【００５９】事象検出（ピッチ検出）回路１５は、各チ
ャンネルの活性度を監視して、ある近傍での最大値を検
出して、時間積分を制御する。事象検出（ピッチ検出）
回路１５での処理は例えば以下のようにして行なわれ
る。まず、活性度を平滑化して包絡線を計算する。得ら
れた包絡線の微分を計算して、その値（包絡線の勾配）
が正から負に変化する時点に近い、活性度の一番大きい
ピーク時点を近傍最大値時点とする（上記Irino and Pa
tterson, 1996）。この近傍最大値は、音声の有声音お
よび定常的な楽器音のように周期性や疑似周期性を持っ
た信号では、定常的に発生する。この近傍最大値を合図
（ストローブ）として、神経発火パターンの現在の一定
区間を取り出して、聴覚イメージ１６のバッファの対応
するチャンネルに近傍最大値の時点をそろえて加えるこ
とを各区間ごとに繰返し行なうことにより時間積分が行
なわれる。こうした積分をストローブ時間積分（Strobe
d Temporal Integration, STI）と呼ぶ。The event detection (pitch detection) circuit 15 monitors the activity of each channel, detects the maximum value in a certain vicinity, and controls time integration. Event detection (pitch detection)
The processing in the circuit 15 is performed, for example, as follows. First, the activity is smoothed to calculate an envelope. Calculate the derivative of the obtained envelope and calculate its value (envelope slope)
The peak time point of the highest activity, which is close to the point at which the value changes from positive to negative, is defined as the nearby maximum value time point (see Irino and Pa
tterson, 1996). This neighborhood maximum value occurs constantly in a signal having periodicity or pseudo-periodicity such as a voiced sound of a voice and a stationary musical instrument sound. Using this neighborhood maximum value as a signal (strobe), the current fixed section of the nerve firing pattern is taken out, and the addition of the neighborhood maximum value to the corresponding channel of the buffer of the auditory image 16 is repeated for each section. As a result, time integration is performed. Such integration is called strobe time integration (Strobe
d Temporal Integration (STI).

【００６０】ＳＴＩの処理は、神経発火パターン（ＮＡ
Ｐ）の時間軸を、直前の近傍最大値を基準とする時間間
隔軸に変換する役割を果たしている（式Ｃ７）。ストロ
ーブ時間積分を聴覚フィルタバンク１３の全てのチャン
ネルについて行なえば、聴覚フィルタバンク１３での縦
軸（対数周波数軸）の値を保ったまま、安定化された聴
覚イメージ１６（式Ｃ７）が得られる。この安定化され
た聴覚イメージは、半減期約３０ｍｓで全体が減衰する
ようにされており、入力信号がなくなった時点で自然に
イメージも消失する。The processing of the STI is based on the nerve firing pattern (NA
It serves to convert the time axis of P) into a time interval axis based on the immediately preceding maximum value (Equation C7). If the strobe time integration is performed for all the channels of the auditory filter bank 13, a stabilized auditory image 16 (equation C7) can be obtained while maintaining the value of the vertical axis (logarithmic frequency axis) in the auditory filter bank 13. . This stabilized auditory image is attenuated in its entirety with a half-life of about 30 ms, and the image naturally disappears when the input signal disappears.

【００６１】安定化聴覚イメージを時間方向に積分する
ことにより、スペクトル的な周辺分布が得られる。この
スペクトル的な周辺分布は従来からのスペクトログラム
のスペクトルベクトルと類似しているので、聴覚的スペ
クトログラムを構成でき音声認識にも応用できる（たと
えば、上記、Patterson et. al. 1995を参照）。２．寸法−形状イメージの構成この節では、寸法−形状イメージ処理部１７で行なわれ
る処理の詳細について述べる。安定化聴覚イメージ処理
部１６から出力される安定化された聴覚イメージは横軸
に線形の時間間隔軸、縦軸に対数周波数軸を持った表現
になっている。寸法−形状イメージ処理部１７では、こ
の表現を変形することによって、新たな表現である寸法
−形状イメージを求める。これは次節のメリンイメージ
１８を容易に計算できるようにする重要な段階である。
この処理を行なう寸法−形状イメージ処理部１７の詳細
を図５のブロック図に示す。また、以下の処理の流れを
図６のフローチャートに示す。以下の説明では随時図５
および図６の記載を参照する。By integrating the stabilized auditory image in the time direction, a spectral peripheral distribution can be obtained. Since this spectral marginal distribution is similar to the spectral vector of a conventional spectrogram, an auditory spectrogram can be constructed and applied to speech recognition (see, eg, Patterson et. Al. 1995, supra). 2. Configuration of Size-Shape Image In this section, details of processing performed by the size-shape image processing unit 17 will be described. The stabilized auditory image output from the stabilized auditory image processing unit 16 has a representation having a linear time interval axis on the horizontal axis and a logarithmic frequency axis on the vertical axis. The size-shape image processing unit 17 obtains a new size-shape image by modifying this expression. This is an important step to make it easier to calculate the Melin image 18 in the next section.
FIG. 5 is a block diagram showing details of the size-shape image processing unit 17 that performs this processing. The following processing flow is shown in the flowchart of FIG. FIG.
And the description of FIG.

【００６２】図５を参照して、寸法−形状イメージ処理
部１７は、安定化聴覚イメージ２１に含まれるフィルタ
遅れを補正するためのフィルタ遅れ補正部２２と、聴覚
イメージを全てのチャンネルについて垂直方向に加え合
わせて時間間隔軸上の総計活性度を計算するための活性
度計算部２３と、活性度計算部２３によって計算された
活性度の大きさに基づいて、聴覚イメージの周期性を検
出するための周期性検出部２４と、周期性検出部２４に
よって検出された周期性を用い、聴覚イメージの中から
後述する聴覚図形を抽出するための聴覚図形抽出部２５
と、聴覚図形抽出部２５によって抽出された聴覚図形の
横軸を線形の時間間隔軸から対数の時間間隔軸に変換す
るための対数時間間隔表現への変換部２６と、対数時間
間隔表現への変換部２６によって横軸が変換された聴覚
図形において観察される、直線のインパルス応答線が縦
軸と平行な向きとなるように各チャンネルごとに横軸を
移動させる処理を行なうインパルス応答分補正部２７と
を含む。Referring to FIG. 5, the size-shape image processing unit 17 includes a filter delay correction unit 22 for correcting a filter delay included in the stabilized auditory image 21 and a vertical direction for the auditory image for all channels. Activity calculator 23 for calculating the total activity on the time interval axis in addition to the above, and the periodicity of the auditory image is detected based on the magnitude of the activity calculated by the activity calculator 23 Pattern detecting unit 24 for detecting a sound pattern, and an auditory figure extracting unit 25 for extracting an auditory figure to be described later from an auditory image using the periodicity detected by the periodicity detecting unit 24
A conversion unit 26 for converting the horizontal axis of the auditory graphic extracted by the auditory graphic extraction unit 25 from a linear time interval axis to a logarithmic time interval axis, to a logarithmic time interval expression; An impulse response correction unit that performs processing of moving the horizontal axis for each channel so that the straight impulse response line observed in the auditory figure whose horizontal axis is converted by the conversion unit 26 is parallel to the vertical axis. 27.

【００６３】聴覚イメージモデル（Auditory Image Mod
el, AIM）（上記、Patterson et. al. 1995）にしたがって求め
た、安定化された聴覚イメージの例としての安定化聴覚
イメージ２１を図７に示す。図７は、１０ｍｓ間隔、す
なわち周波数１００Ｈｚ、で発生させたクリック系列音
に対する聴覚イメージを２周期強表示している。縦軸
は、フィルタの各チャンネルをそれらの最大周波数Ｈｚ
で表しており、疑似対数周波数軸になっている。横軸
は、ストローブ時間積分を開始した近傍最大値の時点か
らの時間間隔を表わし、ミリ秒単位で表されている。こ
こでは、時間間隔は線形の軸である。The Auditory Image Mod (Auditory Image Mod)
el, AIM) FIG. 7 shows a stabilized auditory image 21 as an example of a stabilized auditory image obtained according to (Patterson et. al. 1995, supra). FIG. 7 shows an auditory image of a click sequence sound generated at intervals of 10 ms, that is, a frequency of 100 Hz, for a little over two cycles. The vertical axis shows each channel of the filter at their maximum frequency Hz.
, And is a pseudo logarithmic frequency axis. The horizontal axis represents the time interval from the time of the vicinity maximum value at which the strobe time integration is started, and is expressed in milliseconds. Here, the time interval is a linear axis.

【００６４】図７を参照して、３つある垂直の線に沿っ
た活性度が高い所は、原波形の周期と同じ周期で配置さ
れている。横軸の０ｍｓの所は、ストローブ時間積分で
近傍最大値の活性度が転写される場所である。この近傍
最大値は、周期信号の場合は各々の周期を特定し、ま
た、非周期信号の場合は特徴の開始点を特定する。この
ようにしてストローブ時間積分は、メリン変換の解析の
開始時点、または零点を特定する。Referring to FIG. 7, portions having high activity along three vertical lines are arranged at the same period as the period of the original waveform. The location of 0 ms on the horizontal axis is the location where the activity of the nearby maximum value is transferred in the strobe time integration. This neighborhood maximum specifies each cycle in the case of a periodic signal, and specifies the starting point of a feature in the case of an aperiodic signal. In this way, the strobe time integration specifies the starting point or zero point of the analysis of the Mellin transform.

【００６５】メリン変換においては、初段の聴覚フィル
タバンク１３を構成する各々のウェーブレットフィルタ
が合理的な基準で揃っていること、たとえば、聴覚フィ
ルタの包絡線の立ち上がり時点（式Ｃ１での時間ｔ＝０
の時点）が全てのチャンネルで揃っていることが理論的
には望ましい。ところが、ストローブ時間積分では、聴
覚フィルタの包絡線の立ち上がり自体を検出できるわけ
ではなく、応答の最大値でストローブをかけるので、包
絡線の立ち上がりに対して遅れ時間を生じる。このずれ
は、図７の垂直の活性度の各密集位置の左側に存在して
いる曲線上の活性度によって見ることができる。このフ
ィルタ分の時間遅れを補正することが処理を分かりやす
くする上で望ましい。In the Merin transform, the wavelet filters constituting the first-stage auditory filter bank 13 are arranged on a reasonable basis, for example, at the time when the envelope of the auditory filter rises (time t = 0
Is theoretically desirable for all channels. However, in the strobe time integration, the rise of the envelope of the auditory filter itself cannot be detected, and the strobe is applied at the maximum value of the response, so that a delay time is generated with respect to the rise of the envelope. This shift can be seen by the activity on the curve that is to the left of each dense location of vertical activity in FIG. It is desirable to correct the time delay for this filter in order to make the process easy to understand.

【００６６】そのための補正を行なうのがフィルタ遅れ
補正部２２である。この補正を行なうためには、単純に
聴覚フィルタの最大周波数の逆数の周期分、各々のチャ
ンネルの活性度を右に移動させてやれば良い（式Ｃ
８）。図７に対して補正を施した結果の聴覚イメージを
図８に示す。これによって、垂直に配置された所は、メ
リン変換の開始点の良い近似となる。なお、この補正を
行なわなくてもメリン変換の出力にそれほど影響がない
ことが分かっていることについては後述するとおりであ
る。The filter delay correction section 22 performs the correction for this. In order to perform this correction, the activity of each channel may be simply shifted to the right by the period of the reciprocal of the maximum frequency of the auditory filter (equation C).
8). FIG. 8 shows an auditory image as a result of performing the correction on FIG. This gives a good approximation of the starting point of the Mellin transform where it is placed vertically. It will be described later that it is known that the output of the Merin transform is not so affected even without performing this correction.

【００６７】前述のように安定化聴覚イメージ処理部１
６で行なわれるストローブ時間積分（ＳＴＩ）は、周期
的な音によって聴神経発火パターン（ＮＡＰ）にくりか
えし生じる時間間隔パターンを安定させて、図７の時間
間隔で０、１０、２０の所で示されるように聴覚イメー
ジ（ＳＡＩ）の中で垂直の方向に活性度が集中する所を
生じさせる。図７を参照して明らかなように、この垂直
活性度線はもとの信号の周期と同じ間隔で、聴覚イメー
ジをいくつかの類似した区間に分割している。この一つ
の区間を、音源信号に対応する聴覚図形（AuditoryFigu
re, AF）（式Ｃ９）と呼ぶことにする。As described above, the stabilized auditory image processing unit 1
The strobe time integration (STI) performed at 6 stabilizes the time interval pattern that repeats into the auditory nerve firing pattern (NAP) due to the periodic sound and is shown at 0, 10, and 20 in the time interval of FIG. This causes a vertical concentration of activity in the auditory image (SAI). As can be seen with reference to FIG. 7, this vertical activity line divides the auditory image into several similar intervals at the same interval as the period of the original signal. This one section is defined as the auditory figure (AuditoryFigu) corresponding to the sound source signal.
re, AF) (Equation C9).

【００６８】活性度計算部２３は、この聴覚イメージを
各々のチャンネル全てについて垂直方向に加え合わせ
て、時間間隔軸上の分布の総計活性度を計算する。周期
性検出部２４は、この活性度の大きさによりパターンの
周期性を決定できる。この周期性情報を用いることによ
り、聴覚図形抽出部２５はフィルタ分の補正を行なった
聴覚イメージ（図８、フィルタ遅れ２２で補正した結果
に相当）中から聴覚イメージの１周期分に相当する聴覚
図形を抽出できる。The activity calculator 23 adds the auditory image to all the channels in the vertical direction and calculates the total activity of the distribution on the time interval axis. The periodicity detector 24 can determine the periodicity of the pattern based on the magnitude of the activity. By using this periodicity information, the auditory figure extraction unit 25 detects the auditory image corresponding to one period of the auditory image from the auditory image corrected for the filter (FIG. 8, corresponding to the result corrected by the filter delay 22). You can extract figures.

【００６９】聴覚図形抽出部２５によって抽出された聴
覚図形は横軸として線形の時間間隔軸を有する。この横
軸の時間間隔を対数変換すると後の処理を容易に行なう
ことができる。対数時間間隔表現への変換部２６がこの
対数変換を行なう。すなわち対数時間間隔表現への変換
部２６は、聴覚図形の横軸を対数時間間隔軸に変換する
（式Ｃ１０）。この変換により、図９に示すように聴覚
フィルタのインパルス応答に相当する聴覚図形中の曲線
群を、５００Ｈｚ以上ではほぼ平行で規則的に並んだ直
線群に変換することができる。図９は、図８内の最も左
側の聴覚図形を対数時間間隔軸にスプライン補間を用い
て変換した図である。The auditory graphic extracted by the auditory graphic extractor 25 has a linear time interval axis as a horizontal axis. When the time interval on the horizontal axis is logarithmically converted, subsequent processing can be easily performed. The conversion unit 26 for logarithmic time interval representation performs this logarithmic conversion. That is, the conversion unit 26 to logarithmic time interval representation converts the horizontal axis of the auditory figure into a logarithmic time interval axis (Equation C10). By this conversion, as shown in FIG. 9, a group of curves in the auditory diagram corresponding to the impulse response of the auditory filter can be converted into a group of straight lines that are substantially parallel and regularly arranged at 500 Hz or higher. FIG. 9 is a diagram in which the leftmost auditory graphic in FIG. 8 is converted using a logarithmic time interval axis using spline interpolation.

【００７０】図９を参照して、この直線のインパルス応
答線はいずれも負の勾配を持ち、聴覚図形の対角線と同
様に傾いている。この表現は、横軸に対数時間間隔を、
縦軸に対数周波数を、それぞれ持っており、メリン変換
を容易に計算できるような形になっている。Referring to FIG. 9, each of the linear impulse response lines has a negative gradient and is inclined similarly to the diagonal line of the auditory figure. This expression expresses a logarithmic time interval on the horizontal axis,
The vertical axis has a logarithmic frequency, so that the Mellin transform can be easily calculated.

【００７１】メリン変換の計算と音源情報を示す表現を
わかりやすくするために、図９の対数時間間隔聴覚図形
（式Ｃ１０）のインパルス応答線が縦軸と平行な線（横
軸に対して垂直となるので、以下これを「垂線」と呼ぶ
こととする。）になるように補正して図１０を得る（式
Ｃ１１）。この補正は対数時間間隔表現への変換部２６
によって行なわれるものであり、各チャンネルごとに、
最大周波数の対数に比例した分だけ右方向に対数時間間
隔軸を移動することに相当する。図１０での新しい横軸
は、時間間隔とチャンネル最大周波数との積ｈ（式Ｂ
９）の対数で表わされている。縦軸は従来と同様対数軸
表示の最大周波数である。In order to make the calculation of the Mellin transform and the expression indicating the sound source information easy to understand, the impulse response line of the logarithmic time interval auditory diagram (Equation C10) in FIG. Therefore, this is hereinafter referred to as “perpendicular line”.), And FIG. 10 is obtained (Equation C11). This correction is performed by a conversion unit 26 for converting to a logarithmic time interval expression.
And for each channel,
This corresponds to moving the logarithmic time interval axis to the right by an amount proportional to the logarithm of the maximum frequency. The new horizontal axis in FIG. 10 is the product of the time interval and the channel maximum frequency h (equation B
It is expressed by the logarithm of 9). The vertical axis is the maximum frequency of the logarithmic axis display as in the related art.

【００７２】図１０を参照して、一番左の点線の垂線
は、時間間隔とチャンネル最大周波数との積ｈが１とな
る聴覚図形内の位置を示している。また、図１０ではｈ
の値が１〜５に対応する垂線が破線で引いてあるが、そ
のいずれの上にも活性度が集中している。すなわち、図
１０に示される表現においては全てのウェーブレットフ
ィルタのインパルス応答は、ｈの値が整数となる垂線上
に集中しており、したがってこの表現がウェーブレット
フィルタの拡大・縮小に依存しないことがわかる。これ
を容易に理解できるようにするため、横軸をｈの線形軸
に直すと図１１が得られる。Referring to FIG. 10, the leftmost vertical dotted line indicates the position in the auditory figure where the product h of the time interval and the maximum channel frequency is 1. In FIG. 10, h
Are drawn by broken lines corresponding to values of 1 to 5, and the activity is concentrated on any of them. That is, in the expression shown in FIG. 10, the impulse responses of all the wavelet filters are concentrated on the vertical line where the value of h is an integer, and therefore, it is understood that this expression does not depend on the enlargement / reduction of the wavelet filter. . To make this easier to understand, FIG. 11 is obtained by changing the horizontal axis to the linear axis of h.

【００７３】なお、図１１に示される例では、対数変換
を用いないで直接図８の聴覚イメージから活性度を求め
たため、ｈ＝０に対応する垂線上の活性度も示されてい
る。この処理を行なうためには、図８に示される聴覚イ
メージにおいて、各チャンネルの最大周波数に比例した
サンプリング周波数でそれぞれの活性度の再サンプリン
グを行ない、そのサンプル点をそのまま２次元上に並べ
ればよいだけである。In the example shown in FIG. 11, since the activity is directly obtained from the auditory image shown in FIG. 8 without using the logarithmic transformation, the activity on the vertical line corresponding to h = 0 is also shown. In order to perform this processing, in the auditory image shown in FIG. 8, resampling of each activity is performed at a sampling frequency proportional to the maximum frequency of each channel, and the sample points may be arranged in two dimensions as they are. Only.

【００７４】前節で述べたとおり、この表現において
は、ウェーブレットフィルタはどのチャンネルでも同じ
表現になっているので、音源が相似でその結果波形がウ
ェーブレット的に拡大・縮小されている場合には、常に
同じ形状の表現が得られる。波形の拡大・縮小は、この
表現においては垂直の周波数軸の方向への、活性度の分
布の単なる平行移動という形で表される。したがって、
音源の大きさと形状との双方に関する情報を表現してい
るという意味で、この表現を寸法−形状イメージ（Size
-Shape Image, SSI）と呼ぶことにする。後で述べるよ
うに、この表現は母音の聴覚図形を表現する時に特に有
効である。以上の処理の流れが図６のフローチャートに
示されたものである。As described in the previous section, in this expression, since the wavelet filter has the same expression in every channel, if the sound sources are similar and the waveform is expanded / reduced as a wavelet, the expression is always used. A representation of the same shape is obtained. The scaling of the waveform is represented in this expression as a simple translation of the activity distribution in the direction of the vertical frequency axis. Therefore,
In the sense that the information represents both the size and shape of the sound source, this expression is referred to as a size-shape image (Size
-Shape Image, SSI). As will be described later, this expression is particularly effective when expressing an auditory figure of a vowel. The above processing flow is shown in the flowchart of FIG.

【００７５】図１０・図１１の寸法−形状イメージにお
ける聴覚図形は、図７の聴覚イメージの一番左の聴覚図
形から上記の一連の手続きによって求められたものであ
る。しかし、必ずしも一番左の聴覚図形でなくてもかま
わず、２番目の聴覚図形でも良いし、どの信号のどの１
周期分を表現する聴覚図形（式Ｃ９）でも手続きを進め
る上では問題ない。The auditory figures in the size-shape images of FIGS. 10 and 11 are obtained from the leftmost auditory figure of the auditory image of FIG. 7 by the above-described series of procedures. However, it does not necessarily have to be the leftmost auditory graphic, and may be the second auditory graphic.
There is no problem in proceeding with the procedure even with an auditory figure (Equation C9) representing a period.

【００７６】ただし、この例のような単純なクリック音
系列の場合にはどこを選んでも同じであるが、音声や楽
音等に雑音が付加された場合にはむしろ２番目の聴覚図
形を選んだ方が信号のみの成分を抽出するには有利であ
る。これは、雑音と信号の成分の両方が１番目の聴覚図
形に集中するためである。However, in the case of a simple click sound sequence as in this example, the selection is the same regardless of where it is selected, but when noise is added to voice or musical sound, the second auditory figure is selected. It is more advantageous to extract a signal-only component. This is because both noise and signal components are concentrated in the first auditory pattern.

【００７７】寸法−形状イメージの横軸のｈ軸に沿った
周辺分布は、各チャンネルで形状が同じウェーブレット
フィルタのインパルス応答が主になるので、インパルス
周辺分布（ImpulseProfile）と呼ぶことにする（式Ｃ１
２）。これに対して縦軸に沿ったものは、聴覚スペクト
ル周辺分布（Spectral Profile）である（式Ｃ１３）。
インパルス周辺分布は、従来からのスペクトルベクトル
とは異なる音源情報を持っている。各周辺分布はある一
時点における寸法−形状イメージを代表するベクトルで
あるので、たとえば一定間隔ごと（たとえば、５〜３０
ｍｓ程度ごと）にこれらのベクトルを計算して時系列と
してスペクトログラムの形式に並べれば、音声認識に応
用できるようになる。この表現は、寸法−形状イメージ
スペクトログラムと呼ぶことができよう。３．メリンイメージの構成この節では、寸法−形状イメージ処理部１７から出力さ
れた寸法−形状イメージから、メリンイメージ処理部１
８がメリンイメージを求める理由と処理過程を述べ、こ
のメリンイメージが図１のメリン変換処理部３から出力
されるメリンイメージに相当することを示す。The marginal distribution along the h-axis of the horizontal axis of the dimension-shape image is mainly called an impulse marginal distribution (ImpulseProfile) because the impulse response of a wavelet filter having the same shape in each channel is mainly used (formula (ImpulseProfile)). C1
2). On the other hand, what is along the vertical axis is an auditory spectrum peripheral distribution (Spectral Profile) (formula C13).
The impulse margin distribution has sound source information different from a conventional spectrum vector. Since each marginal distribution is a vector representing a dimension-shape image at a certain point in time, for example, at regular intervals (for example, 5 to 30)
If these vectors are calculated at intervals of about ms and arranged in a spectrogram format as a time series, it can be applied to speech recognition. This representation could be called a dimension-shape image spectrogram. 3. Configuration of Merin Image In this section, the merin image processing unit 1 is converted from the size-shape image output from the size-shape image processing unit 17.
8 shows the reason and the process of obtaining the melin image, and shows that this melin image corresponds to the melin image output from the melin conversion processing section 3 in FIG.

【００７８】寸法−形状イメージ処理部１７の出力する
寸法−形状イメージは、聴覚ウェーブレットフィルタの
応答が分布のほとんどを占めている。クリック系列音以
外が入力された時にこれらのインパルス応答線の右側に
出力されたであろう音源情報は相対的に小さくしか表現
されない。我々が抽出したいのは音源情報自体であるの
で、何らかの手段で聴覚フィルタ情報を何らかの逆畳み
込み法で取り除きたい。このために、この寸法−形状イ
メージの各ｈごとに垂直ベクトルをフーリエ変換してそ
の空間周波数成分の振幅で各ベクトルを表すことを考え
る。寸法−形状イメージ内の聴覚ウェーブレットフィル
タ情報は図１０からもわかるように、各チャンネルであ
まり変化しないことから、空間周波数が極めて低い所に
その情報が集中するであろう。これに対し、クリック系
統音以外の音源からの音情報は、ウェーブレットフィル
タを強制的に励振して色々な周波数に別々のリンギング
を起こすので、空間周波数が比較的高い所に出てくるで
あろう。これにより、音源情報をウェーブレットフィル
タ自体の情報から分離することができる。In the size-shape image output from the size-shape image processing unit 17, the response of the auditory wavelet filter occupies most of the distribution. The sound source information that would have been output to the right of these impulse response lines when a non-click sequence sound was input is expressed relatively small. Since we want to extract the sound source information itself, we want to remove the auditory filter information by some means using a deconvolution method. For this purpose, it is considered that a vertical vector is Fourier-transformed for each h of the dimension-shape image and each vector is represented by the amplitude of the spatial frequency component. Since the auditory wavelet filter information in the size-shape image does not change much in each channel, as can be seen from FIG. 10, the information will be concentrated where the spatial frequency is extremely low. On the other hand, sound information from sound sources other than click system sounds will appear at places where the spatial frequency is relatively high because the wavelet filter is forcibly excited and separate ringing occurs at various frequencies. . Thereby, the sound source information can be separated from the information of the wavelet filter itself.

【００７９】この計算は、インパルス周辺分布の式Ｃ１
２の中にある荷重関数W(αf_b,h)を式Ｃ１４で示される
対数周波数上で定義される荷重付き複素正弦波に置き換
えることによって実現できる。このとき空間角周波数ｃ
／２πをパラメータとして導入してW(αf_b,h,c)とし、
式Ｃ１２に代入することによって、２次元表現の式Ｃ１
５を得ることができる。式Ｃ１５から得られる出力M
_I（h,c）をメリンイメージ（Mellin Image）１８と呼ぶ
ことにする。この時横軸は、寸法−形状イメージと同じ
ｈで、縦軸はフーリエ変換の空間周波数ｃ／２πであ
る。寸法−形状イメージにおける垂直方向への平行移動
は、フーリエ変換を通すと単なる位相の変化となって振
幅情報は不変である。また、寸法−形状イメージにおい
ては、すでに音源の周期性は取り除かれていて、ｈ軸方
向は大きさ不変である。したがって、このメリンイメー
ジによって表現された聴覚図形は、音源の大きさや音源
励振の周期性に依存しない音源の形状情報を表現してい
ることになる。This calculation is based on the expression C1 of the impulse margin distribution.
2 can be realized by replacing the weighting function W (αf _b , h) with a weighted complex sine wave defined on a logarithmic frequency represented by Expression C14. At this time, the spatial angular frequency c
/ 2π is introduced as a parameter to obtain W (αf _b , h, c),
By substituting into the expression C12, the two-dimensional expression C1
5 can be obtained. Output M obtained from equation C15
_I (h, c) will be referred to as Mellin Image 18. At this time, the horizontal axis is h, which is the same as the dimension-shape image, and the vertical axis is the spatial frequency c / 2π of the Fourier transform. The vertical translation in the size-shape image becomes a mere change in phase through the Fourier transform, and the amplitude information remains unchanged. In the size-shape image, the periodicity of the sound source has already been removed, and the size in the h-axis direction does not change. Therefore, the auditory figure represented by the Merin image represents sound source shape information that does not depend on the size of the sound source or the periodicity of the sound source excitation.

【００８０】クリック系列音の寸法−形状イメージの図
１１から求めたメリンイメージを図１２に示す。図１２
から分かるように、このクリック系列音のメリンイメー
ジにおいては、非常に低い空間周波数にのみ活性度が集
中していて、高い周波数の所にはほとんど活性度がな
い。これは、上述のとおりクリック音が寸法−形状イメ
ージの中では垂線上に低周波数のチャンネル以外ほぼ平
坦な活性度しか生じさせないことを反映している。そも
そもウェーブレットフィルタのインパルス応答をどのチ
ャンネルでも同じ形となるように正規化したのが寸法−
形状イメージであるので、理論的には、単一のクリック
のみが入力された場合、空間周波数が零の所にのみ振幅
値が存在する。４．メリンイメージとメリン変換の対応づけ減衰振動波や母音の解析の例に移る前に、メリンイメー
ジ処理部１８の出力として得られる、本例での周波数領
域での積分で表されるメリンイメージ（式Ｃ１５）と、
基本的な説明として述べた時間間隔領域での積分で表さ
れるメリン変換処理部３から出力されるメリンイメージ
（式Ｂ１０）との関係を考える。時間間隔と最大周波数
との積が一定という基本的な制約条件（式Ｂ９）の対数
をとると式Ｃ１６となり、その微分から式Ｃ１７が得ら
れる。式Ｃ１５にこの関係を代入して式Ｃ１０、式Ｃ１
１を利用すると式Ｃ１８が得られる。これは、定数以
外、式Ｂ１０と同様な時間間隔領域での積分の式になっ
ている。この事実は、メリンイメージ処理部１８の出力
として得られる、本例での周波数領域での積分で表され
るメリンイメージ（式Ｃ１５）と、基本的な説明として
述べた時間間隔領域での積分で表されるメリン変換処理
部３から出力されるメリンイメージ（式Ｂ１０）とが同
じであることを示している。５．減衰振動波の聴覚イメージ・寸法−形状イメージ・
メリンイメージ繰返しのある指数減衰正弦波の聴覚イメージを図１３に
示す。この指数減衰正弦波は、２ｍｓの半減期の指数包
絡線を持ち、２ｋＨｚの周波数の正弦波の搬送波を持
ち、繰返し周波数は１００Ｈｚである。このパラメータ
を持つ減衰正弦波は単一ホルマントの母音に類似してい
る。繰返される立ち上がりの部分は、クリックに似た応
答を周波数領域で２ｋＨｚから離れた所で垂線上の活性
度として生じさせていて、２つの垂直活性度の間隔は信
号の周期性を示している。図１３の聴覚イメージから
は、２ｋＨｚの領域では、減衰包絡を持つ共振によっ
て、応答が強調されて伸びていることがわかる。これ
は、音声を含む自然界の音では共通に見られる特徴であ
る。FIG. 12 shows a melin image obtained from FIG. 11 of the size-shape image of the click series sound. FIG.
As can be seen from FIG. 5, in the melin image of the click sequence sound, the activity is concentrated only at a very low spatial frequency, and there is almost no activity at a high frequency. This reflects the fact that, as described above, the click sound causes almost flat activity on the vertical line except for the low-frequency channel in the size-shape image. In the first place, the impulse response of the wavelet filter is normalized so that it has the same shape in every channel.
Since it is a shape image, theoretically, when only a single click is input, an amplitude value exists only at a place where the spatial frequency is zero. 4. Before moving on to the example of analysis of a damped oscillatory wave or a vowel, a Merin image represented by an integral in the frequency domain in this example, which is obtained as an output of the Merin image processing unit 18 (formula C15),
Consider the relationship with the melin image (formula B10) output from the melin transform processing unit 3 represented by integration in the time interval region described as a basic explanation. Taking the logarithm of the basic constraint condition (Equation B9) that the product of the time interval and the maximum frequency is constant gives Equation C16, and the derivative thereof gives Equation C17. Substituting this relationship into Equation C15, Equation C10, Equation C1
Utilizing 1 gives equation C18. This is an integral expression in a time interval region similar to Expression B10 except for the constant. This fact is based on the Merin image (Equation C15) obtained as the output of the Merin image processing unit 18 and represented by the integration in the frequency domain in this example, and the integration in the time interval domain described as a basic explanation. It is shown that the expressed merin image (formula B10) output from the merin conversion processing unit 3 is the same. 5. Hearing image / dimensions / shape image /
FIG. 13 shows an auditory image of a repetitive exponentially attenuated sine wave. This exponentially attenuated sine wave has an exponential envelope with a half-life of 2 ms, a sine wave carrier with a frequency of 2 kHz, and a repetition frequency of 100 Hz. A damped sine wave with this parameter is similar to a single formant vowel. The repeated rising portion produces a click-like response in the frequency domain at a distance from 2 kHz as activity on the vertical, with the interval between the two vertical activities indicating the periodicity of the signal. From the auditory image of FIG. 13, it can be seen that in the region of 2 kHz, the response is emphasized and extended by resonance having an attenuation envelope. This is a feature commonly found in sounds in the natural world including voice.

【００８１】この減衰正弦波の聴覚図形の寸法−形状イ
メージを図１４に示す。２ｋＨｚから離れた所の活性度
は図１１のクリック系列音の場合とあまり変わらない。
しかし、２ｋＨｚ周辺のチャンネルでは、活性度は高い
ｈの値まで伸びており、ｈの値が増えるにつれて次第に
隣接活性度の列の傾きが増していることがわかる。この
ことは、２ｋＨｚのチャンネル以外のチャンネルでの瞬
時周波数がウェーブレットフィルタの周波数すなわち各
チャンネルのフィルタの搬送波周波数になっているわけ
でないことを示している。FIG. 14 shows a size-shape image of the auditory figure of the attenuated sine wave. The activity at a position away from 2 kHz is not much different from that of the click sequence sound in FIG.
However, in the channel around 2 kHz, the activity increases to a high value of h, and it can be seen that the slope of the column of adjacent activity gradually increases as the value of h increases. This indicates that the instantaneous frequency in a channel other than the 2 kHz channel is not the frequency of the wavelet filter, that is, the carrier frequency of the filter of each channel.

【００８２】この減衰正弦波のメリンイメージを図１５
に示す。立ち上がりの部分はクリック的なのでクリック
系列音の場合（図１１）と同様に空間周波数が非常に低
い所に活性度が集中する。寸法−形状イメージの２ｋＨ
ｚ領域の共振に関係する活性度は、メリンイメージ上で
はさらに垂直の帯状の活性領域を増やしていて、ｈが大
きい部分で広い空間周波数の応答があることを示してい
る。帯状活性領域の幅はｈが大きくなるにつれ広くな
り、これは、微細構造において観測される隣接する活性
度の間の傾きがｈの増大につれて大きくなっていること
に対応している。これは単一共振または単一ホルマント
の音源の特徴である。FIG. 15 shows the Mellin image of this attenuated sine wave.
Shown in Since the rising portion is like a click, the activity is concentrated at a place where the spatial frequency is very low, as in the case of the click series sound (FIG. 11). 2kH of dimension-shape image
The activity related to the resonance in the z region indicates that the vertical band-shaped active region is further increased on the Merin image, and that a portion where h is large has a response with a wide spatial frequency. The width of the band-shaped active region increases with increasing h, which corresponds to the slope between adjacent activities observed in the microstructure increasing with increasing h. This is characteristic of single resonance or single formant sound sources.

【００８３】減衰正弦波のメリンイメージの帯状構造の
うちこれ以外のパラメータを持つものは、搬送波の周波
数・包絡線の半減期・信号の繰返し周波数によってあま
り変化しない。すなわち、上記した帯状構造の相違によ
って、寸法や繰返し周波数と独立に音源の形状の情報を
取り出していることになる。垂直の帯状領域の強さや広
がりは減衰正弦波の半減期の増加とともにゆるやかに増
加する。次節では、例をさらに拡張し、声道断面積関数
を用いて合成した母音について同様な解析を行なう。６．４種類の母音'a'の聴覚イメージ・寸法−形状イメ
ージ・メリンイメージ寸法−形状イメージとメリンイメージとの、音源の寸法
に対する不変特性を示すために、４種類の合成母音の'
a'を作成した。この合成母音はひとりの男性の声道断面
積関数（Yang C-S and Kasuya, H.（1995）."Dimension
differeces inthe vocal tract shapes measure from
MR images across boy, female and male subjects,"
J. Acoust. Soc. Jpn （E）, 16, pp.41-44.）を使って
声道モデルから合成した母音である。この声道形状の特
徴を寸法−形状イメージ・メリンイメージで抽出するこ
とを考える。Among the band structures of the attenuated sine wave Merin image having other parameters, the band structure does not change much depending on the frequency of the carrier, the half-life of the envelope, and the repetition frequency of the signal. In other words, the information on the shape of the sound source is extracted independently of the size and the repetition frequency due to the difference in the band-like structure described above. The intensity and extent of the vertical band increases slowly with increasing half-life of the damped sinusoid. In the next section, we extend the example further and perform a similar analysis on vowels synthesized using the vocal tract cross-sectional area function. 6. The auditory image, size-shape image, and merin image of the four vowels 'a'.
a 'created. This synthetic vowel is a vocal tract cross-sectional area function of one male (Yang CS and Kasuya, H. (1995). "Dimension
differeces inthe vocal tract shapes measure from
MR images across boy, female and male subjects, "
A vowel synthesized from a vocal tract model using J. Acoust. Soc. Jpn (E), 16, pp.41-44. Consider extracting features of this vocal tract shape using a size-shape image / melin image.

【００８４】４種類のうちの１組２音声は、その声道断
面積関数をそのまま用い、異なる２周波数１００Ｈｚと
１６０Ｈｚとの声帯パルスで励振したものである。これ
らの聴覚イメージを図１６と図１７に示す。声道の共振
は、聴覚イメージ上での共振領域での応答の伸びとして
見ることができる。これこそ音声学で呼ぶホルマントで
ある。第２・第３ホルマントは、おおよそ１０００Ｈｚ
と２２００Ｈｚとに中心周波数を持っている。図中の垂
直活性度の集中位置は図１７の方が図１６よりも互いに
近くなっているが、声帯振動周波数によってはホルマン
トの位置は変化していないことが分かる。One set of two voices of the four types is obtained by using the vocal tract cross-sectional area function as it is and excited by vocal cord pulses having two different frequencies of 100 Hz and 160 Hz. These auditory images are shown in FIG. 16 and FIG. Vocal tract resonance can be seen as an extension of the response in the region of resonance on the auditory image. This is the formant we call in phonetics. The second and third formants are approximately 1000Hz
And 2200 Hz. The vertical activity concentration positions in the figure are closer to each other in FIG. 17 than in FIG. 16, but it can be seen that the formant position does not change depending on the vocal cord vibration frequency.

【００８５】２組目の２音声は、上記で用いた同じ声道
断面積関数を相似に保ったまま、声道の長さを２／３に
縮小して合成した場合である。声帯振動周波数は前と同
様１００Ｈｚと１６０Ｈｚである。これらの母音の聴覚
イメージを図１８と図１９に示す。これらの図同士では
第２・第３ホルマントは同じ位置にあるが、元の図１６
と図１７の場合と比べると、３／２倍の周波数１５００
Ｈｚと３３００Ｈｚとにそれぞれ移動している。これ
は、声道長が短くなったためである。垂直活性度の位置
は、図１６と図１８、図１７と図１９でそれぞれ同じに
なっている。The second set of two voices is a case in which the length of the vocal tract is reduced to 2/3 while the same vocal tract cross-sectional area function used above is kept similar and synthesized. The vocal cord vibration frequencies are 100 Hz and 160 Hz as before. The auditory images of these vowels are shown in FIGS. In these figures, the second and third formants are at the same position, but the original FIG.
17 and the frequency 1500 which is 3/2 times that of the case of FIG.
Hz and 3300 Hz, respectively. This is because the vocal tract length has become shorter. The position of the vertical activity is the same in FIGS. 16 and 18 and FIGS. 17 and 19, respectively.

【００８６】これらの４母音の寸法−形状イメージを図
２０〜図２３に聴覚イメージの順番どおりに示す。これ
らの聴覚図形では、聴覚図形の左にある声帯パルスに対
する応答と右側に伸びるホルマントとの区別が強調され
ている。元々の長い声道からの音声の情報のパターン
（図２０と図２１）は基本的には同じになる。しかし、
波形上の繰返し周波数によって決まる聴覚図形の右側の
境界の位置だけは互いに異なり、高いピッチの図２１の
方が範囲が狭い。同様に、短い声道の母音（図２２と図
２３）の寸法−形状イメージでも両者のパターンは同じ
で、やはり右側の境界の位置だけが異なる。The size-shape images of these four vowels are shown in the order of the auditory images in FIGS. In these auditory figures, the distinction between the response to the vocal cord pulses to the left of the auditory figure and the formants extending to the right is emphasized. The pattern of the voice information from the original long vocal tract (FIGS. 20 and 21) is basically the same. But,
Only the position of the right boundary of the auditory figure determined by the repetition frequency on the waveform differs from each other, and the higher pitch in FIG. 21 has a smaller range. Similarly, in the size-shape images of the short vocal vowels (FIGS. 22 and 23), both patterns are the same, and only the position of the right boundary is different.

【００８７】さらに、長い声道と短い声道での寸法−形
状イメージを比べると、下から４つのホルマントの応答
パターンがそれぞれ非常に類似していることがわかる。
異なるのは、長い声道の図２０と図２１とのパターンに
くらべて短い声道の図２２と図２３とのパターンは周波
数の上方に平行移動している点である。長い声道の図２
０と図２１の寸法−形状イメージで見える第５・第６ホ
ルマントは、図２２と図２３とでは上限周波数６０００
Ｈｚの上に同じ量だけ移動してしまって見えなくなって
いるが、図の周波数範囲を上方に広げれば見えるように
なる。Further, comparing the size-shape images of the long vocal tract and the short vocal tract, it can be seen that the response patterns of the four formants from the bottom are very similar to each other.
The difference is that the patterns of FIGS. 22 and 23 of the shorter vocal tract are translated upward in frequency compared to the patterns of FIGS. 20 and 21 of the longer vocal tract. Figure 2 of the long vocal tract
The fifth and sixth formants that can be seen in the dimension-shape image of FIG. 0 and FIG. 21 are the upper limit frequency 6000 in FIG. 22 and FIG.
Although it has moved by the same amount above Hz and cannot be seen, it can be seen if the frequency range in the figure is extended upward.

【００８８】これらの４母音のメリンイメージを図２４
〜図２７に聴覚イメージや寸法−形状イメージの順番ど
おりに示す。メリンイメージの縦軸はメリン係数ｃ／２
πで、これは寸法−形状イメージの垂直方向に対する空
間周波数に相当し、１００Ｈｚから６０００Ｈｚまでの
範囲での１周期が空間周波数１に対応する。あるｈの値
に対するメリンイメージの値は、寸法−形状イメージの
垂直方向に複素正弦波を用いて積分した後の絶対値で、
空間周波数と活性度の分布とに最も合致するものが大き
くなる。FIG. 24 shows a merin image of these four vowels.
27 to FIG. 27 show the auditory images and the size-shape images in order. The vertical axis of the melin image is the melin coefficient c / 2
At π, this corresponds to the spatial frequency in the vertical direction of the dimension-shape image, one period in the range from 100 Hz to 6000 Hz corresponds to a spatial frequency of 1. The value of the Merin image for a certain value of h is the absolute value after integration using a complex sine wave in the vertical direction of the dimension-shape image,
The one that best matches the spatial frequency and the distribution of the activity becomes larger.

【００８９】図２０〜図２３を参照して、母音'a'の寸
法−形状イメージのｈの整数の５ぐらいまでは、声帯パ
ルスの応答が４サイクル／周波数範囲以下の低い空間周
波数に活性度が見られる。ｈが２以上になると、ホルマ
ントが寸法−形状イメージ中の別々の帯に値が大きい所
として現れる。ｈが２から８に増えると最も良く整合す
る周波数が６から１８程度と、値が大きいところが出て
くる。ｈが８以上では、寸法−形状イメージでみると一
つしかホルマントがなく、それによってメリンイメージ
に幅広い帯状活性領域ができることがわかる。これが、
これらの４母音'a'のメリンイメージを示す図２０〜図
２３での、共通特性でもっとも特徴的である。７．日本語の５母音'a,i,u,e,o'の寸法−形状イメージ
とメリンイメージ寸法−形状イメージとメリンイメージとにおいて、異な
る母音がどのように表現されるか示すために、日本語５
母音の組を解析した。同一の声道モデルで同一の男性話
者であるが、異なる声道断面積関数（上記Yang and Kas
uya, 1995）を使って異なる５母音を合成した。すべ
て、計測通りの声道断面積・声道長を用いて、１００Ｈ
ｚの声帯パルスで駆動することにより合成した。５母
音'a, e, i,o, u'についてこの順番で、聴覚イメージを
図２８〜図３２、寸法−形状イメージを図３３〜図３
７、メリンイメージを図３８〜図４２に、それぞれ示
す。Referring to FIGS. 20 to 23, the response of the vocal cord pulse has an activity at a low spatial frequency of 4 cycles / frequency or less within about 5 integers of h in the size-shape image of vowel 'a'. Can be seen. When h is 2 or greater, the formants appear as large values in separate bands in the size-shape image. When h is increased from 2 to 8, the best matching frequency is about 6 to 18, and a large value appears. When h is 8 or more, there is only one formant in the size-shape image, which indicates that a wide band-shaped active region is formed in the melin image. This is,
FIGS. 20 to 23 showing the merin image of these four vowels 'a' are the most characteristic in the common characteristics. 7. Size-shape image and merin image of five vowels 'a, i, u, e, o' in Japanese To show how different vowels are expressed in size-shape image and merin image 5
The vowel pairs were analyzed. The same vocal tract model and the same male speaker but different vocal tract cross-sectional functions (see Yang and Kas
uya, 1995). 100H using the vocal tract cross-sectional area and vocal tract length as measured
Synthesized by driving with z vocal cord pulses. In this order for the five vowels 'a, e, i, o, u', the auditory images are shown in FIGS. 28 to 32, and the size-shape images are shown in FIGS.
7. The melin image is shown in FIGS. 38 to 42, respectively.

【００９０】聴覚イメージと寸法−形状イメージとを比
べると、時間間隔軸の対数変換が、ホルマントの強調の
仕方を変化させていることがわかる。たとえば、母音'
a'（図２８）においては、第２ホルマントの共振の継続
長が第４ホルマントに対して３倍くらい長くなってい
る。しかし、寸法−形状イメージ（図３３）においては
時間周波数積の軸ｈに対して第２ホルマントの共振の継
続長が第４ホルマントに対して同程度からやや短くなっ
ている。このような表現の変換がなければ、メリン変換
を周波数軸に対し直接取っても高次のホルマントの役割
はほとんど見えなくなるであろう。寸法−形状イメージ
におけるチャンネルの補正が、ウェーブレットインパル
ス応答と音源の性質による応答とを分けるのに有効に働
いている。A comparison between the auditory image and the size-shape image shows that the logarithmic transformation of the time interval axis changes the manner of emphasizing the formants. For example, the vowel '
In a '(FIG. 28), the continuation length of the resonance of the second formant is about three times longer than that of the fourth formant. However, in the dimension-shape image (FIG. 33), the continuation length of the second formant resonance with respect to the axis h of the time-frequency product is substantially the same as or slightly shorter than the fourth formant. Without such a transformation, the role of higher-order formants would be almost invisible if the Mellin transformation was taken directly on the frequency axis. Channel correction in the size-shape image effectively works to separate the wavelet impulse response from the response due to the nature of the sound source.

【００９１】まず、前節で説明した'a'（図３３と図３
８）と'e'（図３４と図３９）との寸法−形状イメージ
とメリンイメージとを比較する。'e'（図３４）の寸法
−形状イメージの中の高次ホルマントは'a'のものより
も集まっていて、高いｈ値まで伸びている。これによ
り、'e'メリンイメージは'a'メリンイメージと異なり、
空間周波数ｃ／２πが低い４のあたりと１２〜１６あた
りで値が大きく、さらにｈの高い所までその値が伸びて
いる。First, the 'a' described in the previous section (FIGS. 33 and 3)
8) Compare the size-shape image of 'e' (FIGS. 34 and 39) with the Merin image. The higher order formants in the dimension-shape image of 'e' (FIG. 34) are more concentrated than those of 'a' and extend to higher h values. This makes the 'e' merin image different from the 'a' merin image,
The value is large around 4 and 12 to 16 where the spatial frequency c / 2π is low, and the value extends to a position where h is high.

【００９２】母音'i'（図３５と図４０）では、'e'と同
様高次ホルマントが群をなしているがさらに集中してい
る。これが、ｈの２〜６でのｃ／２πが８あたりの値の
大きい所を生じさせている。ｈが４以上ではｃ／２πが
１５〜２０くらいに活性領域が移動している。さら
に、'i'の寸法−形状イメージでの共振領域の伸びから
もわかるように、１５以上の高いｈの値まで幅広い帯状
領域が広がっている。In the vowel 'i' (FIGS. 35 and 40), similar to 'e', higher formants form a group but are more concentrated. This gives rise to a point where the value of c / 2π at h 2 to 6 is large around 8. When h is 4 or more, the active region moves to c / 2π of about 15 to 20. Furthermore, as can be seen from the elongation of the resonance region in the dimension-shape image of 'i', a wide band-like region extends to a high h value of 15 or more.

【００９３】'o'の寸法−形状イメージ（図３６）で
は、第１・第２ホルマントの組と残りの３ホルマントの
組との間（１２００Ｈｚ〜２８００Ｈｚ程度）に大きな
周波数の隔たりがある。これにより、図４１の'o'のメ
リンイメージではｃ／２πが４以下の活性度はあまり大
きくない。第１ホルマントがある範囲、すなわち図３６
でｈが５までの範囲で、ｃ／２πが５〜８くらいの所で
第１と第２ホルマントの間隔を反映している活性度があ
るが、第１ホルマントが消えるとｃ／２πが１２〜２０
くらいでの高次ホルマントの間隔を反映する活性度が主
になる。継続して続く高次ホルマントの群はｈが高い所
での低い空間周波数の拡散した活性度に反映して、他の
母音との違いを示している。In the dimension-shape image of 'o' (FIG. 36), there is a large frequency gap between the first and second formant sets and the remaining three formant sets (about 1200 Hz to 2800 Hz). As a result, in the melin image of “o” in FIG. 41, the activity at c / 2π of 4 or less is not so large. The range where the first formant is located, that is, FIG.
In the range where h is up to 5 and where c / 2π is about 5 to 8, there is an activity that reflects the interval between the first and second formants, but when the first formant disappears, c / 2π becomes 12 ~ 20
The degree of activity mainly reflects the spacing of higher formants in the order of magnitude. Continuing groups of higher formants show differences from other vowels, reflecting the diffuse activity of lower spatial frequencies where h is higher.

【００９４】母音'u'（図３７と図４２）は、他の母音
と比べ単純で、ホルマントの共振帯域幅が広いために、
寸法−形状イメージやメリンイメージでのｈの値の大き
い所まで活性度が伸びていない。これが、この母音の特
徴を表しているのであろうが、それゆえｈやｃ／２πが
大きい所での区別しやすい特徴を失っている。ｈが２〜
５の範囲ではｃ／２πが７あたりで強い活性度があり、
ｈが４〜５の範囲では１３くらいにある。帯状領域はｈ
が１０以上にほとんど存在せず、他の母音では'a'に近
い。The vowel 'u' (FIGS. 37 and 42) is simpler than the other vowels and has a wide formant resonance bandwidth.
The activity does not extend to the place where the value of h is large in the dimension-shape image or the merin image. This may represent the characteristics of this vowel, but has lost its easily distinguishable features where h and c / 2π are large. h is 2
In the range of 5, c / 2π has a strong activity at around 7,
When h is in the range of 4 to 5, it is about 13. The band is h
Is almost nonexistent in 10 or more, and is close to 'a' in other vowels.

【００９５】このように、各々の母音のメリンイメージ
は特徴的に異なり、これらの相違からそれぞれの違いを
容易に抽出できる。８．音声認識装置前節までで、音源が同じ形状ではほぼ同じになり、異な
る場合は特徴的に異なるという、メリンイメージの優れ
た特徴を示してきた。このようなメリンイメージの情報
を用いると、優れた音声認識装置を実現できる。たとえ
ば、メリンイメージの縦軸方向または横軸方向に向かっ
て活性度を加えあわせると、それぞれ１次元ベクトルの
周辺分布が得られる。これらのベクトルの両方または片
方を一列に並べて１次元ベクトルとすれば、聴覚イメー
ジのある一時点における特徴を表わす特徴ベクトルとな
る。As described above, the melody images of the vowels are characteristically different, and the respective differences can be easily extracted from these differences. 8. Speech Recognition Device Up to the previous section, the excellent features of the Merin image have been shown, in which the sound source is almost the same for the same shape, and is different when it is different. By using such information of the merin image, an excellent voice recognition device can be realized. For example, when the activities are added in the vertical axis direction or the horizontal axis direction of the merin image, the peripheral distribution of the one-dimensional vector is obtained. By arranging both or one of these vectors in a line to form a one-dimensional vector, a feature vector representing a feature at a certain point in the auditory image is obtained.

【００９６】この特徴ベクトルを聴覚イメージのたとえ
ば一定間隔ごと（たとえば、５〜３０ｍｓ程度ごと）に
計算して、順次縦軸に並べてスペクトログラムの形式に
すれば、メリンイメージスペクトログラムとでも呼べる
表現が得られる。前述の寸法−形状イメージスペクトロ
グラムと合わせても、現在広く使われている音声認識回
路１９（図４）にそのまま入力できる。各々の周辺分布
は一時点の音源情報を代表するベクトルで、従来の振幅
スペクトルより豊かな情報量を持っている。これによ
り、従来より優れた音声認識結果２０を得られる。これ
が本発明の最大の長所である。第２の実施の形態図４３は、声道の大きさの違う大人・子供にかかわらず
応用できる、他言語の練習または障害からのリハビリテ
ーション用の発声練習装置に本発明を適用した実施の形
態の装置を示す。この装置は、入力される音声を電気信
号に変換するためのマイクロホン２９と、マイクロホン
２９の出力する電気信号を増幅するための増幅器３０
と、増幅器３０によって増幅された電気信号をアナログ
／デジタル変換するためのＡ−Ｄ変換器３１と、Ａ−Ｄ
変換器３１から出力されるデジタル信号を受けて音声信
号処理を行なうためのプログラムを実行する汎用コンピ
ュータ３２と、汎用コンピュータ３２の出力に基づいて
音韻、単語文字、特徴量を表示するための音韻・単語文
字・特徴量表示装置３３と、汎用コンピュータ３２の出
力するデジタルの音声信号をアナログ信号に変換するた
めのＤ−Ａ変換器３４と、Ｄ−Ａ変換器３４によってア
ナログ信号に変換された音声信号を増幅するための増幅
器３５と、増幅器３５から与えられる音声信号を音声に
変換するためのスピーカまたはヘッドホン３６とを含
む。If this feature vector is calculated, for example, at regular intervals (for example, about every 5 to 30 ms) of the auditory image, and sequentially arranged on the vertical axis to form a spectrogram, an expression which can be called a melin image spectrogram is obtained. . Even if it is combined with the above-described size-shape image spectrogram, it can be directly input to the speech recognition circuit 19 (FIG. 4) which is currently widely used. Each marginal distribution is a vector representing the sound source information at a point in time, and has a richer amount of information than the conventional amplitude spectrum. As a result, a speech recognition result 20 superior to the related art can be obtained. This is the greatest advantage of the present invention. Second Embodiment FIG. 43 shows an embodiment in which the present invention is applied to an utterance practice device for rehabilitation from practice or disability in another language, which can be applied to adults and children having different vocal tract sizes. The device is shown. This device includes a microphone 29 for converting an input voice into an electric signal, and an amplifier 30 for amplifying an electric signal output from the microphone 29.
An A / D converter 31 for performing analog / digital conversion of the electric signal amplified by the amplifier 30;
A general-purpose computer 32 for executing a program for performing audio signal processing in response to a digital signal output from the converter 31; A word character / feature amount display device 33, a DA converter 34 for converting a digital voice signal output from the general-purpose computer 32 into an analog signal, and a voice converted to an analog signal by the DA converter 34. It includes an amplifier 35 for amplifying a signal, and a speaker or headphones 36 for converting an audio signal provided from the amplifier 35 into audio.

【００９７】マイクロホン２９の出力する、音声を表わ
す電気信号は増幅器３０およびＡ−Ｄ変換器３１を通っ
て汎用コンピュータ３２に入力される。汎用コンピュー
タ３２は、後述するような処理をこの電気信号に対して
行ない、その結果を表わす信号を音韻・単語文字・特徴
量表示装置３３およびＤ−Ａ変換器３４に与える。汎用
コンピュータ３２の出力は、音韻・単語文字・特徴量表
示装置３３により視覚的に提示され、また、Ｄ−Ａ変換
器３４・増幅器３５を通してスピーカまたはヘッドホン
３６によって聴覚的に提示される。An electric signal representing a sound output from the microphone 29 is input to a general-purpose computer 32 through an amplifier 30 and an AD converter 31. The general-purpose computer 32 performs a process to be described later on the electric signal, and supplies a signal representing the result to the phoneme / word character / feature amount display device 33 and the DA converter 34. The output of the general-purpose computer 32 is visually presented by a phoneme / word character / feature amount display device 33, and is presented audibly by a speaker or headphones 36 through a DA converter 34 and an amplifier 35.

【００９８】この汎用コンピュータでは、図４４のフロ
ーチャートに従った処理が行なわれる。まず、既に説明
した安定化ウェーブレット変換が行なわれる。その情報
を用いて、ピッチ周波数・寸法−形状イメージ・メリン
イメージが並列的に計算される。In this general-purpose computer, processing according to the flowchart of FIG. 44 is performed. First, the already described stabilized wavelet transform is performed. Using the information, a pitch frequency, a size-shape image, and a melin image are calculated in parallel.

【００９９】寸法−形状イメージの計算では、話者の声
道長に関する情報が計算され、メリンイメージでは声道
長を正規化した表現が算出される。それらをあらかじめ
蓄積されている標準テンプレートと比較することによ
り、話者がしゃべった音韻や文字列を判断してそれを視
覚提示情報として出力したり、話者の声道長やピッチ情
報に合わせた合成音として聴覚提示情報として出力した
りする。In the calculation of the size-shape image, information on the vocal tract length of the speaker is calculated, and in the Merin image, a normalized expression of the vocal tract length is calculated. By comparing them with the standard templates stored in advance, the phonemes and character strings spoken by the speaker can be determined and output as visual presentation information, or matched to the speaker's vocal tract length and pitch information. It is output as auditory presentation information as a synthesized sound.

【０１００】発声練習装置として用いるために、練習問
題の生成等の教示情報からも視覚・聴覚提示ができるよ
うになっている。これにより、標準テンプレートを大人
でも子供でもすべての場合に用意する必要がないにもか
かわらず正確な音韻判断ができるので、効率的な練習の
ための装置として有効である。第３の実施の形態図４５は、大きさの違う青果・果物・食物の品質の自動
選別器に本発明を応用した実施の形態である。この自動
選別器は、選別の対象となる物体に対して音波を照射す
るためのスピーカ３７、増幅器３８およびＤ−Ａ変換器
３９と、選別する品物から戻ってくる音波を受信するた
めのマイクロホン４０と、マイクロホン４０の出力を増
幅するための増幅器４１と、増幅器４１の出力をデジタ
ル信号に変換するためのＡ−Ｄ変換器４２と、Ａ−Ｄ変
換器４２から与えられる信号に対して後述する処理を行
なうためのコンピュータ４３と、コンピュータ４３から
出力される制御信号にしたがって品物の選別を行なうた
めの品質等級分別装置４４と、コンピュータ４３の出力
する情報を表示するための表示装置４５と、コンピュー
タ４３の出力にしたがって警告を発するためのアラーム
装置４６とを含む。In order to use the apparatus as a vocal training apparatus, visual and auditory presentation can be performed from teaching information such as generation of a practice exercise. Thus, although it is not necessary to prepare a standard template in all cases for both adults and children, accurate phonological judgment can be made, which is effective as a device for efficient practice. Third Embodiment FIG. 45 shows an embodiment in which the present invention is applied to an automatic sorter for qualities of fruits, fruits, and foods having different sizes. This automatic sorter includes a speaker 37 for irradiating a sound wave to an object to be sorted, an amplifier 38 and a DA converter 39, and a microphone 40 for receiving a sound wave returned from the item to be sorted. And an amplifier 41 for amplifying the output of the microphone 40, an A / D converter 42 for converting the output of the amplifier 41 into a digital signal, and a signal given from the A / D converter 42, which will be described later. A computer 43 for performing processing, a quality class sorting device 44 for selecting items according to control signals output from the computer 43, a display device 45 for displaying information output by the computer 43, and a computer And an alarm device 46 for issuing a warning according to the output of 43.

【０１０１】コンピュータ４３で行なわれる処理を図４
６に示す。コンピュータ４３はスピーカ３７から品物に
向けて発射される音声のための送信信号の生成を行な
い、Ｄ−Ａ変換器３９に与える。コンピュータ４３はさ
らに、出力信号の生成パラメータと、スピーカ３７から
発生された音声に応答して品物により反射され、マイク
ロホン４０、増幅器４１およびＡ−Ｄ変換器４２を介し
て電気信号に変換されてコンピュータ４３に与えられた
受信信号とに基づいて、安定化ウェーブレット変換、寸
法−形状イメージ、メリンイメージの計算を実行して、
品物の大きさに依存しない、品物の内部状態に関する表
現を得る。コンピュータ４３は、得られた表現と、あら
かじめ蓄積してある標準テンプレートとを比較すること
により、品物の品質等級を決定して、その決定結果を出
力する。出力と標準テンプレートとのずれが所定の値よ
りも大きい場合には、コンピュータ４３は品物に欠陥が
あると判断して表示装置４５およびアラーム装置４６に
よる診断結果の出力を行なう。The processing performed by the computer 43 is shown in FIG.
6 is shown. The computer 43 generates a transmission signal for a sound emitted from the speaker 37 toward the item, and supplies the transmission signal to the DA converter 39. The computer 43 is further reflected by the product in response to the output signal generation parameter and the sound generated from the speaker 37, and is converted into an electric signal through the microphone 40, the amplifier 41, and the A / D converter 42. Based on the received signal given to 43, a stabilized wavelet transform, a size-shape image, and a calculation of a Merin image are executed,
Get an expression about the internal state of an item, independent of the size of the item. The computer 43 determines the quality grade of the item by comparing the obtained expression with a standard template stored in advance, and outputs the determination result. If the difference between the output and the standard template is larger than a predetermined value, the computer 43 determines that the item is defective and outputs a diagnosis result by the display device 45 and the alarm device 46.

【０１０２】この実施の形態の装置により、ばらつきが
ある品物の大きさに依存せず、その内部状態だけに依存
した有効な選別ができるようになる。このシステムは、
上記のような品物だけではなく、身体の診断、鉄や金属
製品、陶磁器等の製品の欠陥判断にも適用できる。第４の実施の形態この第４の実施の形態の装置は、基本的には第３の実施
の形態と同じ構成を有し、コンピュータで計算されたイ
メージを表示するための表示装置４５（モニタ等）をさ
らに含む。この表示装置４５により、大きさを正規化し
た表現を視覚的に提示する手段が得られ、人間が対象物
の特性を直接判断できるようになる。また、欠陥判断を
してアラームを鳴らす装置４６を設ければ、装置の欠陥
を自動診断できるようになる。これにより第３の実施の
形態だけではない、ソナー信号の処理一般に本発明を応
用することができる。According to the apparatus of this embodiment, it is possible to carry out effective sorting depending only on the internal state of the article without depending on the size of the article having the variation. This system is
The present invention can be applied not only to the above-mentioned articles but also to the diagnosis of the body and the determination of defects of products such as iron, metal products, and ceramics. Fourth Embodiment A device according to a fourth embodiment has basically the same configuration as that of the third embodiment, and has a display device 45 (monitor) for displaying an image calculated by a computer. Etc.). The display device 45 provides a means for visually presenting an expression whose size has been normalized, so that a human can directly determine the characteristics of an object. Further, if a device 46 for making a defect judgment and sounding an alarm is provided, a defect of the device can be automatically diagnosed. As a result, the present invention can be applied not only to the third embodiment but also to sonar signal processing in general.

【０１０３】本発明の応用としては、他にもさまざまな
ものが考えられる。たとえば、本発明によって対象物の
大きさに依存しない表現が得られるため、建築の分野に
おいては、コンサートホールのミニチュアモデルで計測
を行なえば、建設後のコンサートホールの音響特性を予
測できる。建築構造物自体の音波による老朽化診断も挙
げられる。また、水中でのソナー信号の解析への応用も
可能となる。第５の実施の形態図４７は、様々な大きさのエンジンの故障診断に本発明
を適用した第５の実施の形態である。自動車・船舶等の
エンジンに取り付けた振動センサかマイクロホン４７の
出力信号を増幅器４１、Ａ−Ｄ変換器４２を通してコン
ピュータ５０に入力する。コンピュータ５０によって欠
陥や故障の判断が行なわれその情報の表示装置５１、ア
ラーム装置５２、エンジンの制御装置５３が制御され
る。また直接イメージ出力する装置５４も付けられる。Various other applications of the present invention are conceivable. For example, an expression independent of the size of an object can be obtained by the present invention. Therefore, in the field of architecture, if a measurement is performed using a miniature model of a concert hall, the acoustic characteristics of the concert hall after construction can be predicted. Diagnosis of aging by sound waves of the building structure itself is also included. Further, application to analysis of sonar signals in water is also possible. Fifth Embodiment FIG. 47 shows a fifth embodiment in which the present invention is applied to failure diagnosis of engines of various sizes. An output signal of a vibration sensor or a microphone 47 attached to an engine of a car, a ship, or the like is input to a computer 50 through an amplifier 41 and an AD converter 42. The computer 50 determines a defect or a failure, and the display device 51, the alarm device 52, and the engine control device 53 of the information are controlled. A device 54 for directly outputting an image is also provided.

【０１０４】このコンピュータ５０では、図４８で示さ
れる処理が行なわれている。図４８を参照して、入力さ
れた準周期的な信号に基づいて、安定化ウェーブレット
変換が行なわれ、その結果から寸法−形状イメージ、お
よびメリンイメージが計算される。これらイメージと、
あらかじめ蓄積してある標準テンプレートとを比較する
ことにより、エンジンの状態を診断して結果を出力す
る。この時、結果として欠陥の有無という２値的な信号
が得らるようにすれば、この信号で欠陥・故障表示装置
やアラーム装置を制御することができる。これに対し、
標準パターンとの距離尺度をあらかじめ決定しておい
て、どれくらい類似しているかの距離を計算して連続量
として出力することもできる。この情報はエンジンの回
転等の異常の度合いを示すことになるのでエンジンの制
御装置を制御する信号として用いることができる。ま
た、直接イメージを出力すれば人間が視覚的に故障判断
を行なうこともできる。In this computer 50, the processing shown in FIG. 48 is performed. Referring to FIG. 48, a stabilized wavelet transform is performed based on the input quasi-periodic signal, and a size-shape image and a Merin image are calculated from the result. With these images,
The state of the engine is diagnosed by comparing with a standard template stored in advance, and the result is output. At this time, if a binary signal indicating the presence or absence of a defect is obtained as a result, the defect / fault display device and the alarm device can be controlled by this signal. In contrast,
It is also possible to determine a distance measure with the standard pattern in advance, calculate the distance of how similar the distance is, and output the distance as a continuous amount. Since this information indicates the degree of abnormality such as rotation of the engine, it can be used as a signal for controlling the control device of the engine. In addition, if an image is directly output, a human can visually determine a failure.

【０１０５】エンジンの形状は同じでも、排気量は目的
に応じて変わる。同じエンジンの族ではたとえその寸法
が異なっていても本発明を用いると同じ表現を用いるこ
とによりその状態を判断できる。したがって本発明によ
るエンジン状態の判断装置は、種々の大きさのエンジン
について、有効に共通の故障原因などを判定することが
できる。Even if the shape of the engine is the same, the displacement varies depending on the purpose. In the same engine family, even if the dimensions are different, the state can be determined by using the same expression when using the present invention. Therefore, the engine state determination device according to the present invention can effectively determine a common cause of failure for engines of various sizes.

【０１０６】さらには、建築物に取り付けたセンサから
の出力を用いれば、建築物の欠陥診断にも応用でき、地
震波の信号を用いれば、震源の大きさに依存しない共通
の特徴をみつけることができる。また、本発明によれ
ば、人工物であるか自然物であるか、またはどのような
物理系により測定された信号かにかかわらず、信号源か
らの信号であれば何を入力としてもよい。例えば、心臓
拍動音や脳波信号等の生体信号をピックアップすれば、
その身体や頭の大きさに依存しない表現が得られるの
で、良好な診断結果を出すこともできる。Furthermore, if the output from the sensor attached to the building is used, it can be applied to the defect diagnosis of the building. If the signal of the seismic wave is used, a common feature independent of the size of the epicenter can be found. it can. Further, according to the present invention, regardless of whether the signal is an artificial object, a natural object, or a signal measured by any physical system, any signal may be input as long as it is a signal from a signal source. For example, if you pick up biological signals such as heart beat sounds and brain wave signals,
Since an expression independent of the size of the body and the size of the head can be obtained, a good diagnosis result can be obtained.

【０１０７】以上のようにこの発明による安定化ウェー
ブレット−メリン変換によれば、基本的に音源の物理的
な大きさに依存しない信号表現（例えば音声の場合、男
性・女性・子供によって異なる声道長を正規化した表
現）、または、時系列データの場合には自己相似性（フ
ラクタル性）を正規化した表現が得られる。すなわち、
大きな部分を構成する一部分がもとの大きな部分と共通
の構成を持っている事象については、大きな部分とそれ
を構成する小さな部分との双方について同じ表現が得ら
れるということである。これは従来の自己回帰モデルや
スペクトル分析では行ないづらかったことで、従来の時
系列データ処理の限界を超えうる信号処理が可能とな
る。また、この過程で正規化できない要素は逆に分離で
きるので音声であれば個人認証等に有効に活用できる。
このように音源の物理的大きさや自己相似性の正規化が
必要となる信号処理に広く利用できる。As described above, according to the stabilized wavelet-Mellin transform according to the present invention, a signal expression that does not basically depend on the physical size of a sound source (for example, in the case of voice, a vocal tract that differs depending on men, women, and children) In the case of time-series data, an expression in which self-similarity (fractality) is normalized can be obtained. That is,
For an event in which a part constituting a large part has a common configuration with the original large part, the same expression can be obtained for both the large part and the small part constituting the large part. This is difficult to perform with a conventional autoregressive model or spectrum analysis, and enables signal processing that can exceed the limit of conventional time-series data processing. In addition, elements that cannot be normalized in this process can be separated in reverse, so that voice can be effectively used for personal authentication and the like.
As described above, the present invention can be widely used for signal processing that requires normalization of the physical size and self-similarity of a sound source.

【０１０８】今回開示された実施の形態はすべての点で
例示であって制限的なものではないと考えられるべきで
ある。本発明の範囲は上記した説明ではなくて特許請求
の範囲によって示され、特許請求の範囲と均等の意味お
よび範囲内でのすべての変更が含まれることが意図され
る。The embodiments disclosed this time are to be considered in all respects as illustrative and not restrictive. The scope of the present invention is defined by the terms of the claims, rather than the description above, and is intended to include any modifications within the scope and meaning equivalent to the terms of the claims.

【０１０９】以下は説明中で引用した付録である。The following is an appendix cited in the description.

【０１１０】[0110]

【数１】 (Equation 1)

【０１１１】[0111]

【数２】 (Equation 2)

【０１１２】[0112]

【数３】 (Equation 3)

【０１１３】[0113]

【数４】 (Equation 4)

[Brief description of the drawings]

【図１】この発明の原理を説明する概略ブロック図で
ある。FIG. 1 is a schematic block diagram illustrating the principle of the present invention.

【図２】図１の安定化ウェーブレット処理部２のブロ
ック図である。FIG. 2 is a block diagram of a stabilized wavelet processing unit 2 of FIG.

【図３】図１および図２に関連するフローチャートで
ある。FIG. 3 is a flowchart related to FIGS. 1 and 2;

【図４】この発明の第１の実施の形態の音声認識装置
の概略ブロック図である。FIG. 4 is a schematic block diagram of a voice recognition device according to the first embodiment of the present invention.

【図５】図４の事象検出（ピッチ検出）回路１５およ
び安定化聴覚イメージ処理部１６のブロック図である。5 is a block diagram of an event detection (pitch detection) circuit 15 and a stabilized auditory image processing unit 16 of FIG.

【図６】図４および図５に関連するフローチャートで
ある。FIG. 6 is a flowchart related to FIGS. 4 and 5;

【図７】クリック系列音の安定化聴覚イメージの例を
示す図である。FIG. 7 is a diagram illustrating an example of a stabilized auditory image of a click sequence sound.

【図８】図７からフィルタの遅れに相当する分だけ補
正した安定化聴覚イメージを示す図である。8 is a diagram showing a stabilized auditory image corrected from FIG. 7 by an amount corresponding to a delay of a filter.

【図９】図８の横軸の時間間隔軸を対数変換して表示
した安定化聴覚イメージを示す図である。9 is a diagram showing a stabilized auditory image in which the time interval axis of the horizontal axis in FIG. 8 is logarithmically converted and displayed.

【図１０】すべてのチャンネルでウェーブレットフィ
ルタのインパルス応答が縦方向にそろうように補正した
安定化聴覚イメージを示す図である。FIG. 10 is a diagram showing a stabilized auditory image in which the impulse responses of the wavelet filters in all the channels are corrected so as to be aligned in the vertical direction.

【図１１】図１０に示される安定化聴覚イメージを、
横軸の時間間隔周波数積ｈが線形軸となるように変換し
て表した図である。FIG. 11 shows the stabilized auditory image shown in FIG.
FIG. 7 is a diagram in which the time interval frequency product h on the horizontal axis is converted and represented on a linear axis.

【図１２】クリック系列音のメリンイメージを示す図
である。FIG. 12 is a diagram showing a melin image of a click sequence sound.

【図１３】指数減衰正弦波の聴覚イメージを示す図で
ある。FIG. 13 is a diagram showing an auditory image of an exponentially attenuated sine wave.

【図１４】指数減衰正弦波の寸法−形状イメージを示
す図である。FIG. 14 is a diagram showing a size-shape image of an exponentially attenuated sine wave.

【図１５】指数減衰正弦波のメリンイメージを示す図
である。FIG. 15 is a diagram showing a Merin image of an exponentially attenuated sine wave.

【図１６】測定した男性話者の声道断面積関数を用い
て声道モデルより合成した日本語母音'a'の聴覚イメー
ジ（声帯パルスの繰返し周波数１００Ｈｚ）を示す図で
ある。FIG. 16 is a diagram showing an auditory image (repetition frequency of vocal cord pulses of 100 Hz) of a Japanese vowel 'a' synthesized from a vocal tract model using the measured male speaker's vocal tract cross-sectional area function.

【図１７】図１６と同じ条件だが、声帯パルスの繰返
し周波数１６０Ｈｚで合成した、日本語母音'a'の聴覚
イメージを示す図である。FIG. 17 is a diagram showing an auditory image of a Japanese vowel 'a' synthesized under the same conditions as in FIG. 16 but at a repetition frequency of vocal cord pulses of 160 Hz.

【図１８】図１６の声道断面積関数に対して声道長を
２／３に縮小して、声道モデルより合成した日本語母
音'a'の聴覚イメージ（声帯パルスの繰返し周波数１０
０Ｈｚ）を示す図である。FIG. 18 shows an auditory image of Japanese vowel 'a' synthesized from a vocal tract model by reducing the vocal tract length to 2/3 of the vocal tract cross-sectional area function of FIG.
0 Hz).

【図１９】図１８と同じ条件だが、声帯パルスの繰返
し周波数１６０Ｈｚで合成した、日本語母音'a'の聴覚
イメージを示す図である。FIG. 19 is a diagram showing an auditory image of a Japanese vowel 'a' synthesized under the same conditions as in FIG. 18 but at a repetition frequency of a vocal cord pulse of 160 Hz.

【図２０】図１６に対する寸法−形状イメージを示す
図である。FIG. 20 is a diagram showing a size-shape image for FIG. 16;

【図２１】図１７に対する寸法−形状イメージを示す
図である。FIG. 21 is a diagram showing a size-shape image for FIG. 17;

【図２２】図１８に対する寸法−形状イメージを示す
図である。FIG. 22 is a diagram showing a size-shape image for FIG. 18;

【図２３】図１９に対する寸法−形状イメージを示す
図である。FIG. 23 is a diagram showing a size-shape image for FIG. 19;

【図２４】図１６に対するメリンイメージを示す図で
ある。FIG. 24 is a view showing a merin image for FIG. 16;

【図２５】図１７に対するメリンイメージを示す図で
あるFIG. 25 is a diagram showing a merin image for FIG. 17;

【図２６】図１８に対するメリンイメージを示す図で
ある。FIG. 26 is a diagram showing a merin image for FIG. 18;

【図２７】図１９に対するメリンイメージを示す図で
ある。FIG. 27 is a diagram showing a merin image for FIG. 19;

【図２８】測定した声道断面積関数を用いて声道モデ
ルより合成した日本語母音'a'の聴覚イメージ（声帯パ
ルスの繰返し周波数１００Ｈｚ。）を示す、図１６と同
一の図である。FIG. 28 is the same diagram as FIG. 16, showing an auditory image (repetition frequency of vocal cord pulse of 100 Hz) of Japanese vowel 'a' synthesized from a vocal tract model using the measured vocal tract cross-sectional area function.

【図２９】図２８と同じ男性話者で測定した'e'の声
道断面積関数を用いて声道モデルより合成した日本語母
音'e'の聴覚イメージ（声帯パルスの繰返し周波数１０
０Ｈｚ）を示す図である。29 is an auditory image of a Japanese vowel 'e' synthesized from a vocal tract model using a vocal tract cross-sectional area function of 'e' measured by the same male speaker as in FIG.
0 Hz).

【図３０】図２８と同じ男性話者で測定した'i'の声
道断面積関数を用いて声道モデルより合成した日本語母
音'i'の聴覚イメージ（声帯パルスの繰返し周波数１０
０Ｈｚ）を示す図である。FIG. 30 shows an auditory image of a Japanese vowel 'i' synthesized from a vocal tract model using a vocal tract cross-sectional function of 'i' measured by the same male speaker as in FIG.
0 Hz).

【図３１】図２８と同じ男性話者で測定した'o'の声
道断面積関数を用いて声道モデルより合成した日本語母
音'o'の聴覚イメージ（声帯パルスの繰返し周波数１０
０Ｈｚ）を示す図である。FIG. 31 shows an auditory image of a Japanese vowel 'o' synthesized from a vocal tract model using the vocal tract cross-sectional function of 'o' measured by the same male speaker as in FIG. 28 (repetition frequency of vocal cord pulse of 10)
0 Hz).

【図３２】図２８と同じ男性話者で測定した'u'の声
道断面積関数を用いて声道モデルより合成した日本語母
音'u'の聴覚イメージ（声帯パルスの繰返し周波数１０
０Ｈｚ）を示す図である。FIG. 32 shows an auditory image of a Japanese vowel 'u' synthesized from a vocal tract model using a vocal tract cross-sectional function of 'u' measured by the same male speaker as in FIG.
0 Hz).

【図３３】図２８に対する寸法−形状イメージを示す
図である。FIG. 33 is a diagram showing a size-shape image for FIG. 28;

【図３４】図２９に対する寸法−形状イメージを示す
図である。FIG. 34 is a diagram showing a size-shape image for FIG. 29;

【図３５】図３０に対する寸法−形状イメージを示す
図である。FIG. 35 is a diagram showing a size-shape image for FIG. 30;

【図３６】図３１に対する寸法−形状イメージを示す
図である。FIG. 36 is a diagram showing a dimension-shape image for FIG. 31;

【図３７】図３２に対する寸法−形状イメージを示す
図である。FIG. 37 is a diagram showing a size-shape image for FIG. 32;

【図３８】図２８に対するメリンイメージを示す図で
ある。FIG. 38 is a view showing a merin image for FIG. 28;

【図３９】図２９に対するメリンイメージを示す図で
ある。FIG. 39 is a diagram showing a merin image for FIG. 29;

【図４０】図３０に対するメリンイメージを示す図で
ある。FIG. 40 is a view showing a merin image with respect to FIG. 30;

【図４１】図３１に対するメリンイメージを示す図で
ある。FIG. 41 is a diagram showing a merin image for FIG. 31.

【図４２】図３２に対するメリンイメージを示す図で
ある。FIG. 42 is a diagram showing a merin image for FIG. 32;

【図４３】第２の実施の形態の発声練習装置のブロッ
ク図である。FIG. 43 is a block diagram of a speech training device according to a second embodiment.

【図４４】第２の実施の形態の汎用コンピュータが行
なっている処理のフローチャートである。FIG. 44 is a flowchart of a process performed by the general-purpose computer according to the second embodiment.

【図４５】第３の実施の形態の品物品質等級分別装置
および第４の実施の形態のソナーシステムのブロック図
である。FIG. 45 is a block diagram of an article quality classifying apparatus according to a third embodiment and a sonar system according to a fourth embodiment.

【図４６】第３の実施の形態・第４の実施の形態の
コンピュータが行なっている処理のフローチャートであ
る。FIG. 46 is a flowchart of a process performed by a computer according to the third embodiment and the fourth embodiment.

【図４７】第５の実施の形態のエンジン故障診断装置
のブロック図である。FIG. 47 is a block diagram of an engine failure diagnosis device according to a fifth embodiment.

【図４８】第５の実施の形態のコンピュータが行なっ
ている処理のフローチャートである。FIG. 48 is a flowchart of a process performed by a computer according to the fifth embodiment.

[Explanation of symbols]

２安定化ウェーブレット変換処理部、３メリン変換
処理部、４信号処理部、７ウェーブレット変換部、
８振幅圧縮部、９事象検出処理部、１０時間間隔安
定化処理部、１３聴覚フィルタバンク、１４聴神経
発火パターン変換部、１５事象検出回路、１６安定
化聴覚イメージ処理部、１７寸法−形状イメージ処理
部、１８メリンイメージ処理部、１９音声認識回
路、２２フィルタ遅れ補正部、２５聴覚図形抽出
部、２６対数時間間隔表現への変換部、２７インパ
ルス応答分補正部。2 stabilizing wavelet transform processing section, 3 Merlin transform processing section, 4 signal processing section, 7 wavelet transform section,
8 amplitude compression section, 9 event detection processing section, 10-hour interval stabilization processing section, 13 auditory filter bank, 14 auditory nerve firing pattern conversion section, 15 event detection circuit, 16 stabilized auditory image processing section, 17 dimension-shape image processing , 18 Merin image processing unit, 19 speech recognition circuit, 22 filter delay correction unit, 25 auditory figure extraction unit, 26 conversion unit to logarithmic time interval expression, 27 correction unit for impulse response.

───────────────────────────────────────────────────── フロントページの続き (72)発明者入野俊夫京都府相楽郡精華町大字乾谷小字三平谷５番地株式会社エイ・ティ・アール人間情報通信研究所内 (72)発明者ロイ・ディ・パターソンイギリス、ダブリュ・１・エヌ４・エイ・エルロンドン、パーク・クレセント、20 メディカル・リサーチ・カウンシル内 ────────────────────────────────────────────────── ─── Continuing on the front page (72) Inventor Toshio Irino 5th Sanraya, Inaya, Seika-cho, Soraku-gun, Kyoto Pref. ATI Human Information and Communication Research Laboratories Co., Ltd. (72) Inventor Roy Di Patterson United Kingdom, W1N4A1L London, Park Crescent, within 20 Medical Research Councils

Claims

[Claims]

1. A wavelet transforming step of performing a wavelet transform on an input signal by a computer, and a characteristic extracting step of extracting a characteristic of the signal by performing a Mellin transform on the output of the wavelet transforming in a computer in synchronization with the input signal. And a signal processing method.

2. The method according to claim 1, wherein the characteristic extracting step comprises: stabilizing a representation corresponding to the running spectrum obtained in the wavelet transform step with time while maintaining a fine structure of a response waveform by signal synchronization;
Converting to a logarithmic frequency expression, and performing a process corresponding to the Mellin transformation along a line where the value of the product or ratio of the time interval and the frequency is constant in the time interval-logarithmic frequency expression The signal processing method according to claim 1.

Obtaining a logarithmic time-interval logarithmic frequency expression obtained by logarithmically converting the time-interval axis in a stabilized time interval-logarithmic frequency expression of the input signal whose origin is specified; Using a computer, the logarithmic time interval-logarithmic frequency expression is converted into a new expression having the product of the time interval and frequency on the horizontal axis and the logarithmic frequency on the vertical axis, and expressed in the vertical or horizontal axis direction. Extracting the characteristic of the vibration by performing an integral conversion.

4. The signal processing method according to claim 3, further comprising the step of expressing the expression space obtained by the integral transformation as a time series of expression vectors at a certain point.

5. The method according to claim 1, further comprising the step of providing an output obtained by subjecting a signal converted into a format that can be processed by a computer to a frequency analysis in consideration of auditory characteristics to the melin conversion step. The signal processing method according to any one of the above.

6. A wavelet transform means for performing a wavelet transform on an input signal converted into a predetermined format which can be processed by a computer, and a Merin transform in synchronizing an output of the wavelet transform means with the input signal to obtain a signal. A signal extraction device for extracting characteristics.

7. The characteristic extracting means, wherein the expression corresponding to the running spectrum obtained by the wavelet transform means is temporally stabilized by signal synchronization while maintaining a fine structure of a response waveform, and a time interval-logarithmic frequency expression is provided. And means for performing a process equivalent to the Mellin transform along a line where the value of the product or ratio of the time interval and the frequency is constant in the time interval-logarithmic frequency expression. The signal processing device according to claim 6, comprising:

8. In a stabilized time interval-log frequency expression in which an origin is specified of an input signal converted into a form that can be processed by a computer, a log time interval-log frequency expression in which a time interval axis is logarithmically converted is used. Means for obtaining, and further converting the logarithmic time interval-logarithmic frequency expression into a new expression having a product of the time interval and frequency on the horizontal axis and a logarithmic frequency on the vertical axis, in the vertical axis direction or the horizontal axis direction. Means for extracting the characteristic of the vibration by performing integral conversion on the expression.

9. The signal processing apparatus according to claim 8, further comprising means for expressing the expression space obtained by the integration transformation as a time series of expression vectors at a certain point.

10. The apparatus according to claim 6, further comprising means for providing an output obtained by subjecting a signal converted into a format that can be processed by a computer to a frequency analysis in consideration of auditory characteristics to the Mellin transform. The signal processing device according to any one of the above.

11. A wavelet bank comprising a plurality of wavelet filters, each connected to receive an input signal, and each of which is transformed by a wavelet having the same wavelet kernel function and having a different frequency. An auditory figure extracting means for extracting an auditory figure from an output of the wavelet bank; and generating a size-shape image of the input signal from the auditory figure extracted by the auditory figure extracting means. A signal processing apparatus, comprising: a size-shape image generating means for performing the processing;

12. The merin image generating means for generating a merin image by performing a Fourier transform on the size-shape image along an impulse response line of each of the wavelet filters. The signal processing device according to claim 11.

13. The auditory figure extracting means detects a periodicity included in an output of the wavelet filter bank, and performs a time strobe integration on an output of each channel of the wavelet filter bank, thereby stabilizing the output. A time strobe integration means for generating an auditory image, based on the periodicity detected by the time strobe integration means, a predetermined one cycle of the stabilized auditory image obtained by the time strobe integration, 13. The signal processing device according to claim 12, further comprising a stabilized auditory image extracting unit for extracting the auditory graphic.

14. The stabilized auditory image extracting means,
14. The signal processing device according to claim 13, further comprising means for extracting a first cycle of the stabilized auditory image as the auditory graphic.

15. The stabilized auditory image extracting means,
14. The signal processing device according to claim 13, further comprising: means for extracting a second period of the stabilized auditory image as the auditory graphic.

16. An auditory nerve for converting the output of the filter bank so that the output of the filter bank becomes an output similar to the nerve activity of the auditory nerve, and providing the output to the auditory figure extracting means. The signal processing device according to claim 11, further comprising a firing pattern conversion unit.