JP2007068847A

JP2007068847A - Glottal closure region detecting apparatus and method

Info

Publication number: JP2007068847A
Application number: JP2005261008A
Authority: JP
Inventors: Tatsuya Kitamura; 達也北村; Hironori Takemoto; 浩典竹本; Seiji Adachi; 整治足立; Mokhtari Parham; パーハム・モクタリ; Kiyoshi Honda; 清志本多
Original assignee: ATR Advanced Telecommunications Research Institute International
Current assignee: ATR Advanced Telecommunications Research Institute International
Priority date: 2005-09-08
Filing date: 2005-09-08
Publication date: 2007-03-22
Anticipated expiration: 2025-09-08
Also published as: JP4568826B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a glottis closing zone detecting apparatus detecting a glottal closure region in a natural phonation state with a simple constitution. <P>SOLUTION: A changeover device 202 directly receives an analog voice signal from a microphone 132 and receives a signal after the voice signal from the microphone 132 passes through a passing band variable BPF (Band-Pass Filter) 200. The output of the changeover device 202 is converted into a digital signal by an A/D converter 204 and then stored in a buffer memory section 206. CPU 120 detects a forth formant frequency region based on the frequency spectrum from a frequency analysis part 208 and controls the passing band of the BPF 200 to pass through the fourth formant frequency region based on the detection. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

この発明は、音声信号から当該音声の発声の際の声門閉鎖区間を検出することが可能な声門閉鎖区間検出装置および声門閉鎖区間検出方法に関する。 The present invention relates to a glottal closing segment detecting device and a glottal closing segment detecting method capable of detecting a glottal closing segment when a voice is uttered from a voice signal.

人の声帯ヒダは、発声中１秒間に１００回以上、ときには１，０００回も振動する。このため、個々の振動の状態を肉眼で直接見ることはできない。 Human vocal cord folds vibrate 100 times or more, sometimes 1,000 times per second during utterance. For this reason, the state of each vibration cannot be directly seen with the naked eye.

そこで、声帯の振動の様子を観察するためにこれを可視化する方法には、喉頭の画像を機器の応用によって可視化する直接的な方法と、声門の開閉運動のみを検出記録する間接的な方法とに大別することができる。直接的観測法には喉頭高速度映画、喉頭ストロボスコピー、フォトキモグラフィ、半導体撮像素子法などがあり、間接的観測法には、光電グロトグラフィ、電気グロトグラフィ（Electro-glottogram：EGG）、超音波グロトグラフィなどがある。 Therefore, in order to visualize the vibration of the vocal cords, there are a direct method for visualizing the larynx image by application of the device and an indirect method for detecting and recording only the opening and closing movement of the glottis. Can be broadly classified. Direct observation methods include laryngeal high-speed movies, laryngeal stroboscopic copying, photochromography, and semiconductor imaging device methods, and indirect observation methods include photoelectric glography, electro-glottogram (EGG), ultra There is sonic grography.

このうち、ＥＧＧは、左右の甲状軟骨板外側の皮膚面に電極をおいて高周波電流を流しておき、声門の開閉による電気的インピーダンスの変化を検出記録する方法である。 Among them, EGG is a method of detecting and recording a change in electrical impedance due to opening and closing of the glottis by placing electrodes on the skin surfaces outside the left and right thyroid cartilage plates and passing a high-frequency current.

非特許文献１によれば、図７に示すとおり、発声中の声帯粘膜の運動は決して単純な左右方向の開閉運動ではなく、上下方向の波動を伴った３次元の運動である。図７において、１〜３は、声門の開大期、３〜７は閉小期、７〜１０は閉鎖期をそれぞれ示す。 According to Non-Patent Document 1, as shown in FIG. 7, the movement of vocal cord mucosa during utterance is not a simple left-right opening / closing movement, but a three-dimensional movement accompanied by a vertical wave. In FIG. 7, 1 to 3 indicate the glottal opening period, 3 to 7 indicate the closing period, and 7 to 10 indicate the closing period.

しかし、喉頭での発声を考える際には、気流に直角な平面上での声門面積の変化が最も問題になるので、声帯振動の観測に当たっては、声門面積波形（声門面積を時間の関数として表示したもの）を把握することが、最も重要な課題となる。 However, when considering vocalization at the larynx, the change in glottal area on a plane perpendicular to the airflow is the most problematic, so glottal area waveform (glottal area is displayed as a function of time) when observing glottal vibration. Is the most important issue.

図８は、このような声門面積波形を示す図である。声門面積波形では、振動サイクルごとに、上述した開大期、閉小期、閉鎖期の３つの位相を区別する。１回の振動に要する時間を、基本周期という。また単位時間当たりの振動回数を、基本周波数という。音声の基本周期は、声帯振動の基本周期に一致する。したがって音声の基本周波数は、声帯振動の基本周波数に等しい。 FIG. 8 is a diagram showing such a glottal area waveform. In the glottal area waveform, for each vibration cycle, the above-described three phases of the large period, the small period, and the closed period are distinguished. The time required for one vibration is called a basic period. The number of vibrations per unit time is called the fundamental frequency. The fundamental period of speech coincides with the fundamental period of vocal cord vibration. Therefore, the fundamental frequency of speech is equal to the fundamental frequency of vocal cord vibration.

一方で、音声の伝送・認識において、声道伝達特性を正確に推定することは極めて重要であり、その推定のための方法の１つとして、従来、線形予測法が用いられている。しかしながら、通常の線形予測法を用いて正確な声道伝達特性を得るためには、励起源が単一のインパルスあるいは白色雑音でなければならない。ところが、現実には、このような仮定は成り立たず、ホルマント周波数推定には励起源の影響が生じる。 On the other hand, in voice transmission / recognition, it is extremely important to accurately estimate vocal tract transmission characteristics, and a linear prediction method has been conventionally used as one of the estimation methods. However, in order to obtain accurate vocal tract transfer characteristics using normal linear prediction methods, the excitation source must be a single impulse or white noise. However, in reality, this assumption does not hold, and the influence of the excitation source occurs on the formant frequency estimation.

このような励起源の影響を軽減する方法には、分析窓長を１ピッチ周期以下と短くして声門閉止（閉鎖）期間すなわち自由振動区間のみ推定し、これを分析対象とする方法（たとえば、非特許文献２を参照）や、残差情報を参照することで線形予測モデルに適合する音声標本点を選択する標本選択線形予測法において、標本の選択処理を予測誤差の大局的な特徴を考慮して行い、かつこの処理を２段階行って、声門開口期間の音声標本を非予測標本から除く「２段標本選択線形予測法」などが提案されている（たとえば、非特許文献３を参照）。
日本音声言語医学会編，医歯薬出版株式会社，「第２版声の検査法」ｐ．９７−ｐ．９９，１９９４年 K.Steiglitz and B.Dickinson: ”The use of time-domain selection for improved linear prediction”, IEEE Tran. Acoust., Speech & Signal process, ASSp-25, pp.34-39(1977) 三好義昭，大和一晴，柳田益造、角所収著：「２段階標本選択線形予測法による高ピッチ音声の分析」，電子情報通信学会論文誌Ａ Vol. J70-A No.8 pp.1146-1156 1987年8月 In order to reduce the influence of such an excitation source, the analysis window length is shortened to 1 pitch period or less, and only the glottal closing (closed) period, that is, the free vibration period is estimated, and this is analyzed (for example, In the sample selection linear prediction method that selects speech sample points that match the linear prediction model by referring to residual information (see Non-Patent Document 2), the overall characteristics of the prediction error are considered in the sample selection linear prediction method. And performing this process in two stages to propose a “two-stage sample selection linear prediction method” or the like that removes speech samples in the glottal opening period from non-predicted samples (see, for example, Non-Patent Document 3). .
Edited by the Japan Spoken Language Society, Bio-Dental Publishing Co., Ltd. 97-p. 99, 1994 K. Steiglitz and B. Dickinson: “The use of time-domain selection for improved linear prediction”, IEEE Tran. Acoust., Speech & Signal process, ASSp-25, pp. 34-39 (1977) Yoshiaki Miyoshi, Kazuharu Yamato, Masuzou Yanagita, Toru Kakusho: “Analysis of high pitch speech using two-stage sample selection linear prediction method”, IEICE Transactions A Vol. J70-A No.8 pp.1146- 1156 August 1987

しかしながら、前者においては、自然音声の声門閉止区間を正確に推定するのは一般に困難であり、後者にあっては、残差の絶対値がしきい値以上となるものを被予測標本から除くという処理を行うものの声門の状態の観察結果との対比が行われている訳ではない。 However, in the former, it is generally difficult to accurately estimate the glottal closure interval of natural speech, and in the latter case, those whose absolute value of the residual is greater than or equal to the threshold value are excluded from the predicted sample. What is being processed does not compare with the observation of glottal state.

また、声門の閉鎖区間を簡単に検出できれば、ボイストレーニングなどにおいては、明瞭な発声の指標として使用することが期待でき、また、言語聴覚療法においては、声門閉鎖不全音声の診断やリハビリの支援に活用できることが期待できるものの、上述した声帯振動の観察方法は、自然発声中の観察には不向きであったり、測定には被験者に身体的あるいは精神的な負担を強いるためにリハビリなどの用途には不向きであるなどの問題点があった。 In addition, if it can easily detect the glottal closure interval, it can be used as a clear voicing index in voice training, etc. Although it can be expected that it can be utilized, the above-mentioned method of observing vocal cord vibration is not suitable for observation during natural utterance, or for the purpose of rehabilitation because it places a physical or mental burden on the subject for measurement. There were problems such as being unsuitable.

本発明は、上述したような問題点を解決するためになされたものであって、その目的は、自然な発声状態における声門の閉鎖区間の検出を簡単な構成で可能とする声門閉鎖区間検出装置および声門閉鎖区間検出方法を提供することである。 The present invention has been made to solve the above-described problems, and an object of the present invention is to detect a glottal closure interval detection device that can detect a glottal closure interval in a natural utterance state with a simple configuration. And providing a glottal closure interval detection method.

このような目的を達成するために、本発明の１つの局面にしたがうと、声門閉鎖区間検出装置であって、入力された音声信号のうち、喉頭腔共鳴に対応する周波数帯域の音声信号を選択的に抽出する帯域抽出手段と、抽出された音声信号の強度に基づいて、声門閉鎖区間を判定する演算手段とを備える。 In order to achieve such an object, according to one aspect of the present invention, there is provided a glottal closed section detecting device that selects an audio signal in a frequency band corresponding to laryngeal cavity resonance from among input audio signals. Band extracting means for extracting automatically, and arithmetic means for determining the glottal closure interval based on the intensity of the extracted voice signal.

好ましくは、帯域抽出手段は、通過帯域を可変に変更できる帯域通過型フィルタ手段を含み、声門閉鎖区間検出装置は、入力された音声信号を周波数分析して、喉頭腔共鳴に対応する周波数帯域を特定して、帯域通過型フィルタ手段の通過帯域として設定するための通過帯域設定手段をさらに備え、演算手段は、抽出された音声信号の強度が設定されたしきい値を超えることに応じて、対応する音声信号の区間を声門閉鎖区間と判定する。 Preferably, the band extracting means includes band-pass type filter means that can change the pass band variably, and the glottal closed section detecting device frequency-analyzes the input voice signal to obtain a frequency band corresponding to laryngeal cavity resonance. In particular, it further comprises a passband setting means for setting as a passband of the bandpass filter means, the computing means according to when the intensity of the extracted audio signal exceeds a set threshold value, The corresponding voice signal section is determined as a glottal closed section.

好ましくは、通過帯域設定手段は、音声信号の周波数スペクトルに基づいて、第４ホルマントを喉頭腔共鳴に対応する周波数と判定する。 Preferably, the passband setting means determines the fourth formant as a frequency corresponding to the laryngeal cavity resonance based on the frequency spectrum of the audio signal.

この発明の他の局面に従うと、声門閉鎖区間検出方法であって、被験者の音声を音声信号に変換して、喉頭腔共鳴に対応する周波数帯域の音声信号を選択的に抽出するステップと、抽出された音声信号の強度に基づいて、声門閉鎖区間を判定するステップとを備える。 According to another aspect of the present invention, there is provided a glottal closed section detecting method, wherein a voice of a subject is converted into a voice signal and a voice signal in a frequency band corresponding to laryngeal cavity resonance is selectively extracted; Determining a glottal closure interval based on the intensity of the voice signal that has been generated.

好ましくは、抽出するステップは、入力された音声信号を周波数分析して、喉頭腔共鳴に対応する周波数帯域を特定するステップと、特定された喉頭腔共鳴に対応する周波数帯域を帯域通過型フィルタ手段の通過帯域として設定するステップとを含み、判定するステップは、帯域通過型フィルタ手段の出力の強度が設定されたしきい値を超えることに応じて、対応する音声信号の区間を声門閉鎖区間と判定するステップを含む。 Preferably, in the extracting step, a frequency analysis corresponding to the laryngeal cavity resonance is performed by analyzing the frequency of the input audio signal, and the frequency band corresponding to the identified laryngeal cavity resonance is band-pass filter means. The step of determining as a passband of the vowel is determined to be a glottal closed section in response to the fact that the output intensity of the bandpass filter means exceeds a set threshold value. Determining.

本発明に係る声門閉鎖区間検出装置および声門閉鎖区間検出方法によれば、特殊な装置を必要とせず、簡単な装置構成で、声門の閉鎖区間の検出を行うことが可能である。 According to the glottal closed section detecting device and the glottal closed section detecting method according to the present invention, it is possible to detect the glottal closed section with a simple device configuration without requiring a special device.

また、本発明に係る声門閉鎖区間検出装置および声門閉鎖区間検出方法によれば、発声の内容によらず、被験者の自然な発声状態において、声門の閉鎖区間の検出を行うことが可能である。 Moreover, according to the glottal closed section detecting device and the glottal closed section detecting method according to the present invention, it is possible to detect the glottal closed section in the natural utterance state of the subject regardless of the content of the utterance.

以下、図面を参照して本発明の実施の形態について説明する。
［ハードウェア構成］
図１は、本発明の声門閉鎖区間検出方法が適用される声門閉鎖区間検出装置１００の一例を示すブロック図である。 Embodiments of the present invention will be described below with reference to the drawings.
[Hardware configuration]
FIG. 1 is a block diagram showing an example of a glottal closing segment detection device 100 to which the glottal closing segment detection method of the present invention is applied.

図１を参照して、声門閉鎖区間検出装置１００は、基本的には、パーソナルコンピュータに音声処理インタフェースを設けることで構成される。 Referring to FIG. 1, glottal closing section detecting device 100 is basically configured by providing a voice processing interface in a personal computer.

すなわち、この声門閉鎖区間検出装置１００は、ＣＤ−ＲＯＭ（Compact Disc Read-Only Memory ）１１８などの光ディスク上の情報を読込むための光ディスクドライブ１０８およびフレキシブルディスク（Flexible Disk、以下ＦＤ）１１６に情報を読み書きするためのＦＤドライブ１０６を備えたコンピュータ本体１０２と、コンピュータ本体１０２に接続された表示装置としてのモニタ１０４と、同じくコンピュータ本体１０２に接続された入力装置としてのキーボード１１０およびマウス１１２と、音声入力装置としてのマイク１３２と、音声出力装置としてのスピーカ１３４とを含む。 That is, the glottal closed section detecting device 100 reads / writes information to / from an optical disc drive 108 for reading information on an optical disc such as a CD-ROM (Compact Disc Read-Only Memory) 118 and a flexible disc (FD) 116. A computer main body 102 provided with an FD drive 106, a monitor 104 as a display device connected to the computer main body 102, a keyboard 110 and a mouse 112 as input devices also connected to the computer main body 102, and voice input A microphone 132 as a device and a speaker 134 as an audio output device are included.

このコンピュータ本体１０２は、光ディスクドライブ１０８およびＦＤドライブ１０６に加えて、それぞれバスＢＳに接続された演算処理部であるＣＰＵ（Central Processing Unit ）１２０と、ＲＯＭ（Read Only Memory) およびＲＡＭ（Random Access Memory）を含むメモリ１２２と、直接アクセスメモリ装置、たとえば、ハードディスク１２４と、マイク１３２またはスピーカ１３４とデータの授受を行うための音声処理インタフェース部１２８とを含んでいる。 In addition to the optical disk drive 108 and the FD drive 106, the computer main body 102 includes a CPU (Central Processing Unit) 120 that is an arithmetic processing unit connected to the bus BS, a ROM (Read Only Memory), and a RAM (Random Access Memory). ), A direct access memory device, for example, a hard disk 124, and a voice processing interface unit 128 for exchanging data with the microphone 132 or the speaker 134.

なお、ＣＤ−ＲＯＭ１１８は、コンピュータ本体に対してインストールされるプログラム等の情報を記録可能な媒体であれば、他の媒体、たとえば、ＤＶＤ−ＲＯＭ（Digital Versatile Disc）やメモリカードなどでもよく、その場合は、コンピュータ本体１０２には、これらの媒体を読取ることが可能なドライブ装置が設けられる。 The CD-ROM 118 may be another medium, such as a DVD-ROM (Digital Versatile Disc) or a memory card, as long as it can record information such as a program installed in the computer main body. In this case, the computer main body 102 is provided with a drive device that can read these media.

本発明の声門閉鎖区間検出装置の主要部は、コンピュータハードウェアと、ＣＰＵ１２０により実行される声門閉鎖区間検出装置を制御するためのソフトウェアとにより構成される。一般的にこうしたソフトウェアはＣＤ−ＲＯＭ１１８、ＦＤ１１６等の記憶媒体に格納されて流通し、ＣＤ−ＲＯＭドライブ１０８またはＦＤドライブ１０６等により記憶媒体から読取られてハードディスク１２４に一旦格納される。または、当該装置がネットワーク３１０に接続されている場合には、ネットワーク上のサーバから一旦ハードディスク１２４にコピーされる。そうしてさらにハードディスク１２４からメモリ１２２中のＲＡＭに読出されてＣＰＵ１２０により実行される。なお、ネットワーク接続されている場合には、ハードディスク１２４に格納することなくＲＡＭに直接ロードして実行するようにしてもよい。 The main part of the glottal closing section detecting device of the present invention is constituted by computer hardware and software for controlling the glottal closing section detecting device executed by the CPU 120. Generally, such software is stored and distributed in a storage medium such as a CD-ROM 118 or FD 116, read from the storage medium by the CD-ROM drive 108 or FD drive 106, and temporarily stored in the hard disk 124. Alternatively, when the device is connected to the network 310, it is temporarily copied from the server on the network to the hard disk 124. Then, the data is further read from the hard disk 124 to the RAM in the memory 122 and executed by the CPU 120. In the case of network connection, the program may be directly loaded into the RAM and executed without being stored in the hard disk 124.

図１に示したコンピュータのハードウェア自体およびその動作原理は一般的なものである。したがって、本発明の最も本質的な部分は、ＦＤ１１６、ＣＤ−ＲＯＭ１１８、ハードディスク１２４等の記憶媒体に記憶されたソフトウェアである。 The computer hardware itself shown in FIG. 1 and its operating principle are general. Therefore, the most essential part of the present invention is software stored in a storage medium such as the FD 116, the CD-ROM 118, and the hard disk 124.

なお、一般的傾向として、コンピュータのオペレーティングシステムの一部として様々なプログラムモジュールを用意しておき、アプリケーションプログラムはこれらモジュールを所定の配列で必要な時に呼び出して処理を進める方式が一般的である。そうした場合、当該声門閉鎖区間検出装置を実現するためのソフトウェア自体にはそうしたモジュールは含まれず、当該コンピュータでオペレーティングシステムと協働してはじめて声門閉鎖区間検出装置が実現することになる。しかし、一般的なプラットフォームを使用する限り、そうしたモジュールを含ませたソフトウェアを流通させる必要はなく、それらモジュールを含まないソフトウェア自体およびそれらソフトウェアを記録した記録媒体（およびそれらソフトウェアがネットワーク上を流通する場合のデータ信号）が実施の形態を構成すると考えることができる。 As a general tendency, various program modules are prepared as a part of a computer operating system, and an application program generally calls a module in a predetermined arrangement and advances the processing when necessary. In such a case, the software itself for realizing the glottal closing section detecting device does not include such a module, and the glottal closing section detecting device is realized only in cooperation with the operating system on the computer. However, as long as a general platform is used, it is not necessary to distribute software including such modules, and the software itself not including these modules and the recording medium storing the software (and the software distributes on the network). Data signal) can be considered to constitute the embodiment.

図２は、図１に示した音声処理インタフェース部１２８の構成をより詳しく説明するための機能ブロック図である。なお、図２においては、マイク１３２からの音声信号の入力処理に関する部分のみを抜き出して示す。 FIG. 2 is a functional block diagram for explaining the configuration of the voice processing interface unit 128 shown in FIG. 1 in more detail. In FIG. 2, only the part related to the input processing of the audio signal from the microphone 132 is extracted and shown.

図２を参照して、切換器２０２は、マイク１３２からアナログの音声信号を直接受け取るとともに、マイク１３２からの音声信号が通過帯域可変のバンドパスフィルタ（以下、ＢＰＦと呼ぶ）２００を通過した後の信号を受ける。ＢＰＦ２００の通過帯域および切換器２０２がいずれの信号を選択するかについては、ＣＰＵ１２０により制御される。 Referring to FIG. 2, switcher 202 directly receives an analog audio signal from microphone 132, and after the audio signal from microphone 132 passes through a bandpass filter (hereinafter referred to as BPF) 200 having a variable passband. Receive the signal. The CPU 120 controls which passband of the BPF 200 and which signal the switcher 202 selects.

切換器２０２の出力は、Ａ／Ｄ変換器２０４によりデジタル信号に変換された後、バッファメモリ部２０６に格納される。周波数分析部２０８は、バッファメモリ部２０６に格納された音声信号に対して周波数スペクトルを求めて、ＣＰＵ１２０に対して出力する。 The output of the switcher 202 is converted into a digital signal by the A / D converter 204 and then stored in the buffer memory unit 206. The frequency analysis unit 208 obtains a frequency spectrum for the audio signal stored in the buffer memory unit 206 and outputs it to the CPU 120.

ＣＰＵ１２０は、周波数スペクトルに基づいて、後に説明するように喉頭腔共鳴に相当するホルマント（第４ホルマント）の周波数領域を検出し、これに基づいて、第４ホルマントの領域を通過させるようにＢＰＦ２００の通過帯域を制御する。また、ＣＰＵ１２０での処理の結果は、たとえば、表示装置１０４に表示される。 The CPU 120 detects a formant (fourth formant) frequency region corresponding to the laryngeal cavity resonance based on the frequency spectrum, and based on this, detects the frequency region of the BPF 200 so as to pass through the region of the fourth formant. Control the passband. Further, the result of the processing by the CPU 120 is displayed on the display device 104, for example.

なお、図２においては、ＢＰＦ２００としては、アナログ方式の帯域可変フィルタを用いるものとして説明した。しかしながら、ＢＰＦ２００と切換器２０２とをＡ／Ｄ変換器２０４の後段に配置して、デジタル方式の帯域可変フィルタを用いることも可能である。あるいは、マイク１３２からの音声信号をＡ／Ｄ変換器２０４がデジタル信号に変換して直接バッファメモリ部２０６に格納することとし、このバッファメモリ２０６内の音声信号データに対して、ＣＰＵ１２０が演算処理を行うことで、デジタルフィルタ処理を行うこととしてもよい。 In FIG. 2, the BPF 200 has been described as using an analog band variable filter. However, it is also possible to use a digital band-variable filter by arranging the BPF 200 and the switching unit 202 at the subsequent stage of the A / D converter 204. Alternatively, the audio signal from the microphone 132 is converted into a digital signal by the A / D converter 204 and directly stored in the buffer memory unit 206, and the CPU 120 performs arithmetic processing on the audio signal data in the buffer memory 206. It is good also as performing a digital filter process by performing.

また、バッファメモリ部２０６は、必ずしも音声処理インタフェース部１２８内に設けられる必要はなく、たとえば、メモリ１２２またはハードディスク１２４をバッファメモリとして使用してもよい。さらに、周波数分析部２０８についても、必ずしも音声処理インタフェース部１２８内に設けられる必要はなく、たとえば、ＣＰＵ１２０のフーリエ変換などの演算処理により同等の処理を行うことも可能である。 Further, the buffer memory unit 206 is not necessarily provided in the audio processing interface unit 128. For example, the memory 122 or the hard disk 124 may be used as the buffer memory. Further, the frequency analysis unit 208 is not necessarily provided in the audio processing interface unit 128, and for example, equivalent processing can be performed by arithmetic processing such as Fourier transform of the CPU 120.

図３は、等価回路モデルで求めた声道伝達特性における声門開口面積Ａｇの影響を結果を示す図である。 FIG. 3 is a diagram showing the result of the effect of the glottal opening area Ag on the vocal tract transmission characteristics obtained by the equivalent circuit model.

図３においては、声門開口面積Ａｇを０．０ｃｍ²(声門閉鎖)、０．１ｃｍ²、０．２ｃｍ²の３段階に変化させている。この図から声門の開放により３．１ｋＨｚのホルマント（低周波側から４番目のピーク：第４ホルマント）が消失することがわかる。このホルマントは、喉頭腔により生じるホルマント(喉頭腔共鳴)と一致する。 In FIG. 3, the glottal opening area Ag 0.0 cm ² (glottal closure), 0.1 cm ^2, is varied in three steps of 0.2 cm ^2. It can be seen from this figure that the 3.1 kHz formant (fourth peak from the low frequency side: the fourth formant) disappears due to the opening of the glottis. This formant is consistent with the formant produced by the laryngeal cavity (laryngeal cavity resonance).

従って、喉頭腔共鳴は声門の閉鎖区間で出現し、開放区間で消失することが予測される。したがって、この喉頭腔共鳴の周期内変動を検出することによって音声から声門閉鎖区間を抽出できると考えられる。 Therefore, it is predicted that the laryngeal cavity resonance appears in the closed section of the glottis and disappears in the open section. Therefore, it is considered that the glottal closure section can be extracted from the voice by detecting the intra-periodic variation of the laryngeal cavity resonance.

図４は、図１および図２に示した声門閉鎖区間検出装置１００の動作を説明するためのフローチャートである。 FIG. 4 is a flowchart for explaining the operation of the glottal closing section detecting device 100 shown in FIGS. 1 and 2.

図４を参照して、まず、ＣＰＵ１２０により制御されて、切換器２０２はマイク１３２から直接受け取った信号をＡ／Ｄ変換器２０４に与え、バッファメモリ部２０６にデジタル化された音声信号が格納される。このバッファメモリ部２０６中のデータに対して、周波数分析部２０８が、周波数分析を行う（Ｓ１００）。 Referring to FIG. 4, first, under the control of CPU 120, switcher 202 provides a signal received directly from microphone 132 to A / D converter 204, and a digital audio signal is stored in buffer memory unit 206. The The frequency analysis unit 208 performs frequency analysis on the data in the buffer memory unit 206 (S100).

周波数分析の結果得られる周波数スペクトルをＣＰＵ１２０が解析することにより、被験者の第４ホルマントの周波数領域を特定する（Ｓ１０２）。 The CPU 120 analyzes the frequency spectrum obtained as a result of the frequency analysis, thereby specifying the frequency region of the fourth formant of the subject (S102).

続いて、ＣＰＵ１２０は、第４ホルマントの音声信号を通過させるようにＢＰＦ２００の通過帯域を調整する（Ｓ１０４）。特に限定されないが、たとえば、第４ホルマントのピーク位置がわかれば、これに対して周波数の上下について所定の周波数分だけの帯域の信号を通過させるように調整することとしてもよい。 Subsequently, the CPU 120 adjusts the passband of the BPF 200 so as to pass the fourth formant audio signal (S104). Although not particularly limited, for example, if the peak position of the fourth formant is known, adjustment may be made so that a signal of a band corresponding to a predetermined frequency is passed above and below the frequency.

このようにＢＰＦ２００の通過帯域を調整した後、ＣＰＵ１２０は、切換器２０２を制御して、ＢＰＦ２００を通過した信号が、バッファメモリ部２０６に格納されるように調整する。以後は、同一の被験者についての同一の入力条件については、調整されたＢＰＦ２００からの信号強度に応じて、声門の閉鎖区間を検出する（Ｓ１０６）。すなわち、声門の閉鎖区間においては、ＢＰＦ２００からの信号強度が大きくなるので、しきい値を設定して、ＣＰＵ１２０は、信号強度がこのしきい値を超える区間は、声門閉鎖区間であると判定できる。特に限定されないが、このようなしきい値は、ユーザが表示装置１０４に出力される測定結果を見て、マニュアルで設定してもよいし、ＣＰＵ１２０が、ＢＰＦ２００を通過した信号の強度に応じて、たとえば、その最高強度の絶対値の所定割合となるように設定してもよい。 After adjusting the pass band of the BPF 200 in this way, the CPU 120 controls the switcher 202 to adjust so that the signal that has passed through the BPF 200 is stored in the buffer memory unit 206. Thereafter, for the same input condition for the same subject, the closed glottal section is detected according to the adjusted signal intensity from the BPF 200 (S106). That is, since the signal strength from the BPF 200 becomes large in the glottal closed section, the threshold value is set and the CPU 120 can determine that the section where the signal strength exceeds the threshold value is the glottal closed section. . Although not particularly limited, such a threshold value may be set manually by the user looking at the measurement result output to the display device 104, or depending on the intensity of the signal that the CPU 120 has passed through the BPF 200, For example, you may set so that it may become the predetermined ratio of the absolute value of the highest intensity | strength.

（実験結果）
図５および図６は、男女各１名が座位で持続発声した日本語母音／ａ／および／ｉ／を無響室にて収録した結果を示す図である。図５は、男性の測定結果を、図６は女性の測定結果をそれぞれ示す。 (Experimental result)
FIG. 5 and FIG. 6 are diagrams showing the results of recording Japanese vowels / a / and / i / that were uttered continuously by one male and one female in a sitting position in an anechoic room. FIG. 5 shows measurement results for men, and FIG. 6 shows measurement results for women.

図５および図６示した実験においては、音声と同時にＥＧＧ信号も収録した。ＥＧＧ信号はカットオフ１．６Ｈｚのハイパスフィルタにより直流成分を除去した。これらの信号は標本化周波数４８ｋＨｚ、量子化１６ｂｉｔで収録した。音声とＥＧＧ信号の間には声門からマイクロホンまでの距離に対応する時間差が存在するため、ＥＧＧ信号をこの時間差分シフトさせた。 In the experiments shown in FIG. 5 and FIG. 6, EGG signals were recorded simultaneously with voice. The direct current component was removed from the EGG signal by a high-pass filter with a cutoff of 1.6 Hz. These signals were recorded at a sampling frequency of 48 kHz and a quantization of 16 bits. Since there is a time difference corresponding to the distance from the glottis to the microphone between the voice and the EGG signal, the EGG signal is shifted by this time difference.

また、ＢＰＦ２００としては、理想的なフィルタ特性のフーリエ級数に窓関数をかける方法でＦＩＲ（Finite Impulse Response）型のバンドパスフィルタを作成した。窓関数として１０１点のハミング窓を用いた。 As the BPF 200, an FIR (Finite Impulse Response) type band-pass filter was created by applying a window function to the Fourier series of ideal filter characteristics. A 101-point Hamming window was used as the window function.

音声データのスペクトログラムから男性話者のバンドパスフィルタの通過帯域は２．８ｋＨｚから３．８ｋＨｚ、女性話者の通過帯域は３．８ｋＨｚから４．８ｋＨｚと決定した。音声データにこのバンドパスフィルタをかけ、その出力とEGG 信号とを比較した。 From the spectrogram of the voice data, the passband of the band pass filter of the male speaker was determined from 2.8 kHz to 3.8 kHz, and the passband of the female speaker was determined from 3.8 kHz to 4.8 kHz. The bandpass filter was applied to the audio data, and the output was compared with the EGG signal.

図５および図６では、母音の３０ｍｓｅｃの音声波形、対応するＥＧＧ信号、およびバンドパスフィルタの出力（第４ホルマント信号）ならびに比較のために第２ホルマント信号も示している。ＥＧＧ信号は声帯の接触面積に比例するため、その値の大きい区間が声門閉鎖区間となる。 FIGS. 5 and 6 also show the 30 msec speech waveform of the vowel, the corresponding EGG signal, the output of the bandpass filter (fourth formant signal) and the second formant signal for comparison. Since the EGG signal is proportional to the contact area of the vocal cords, a section with a large value is a glottal closure section.

図５および図６から、声門閉鎖区間においてバンドパスフィルタの出力の振幅が相対的に大きくなることがわかる。この結果は、図３に示したシミュレーション結果と同様に、実音声でもピッチ周期のうち声門閉鎖区間で喉頭腔共鳴(第４ホルマント)が出現し、声門開放区間においてこの共鳴が消失することを示している。声帯振動の１周期内で声門は急激に閉鎖し緩徐に開放する。バンドパスフィルタ出力もこれに対応し、声門閉鎖の開始時点で振幅が急激に増加し、その後ゆるやかに振幅が減衰する。 5 and 6 that the amplitude of the output of the bandpass filter becomes relatively large in the glottal closed period. Similar to the simulation results shown in FIG. 3, this result shows that the laryngeal cavity resonance (fourth formant) appears in the glottal closed section of the pitch period even in real speech, and this resonance disappears in the glottal open section. ing. Within one cycle of vocal cord vibration, the glottis closes rapidly and opens slowly. The bandpass filter output also corresponds to this, and the amplitude increases rapidly at the start of glottal closure and then gradually attenuates.

従って、バンドパスフィルタ出力、すなわち、第４ホルマントに対応する信号の包絡線では、明確にオン・オフの変化が検出でき、しきい値処理により声門閉鎖区間を判定できる。これに対して、第２ホルマントではなだらかに減衰し、明確なオン・オフの変化が検出できない。 Accordingly, the on / off change can be clearly detected in the bandpass filter output, that is, the envelope of the signal corresponding to the fourth formant, and the glottal closure interval can be determined by threshold processing. On the other hand, the second formant is gently attenuated, and a clear on / off change cannot be detected.

以上説明したとおり、母音の喉頭腔共鳴パターンが１ピッチ周期内で変動することを利用して、声門閉鎖区間を検出すること可能となる。本発明の声門閉鎖区間の検出方法は、後舌母音にも適用でき、より自然な発声状態における声門開閉を記録することが可能である。また、この方法は、基本的に、マイクロホンとバンドパスフィルタを用いれば、音声入出力機能を有するコンピュータで実現できる。さらに、喉頭腔共鳴は他のホルマントと異なり母音によらずほぼ一定した周波数帯域に現れるため、バンドパスフィルタの通過帯域を一旦決めればどの母音でも利用することができる。 As described above, it is possible to detect the glottal closure section by utilizing the fact that the laryngeal cavity resonance pattern of the vowel varies within one pitch period. The method for detecting the glottal closure interval of the present invention can be applied to the back tongue vowel, and can record glottal opening and closing in a more natural voicing state. This method can be basically realized by a computer having an audio input / output function if a microphone and a band-pass filter are used. Furthermore, unlike the other formants, the laryngeal cavity resonance appears in a substantially constant frequency band regardless of the vowels, so that any vowel can be used once the passband of the bandpass filter is determined.

また、ボイストレーニングなどにおいて、検出された声門閉鎖区間を、明瞭な発声の指標として使用することが可能である。あるいは、言語聴覚療法においては、声門閉鎖不全音声の診断やリハビリの支援に活用できる。 In voice training or the like, the detected glottal closure interval can be used as a clear utterance index. Alternatively, in speech and auditory therapy, it can be used for diagnosis of glottic insufficiency speech and rehabilitation support.

なお、以上の説明では、単に、声門の閉鎖区間の検出について説明した。しかし、一般に、音声処理技術は、発声時に声門が閉じていることを前提としている。従って、音声の特徴量抽出の際には声門閉鎖区間のみから抽出する必要がある。本発明の声門の閉鎖区間の検出方法を使って声門閉鎖区間を検出し、そこから特徴量抽出を行うという応用も可能である。 In the above description, only the detection of the glottal closed section has been described. However, in general, the speech processing technology is based on the premise that the glottis are closed when speaking. Therefore, it is necessary to extract only from the glottal closed section when extracting the voice feature value. An application is also possible in which the glottal closed section is detected by using the glottal closed section detection method of the present invention, and the feature amount is extracted therefrom.

今回開示された実施の形態はすべての点で例示であって制限的なものではないと考えられるべきである。本発明の範囲は上記した説明ではなくて特許請求の範囲によって示され、特許請求の範囲と均等の意味および範囲内でのすべての変更が含まれることが意図される。 The embodiment disclosed this time should be considered as illustrative in all points and not restrictive. The scope of the present invention is defined by the terms of the claims, rather than the description above, and is intended to include any modifications within the scope and meaning equivalent to the terms of the claims.

本発明の声門閉鎖区間検出方法が適用される声門閉鎖区間検出装置１００の一例を示すブロック図である。It is a block diagram which shows an example of the glottal closed area detection apparatus 100 with which the glottal closed area detection method of this invention is applied. 音声処理インタフェース部１２８の構成をより詳しく説明するための機能ブロック図である。3 is a functional block diagram for explaining the configuration of a voice processing interface unit 128 in more detail. FIG. 等価回路モデルで求めた声道伝達特性における声門開口面積Ａｇの影響を結果を示す図である。It is a figure which shows the result of the influence of the glottal opening area Ag in the vocal tract transmission characteristic calculated | required with the equivalent circuit model. 声門閉鎖区間検出装置１００の動作を説明するためのフローチャートである。4 is a flowchart for explaining the operation of the glottal closure section detection device 100. 男性被験者が座位で持続発声した日本語母音／ａ／および／ｉ／を無響室にて収録した結果を示す図である。It is a figure which shows the result of having recorded the Japanese vowel / a / and / i / which the male test subject continuously uttered in the sitting position in the anechoic room. 女性被験者が座位で持続発声した日本語母音／ａ／および／ｉ／を無響室にて収録した結果を示す図である。It is a figure which shows the result of having recorded the Japanese vowel / a / and / i / which the female test subject uttered continuously in the sitting position in the anechoic room. 発声中の声帯粘膜の運動を示す概念図である。It is a conceptual diagram which shows the movement of the vocal cord mucosa during utterance. 声門面積波形を示す図である。It is a figure which shows a glottal area waveform.

Explanation of symbols

１００声門閉鎖区間検出装置、１０２コンピュータ本体、１０４表示装置、１０６ＦＤドライブ、１０８光ディスクドライブ、１１０キーボード、１１２マウス、１１６フレキシブルディスク、１１８ＣＤ−ＲＯＭ、１２０ＣＰＵ、１２２メモリ、１２４ハードディスク、１２８音声処理インタフェース部、１３２マイク、１３４スピーカ、２００ＢＰＦ、２０２切換器、２０４Ａ／Ｄ変換器、２０６バッファメモリ部。 100 Glottal Closure Section Detection Device, 102 Computer Main Body, 104 Display Device, 106 FD Drive, 108 Optical Disk Drive, 110 Keyboard, 112 Mouse, 116 Flexible Disk, 118 CD-ROM, 120 CPU, 122 Memory, 124 Hard Disk, 128 Audio Processing Interface unit, 132 microphone, 134 speaker, 200 BPF, 202 switching unit, 204 A / D converter, 206 buffer memory unit.

Claims

Band extraction means for selectively extracting an audio signal in a frequency band corresponding to laryngeal cavity resonance from the input audio signal;
A glottal closure segment detection device comprising: a calculating means for determining a glottal closure segment based on the intensity of the extracted voice signal.

The band extracting means includes band pass filter means that can change the pass band variably,
A frequency analysis of the input audio signal to identify a frequency band corresponding to the laryngeal cavity resonance, further comprising passband setting means for setting as the passband of the bandpass filter means;
2. The glottal closing section according to claim 1, wherein the computing means determines that the corresponding speech signal section is the glottal closing section when the intensity of the extracted voice signal exceeds a set threshold value. 3. Detection device.

The glottal closure segment detection device according to claim 2, wherein the passband setting means determines the fourth formant as a frequency corresponding to the laryngeal cavity resonance based on a frequency spectrum of the audio signal.

Converting the subject's voice into a voice signal and selectively extracting a voice signal in a frequency band corresponding to the laryngeal cavity resonance;
A glottal closure interval detection method comprising: determining a glottal closure interval based on the intensity of the extracted speech signal.

The extracting step includes:
Analyzing the frequency of the input audio signal to identify a frequency band corresponding to the laryngeal cavity resonance;
Setting a frequency band corresponding to the identified laryngeal cavity resonance as a pass band of the band-pass filter means,
The step of determining includes
5. The glottal closed section according to claim 4, comprising the step of determining a corresponding voice signal section as the glottal closed section in response to an output intensity of the band-pass filter means exceeding a set threshold value. Detection method.