JPWO2015019835A1

JPWO2015019835A1 - Electric artificial laryngeal device

Info

Publication number: JPWO2015019835A1
Application number: JP2015530782A
Authority: JP
Inventors: 戸田　智基; 智基戸田; 田中　宏; 宏田中; 中村　哲; 哲中村; サクリアニサクティ; グラムニュービッグ
Original assignee: Nara Institute of Science and Technology NUC
Current assignee: Nara Institute of Science and Technology NUC
Priority date: 2013-08-08
Filing date: 2014-07-22
Publication date: 2017-03-02
Also published as: WO2015019835A1

Abstract

使用者が発する発声音に適合した音源音を円滑に出力することが可能な電気式人工喉頭装置を提供する。電気式人工喉頭装置１は、使用者Ｐの声道に入力された音源音が調音処理されて発せられる発声音を集音して、発声信号を生成する集音部１０と、集音部１０が生成する発声信号に対応した音源信号を生成する信号処理部２０と、信号処理部２０が生成する音源信号を再生して音源音を出力する音源信号再生部３０と、を備える。Provided is an electric artificial laryngeal device capable of smoothly outputting a sound source sound adapted to a voice uttered by a user. The electric artificial laryngeal device 1 collects a uttered sound generated by adjusting the sound source sound input to the vocal tract of the user P and generates a utterance signal, and a sound collecting unit 10 Includes a signal processing unit 20 that generates a sound source signal corresponding to the utterance signal generated by and a sound source signal reproduction unit 30 that reproduces the sound source signal generated by the signal processing unit 20 and outputs a sound source sound.

Description

本発明は、例えば喉頭癌等の疾患によって声帯を含む喉頭部を摘出した人や、声帯が正常に機能しない人など、自らの体内で音源となる音（以下、「音源音」という）を出力することが不可能または困難な人（以下、「喉頭異常者」という）の声道（鼻腔、口腔、舌等で形成される空間、以下同じ）に、体外から音源音を入力する電気式人工喉頭装置に関する。 The present invention outputs sound (hereinafter referred to as “sound source sound”) that serves as a sound source in its own body, such as a person who has removed the larynx including the vocal cord due to a disease such as laryngeal cancer, or a person whose vocal cord does not function normally. Electric artificial sound input from outside the body into the vocal tract (space formed by the nasal cavity, oral cavity, tongue, etc., the same shall apply hereinafter) of a person who cannot or is difficult to perform The present invention relates to a laryngeal device.

喉頭異常者ではない健常な人（以下、「喉頭正常者」という）は、肺から排出されて気管を通過する空気によって声帯を振動させることで発する音源音を、声道に入力して調音処理する（音源音を声道で共鳴させて変調する、以下同じ）ことで、口から音（以下、「発声音」という）を発する。 A healthy person who is not a larynx abnormal person (hereinafter referred to as “normal larynx”) inputs sound source sound generated by vibrating the vocal cords by air discharged from the lungs and passing through the trachea into the vocal tract for articulation processing Sounds from the mouth (hereinafter referred to as “voiced sound”) by sounding (modulating by resonating the sound source sound in the vocal tract, hereinafter the same).

しかしながら、喉頭異常者は、声道の調音処理機能は正常であっても、自己の体内で音源音を発して声道に入力することが不可能または困難であるため、喉頭正常者と同じように発声音を発することができない。 However, even if the larynx abnormal person has normal articulation processing function of the vocal tract, it is impossible or difficult to generate sound source sound in the body and input to the vocal tract. Can't make utterance sound.

そこで、喉頭異常者の喉の外部に密着して振動することで、喉頭異常者の声道に音源音を入力する電気式人工喉頭装置が、広く使用されている。喉頭異常者は、この電気式人工喉頭装置を使用することで、声道に音源音を入力することが可能となる。そのため、喉頭異常者は、喉頭正常者が発声音を発する場合と同様に声道の形状を変化させる（例えば、口や舌を動かす）という簡易かつ容易な動作によって、所望の発声音を発することが可能になる。 Thus, an electric artificial laryngeal device that inputs sound source sound into the vocal tract of an abnormal laryngeal person by vibrating in close contact with the outside of the throat of the abnormal laryngeal person is widely used. By using this electric artificial laryngeal device, a person with abnormal larynx can input sound source sound into the vocal tract. Therefore, a person with abnormal larynx emits a desired utterance sound by a simple and easy operation of changing the shape of the vocal tract (for example, moving the mouth or tongue) in the same manner as when a normal larynx utters a sound. Is possible.

ただし、電気式人工喉頭装置が発する音源音は、喉頭異常者が発する言葉や発話内容（即ち、上記の調音処理）とは無関係に決定される。例えば、電気式人工喉頭装置が発する音源音は、基本周波数（ピッチ）が時間的に変化せず、一定になることがある。そのため、喉頭異常者は、アクセントやイントネーション（例えば、音源音の基本周波数や振幅の変動による語調の変化）を発声音に付加することが、極めて困難である。その結果、喉頭異常者が発する発声音が、機械的な音として聞こえたり、正しく伝わり難くなったりするため、問題となる。 However, the sound source sound generated by the electric artificial laryngeal device is determined irrespective of the words and utterance contents (that is, the articulation process) generated by the abnormal larynx. For example, the sound source sound emitted from the electric artificial laryngeal device may be constant without changing the fundamental frequency (pitch) with time. For this reason, it is extremely difficult for a person with abnormal larynx to add accents and intonation (for example, changes in tone due to variations in the fundamental frequency or amplitude of the sound source sound) to the uttered sound. As a result, the utterance sound produced by the person with abnormal larynx can be heard as a mechanical sound or difficult to be transmitted correctly, which is problematic.

これらの問題について、具体的に図５及び図６を参照して説明する。図５は、喉頭正常者が発する発声音の各種特徴について示したグラフである。また、図６は、電気式人工喉頭装置を使用した喉頭異常者が発する発声音の各種特徴について示したグラフである。なお、図５及び図６のグラフでは、それぞれの発声音の特徴として、信号波形、基本周波数、非周期成分及びスペクトログラムを示している。 These problems will be specifically described with reference to FIGS. FIG. 5 is a graph showing various characteristics of vocal sounds produced by a normal larynx. FIG. 6 is a graph showing various characteristics of uttered sounds produced by an abnormal larynx using an electric artificial laryngeal device. In the graphs of FIGS. 5 and 6, signal waveforms, fundamental frequencies, non-periodic components, and spectrograms are shown as the characteristics of each uttered sound.

図５及び図６において、信号波形のグラフは、横軸が時間、縦軸が振幅である。また、基本周波数のグラフは、横軸が時間、縦軸が周波数である。また、非周期成分のグラフは、横軸が時間、縦軸が強度である。また、スペクトログラムは、横軸が時間、縦軸が周波数であり、色が暗い（黒色に近い）ほど強度が大きいことを示している。 5 and 6, in the signal waveform graph, the horizontal axis represents time, and the vertical axis represents amplitude. In the fundamental frequency graph, the horizontal axis represents time and the vertical axis represents frequency. In the aperiodic component graph, the horizontal axis represents time, and the vertical axis represents intensity. Further, the spectrogram indicates that the horizontal axis is time, the vertical axis is frequency, and the darker the color (closer to black), the higher the intensity.

図５及び図６に示した発声音の各種特徴のうち、信号波形は、発声音の全体的な特徴を示すものである。また、基本周波数は、主として音源音の特徴を示すものである。また、非周期成分は、主として音源音の特徴（具体的には、発声音のかすれ具合などを表す音色等）の特徴を示すものである。また、スペクトログラムは、声道における調音処理の特徴を示すものである。 Of the various features of the uttered sound shown in FIGS. 5 and 6, the signal waveform indicates the overall features of the uttered sound. The fundamental frequency mainly indicates the characteristics of the sound source sound. Further, the non-periodic component mainly indicates the characteristics of the sound source sound (specifically, the timbre representing the blurred state of the uttered sound). The spectrogram shows the characteristics of articulation processing in the vocal tract.

図５に示すように、喉頭正常者が発する発声音の基本周波数は、時間的に変化しており一定とはならない。即ち、喉頭正常者が発する発声音には、アクセントやイントネーションが付加されている。 As shown in FIG. 5, the fundamental frequency of the sound produced by a normal larynx person changes with time and is not constant. In other words, accents and intonation are added to the utterance sound produced by a normal larynx person.

これに対して、図６に示すように、喉頭異常者が発する発声音の基本周波数は、時間的に変化せず一定となっている。即ち、喉頭異常者が発する発声音には、アクセントやイントネーションが付加されていない。そのため、喉頭異常者が発する発声音は、機械的な音として聞こえたり、正しく伝わり難かったりする。 On the other hand, as shown in FIG. 6, the fundamental frequency of the uttered sound produced by the larynx abnormal person is constant without changing over time. That is, accents and intonation are not added to the utterance sound produced by the larynx abnormal person. For this reason, the utterance sound produced by the person with abnormal larynx is heard as a mechanical sound or is difficult to be transmitted correctly.

そこで、特許文献１では、センサを用いて検出した筋電位や関節角度などに応じて、音源音の基本周波数や音量を制御する電気式人工喉頭装置が提案されている。また、特許文献２では、喉頭異常者によるスイッチの操作内容に応じて、基本周波数の変動態様が異なる複数のパターンの音源音を出力することが可能な電気式人工喉頭装置が提案されている。 Therefore, Patent Document 1 proposes an electric artificial laryngeal device that controls the fundamental frequency and volume of sound source sound in accordance with myoelectric potential, joint angle, and the like detected using a sensor. Further, Patent Document 2 proposes an electric artificial laryngeal device capable of outputting a plurality of patterns of sound source sounds having different fundamental frequency variations in accordance with the operation content of a switch by an abnormal laryngeal person.

特開平７−４３３号公報Japanese Patent Laid-Open No. 7-433 特開平１１−６９４７６号公報Japanese Patent Laid-Open No. 11-69476

特許文献１及び２で提案されている電気式人工喉頭装置を使用すれば、基本周波数が異なる音源音を出力すること自体は可能である。しかしながら、特許文献１で提案されている電気式人工喉頭装置は、発声音とは直接的な関連性がない情報（人体の外表面に取り付けられたセンサから得られる生体情報）に基づいて、出力する音源音を制御するものであるため、喉頭異常者が発したい発声音には不適合な音源音が出力されることがある。一方、特許文献２で提案されている電気式人工喉頭装置は、人の操作によって音源音を制御する必要があるため、電気式人工喉頭装置の操作が煩雑になるとともに、円滑に音源音及び発声音を発することが困難になってしまう。 If the electric artificial laryngeal device proposed in Patent Documents 1 and 2 is used, it is possible to output sound sources having different fundamental frequencies. However, the electric artificial laryngeal device proposed in Patent Document 1 is output based on information (biological information obtained from a sensor attached to the outer surface of the human body) that is not directly related to the vocal sound. Since the sound source sound to be controlled is controlled, an unsuitable sound source sound may be output to the utterance sound that the larynx abnormal person wants to utter. On the other hand, the electric artificial laryngeal device proposed in Patent Document 2 needs to control the sound source sound by human operation, so that the operation of the electric artificial laryngeal device becomes complicated and the sound source sound and sound are smoothly generated. It becomes difficult to produce voice sounds.

そこで、本発明は、使用者が発する発声音に適合した音源音を円滑に出力することが可能な電気式人工喉頭装置を提供することを目的とする。 Therefore, an object of the present invention is to provide an electric artificial laryngeal device that can smoothly output a sound source sound suitable for a utterance sound emitted by a user.

上記目的を達成するため、本発明は、使用者の声道に入力された音源音が調音処理されて発せられる発声音を集音して、発声信号を生成する集音部と、前記集音部が生成する前記発声信号に対応した音源信号を生成する信号処理部と、前記信号処理部が生成する前記音源信号を再生して前記声道に入力するための音源音を出力する音源信号再生部と、を備えることを特徴とする電気式人工喉頭装置を提供する。 In order to achieve the above-mentioned object, the present invention collects a uttered sound generated by adjusting a sound source sound input to a user's vocal tract and generates a utterance signal, and the sound collecting A signal processing unit for generating a sound source signal corresponding to the utterance signal generated by the unit, and a sound source signal reproduction for reproducing the sound source signal generated by the signal processing unit and outputting the sound source sound for input to the vocal tract And an electric artificial laryngeal device characterized by comprising:

この電気式人工喉頭装置によれば、使用者が実際に発した発声音に対応した音源音を出力することが可能となる。 According to this electric artificial laryngeal device, it is possible to output a sound source sound corresponding to the vocal sound actually emitted by the user.

さらに、上記特徴の電気式人工喉頭装置において、前記信号処理部が、前記集音部が生成する前記発声信号から、前記使用者の声道における調音処理の特徴を示す音声特徴量を抽出する音声特徴量抽出部と、前記音声特徴量抽出部が抽出する前記音声特徴量に基づいて、前記使用者の声道における調音処理に対応した音源音の特徴を示す音源特徴量を推定する音源特徴量推定部と、前記音源特徴量推定部が推定する前記音源特徴量を有する前記音源信号を生成する音源信号生成部と、を備えると、好ましい。 Further, in the electric artificial laryngeal device having the above characteristics, the signal processing unit extracts a voice feature amount indicating characteristics of the articulation processing in the user's vocal tract from the utterance signal generated by the sound collection unit. A sound source feature amount for estimating a sound source feature amount indicating a feature of a sound source sound corresponding to the articulation processing in the user's vocal tract based on the feature amount extraction unit and the voice feature amount extracted by the voice feature amount extraction unit It is preferable to include an estimation unit and a sound source signal generation unit that generates the sound source signal having the sound source feature amount estimated by the sound source feature amount estimation unit.

この電気式人工喉頭装置によれば、音源特徴量推定部が、発声信号から抽出された音声特徴量に基づいて、音源特徴量を推定する。そのため、音源音の変動による影響を排除して、声道における調音処理に対応した音源特徴量を、精度良く推定することが可能となる。 According to this electric artificial laryngeal device, the sound source feature quantity estimation unit estimates the sound source feature quantity based on the voice feature quantity extracted from the utterance signal. For this reason, it is possible to accurately estimate the sound source feature amount corresponding to the articulation processing in the vocal tract while eliminating the influence of the fluctuation of the sound source sound.

さらに、上記特徴の電気式人工喉頭装置において、前記信号処理部が、前記音声特徴量と前記音源特徴量との対応関係を示す統計モデルを記録しているデータベースを、さらに備え、前記音源特徴量推定部が、前記データベースが記録している前記統計モデルに基づいて、前記音源特徴量を推定すると、好ましい。 Furthermore, in the electric artificial laryngeal device having the above characteristics, the signal processing unit further includes a database in which a statistical model indicating a correspondence relationship between the audio feature quantity and the sound source feature quantity is recorded, and the sound source feature quantity It is preferable that the estimation unit estimates the sound source feature amount based on the statistical model recorded in the database.

この電気式人工喉頭装置によれば、音源特徴量推定部が、事前に構築されている統計モデルを利用することによって、簡易的かつ精度良く音源特徴量を推定することが可能となる。 According to this electric artificial laryngeal device, the sound source feature amount estimation unit can easily and accurately estimate the sound source feature amount by using a statistical model built in advance.

さらに、上記特徴の電気式人工喉頭装置において、前記統計モデルは、ある言葉について喉頭異常者が発する第１発声音を集音して生成される第１発声信号から抽出される第１音声特徴量と、当該ある言葉について喉頭正常者が発する第２発声音を集音して生成される第２発声信号から抽出された第２音源特徴量と、を対応付けることで構築されたものであり、前記第１発声音は、前記喉頭異常者の声道に入力された第１音源音が調音処理されて発せられるものであり、前記第１音声特徴量は、前記喉頭異常者の声道における調音処理の特徴を示すものであり、前記第２発声音は、前記喉頭正常者の声帯が出力する第２音源音が声道で調音処理されて発せられるものであり、前記第２音源特徴量は、前記第２音源音の特徴を示すものであると、好ましい。 Furthermore, in the electric artificial laryngeal device having the above characteristics, the statistical model includes a first speech feature amount extracted from a first utterance signal generated by collecting a first utterance sound emitted by an abnormal larynx for a certain word. And the second sound source feature amount extracted from the second utterance signal generated by collecting the second utterance sound emitted by the normal larynx for the certain word, and The first voicing sound is generated after the first sound source sound input to the vocal tract of the larynx abnormal person is subjected to an articulation process, and the first voice feature amount is an articulation process in the vocal tract of the abnormal larynx person. The second voicing sound is uttered after the second sound source sound output from the vocal cords of the normal larynx is tuned in the vocal tract, and the second sound source feature amount is The characteristic of the second sound source sound Preferred.

この電気式人工喉頭装置によれば、喉頭正常者の声帯が出力する第２音源音の特徴を示す第２音源特徴量を用いて構築された統計モデルに基づいて、音源特徴量が推定される。そのため、音源信号再生部が出力する音源音を、喉頭正常者の声帯が出力するような自然な音源音に近づけることが可能となる。 According to the electric artificial laryngeal device, the sound source feature amount is estimated based on the statistical model constructed using the second sound source feature amount indicating the feature of the second sound source sound output from the vocal cord of the normal larynx. . Therefore, the sound source sound output from the sound source signal reproducing unit can be brought close to a natural sound source sound output from the vocal cord of a normal larynx.

さらに、上記特徴の電気式人工喉頭装置において、前記統計モデルは、前記第１発声信号から抽出される前記第１音源音の特徴を示す第１音源特徴量が、前記第２音源特徴量の分布範囲内となると、好ましい。 Furthermore, in the electric artificial laryngeal device having the above characteristics, the statistical model includes a distribution of the second sound source feature amount, wherein the first sound source feature amount indicating the feature of the first sound source sound extracted from the first utterance signal is Within the range, it is preferable.

この電気式人工喉頭装置によれば、第１音源特徴量及び第２音源特徴量が揃った状態で統計モデルが構築されるため、音源特徴量推定部が、当該分布範囲内の音源特徴量を精度良く推定することが可能となる。 According to this electric artificial laryngeal device, since the statistical model is constructed in a state where the first sound source feature amount and the second sound source feature amount are aligned, the sound source feature amount estimation unit calculates the sound source feature amount within the distribution range. It is possible to estimate with high accuracy.

さらに、上記特徴の電気式人工喉頭装置において、前記音源特徴量が、前記音源音の基本周波数を示すものであり、前記第２音源特徴量が、前記第２音源音の基本周波数を示すものであると、好ましい。 Furthermore, in the electric artificial laryngeal device having the above characteristics, the sound source feature amount indicates a fundamental frequency of the sound source sound, and the second sound source feature amount indicates a fundamental frequency of the second sound source sound. If there is, it is preferable.

この電気式人工喉頭装置によれば、音源信号再生部が出力する音源音の基本周波数を、声道における調音処理に対応したものとすることが可能となる。 According to the electric artificial laryngeal device, the fundamental frequency of the sound source sound output from the sound source signal reproducing unit can be made to correspond to the articulation processing in the vocal tract.

さらに、上記特徴の電気式人工喉頭装置において、前記統計モデルは、前記第１音声特徴量と、前記第２発声信号から抽出される第２音声特徴量と、の対応関係に基づいて、前記第１発声信号及び前記第２発声信号の時間方向におけるずれを補正した上で、前記第１音声特徴量と前記第２音源特徴量とを対応付けることで構築されたものであり、前記第２音声特徴量は、前記喉頭正常者の声道における調音処理の特徴を示したものであると、好ましい。 Furthermore, in the electric artificial laryngeal device having the above characteristics, the statistical model is based on a correspondence relationship between the first voice feature quantity and a second voice feature quantity extracted from the second utterance signal. The second voice feature is constructed by associating the first voice feature quantity with the second sound source feature quantity after correcting a shift in time direction between the first voice signal and the second voice signal. The amount preferably represents the characteristics of articulation processing in the vocal tract of the normal larynx.

この電気式人工喉頭装置によれば、喉頭異常者及び喉頭正常者の話す速度に差があり、第１発声音及び第２発声音に時間的なずれが生じ得る場合でも、当該ずれを補正した上で第１音声特徴量と第２音源特徴量とが対応付けられる。そのため、精度良く音源特徴量を推定することが可能な統計モデルを、構築することが可能となる。 According to this electric artificial laryngeal device, even when there is a difference in speaking speed between the larynx abnormal person and the normal laryngeal person, and the time difference between the first utterance sound and the second utterance sound may occur, the deviation is corrected. The first sound feature amount and the second sound source feature amount are associated with each other. Therefore, it is possible to construct a statistical model that can accurately estimate the sound source feature amount.

上記特徴の電気式人工喉頭装置によれば、使用者が実際に発した発声音に対応した音源音を出力することが可能となる。そのため、使用者が発する発声音に適合した音源音を、円滑に出力することが可能になる。 According to the electric artificial laryngeal device having the above characteristics, it is possible to output a sound source sound corresponding to the vocal sound actually emitted by the user. Therefore, it is possible to smoothly output the sound source sound that is suitable for the utterance sound emitted by the user.

本発明の実施形態に係る電気式人工喉頭装置の構成例について示すブロック図。The block diagram shown about the structural example of the electric artificial larynx apparatus which concerns on embodiment of this invention. 図１に示した電気式人工喉頭装置が備える信号処理部の構成例について示すブロック図。The block diagram shown about the structural example of the signal processing part with which the electric artificial laryngeal apparatus shown in FIG. 1 is provided. 統計モデルの構築方法の一例について示すグラフ。The graph shown about an example of the construction method of a statistical model. 統計モデルの構築方法の一例について示すグラフ。The graph shown about an example of the construction method of a statistical model. 喉頭正常者が発する発声音の各種特徴について示したグラフ。The graph which showed about the various characteristics of the vocal sound which a larynx normal person utters. 電気式人工喉頭装置を使用した喉頭異常者が発する発声音の各種特徴について示したグラフ。The graph which showed about the various characteristics of the utterance sound which the larynx abnormal person uses the electric artificial laryngeal device.

最初に、本発明の実施形態に係る電気式人工喉頭装置について、図面を参照して説明する。図１は、本発明の実施形態に係る電気式人工喉頭装置の構成例について示すブロック図である。 First, an electric artificial laryngeal device according to an embodiment of the present invention will be described with reference to the drawings. FIG. 1 is a block diagram showing a configuration example of an electric artificial larynx device according to an embodiment of the present invention.

図１に示すように、本発明の実施形態に係る電気式人工喉頭装置１は、集音部１０と、信号処理部２０と、音源信号再生部３０と、を備える。なお、図１では、説明の便宜上、電気式人工喉頭装置１の他に、喉頭異常者である電気式人工喉頭装置の使用者Ｐを図示している。 As shown in FIG. 1, the electric artificial larynx device 1 according to the embodiment of the present invention includes a sound collection unit 10, a signal processing unit 20, and a sound source signal reproduction unit 30. In addition, in FIG. 1, the user P of the electric artificial larynx apparatus which is a larynx abnormal person other than the electric artificial laryngeal apparatus 1 is illustrated for convenience of explanation.

集音部１０は、例えば空気伝導マイクロフォンや体内伝導マイクロフォン等から成り、使用者Ｐが発する発声音を集音し、電気信号に変換することで、発声信号を生成する。このとき、集音部１０は、例えばサンプリング周波数１６ｋＨｚで発声音を集音して、発声信号を生成する。なお、集音部１０として体内伝導マイクロフォンを用いる場合、例えば、非可聴つぶやき（Non-Audible Murmur：ＮＡＭ）マイクロフォンを利用してもよい。ＮＡＭマイクロフォンとは、耳介後方（後頭部側）に圧着して使用されるマイクロフォンであって、頭頸部の肉を伝搬する音を集音する肉伝導マイクロフォンである。 The sound collection unit 10 includes, for example, an air conduction microphone, a body conduction microphone, and the like, collects a utterance sound emitted by the user P, and generates an utterance signal by converting the sound into an electric signal. At this time, the sound collection unit 10 collects uttered sound at a sampling frequency of 16 kHz, for example, and generates a uttered signal. In addition, when using a body conduction microphone as the sound collection unit 10, for example, a non-audible murmur (NAM) microphone may be used. The NAM microphone is a microphone that is used by being crimped to the back of the auricle (occipital side), and is a meat conduction microphone that collects sound propagating through the meat of the head and neck.

信号処理部２０は、例えばＣＰＵ（CentralProcessing Unit）やＤＳＰ（Digital SignalProcessor）等の演算処理装置を備え、集音部１０が生成する発声信号に対応した音源信号を生成する。ただし、信号処理部２０が生成する音源信号は、時間的に変動する発声信号に対応して、時間的に変動するものである。例えば、信号処理部２０が生成する音源信号は、喉頭正常者が声帯で出力する音源音のように、基本周波数が時間的に変動し得るものである（図５中の基本周波数のグラフ参照）。 The signal processing unit 20 includes an arithmetic processing device such as a CPU (Central Processing Unit) or a DSP (Digital Signal Processor), and generates a sound source signal corresponding to the utterance signal generated by the sound collection unit 10. However, the sound source signal generated by the signal processing unit 20 varies in time corresponding to the utterance signal that varies in time. For example, the sound source signal generated by the signal processing unit 20 has a fundamental frequency that can be temporally fluctuated like a sound source sound output from a normal larynx by a vocal cord (see the fundamental frequency graph in FIG. 5). .

音源信号再生部３０は、信号処理部２０が生成する音源信号を再生して、使用者Ｐの声道に入力するための音源音を出力する。例えば、音源信号再生部３０は、振動板と、当該振動板の駆動装置と、を備え、駆動装置が音源信号に従って振動板を振動させることによって、音源音を出力する。このとき、振動板が、使用者Ｐの喉に押し当てられた状態で振動することによって、使用者Ｐの声道に音源音が入力される。 The sound source signal reproducing unit 30 reproduces the sound source signal generated by the signal processing unit 20 and outputs a sound source sound to be input to the user P's vocal tract. For example, the sound source signal reproducing unit 30 includes a diaphragm and a driving device for the diaphragm, and the driving device vibrates the diaphragm according to the sound source signal, and outputs sound source sound. At this time, the diaphragm is vibrated while being pressed against the throat of the user P, so that the sound source sound is input to the vocal tract of the user P.

そして、使用者Ｐは、自らの声道の形状を変化させる（例えば、口や舌を動かす）ことで音源音を調音処理して、発声音を発する。さらに、使用者Ｐが発する発声音は、集音部１０で集音され、上述した一連の動作が再度行われる。このように、電気式人工喉頭装置１は、使用者Ｐが発した発声音に対応した音源音を連続的に出力して、使用者Ｐの声道に対して連続的に入力する。 Then, the user P changes the shape of his or her vocal tract (for example, moves his mouth or tongue), and adjusts the sound of the sound source to produce a uttered sound. Furthermore, the utterance sound emitted by the user P is collected by the sound collection unit 10 and the series of operations described above is performed again. Thus, the electric artificial laryngeal device 1 continuously outputs the sound source sound corresponding to the uttered sound emitted by the user P and continuously inputs it to the vocal tract of the user P.

以上のように、本発明の実施形態に係る電気式人工喉頭装置１は、使用者Ｐが実際に発した発声音に対応した音源音を出力することが可能である。そのため、使用者Ｐが発する発声音に適合した音源音を、円滑に出力することが可能になる。 As described above, the electric artificial laryngeal device 1 according to the embodiment of the present invention can output a sound source sound corresponding to the vocal sound actually emitted by the user P. Therefore, it is possible to smoothly output a sound source sound suitable for the utterance sound emitted by the user P.

なお、この電気式人工喉頭装置１では、現に使用者Ｐが発している発声音に対応した音源音が、若干の時間（例えば、信号処理部２０等の処理に要する時間であり、５０ｍｓ〜７０ｍｓ程度）を経た後に、使用者Ｐの声道に入力される。しかしながら、この時間のずれは極僅かであり、さらに人間はこの時間のずれに対して鈍感であるため、使用者Ｐの発声音について聞き手が違和感を覚えるなどの問題は、生じ難い。 In the electric artificial laryngeal device 1, the sound source sound corresponding to the utterance sound that the user P is actually uttering is a certain amount of time (for example, the time required for the processing of the signal processing unit 20, etc., 50 ms to 70 ms). After that, it is input to the vocal tract of the user P. However, this time lag is negligible, and humans are insensitive to this time lag, so problems such as the listener feeling uncomfortable about the voices of the user P are unlikely to occur.

次に、図１に示した電気式人工喉頭装置１が備える信号処理部２０について、図面を参照して具体的に説明する。図２は、図１に示した電気式人工喉頭装置が備える信号処理部２０の構成例について示すブロック図である。 Next, the signal processing unit 20 included in the electric artificial laryngeal device 1 shown in FIG. 1 will be specifically described with reference to the drawings. FIG. 2 is a block diagram showing a configuration example of the signal processing unit 20 included in the electric artificial laryngeal device shown in FIG.

図２に示すように、信号処理部２０は、音声特徴量抽出部２１と、音源特徴量推定部２２と、データベース２３と、音源信号生成部２４と、を備える。 As shown in FIG. 2, the signal processing unit 20 includes an audio feature amount extraction unit 21, a sound source feature amount estimation unit 22, a database 23, and a sound source signal generation unit 24.

音声特徴量抽出部２１は、集音部１０が生成する発声信号から、使用者Ｐの声道における調音処理の特徴である音声特徴量を抽出する。音声特徴量とは、例えばスペクトル包絡（周波数スペクトルの概形）に基づいたものである。 The voice feature quantity extraction unit 21 extracts a voice feature quantity that is a feature of the articulation processing in the vocal tract of the user P from the utterance signal generated by the sound collection unit 10. The voice feature amount is based on, for example, a spectrum envelope (rough shape of frequency spectrum).

例えば、音声特徴量抽出部２１は、フレーム長２５ｍｓ、フレームシフト長５ｍｓで、発声信号に対して短時間フーリエ変換（ＳＴＦＴ：Short Time Fourier Transform）を行うことで得られる周波数スペクトルから、概形成分を選択的に抽出する（例えば、振幅を対数化した周波数スペクトルを逆フーリエ変換することでケプストラムを得て、当該ケプストラムの低次の成分を選択的に抽出した後、さらにフーリエ変換を行う）ことで、スペクトル包絡を連続的に取得する。なお、このようにして取得されるスペクトル包絡を時間方向に対して連続的に並べると、図５及び図６に示したようなスペクトログラムが得られる。そして、音声特徴量抽出部２１は、例えば各フレームに対して前後４フレームを結合して成るセグメントにおけるスペクトル包絡を、まとめて次元圧縮することで、音声特徴量を得る。 For example, the voice feature quantity extraction unit 21 has a frame length of 25 ms and a frame shift length of 5 ms, and generates a roughly formed component from a frequency spectrum obtained by performing a short time Fourier transform (STFT) on the utterance signal. (For example, a cepstrum is obtained by performing inverse Fourier transform on a frequency spectrum in which the amplitude is logarithm, and a low-order component of the cepstrum is selectively extracted and then further Fourier transformed) Thus, the spectrum envelope is obtained continuously. Note that the spectrograms as shown in FIGS. 5 and 6 can be obtained by arranging the spectral envelopes obtained in this way continuously in the time direction. And the audio | voice feature-value extraction part 21 obtains an audio | voice feature-value by carrying out dimension compression of the spectrum envelope in the segment which couple | bonds four frames before and behind, for example with respect to each frame, for example.

音源特徴量推定部２２は、音声特徴量抽出部２１が抽出する音声特徴量と、データベース２３が格納している統計モデルと、に基づいて、使用者Ｐの声道における調音処理に対応した音源音の特徴を示す音源特徴量を推定する。音源特徴量とは、例えば基本周波数である。 The sound source feature quantity estimation unit 22 is a sound source corresponding to the articulation processing in the vocal tract of the user P based on the voice feature quantity extracted by the voice feature quantity extraction unit 21 and the statistical model stored in the database 23. A sound source feature amount indicating a sound feature is estimated. The sound source feature amount is, for example, a fundamental frequency.

ここで、データベース２３が格納している統計モデルの構築方法の一例について、図面を参照して説明する。図３及び図４は、統計モデルの構築方法の一例について示すグラフである。 Here, an example of a method for constructing a statistical model stored in the database 23 will be described with reference to the drawings. 3 and 4 are graphs showing an example of a statistical model construction method.

統計モデルは、ある言葉について喉頭異常者が発する発声音（以下、「第１発声音」という）と、当該ある言葉について喉頭正常者が発する発声音（以下、「第２発声音」という）と、を対応付けることで構築される。なお、第１発声音とは、喉頭異常者が、従来の電気式人工喉頭装置が出力する音源音（以下、「第１音源音」という）を、声道で調音処理して発するものである。また、第２発声音とは、喉頭正常者が、声帯が出力する音源音（以下、「第２音源音」という）を、声道で調音処理して発するものである。 The statistical model consists of a utterance sound (hereinafter referred to as “first utterance sound”) uttered by a person with abnormal larynx for a certain word, and a utterance sound (hereinafter referred to as “second utterance sound”) generated by a normal larynx for the certain word. Are associated with each other. The first vocal sound is generated by a person with abnormal laryngeal rhythmic processing of a sound source sound (hereinafter referred to as “first sound source sound”) output by a conventional electric artificial laryngeal device in the vocal tract. . The second utterance sound is generated by a person with normal larynx by performing a tone adjustment process in the vocal tract on a sound source sound output from the vocal cords (hereinafter referred to as “second sound source sound”).

図３（ａ）は、第１発声音を集音して生成される発声信号（以下、「第１発声信号」という）と、第２発声音を集音して生成される発声信号（以下、「第２発声信号」という）と、のそれぞれの信号波形を示したグラフである。また、図３（ｂ）は、第１発声信号と第２発声信号の対応付けの方法を示すグラフである。なお、図３（ａ）及び図３（ｂ）に示すいずれのグラフも、喉頭異常者及び喉頭正常者が、同じ言葉を発した場合のものである。 FIG. 3A shows an utterance signal (hereinafter referred to as “first utterance signal”) generated by collecting the first uttered sound and an utterance signal (hereinafter referred to as “first utterance signal”) generated by collecting the second uttered sound. , “Second utterance signal”). FIG. 3B is a graph showing a method of associating the first utterance signal and the second utterance signal. Note that both graphs shown in FIG. 3A and FIG. 3B are obtained when the larynx abnormal person and the larynx normal person utter the same words.

図３（ａ）に示すように、喉頭異常者及び喉頭正常者が同じ言葉を発したとしても、人の話す速度には個人差があるため、第１発声信号及び第２発声信号には時間的なずれが生じ得る。 As shown in FIG. 3 (a), even if the larynx abnormal person and the larynx normal person utter the same word, there are individual differences in the speaking speed of the person, so the first utterance signal and the second utterance signal have time. Misalignment can occur.

そこで、図３（ｂ）に示すように、第１発声信号から抽出される音声特徴量（以下、「第１音声特徴量」という）と、第２発声信号から抽出される音声特徴量（以下、「第２音声特徴量」という）と、を比較することで、この時間的なずれを補正する。これにより、精度良く音源特徴量を推定することが可能な統計モデルを、構築することが可能となる。なお、第１音声特徴量及び第２音声特徴量は、例えば、図２に示した音声特徴量抽出部２１における音声特徴量の抽出方法と同じ方法で、抽出することができる。 Therefore, as shown in FIG. 3B, a speech feature amount extracted from the first utterance signal (hereinafter referred to as “first speech feature amount”) and a speech feature amount extracted from the second utterance signal (hereinafter referred to as “first speech feature amount”). , “Second audio feature amount”) and the time difference is corrected. As a result, a statistical model capable of accurately estimating the sound source feature amount can be constructed. The first voice feature quantity and the second voice feature quantity can be extracted by the same method as the voice feature quantity extraction method in the voice feature quantity extraction unit 21 shown in FIG.

まず、第１音声特徴量及び第２音声特徴量のそれぞれのパターンを比較して、特徴が類似している部分を手がかりに、時間方向におけるずれを補正した対応関係（図３（ｂ）中の破線）を規定する。そして、その対応関係に従って、第１音声特徴量と、第２発声信号から抽出される音源特徴量（以下、「第２音源特徴量」という）と、を対応づける。なお、第２音声特徴量及び第２音源特徴量は、どちらも第２発声信号から抽出されたものであるため、両者には時間的なずれがない。また、第２発声信号から第２音源特徴量を抽出する方法として、周知の様々な方法が適用可能であるが、例えば下記の参考文献１に示す方法を適用してもよい。 First, the patterns of the first voice feature quantity and the second voice feature quantity are compared, and a correspondence relationship in which the shift in the time direction is corrected using a similar feature as a clue (in FIG. 3B). (Dashed line). Then, according to the correspondence relationship, the first voice feature amount is associated with the sound source feature amount extracted from the second utterance signal (hereinafter referred to as “second sound source feature amount”). Note that the second audio feature quantity and the second sound source feature quantity are both extracted from the second utterance signal, and therefore there is no time lag between them. Further, as a method for extracting the second sound source feature amount from the second utterance signal, various known methods can be applied. For example, the method shown in Reference Document 1 below may be applied.

（参考文献１）
H. Kawahara, H. Katayose, A. de Cheveigne,and R.D. Patterson.
Fixed point analysis of frequency toinstantaneous frequency mapping for accurate estimation of F0 and periodicity.
Proc. EUROSPEECH, pp. 2781-2784, Budapest, Hungary,Sep. 1999.(Reference 1)
H. Kawahara, H. Katayose, A. de Cheveigne, and RD Patterson.
Fixed point analysis of frequency toinstantaneous frequency mapping for accurate estimation of F0 and periodicity.
Proc. EUROSPEECH, pp. 2781-2784, Budapest, Hungary, Sep. 1999.

このような第１音声特徴量と第２音源特徴量との対応付けを、様々な言葉について行うことで、統計モデルを構築する。このような統計モデルは、例えば混合正規分布モデル（ＧＭＭ：Gaussian Mixture Model）を用いて構築することができる。なお、図４（ａ）及び図４（ｂ）では、図示及び説明の簡略化のため、第１音声特徴量及び第２音源特徴量のそれぞれをスカラーとしているが、第１音声特徴量及び第２音源特徴量のそれぞれを、複数の成分から成るベクトルとした方が、より精度良く音源特徴量を推定することができるため、好ましい。 A statistical model is constructed by associating the first sound feature quantity and the second sound source feature quantity with respect to various words. Such a statistical model can be constructed using, for example, a mixed normal distribution model (GMM: Gaussian Mixture Model). In FIGS. 4A and 4B, for simplification of illustration and description, each of the first sound feature amount and the second sound source feature amount is a scalar. It is preferable that each of the two sound source feature amounts is a vector composed of a plurality of components because the sound source feature amount can be estimated with higher accuracy.

図４（ａ）に示すグラフは、第１音声特徴量及び第２音源特徴量のデータのヒストグラムである。また、図４（ｂ）に示すグラフは、図４（ａ）に示したデータに対してＧＭＭモデルを適用して構築した統計モデルである。この図４（ｂ）に示すグラフ（統計モデル）では、グラフ中の高くなっている部分ほど、第１音声特徴量及び第２音源特徴量の組み合わせの発生確率が高いことを示している。 The graph shown in FIG. 4A is a histogram of data of the first sound feature quantity and the second sound source feature quantity. The graph shown in FIG. 4B is a statistical model constructed by applying the GMM model to the data shown in FIG. In the graph (statistical model) shown in FIG. 4B, the higher the portion in the graph, the higher the probability of occurrence of the combination of the first sound feature amount and the second sound source feature amount.

音源特徴量推定部２２は、この統計モデルと、音声特徴量抽出部２１が抽出する音声特徴量と、に基づいて、音源特徴量を推定する。このとき、時間方向の相関を考慮した推定処理を使用すると、音源特徴量推定部２２が精度良く音源特徴量の推定を行うことができる。なお、時間方向の相関を考慮した推定処理については、周知の様々な方法が適用可能であるが、例えば下記の参考文献２に示す方法を適用してもよい。 The sound source feature quantity estimation unit 22 estimates a sound source feature quantity based on the statistical model and the voice feature quantity extracted by the voice feature quantity extraction unit 21. At this time, if an estimation process considering the correlation in the time direction is used, the sound source feature amount estimation unit 22 can accurately estimate the sound source feature amount. Note that various known methods can be applied to the estimation processing considering the correlation in the time direction. For example, the method shown in Reference Document 2 below may be applied.

（参考文献２）
T. Toda, M. Nakagiri, K. Shikano.
Statistical voice conversion techniques for body-conducted unvoiced speech enhancement.
IEEE Transactions on Audio, Speech and Language Processing, Vol. 20, No. 9, pp. 2505-2517, Sep. 2012.(Reference 2)
T. Toda, M. Nakagiri, K. Shikano.
Statistical voice conversion techniques for body-conducted unvoiced speech enhancement.
IEEE Transactions on Audio, Speech and Language Processing, Vol. 20, No. 9, pp. 2505-2517, Sep. 2012.

例えば、音源特徴量推定部２２は、音声特徴量抽出部２１が抽出する音声特徴量を、統計モデルにおける第１音声特徴量に当てはめて、対応する（例えば、発生確率が最大となる）第２音源特徴量を求める。音源特徴量推定部２２は、求めた第２音源特徴量を、推定した音源特徴量として出力する。 For example, the sound source feature quantity estimation unit 22 applies the voice feature quantity extracted by the voice feature quantity extraction unit 21 to the first voice feature quantity in the statistical model to correspond (for example, the occurrence probability is maximized). Find the sound source feature. The sound source feature amount estimation unit 22 outputs the obtained second sound source feature amount as the estimated sound source feature amount.

そして、最後に、音源信号生成部２４が、音源特徴量推定部２２が推定した音源特徴量を有する音源信号（例えば、音源特徴量が基本周波数であれば、当該基本周波数の信号波形となる音源音）を生成して、図１に示した音源信号再生部３０に出力する。 Finally, the sound source signal generation unit 24 has a sound source signal having the sound source feature amount estimated by the sound source feature amount estimation unit 22 (for example, if the sound source feature amount is a fundamental frequency, a sound source having a signal waveform of the fundamental frequency). Sound) is generated and output to the sound source signal reproducing unit 30 shown in FIG.

この信号処理部２０では、音源特徴量推定部２２が、発声信号から抽出された音声特徴量に基づいて、音源特徴量を推定する。そのため、音源音の変動による影響を排除して、声道における調音処理に対応した音源特徴量を、精度良く推定することが可能となる。 In the signal processing unit 20, the sound source feature amount estimation unit 22 estimates the sound source feature amount based on the voice feature amount extracted from the utterance signal. For this reason, it is possible to accurately estimate the sound source feature amount corresponding to the articulation processing in the vocal tract while eliminating the influence of the fluctuation of the sound source sound.

さらに、この信号処理部２０では、音源特徴量推定部２２が、事前に構築されている統計モデルを利用することによって、簡易的かつ精度良く音源特徴量を推定することが可能となる。特に、この信号処理部２０では、喉頭正常者の声帯が出力する第２音源音の特徴を示す第２音源特徴量を用いて構築された統計モデルに基づいて、音源特徴量が推定される。そのため、音源信号再生部３０が出力する音源音を、喉頭正常者の声帯が出力するような自然な音源音に近づけることが可能となる。 Further, in the signal processing unit 20, the sound source feature amount estimation unit 22 can easily and accurately estimate the sound source feature amount by using a statistical model constructed in advance. In particular, the signal processing unit 20 estimates the sound source feature amount based on a statistical model constructed using the second sound source feature amount indicating the feature of the second sound source sound output from the vocal cord of the normal larynx. Therefore, the sound source sound output from the sound source signal reproduction unit 30 can be brought close to a natural sound source sound output from the vocal cords of a normal larynx.

なお、上述した統計モデルの構築の際に、第１発声信号から抽出される第１音源音の特徴を示す第１音源特徴量が、第２音源特徴量の分布範囲内となるようにしてもよい。このようにすると、第１音源特徴量及び第２音源特徴量が揃った状態で統計モデルが構築される（例えば、共に男性的な第１発声信号及び第２発声信号に基づいて、男性向けの統計モデルが構築される、または、共に女性的な第１発声信号及び第２発声信号に基づいて、女性向けの統計モデルが構築される）ため、音源特徴量推定部２２が、当該分布範囲内の音源特徴量を精度良く推定することが可能となるため、好ましい。 Note that when the statistical model described above is constructed, the first sound source feature amount indicating the feature of the first sound source sound extracted from the first utterance signal may be within the distribution range of the second sound source feature amount. Good. In this way, the statistical model is constructed in a state where the first sound source feature value and the second sound source feature value are aligned (for example, based on the first utterance signal and the second utterance signal that are both masculine, Since a statistical model is constructed, or a statistical model for women is constructed based on both the female first utterance signal and the second utterance signal), the sound source feature quantity estimation unit 22 is within the distribution range. This is preferable because it is possible to accurately estimate the sound source feature amount.

例えばこの場合、まず、使用者Ｐが望む音源特徴量（以下、「目標音源特徴量」とする）を決定する。具体的に例えば、使用者Ｐが望む声の高さ（基本周波数）を決定する。そして、目標音源特徴量と一致または近似する第１音源特徴量を抽出することが可能な第１発声信号と、目標音源特徴量と一致または近似する第２音源特徴量を抽出することが可能な第２発声信号と、のそれぞれを取得した上で、上述した方法に従って統計モデルを構築する。 For example, in this case, first, a sound source feature amount desired by the user P (hereinafter referred to as “target sound source feature amount”) is determined. Specifically, for example, the pitch (basic frequency) desired by the user P is determined. Then, it is possible to extract the first utterance signal that can extract the first sound source feature amount that matches or approximates the target sound source feature amount, and the second sound source feature amount that matches or approximates the target sound source feature amount. After obtaining each of the second utterance signals, a statistical model is constructed according to the method described above.

上記のような第１発声信号は、上記の第１音源特徴量が得られるように出力を調整した電気式人工喉頭装置を、使用者Ｐなどが使用して発する第１発声音を集音することによって、取得することができる。また、既にデータベース等に記録されている第１発声信号に対して、その第１音源特徴量が目標音源特徴量に近づくように調整することによっても、上記のような第１発声信号を取得することができる。なお、電気式人工喉頭装置の出力の調整や、第１音源特徴量の調整によって得られる、多種多様な第１発声信号を同時に用いて、第１音源特徴量の分布範囲を広くした統計モデルを構築してもよい。 The first utterance signal as described above collects the first utterance sound produced by the user P or the like using the electric artificial laryngeal device whose output is adjusted so that the first sound source feature amount is obtained. Can be obtained. The first utterance signal as described above is also obtained by adjusting the first utterance signal already recorded in the database or the like so that the first utterance feature amount approaches the target utterance feature amount. be able to. A statistical model with a wide distribution range of the first sound source feature amount by simultaneously using various first utterance signals obtained by adjusting the output of the electric artificial laryngeal device and adjusting the first sound source feature amount. May be built.

また、上記のような第２発声信号は、上記の第２音源特徴量が得られるような声帯を有した喉頭正常者を選択し、当該喉頭正常者が発する第２発声音を集音することによって、取得することができる。また、既にデータベース等に記録されている第２発声信号に対して、その第２音源特徴量が目標音源特徴量に近づくように調整することによっても、上記のような第２発声信号を取得することができる。 In addition, the second utterance signal as described above selects a normal larynx person having a vocal cord from which the second sound source feature amount can be obtained, and collects a second utterance sound emitted by the normal larynx person. Can be obtained. The second utterance signal as described above is also acquired by adjusting the second utterance signal already recorded in the database or the like so that the second utterance feature amount approaches the target utterance feature amount. be able to.

＜変形等＞
［１］上述した本発明の実施形態では、主として、電気式人工喉頭装置１が、発声音（特に、声道における調音処理）に対応するように音源音の基本周波数を変動させて出力するものとして説明した。しかし、発声音に対応するように音源音の振幅（パワー）を変動させて出力してもよいし、発声音に対応するように音源音の基本周波数及び振幅の双方を変動させて出力してもよい。<Deformation, etc.>
[1] In the embodiment of the present invention described above, the electric artificial laryngeal device 1 mainly outputs the sound source sound by varying the fundamental frequency so as to correspond to the vocal sound (particularly, articulation processing in the vocal tract). As explained. However, the amplitude (power) of the sound source sound may be varied so as to correspond to the uttered sound, or both the fundamental frequency and amplitude of the sound source sound may be varied and output so as to correspond to the uttered sound. Also good.

電気式人工喉頭装置１が、音源音の基本周波数だけでなく振幅も変動させて出力することが可能な構成であると、音源音の基本周波数の変動でアクセントやイントネーションが付加されることが多い言語（例えば、日本語）に限られず、音源音の振幅の変動でアクセントやイントネーションが付加されることが多い言語（例えば、英語）など、様々な言語に対応した音源音を出力することが可能となる。 When the electric artificial laryngeal device 1 is configured to output not only the fundamental frequency of the sound source sound but also the amplitude, accents and intonation are often added due to the fluctuation of the fundamental frequency of the sound source sound. It is not limited to languages (for example, Japanese), and it is possible to output sound sources for various languages, such as languages that often add accents and intonation due to variations in the amplitude of the sound source (for example, English) It becomes.

［２］電気式人工喉頭装置１は、使用者Ｐの挙動（例えば、ボタンを押下する、本体を喉に押し付けるなどの動作の有無）に応じて、音源音の出力の有無を切り替えるように構成されていると、好ましい。 [2] The electric artificial laryngeal device 1 is configured to switch the presence or absence of the output of the sound source sound according to the behavior of the user P (for example, presence or absence of an operation such as pressing a button or pressing the main body against the throat). Is preferred.

この場合、電気式人工喉頭装置１が音源音の出力を開始した当初の極僅かな時間については、所定の音源特徴量を有する音源音が出力される。しかし、その後すぐに使用者Ｐの発声音に対応した音源音が出力されるため、使用者Ｐの発声音について聞き手が違和感を覚えるなどの問題は、生じ難い。 In this case, a sound source sound having a predetermined sound source characteristic amount is output for a very short time at which the electric artificial larynx device 1 starts outputting the sound source sound. However, since a sound source sound corresponding to the utterance sound of the user P is output immediately thereafter, problems such as a listener feeling uncomfortable about the utterance sound of the user P hardly occur.

［３］例えば、手術によって喉頭部を摘出する予定がある患者など、将来的に声帯が機能しなくなる者（即ち、将来的に上述の使用者Ｐとなる者）については、声帯が機能している間に、その者自身の声帯を使用した発声音（以下、「本人声帯発声音」とする）を集音して記録しておくと、好ましい。 [3] For example, a person whose vocal cords will not function in the future (ie, a person who will become the above-mentioned user P in the future), such as a patient who is scheduled to have the larynx removed by surgery, During that time, it is preferable to collect and record a vocal sound using the person's own vocal cord (hereinafter referred to as a “personal vocal cord vocal sound”).

記録された本人声帯発声音は、喉頭正常者が発した発声音であり、上述の第２発声音に含まれるものである。そのため、この本人声帯発声音を含む第２発声音を用いて、統計モデルを構築すると、好ましい。また、この本人声帯発声音の信号から抽出される音源特徴量こそ、使用者Ｐが望む音源特徴量であると考えられるため、当該音源特徴量を上述の目標音源特徴量として統計モデルを構築すると、好ましい。 The recorded personal vocal cord utterance is a utterance produced by a normal larynx, and is included in the second utterance. Therefore, it is preferable to construct a statistical model using the second vocal sound including the vocal cord vocal sound. Further, since the sound source feature amount extracted from the signal of the vocal cord vocal sound is considered to be the sound source feature amount desired by the user P, when the statistical model is constructed using the sound source feature amount as the above-described target sound source feature amount, ,preferable.

このようにして構築された統計モデルは、使用者Ｐが喉頭正常者であった時の発声の特徴（アクセントやイントネーションなど）を反映したものとなる。そのため、上述の電気式人工喉頭装置１においてこの統計モデルを利用することによって、使用者Ｐが喉頭正常者であった時の発声の特徴を効果的に再現した音源音を出力することが可能になる。 The statistical model constructed in this way reflects the utterance characteristics (accent, intonation, etc.) when the user P is a normal larynx. Therefore, by using this statistical model in the electric artificial laryngeal device 1 described above, it is possible to output a sound source sound that effectively reproduces the characteristics of the utterance when the user P is a normal larynx. Become.

なお、本人声帯発声音の記録量は、多ければ多いほどよいが、５０センテンス（読み上げに３〜５分程度を要する量）程度であってもよい。 It should be noted that the greater the amount of recording of the vocal cord vocal sound, the better, but it may be about 50 sentences (the amount that requires about 3 to 5 minutes to read out).

本発明は、喉頭異常者の声道に音源音を入力する電気式人工喉頭装置に対して、好適に利用することができる。 INDUSTRIAL APPLICABILITY The present invention can be suitably used for an electric artificial laryngeal device that inputs sound source sound into the vocal tract of an abnormal larynx.

１：電気式人工喉頭装置
１０：集音部
２０：信号処理部
２１：音声特徴量抽出部
２２：音源特徴量推定部
２３：データベース
２４：音源信号生成部
３０：音源信号再生部
Ｐ：使用者

DESCRIPTION OF SYMBOLS 1: Electric artificial larynx apparatus 10: Sound collection part 20: Signal processing part 21: Voice feature-value extraction part 22: Sound source feature-value estimation part 23: Database 24: Sound source signal generation part 30: Sound source signal reproduction part P: User

Claims

A sound collection unit that collects the utterance sound generated by the articulation processing of the sound source sound input to the user's vocal tract, and generates an utterance signal;
A signal processing unit that generates a sound source signal corresponding to the utterance signal generated by the sound collection unit;
A sound source signal reproducing unit that reproduces the sound source signal generated by the signal processing unit and outputs a sound source sound for input to the vocal tract;
An electric artificial laryngeal device comprising:

The signal processing unit is
A voice feature amount extraction unit that extracts a voice feature amount indicating characteristics of articulation processing in the user's vocal tract from the utterance signal generated by the sound collection unit;
A sound source feature amount estimation unit that estimates a sound source feature amount indicating a feature of a sound source sound corresponding to articulation processing in the user's vocal tract based on the sound feature amount extracted by the sound feature amount extraction unit;
A sound source signal generation unit that generates the sound source signal having the sound source feature amount estimated by the sound source feature amount estimation unit;
The electric artificial laryngeal device according to claim 1, comprising:

The signal processing unit further comprises a database in which a statistical model indicating a correspondence relationship between the audio feature quantity and the sound source feature quantity is recorded;
3. The electric artificial laryngeal device according to claim 2, wherein the sound source feature amount estimation unit estimates the sound source feature amount based on the statistical model recorded in the database.

The statistical model includes a first speech feature amount extracted from a first utterance signal generated by collecting a first utterance sound produced by a person with abnormal larynx for a certain word, and a first utterance produced by a normal larynx person for the certain word. It is constructed by associating the second sound source feature amount extracted from the second utterance signal generated by collecting two utterance sounds,
The first uttered sound is generated after the first sound source sound input to the vocal tract of the larynx abnormal person is subjected to articulation processing,
The first audio feature amount indicates a characteristic of articulation processing in the vocal tract of the larynx abnormal person,
The second vocal sound is generated by the second sound source sound output from the vocal cord of the normal larynx being subjected to articulation processing in the vocal tract,
The electric artificial laryngeal device according to claim 3, wherein the second sound source feature amount indicates a feature of the second sound source sound.

5. The statistical model is characterized in that a first sound source feature amount indicating a feature of the first sound source sound extracted from the first utterance signal is within a distribution range of the second sound source feature amount. The electric artificial laryngeal device described in 1.

6. The sound source feature amount indicates a fundamental frequency of the sound source sound, and the second sound source feature amount indicates a fundamental frequency of the second sound source sound. The electric artificial laryngeal device as described.

The statistical model is based on a correspondence relationship between the first voice feature quantity and the second voice feature quantity extracted from the second voice signal, and the time direction of the first voice signal and the second voice signal Is constructed by associating the first sound feature quantity with the second sound source feature quantity after correcting the shift in
The electric artificial laryngeal device according to any one of claims 4 to 6, wherein the second voice feature amount indicates a feature of articulation processing in the vocal tract of the normal larynx.