JP5052107B2

JP5052107B2 - Voice reproduction device and voice reproduction method

Info

Publication number: JP5052107B2
Application number: JP2006317304A
Authority: JP
Inventors: 秀之水野
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2006-11-24
Filing date: 2006-11-24
Publication date: 2012-10-17
Anticipated expiration: 2026-11-24
Also published as: JP2008129524A

Description

本発明は、体内伝達音から音声を再生する音声再現装置及び音声再現方法に関する。 The present invention relates to a sound reproduction device and a sound reproduction method for reproducing sound from in-body transmitted sound.

近年、人対人の通信コミュニケーションは、従来の固定電話を用いたものから携帯電話を用いたモバイルコミュニケーションに主流が移行しており、国内だけでも数千万台もの携帯電話が普及している。また、コミュニケーションを媒介する方法として単なる音声の伝送のみでなく、メールや写真、TV電話など様々な情報の伝達技術が次々に開発され実際に普及している。
しかし、やはり携帯電話の主な利用目的は人対人の対話であるため、その普及による社会的な悪影響も多く、図書館など静かな場所での携帯電話の通話、深夜の大声で通話、電車内での通話などは迷惑行為とも言えるものである。しかしそうした行為に対しては、社会的なモラルに反する行為として啓発の努力をするしかなく、実際にはそうした行為を止めることば困難である。
そうした現状を試み、実際には発話することなく音声の出力（合成）が可能な体内伝導音声からの音声の認識原理に基づく「無音声電話」の技術が提案されている（非特許文献１，２参照）。 In recent years, the mainstream of person-to-person communication communication has shifted from mobile telephones using conventional fixed telephones to mobile communication using mobile telephones, and tens of millions of mobile telephones have become widespread in Japan alone. Also, as a method of mediating communication, not only simple voice transmission but also various information transmission techniques such as e-mails, photographs, and videophones have been developed one after another and are in widespread use.
However, since the main purpose of using mobile phones is person-to-person dialogues, there are many social adverse effects due to their spread. Mobile phone calls in quiet places such as libraries, loud calls late at night, on the train Calls such as can be said to be a nuisance. However, it is difficult to stop such acts in practice, as it is only an effort to raise awareness as an act contrary to social morals.
A trial of such a present situation, and a technique of “no voice call” based on the principle of speech recognition from body-conducted speech that can output (synthesize) speech without actually speaking is proposed (Non-Patent Document 1, 2).

このような技術によって、原理的には体内伝導音から音声を再現し伝送することができる見通しが得られている。
戸田智基，鹿野清宏，「無音声電話」，日本音響学会全国大会講演論文集，pp. 369-370, 3-6-21, 2005-9 中島淑貴，柏岡秀紀，キャンベル・ニック，鹿野清宏，「非可聴つぶやき認識」，電子情報通信学会論文誌，Vol. J87-D-II, No. 9, pp. 1757-1764, 2004-9 With such a technique, it is possible in principle to reproduce and transmit sound from the body conduction sound.
Toda Tomomoto, Shikahiro Shikano, "Non-voice telephone", Proceedings of the National Meeting of the Acoustical Society of Japan, pp. 369-370, 3-6-21, 2005-9 Nakajima Yuki, Kashioka Hideki, Campbell Nick, Shikahiro Kano, "Non-audible tweet recognition", IEICE Transactions, Vol. J87-D-II, No. 9, pp. 1757-1764, 2004-9

「背景技術」で述べたように、原理的には体内伝導音からの音声の再現は可能となった。しかし、実用面では、利用者が容易な形で利用者個人の音声を再現することが困難であるという問題点がある。以下にその問題点を示す。 As described in "Background Art", in principle, it is possible to reproduce sound from body conduction sounds. However, in practice, there is a problem that it is difficult for the user to reproduce the voice of the individual user in an easy manner. The problems are shown below.

前述した「無音声電話」を通常用いられている電話のように利用するためには、当然話者の個性が再現された音声を生成する必要がある。そのためには個人の音声や体内伝導音をある程度集めて学習を行う必要がある。しかし、そのような音声収集の際、利用者は単調で面白みもない発話を長時間行わなければならない。こうした発声作業は一般的には好まれない。そのため、こうした手順を必要とする機器は普及が困難になるため、実際にはこうした学習方法をとることができず個人の再現が困難になるという問題がある。このような課題は体内伝導音から音声を再生する技術一般に共通するものであり、前述した「無音声電話」に限るものではない。
本発明はこのような点に鑑みてなされたものであり、利用者に煩雑な作業をさせることなく、体内伝達音から利用者個人の音声を再生することが可能な技術を提供することを目的とする。 In order to use the above-described “voiceless phone” like a phone that is usually used, it is naturally necessary to generate a voice in which the individuality of the speaker is reproduced. In order to do so, it is necessary to collect a certain amount of individual voices and body conduction sounds for learning. However, during such voice collection, the user must utter a monotonous and uninteresting utterance for a long time. Such vocalization is generally not preferred. For this reason, since it is difficult to disseminate devices that require such procedures, there is a problem in that it is difficult to actually use such a learning method and it is difficult to reproduce an individual. Such a problem is common to the technology for reproducing sound from in-body conduction sound, and is not limited to the above-mentioned “voiceless telephone”.
The present invention has been made in view of these points, and an object of the present invention is to provide a technique capable of reproducing a user's personal voice from in-vivo transmitted sound without causing the user to perform complicated operations. And

本発明では上記課題を解決するために、発話に起因する体内伝導音を集音する体内伝導音集音部と、音声を集音する音声集音部と、声帯運動を伴う発話時に体内伝導音集音部で集音された体内伝達音のデータと、これと同時に音声集音部で集音された音声のデータとを時間的に同期させ、これらを体内伝導音と音声とを対応付ける音声対応付けモデルを学習させるために計算機に送信し、さらに、声帯運動を伴わない発話時に体内伝導音集音部で集音された体内伝達音のデータを計算機に送信する第１送信部と、声帯運動を伴わない発話時に体内伝導音集音部で集音された体内伝達音のデータに対して計算機が音声対応付けモデルを用いて生成した音声のデータを受信する第１受信部と、を有する音声再現装置が提供される。 In the present invention, in order to solve the above-mentioned problems, a body conduction sound collection unit that collects body conduction sound caused by speech, a sound collection unit that collects speech, and a body conduction sound during speech with vocal cord movement Voice correspondence that synchronizes temporally the data of the body-transmitted sound collected by the sound collection unit and the data of the sound collected by the sound collection unit at the same time, and associates these with the body conduction sound A first transmission unit that transmits to the computer to learn the attached model, and further transmits to the computer the data of the body-transmitted sound collected by the body-conducted sound collection unit during speech without accompanying vocal cord movement, and vocal cord movement A first receiving unit that receives voice data generated by a computer using a voice association model with respect to in-vivo transmitted sound data collected by the in-body conduction sound collecting unit during utterance without speech A reproduction device is provided.

なお「発話に起因する体内伝達音」とは、発話時に口周囲や舌等の発話器官の運動によって生じ、人間の軟部組織の振動によって伝達される音を意味する。また「発話に起因する体内伝導音」には、声帯運動を伴う発話に起因する体内伝導音と、声帯運動を伴わない発話に起因する体内伝導音とを含む。また「声帯運動」とは、声帯を振動させたり声門を狭めたりする運動を意味し、「声帯運動を伴わない発話」とは、「非可聴つぶやき（Non-Audible Murmur: NAM）」を意味する。 The “internally transmitted sound due to speech” means sound transmitted by vibration of a human soft tissue caused by movement of a speech organ such as the periphery of the mouth or tongue during speech. In addition, the “body conduction sound caused by speech” includes body conduction sound caused by speech accompanied by vocal cord movement and body conduction sound caused by speech not accompanied by vocal cord movement. “Glottal movement” means movement that vibrates the vocal cords and narrows the glottis, and “utterance without vocal cord movement” means “Non-Audible Murmur (NAM)”. .

ここで本発明の音声再現装置は、声帯運動を伴う通常の発話時に、音声集音部によって通常の音声を集音するとともに、体内伝導音集音部によって当該発話に起因する体内伝導音を集音する。そして、当該音声再現装置は、このように集音した音声のデータと体内伝達音のデータとを時間的に同期させ、これらを体内伝導音と音声とを対応付ける音声対応付けモデルを学習させるために計算機に送信する。計算機は、このように時間的に同期した音声のデータと体内伝達音のデータとを用い、音声対応付けモデルの学習を行うことができる。このように本発明では、通常の発話時に収集したデータのみによって学習データを収集する構成であるため、利用者は学習データ収集のために予め単調で面白みもない発話を長時間行う必要はない。また、本発明では、一般的に最も多用されると考えられる通常通話時に学習を行うため、利用者に学習作業に関する負担をかけることなく、モデル学習を十分に行うことができる。なお、「体内伝導音集音」は、口周囲や舌等の発話器官の運動によって生じる音であり、発話が声帯運動を伴うか否かによって影響を受けることはほとんどない。よって、声帯運動を伴う通常の発話時に集音された体内伝導音から抽出した特徴量を学習データとして用いることに問題はない。逆に、本発明では、声帯運動を伴う通常の発話から音声と体内伝導音集音とを同時に集話するため、音声と体内伝導音集音とを別々に集話する場合に比べて適切な学習データが収集できる。また、本発明では、音声と体内伝導音とを同時に集話するため、学習データ収集時点でそれらを時間的に同期させることも容易である。そのため、学習時に統計的にこれらを対応付ける場合に比べて学習時の演算数を低減できる。さらには、対応付けの誤りによってモデルの分散が増加し、結果的に再現音声の品質が劣化してしまうことも殆どない。 Here, the speech reproduction device of the present invention collects normal speech by the sound collection unit during normal speech accompanied by vocal cord movement, and collects in-body conduction sound resulting from the utterance by the in-body conduction sound collection unit. Sound. Then, the sound reproduction device synchronizes temporally the sound data collected in this way and the data of the in-vivo transmission sound, and learns a sound association model that associates the conduction sound with the sound in the body. Send to the computer. The computer can learn the voice association model by using the time-synchronized voice data and the in-vivo transmission sound data. As described above, according to the present invention, learning data is collected only by data collected at the time of normal utterance. Therefore, the user does not need to perform a monotonous and uninteresting utterance in advance for collecting learning data for a long time. Further, in the present invention, learning is performed during a normal call that is generally considered to be most frequently used, so that model learning can be sufficiently performed without placing a burden on the learning work on the user. Note that “in-body conduction sound collection” is a sound generated by the movement of a speech organ such as the periphery of the mouth or tongue, and is hardly affected by whether or not the speech is accompanied by vocal cord movement. Therefore, there is no problem in using the feature value extracted from the body conduction sound collected during normal speech accompanied by vocal cord movement as learning data. On the contrary, in the present invention, since speech and in-vivo conduction sound collection are collected simultaneously from normal speech accompanied by vocal cord movement, it is more appropriate than collecting speech and in-body conduction sound collection separately. Learning data can be collected. In the present invention, since the voice and the body conduction sound are collected simultaneously, it is easy to synchronize them at the time of learning data collection. Therefore, the number of operations during learning can be reduced as compared with a case where these are statistically associated with each other during learning. Furthermore, the variance of the model increases due to an error in association, and as a result, the quality of reproduced speech is hardly degraded.

また、本発明の音声再現装置は、好ましくは、第１受信部が受信した音声のデータから音声を再生する音声再生部をさらに有する。これにより、体内伝導音から発話を再生する際に再生された音声を発話者にフィードバックすることができる。通常、人が発声する場合は常に自分の音声を聴取することで自分の発話を制御している。実際、後天的な聴覚障害者でも発話が困難になるという傾向があるし、自分の声が開き取れない環境での発話を長く続けていると非常に崩れた発話になったり、発話自体が困難になったりする。体内伝導音から再生された音声を利用者にフィードバックすることにより、利用者は良好に体内伝導音による発話を行うことができる。 In addition, the audio reproduction device of the present invention preferably further includes an audio reproduction unit that reproduces audio from audio data received by the first reception unit. Thereby, the sound reproduced when reproducing the utterance from the body conduction sound can be fed back to the speaker. Usually, when a person speaks, his / her speech is controlled by listening to his / her voice. In fact, even people with acquired hearing disabilities tend to have difficulty in speaking, and if they continue speaking in an environment where their voice cannot be opened for a long time, the speech may become very corrupted or the speech itself is difficult It becomes. By feeding back the sound reproduced from the body conduction sound to the user, the user can satisfactorily speak by the body conduction sound.

また、本発明の音声再現装置は、好ましくは、発話に起因する骨導音の集音及び骨導音の再生を行う骨導振動部をさらに有する。そして、第１送信部は、さらに、声帯運動を伴う発話時に体内伝導音集音部で集音された体内伝達音のデータと、これと同時に骨導振動部で集音された骨導音のデータとを時間的に同期させ、これらを体内伝導音と骨導音とを対応付ける骨導音対応付けモデルを学習させるために計算機に送信する。また、第１受信部は、声帯運動を伴わない発話時に体内伝導音集音部で集音された体内伝達音のデータに対して計算機が骨導音対応付けモデルを用いて生成した骨導音のデータを受信し、骨導振動部は、第１受信部が受信した骨導音のデータから骨導音を再生する。この場合、体内伝導音集から再生した骨導音を発話者にフィードバックすることができる。 In addition, the sound reproduction device of the present invention preferably further includes a bone conduction vibration unit that collects bone conduction sound due to speech and reproduces the bone conduction sound. The first transmission unit further transmits the data of the body conduction sound collected by the body conduction sound collection unit during speech with vocal cord movement and the bone conduction sound collected by the bone conduction vibration unit at the same time. The data is temporally synchronized, and these are transmitted to the computer in order to learn a bone conduction sound association model that correlates the body conduction sound and the bone conduction sound. In addition, the first receiving unit generates a bone conduction sound generated by the computer using the bone conduction sound association model with respect to the data of the body conduction sound collected by the body conduction sound collection unit during speech without accompanying vocal cord movement. The bone conduction vibration unit reproduces the bone conduction sound from the bone conduction sound data received by the first reception unit. In this case, the bone conduction sound reproduced from the body conduction sound collection can be fed back to the speaker.

前述のように体内伝達音から再現した音声のみを発話者にフィードバックする場合、そのフィードバック音は発話者の耳のみから聴取される。この場合、発話者は自分の発話した音声を録音したものを聞く場合に感じるような違和感を持つ。その理由は、通常人間は自分の発話した音声を頭蓋内で伝わる音声と耳から聞こえる音声とで重畳して内耳で聴取しており、耳のみで聴取した音声は通常人が聴取する音と異なるからである。これに対し、本発明の当該好ましい構成では、体内伝導音集から再生した骨導音を発話者にフィードバックできるため、発話者は違和感なく体内伝導音による発話を行うことができる。 As described above, when only the sound reproduced from the body-transmitted sound is fed back to the speaker, the feedback sound is heard only from the speaker's ear. In this case, the speaker feels uncomfortable as if he / she listened to a recording of his / her speech. The reason for this is that humans usually listen to the inner ear by superimposing the voice spoken in the skull and the sound heard from the ear, and the sound heard only by the ear is different from the sound normally heard by the human. Because. On the other hand, in the preferable configuration of the present invention, since the bone conduction sound reproduced from the body conduction sound collection can be fed back to the speaker, the speaker can utter the body conduction sound without a sense of incongruity.

また、音声のみが発話者にフィードバックされて聴取される構成の場合、場合によっては、発話者が発話した音声と話し相手が発話した音声とが重なってしまい、相手の音声の聴取が困難になってしまうこともある。これに対し、骨導音を発話者にフィードバックする構成ではこのような問題は生じず、相手の発話の聞き取りが阻害されることなく円滑な対話が可能となる。骨導音は話し相手の音声と錯綜しないからである。 In addition, in the case of a configuration in which only the voice is fed back to the speaker and listened, in some cases, the voice spoken by the speaker and the voice spoken by the other party overlap, making it difficult to listen to the other party's voice. Sometimes it ends up. On the other hand, such a problem does not occur in the configuration in which the bone conduction sound is fed back to the speaker, and smooth conversation is possible without obstructing the listening of the partner's speech. This is because the bone conduction sound does not confuse the voice of the other party.

また、図書館等の非常に静粛な環境では、ヘッドフォンからの音漏れも問題となる場合もあり、こうした環境では体内伝達音から再現した音声のフィードバックそのものが難しい。これに対し、骨導音を発話者にフィードバックする構成ではこのような問題は生じない。骨導音は音漏れを生じないため、環境に依存することなく、再生された発話を発話者にフィードバックすることができる。 Also, in a very quiet environment such as a library, sound leakage from headphones may be a problem, and in such an environment, it is difficult to provide feedback of the sound reproduced from the body-transmitted sound. On the other hand, such a problem does not occur in the configuration in which the bone conduction sound is fed back to the speaker. Since the bone conduction sound does not cause sound leakage, the reproduced utterance can be fed back to the speaker without depending on the environment.

また、この構成では、利用者は骨導音対応付けモデルの学習データ収集の為に予め単調で面白みもない発話を長時間行う必要はなく、一般的に最も多用されると考えられる通常通話時に学習を行うため、利用者に学習作業に関する負担をかけることなく、モデル学習を十分に行うことができる。さらに、この構成では、声帯運動を伴う通常の発話から骨導音と体内伝導音集音とを同時に集話するため、骨導音と体内伝導音集音とを別々に集話する場合に比べて適切な学習データが収集でき、それらを時間的に同期させることも容易である。 In addition, in this configuration, the user does not need to make a monotonous and interesting utterance for a long time in order to collect learning data of the bone-conducted sound correspondence model, and is generally considered to be most frequently used during normal calls. Since learning is performed, model learning can be sufficiently performed without placing a burden on the learning work on the user. Furthermore, in this configuration, since the bone conduction sound and the body conduction sound collection sound are collected simultaneously from the normal utterance accompanied by the vocal cord movement, the bone conduction sound and the body conduction sound collection sound are collected separately. Therefore, it is easy to collect appropriate learning data and synchronize them in time.

また、本発明の音声再現装置は、好ましくは、声帯運動を伴う発話時であるか声帯運動を伴わない発話時であるかを示す入力操作を受け付ける発話状態入力部をさらに有する。発話状態入力部に対する入力操作が声帯運動を伴う発話時を示すものであった場合、第１送信部は、体内伝導音集音部で集音された体内伝達音のデータと、これと同時に音声集音部で集音された音声のデータとを時間的に同期させて計算機に送信し、音声再生部は、音声の再生を行わず、発話状態入力部に対する入力操作が声帯運動を伴わない発話時を示すものであった場合、第１送信部は、体内伝導音集音部で集音された体内伝達音のデータのみを計算機に送信し、第１受信部は、音声のデータを受信し、音声再生部は、当該音声のデータから音声を再生する。なお、「体内伝達音のデータのみを計算機に送信」とは、体内伝達音のデータと一緒に音声のデータや骨導音のデータを計算機に送信しないことを意味する。体内伝達音のデータと制御データ等を一緒に計算機に送信することも「体内伝達音のデータのみを計算機に送信」に含まれる。
これにより、モデル学習時と体内伝導音からの音声再生時との切り替えを容易に実現できる。 The speech reproduction apparatus of the present invention preferably further includes an utterance state input unit that receives an input operation indicating whether the utterance is accompanied by vocal cord movement or the utterance does not involve vocal cord movement. When the input operation to the utterance state input unit indicates an utterance accompanied by vocal cord movement, the first transmission unit transmits the in-vivo transmission sound data collected by the in-body conduction sound collection unit and the voice at the same time. The voice data collected by the sound collection unit is sent to the computer in time synchronization, and the voice playback unit does not play back the voice, and the input operation to the speech state input unit does not involve vocal cord movement. In the case of indicating the time, the first transmission unit transmits only the data of the in-vivo transmission sound collected by the in-body conduction sound collection unit to the computer, and the first reception unit receives the audio data. The sound reproduction unit reproduces sound from the sound data. Note that “transmit only in-body transmission sound data to the computer” means that voice data and bone conduction sound data are not transmitted to the computer together with in-body transmission sound data. Transmitting only the data of internal body sound and control data together to the computer is also included in “send only body sound data to computer”.
As a result, it is possible to easily realize switching between model learning and sound reproduction from the body conduction sound.

また、本発明の音声再現装置は、好ましくは、声帯運動を伴う発話時であるか声帯運動を伴わない発話時であるかを示す入力操作を受け付ける発話状態入力部をさらに有する。発話状態入力部に対する入力操作が声帯運動を伴う発話時を示すものであった場合、第１送信部は、体内伝導音集音部で集音された体内伝達音のデータと、これと同時に骨導振動部で集音された骨導音のデータとを時間的に同期させて計算機に送信し、骨導振動部は、骨導音の再生を行わず、発話状態入力部に対する入力操作が声帯運動を伴わない発話時を示すものであった場合、第１送信部は、体内伝導音集音部で集音された体内伝達音のデータのみを計算機に送信し、第１受信部は、骨導音のデータを受信し、骨導振動部は、当該骨導音のデータから骨導音を再生する。
これにより、モデル学習時と体内伝導音からの音声再生時との切り替えを容易に実現できる。 The speech reproduction apparatus of the present invention preferably further includes an utterance state input unit that receives an input operation indicating whether the utterance is accompanied by vocal cord movement or the utterance does not involve vocal cord movement. When the input operation to the utterance state input unit indicates an utterance accompanied by vocal cord movement, the first transmission unit transmits the data of the body-transmitted sound collected by the body conduction sound collection unit and the bone at the same time. The bone conduction sound data collected by the conduction vibration unit is temporally synchronized and transmitted to the computer. The bone conduction vibration unit does not reproduce the bone conduction sound, and the input operation to the utterance state input unit is a vocal cord. In the case of indicating an utterance without movement, the first transmission unit transmits only the data of the in-vivo transmission sound collected by the internal conduction sound collection unit to the computer, and the first reception unit The bone conduction vibration unit receives the sound conduction data and reproduces the bone conduction sound from the bone conduction sound data.
As a result, it is possible to easily realize switching between model learning and sound reproduction from the body conduction sound.

また、本発明の音声再現装置は、好ましくは、計算機は、音声再現装置から送信されたデータを受信する第２受信部と、体内伝達音のデータを用い、体内伝達音の特徴量を抽出する第１特徴量抽出部と、音声のデータを用い、音声の特徴量を抽出する第２特徴量抽出部と、時間的に同期した体内伝達音の特徴量と音声の特徴量とを相互に対応する学習データとし、学習処理によって、任意の体内伝達音の特徴量と任意の音声の特徴量との対応関係を示す音声対応付けモデルのパラメータを算出する音声学習部と、音声学習部で算出されたパラメータと、体内伝達音の特徴量とを用い、体内伝達音の特徴量に対応する音声の特徴量を算出する音声対応付けモデル適用部と、音声対応付けモデル適用部で算出された音声の特徴量を用い、音声のデータを生成する音声復元部と、音声復元部で生成された音声のデータを音声再現装置に送信する第２送信部と、を有し、第２受信部が時間的に同期した体内伝達音のデータと音声のデータとを受信した場合、第１特徴量抽出部は、第２受信部が受信した体内伝達音のデータを用いて体内伝達音の特徴量を抽出し、第２特徴量抽出部は、第２受信部が受信した音声のデータを用いて音声の特徴量を抽出し、音声学習部は、これらの特徴量を用いて音声対応付けモデルのパラメータを算出し、第２受信部が体内伝達音のデータのみを受信した場合、第１特徴量抽出部は、第２受信部が受信した体内伝達音のデータを用いて体内伝達音の特徴量を抽出し、音声対応付けモデル適用部は、抽出された体内伝達音の特徴量と音声学習部で算出されたパラメータとを用い、当該体内伝達音の特徴量に対応する音声の特徴量を算出し、音声復元部は、算出された音声の特徴量を用いて音声のデータを生成し、第２送信部は、音声復元部で生成された音声のデータを音声再現装置に送信する。 In the sound reproduction device of the present invention, preferably, the computer uses the second reception unit that receives data transmitted from the sound reproduction device and the data of the in-vivo transmission sound to extract the feature amount of the in-body transmission sound. The first feature quantity extraction unit, the second feature quantity extraction unit that extracts voice feature quantity using voice data, and the temporally synchronized in-vivo transmitted sound feature quantity and voice feature quantity correspond to each other A learning unit that calculates a parameter of a speech association model indicating a correspondence relationship between a feature amount of an arbitrary in-vivo transmitted sound and a feature amount of an arbitrary speech, and a speech learning unit. The voice association model application unit for calculating the feature amount of the voice corresponding to the feature amount of the in-vivo transmission sound, and the voice of the voice calculated by the voice association model application unit Using features, audio And a second transmission unit that transmits the audio data generated by the audio restoration unit to the audio reproduction device, and the second reception unit synchronizes with time. When the data and the voice data are received, the first feature amount extraction unit extracts the feature amount of the in-vivo transmission sound using the in-vivo transmission sound data received by the second reception unit, and the second feature amount extraction unit Uses the speech data received by the second receiving unit to extract speech feature values, the speech learning unit uses these feature values to calculate parameters of the speech association model, and the second receiving unit When only the in-body transmission sound data is received, the first feature amount extraction unit extracts the in-body transmission sound feature amount using the in-body transmission sound data received by the second reception unit, and the speech association model application unit Is the extracted feature value of the internal transmission sound and the parameter calculated by the speech learning unit. The voice feature amount corresponding to the feature amount of the in-vivo transmitted sound is calculated, and the voice restoration unit generates voice data using the calculated voice feature amount, and the second transmission unit Transmits the voice data generated by the voice restoration unit to the voice reproduction device.

ここで、当該計算機は、第２受信部が時間的に同期した体内伝達音のデータと音声のデータとを受信したか、体内伝達音のデータのみを受信したかによって、モデル学習処理とモデル適用処理とを切り替えている。これにより、音声再現装置で煩雑な処理を行うことなく、発話者の発話状態に応じて、モデル学習処理とモデル適用処理とを適宜切り替えることができる。 Here, the computer performs model learning processing and model application depending on whether the second receiving unit has received time-synchronized in-vivo transmission sound data and audio data or only in-body transmission sound data. Switching between processing. Thus, the model learning process and the model application process can be appropriately switched according to the utterance state of the speaker without performing complicated processes in the voice reproduction device.

また、本発明の音声再現装置は、好ましくは、計算機は、音声再現装置から送信されたデータを受信する第２受信部と、体内伝達音のデータを用い、体内伝達音の特徴量を抽出する第１特徴量抽出部と、骨導音のデータを用い、骨導音の特徴量を抽出する第３特徴量抽出部と、時間的に同期した体内伝達音の特徴量と骨導音の特徴量とを相互に対応する学習データとし、学習処理によって、任意の体内伝達音の特徴量と任意の骨導音の特徴量との対応関係を示す骨導音対応付けモデルのパラメータを算出する骨導音学習部と、骨導音学習部で算出されたパラメータと、体内伝達音の特徴量とを用い、体内伝達音の特徴量に対応する骨導音の特徴量を算出する骨導音対応付けモデル適用部と、骨導音対応付けモデル適用部で算出された骨導音の特徴量を用い、骨導音のデータを生成する骨導音復元部と、骨導音復元部で生成された骨導音のデータを音声再現装置に送信する第２送信部と、を有し、第２受信部が時間的に同期した体内伝達音のデータと骨導音のデータとを受信した場合、第１特徴量抽出部は、第２受信部が受信した体内伝達音のデータを用いて体内伝達音の特徴量を抽出し、第３特徴量抽出部は、第２受信部が受信した骨導音のデータを用いて骨導音の特徴量を抽出し、骨導音学習部は、これらの特徴量を用いて骨導音対応付けモデルのパラメータを算出し、第２受信部が体内伝達音のデータのみを受信した場合、第１特徴量抽出部は、第２受信部が受信した体内伝達音のデータを用いて体内伝達音の特徴量を抽出し、骨導音対応付けモデル適用部は、抽出された体内伝達音の特徴量と骨導音学習部で算出されたパラメータとを用い、当該体内伝達音の特徴量に対応する骨導音の特徴量を算出し、骨導音復元部は、算出された骨導音の特徴量を用いて骨導音のデータを生成し、第２送信部は、骨導音復元部で生成された骨導音のデータを音声再現装置に送信する。 In the sound reproduction device of the present invention, preferably, the computer uses the second reception unit that receives data transmitted from the sound reproduction device and the data of the in-vivo transmission sound to extract the feature amount of the in-body transmission sound. A first feature quantity extraction unit; a third feature quantity extraction unit that extracts bone conduction sound feature data using bone conduction sound data; and a time-synchronized in-body transmission sound feature quantity and bone conduction sound feature Bones that calculate the parameters of the bone conduction sound correlation model that indicates the correspondence between the feature quantities of any body-borne sound and any bone conduction sound by learning processing. Use bone conduction sound to calculate bone conduction sound feature quantity corresponding to body conduction sound feature quantity using parameters calculated by bone conduction learning section, bone conduction sound learning section and body conduction sound feature quantity Bone conduction sound calculated by the attachment model application unit and the bone conduction sound matching model application unit A bone conduction sound restoration unit that generates bone conduction sound data using the collected amount, and a second transmission unit that transmits the bone conduction sound data generated by the bone conduction sound restoration unit to the sound reproduction device. When the second reception unit receives the time-synchronized in-body transmission sound data and bone conduction sound data, the first feature amount extraction unit uses the in-body transmission sound data received by the second reception unit. The third feature amount extraction unit extracts the bone conduction sound feature amount using the bone conduction sound data received by the second reception unit, and the bone conduction sound learning unit When the parameters of the bone conduction sound association model are calculated using these feature amounts, and the second reception unit receives only the data of the in-vivo transmission sound, the first feature amount extraction unit receives the second reception unit. The feature value of the body conduction sound is extracted using the data of the body conduction sound thus obtained, and the bone conduction sound matching model application unit Using the feature amount of the reaching sound and the parameter calculated by the bone conduction sound learning unit, the feature amount of the bone conduction sound corresponding to the feature amount of the in-vivo transmission sound is calculated, and the bone conduction sound restoration unit is calculated The bone conduction sound data is generated using the feature value of the bone conduction sound, and the second transmission unit transmits the bone conduction sound data generated by the bone conduction sound restoration unit to the sound reproducing device.

ここで、当該計算機は、第２受信部が時間的に同期した体内伝達音のデータと骨導音のデータとを受信したか、体内伝達音のデータのみを受信したかによって、モデル学習処理とモデル適用処理とを切り替えている。これにより、音声再現装置で煩雑な処理を行うことなく、発話者の発話状態に応じて、モデル学習処理とモデル適用処理とを適宜切り替えることができる。 Here, the computer performs model learning processing depending on whether the second receiving unit receives the data of the body conduction sound and the data of the bone conduction sound that are synchronized in time or only the data of the body conduction sound. Switching between model application processing. Thus, the model learning process and the model application process can be appropriately switched according to the utterance state of the speaker without performing complicated processes in the voice reproduction device.

以上のように本発明では、利用者に煩雑な作業をさせることなく、体内伝達音から利用者個人の音声を再生することができる。 As described above, according to the present invention, it is possible to reproduce the user's personal voice from the in-body transmitted sound without causing the user to perform complicated work.

以下、本発明を実施するための最良の形態を図面を参照して説明する。
〔第１実施形態〕
以下、この発明の第１実施形態を述べる。
＜構成＞
図１（ａ）は、本形態のシステム全体の構成を例示した概念図である。 The best mode for carrying out the present invention will be described below with reference to the drawings.
[First Embodiment]
The first embodiment of the present invention will be described below.
<Configuration>
FIG. 1A is a conceptual diagram illustrating the configuration of the entire system of this embodiment.

図１（ａ）に例示するように、本形態のシステムは、音声再現装置１０と計算機２０とを具備し、それらは接続線３０を介して通信可能に電気信号的に接続されている。音声再現装置１０は、体内伝導音から再現された音声を再生するためのヘッドフォンやイヤホンなどのスピーカ１１（「音声再生部」に相当）、体内伝導音を収音するための体内伝導音用マイク１２（「体内伝導音集音部」に相当）、通常音声を収音するための音声集音用マイク１３（「音声集音部」に相当）、それらと電気的に接続された本体１４及び声帯運動を伴う発話時であるか声帯運動を伴わない発話時であるかを示す入力操作を受け付ける切り替えスイッチ１４ａ（「発話状態入力部」に相当）を有している。 As illustrated in FIG. 1A, the system according to the present embodiment includes an audio reproduction device 10 and a computer 20, which are connected to each other via a connection line 30 so as to communicate with each other in electrical signals. The sound reproduction apparatus 10 includes a speaker 11 such as a headphone or an earphone for reproducing sound reproduced from the body conduction sound (corresponding to a “sound reproduction unit”), and a body conduction sound microphone for collecting the body conduction sound. 12 (corresponding to “in-body conduction sound collecting unit”), a sound collecting microphone 13 (corresponding to “sound collecting unit”) for collecting normal sound, a main body 14 electrically connected thereto, and It has a changeover switch 14a (corresponding to “speech state input unit”) that receives an input operation indicating whether the speech is accompanied by vocal cord movement or the speech does not involve vocal cord movement.

ここで、体内伝導音用マイク１２は、例えば、非特許文献２に記載された体表接着聴診器型マイクロフォンである。なお、最適な体内伝導音の集音のためには、この体内伝導音用マイク１２が具備する振動板の上部の一部が、発話者の頭蓋底の耳孔のすぐ後ろの「乳様突起」と呼ばれる骨部分にかかるように取り付けることが望ましい（非特許文献２参照）。また、計算機２０は、ＣＰＵ（Central Processing Unit）やＲＡＭ（Random‐Access Memory）等から構成される一般的なＰＣ（Personal Computer）でもよいし、ＣＰＵやＲＡＭ等を内蔵する携帯電話やＰＤＡ（Personal Digital Assistant）のような携帯機器であってもよいし、さらには本形態専用の計算処理可能な機器であってもよい。また、音声再現装置１０と計算機２０とは、別々の筺体内に構成されてもよいし、同一の筺体内に構成されてもよい。また、接続線３０には、音声ケーブル、光ファイバ、ネットワークケーブル等、音声や体内伝導音声の伝送形式に対応した形態ものを用いればよい。音声再現装置１０と計算機２０とは、単にデジタル的に接続されていてもよいし、モデム等のネットワーク接続機器を通して接続されてもよい。また、音声再現装置１０でＤ／Ａ変換やＡ／Ｄ変換を行なわない構成とし、音声再現装置１０と計算機２０とをＤ／Ａ変換器やＡ／Ｄ変換器を介して接続してもよい。 Here, the body conduction sound microphone 12 is, for example, a body surface-adhesive stethoscope-type microphone described in Non-Patent Document 2. In order to collect the optimal body conduction sound, a part of the upper part of the diaphragm of the body conduction sound microphone 12 is a “milky process” immediately behind the ear canal of the speaker's skull base. It is desirable to attach so that it may cover the bone part called (refer nonpatent literature 2). Further, the computer 20 may be a general PC (Personal Computer) composed of a CPU (Central Processing Unit), a RAM (Random-Access Memory), or the like, or a mobile phone or PDA (Personal Computer) incorporating a CPU, a RAM, or the like. A portable device such as a digital assistant), or a device capable of calculation processing dedicated to this embodiment. Further, the sound reproduction device 10 and the computer 20 may be configured in separate casings, or may be configured in the same casing. Moreover, what is necessary is just to use the thing corresponding to the transmission format of an audio | voice and body conduction audio | voices, such as an audio | voice cable, an optical fiber, and a network cable, for the connection line 30. The audio reproduction device 10 and the computer 20 may be simply connected digitally or may be connected through a network connection device such as a modem. The audio reproduction device 10 may be configured not to perform D / A conversion or A / D conversion, and the audio reproduction device 10 and the computer 20 may be connected via a D / A converter or an A / D converter. .

図１（ｂ）は、本形態の本体１４の構成を例示した概念図である。
図１（ｂ）に例示するように、本形態の本体１４は、切り替えスイッチ１４ａ、制御部１４ｂ、スイッチ１４ｃ，１４ｄ、Ａ／Ｄ変換器１４ｅ，１４ｆ、同期部１４ｇ、Ｄ／Ａ変換器１４ｈ、アンプ１４ｉ、送信部１４ｊ及び受信部１４ｋを具備する。なお、制御部１４ｂ及び同期部１４ｇは、例えば公知のコンピュータに所定のプログラムが読み込まれることにより構成されるものである。また、送信部１４ｊや受信部１４ｋは、伝送形式に対応した通信機器（例えば、ネットワークカードや光伝送モジュールなど）である。またＡ／Ｄ変換器１４ｅ，１４ｆは、物理的な回路としては１個であってもよい。 FIG. 1B is a conceptual diagram illustrating the configuration of the main body 14 of this embodiment.
As illustrated in FIG. 1B, the main body 14 of this embodiment includes a changeover switch 14a, a control unit 14b, switches 14c and 14d, A / D converters 14e and 14f, a synchronization unit 14g, and a D / A converter 14h. , An amplifier 14i, a transmission unit 14j, and a reception unit 14k. The control unit 14b and the synchronization unit 14g are configured, for example, by reading a predetermined program into a known computer. The transmission unit 14j and the reception unit 14k are communication devices (for example, a network card, an optical transmission module, etc.) corresponding to the transmission format. The A / D converters 14e and 14f may be one physical circuit.

図１（ｂ）に例示するように、受信部１４ｋはスイッチ１４ｃを介してＤ／Ａ変換器１４ｈと電気的に接続され、Ｄ／Ａ変換器１４ｈはアンプ１４ｉを介してスピーカ１１に電気的に接続される。また、送信部１４ｊは、同期部１４ｇと接続される。同期部１４ｇは、スイッチ１４ｄとＡ／Ｄ変換器１４ｆとを介して音声集音用マイク１３に電気的に接続され、Ａ／Ｄ変換器１４ｅを介して体内伝導音用マイク１２に電気的に接続される。また、切り替えスイッチ１４ａは、その出力信号が制御部１４ｂに入力可能に構成され、制御部１４ｂはスイッチ１４ｃ，１４ｄに対して制御信号を供給可能に構成される。なお、本体１４は、制御部１４ｂの制御のもと各処理を実行する。
また、本形態の計算機２０には所定のプログラムが読み込まれ、ＣＰＵがそのプログラムを実行することにより各機能構成が実現される。図２は、このように実現される本形態の計算機２０の機能構成を例示したブロック図である。 As illustrated in FIG. 1B, the receiving unit 14k is electrically connected to the D / A converter 14h via the switch 14c, and the D / A converter 14h is electrically connected to the speaker 11 via the amplifier 14i. Connected to. The transmission unit 14j is connected to the synchronization unit 14g. The synchronization unit 14g is electrically connected to the sound collection microphone 13 via the switch 14d and the A / D converter 14f, and electrically connected to the in-body conduction sound microphone 12 via the A / D converter 14e. Connected. The changeover switch 14a is configured such that the output signal can be input to the control unit 14b, and the control unit 14b is configured to be able to supply a control signal to the switches 14c and 14d. The main body 14 executes each process under the control of the control unit 14b.
Also, a predetermined program is read into the computer 20 of this embodiment, and each functional configuration is realized by the CPU executing the program. FIG. 2 is a block diagram illustrating a functional configuration of the computer 20 of the present embodiment realized in this way.

図２に例示するように、本形態の計算機２０は、受信部２０ａ、判定部２０ｂ、第１特徴量抽出部２０ｃ、第２特徴量抽出部２０ｄ、記憶部２０ｅ，２０ｇ、音声学習部２０ｆ、音声対応付けモデル適用部２０ｈ、音声復元部２０ｉ、送信部２０ｊ、一時メモリ２０ｋ及び制御部２０ｍを具備する。ここで、受信部２０ａ及び送信部２０ｊは、ＣＰＵの制御のもと駆動する伝送形式に対応した通信機器であり、記憶部２０ｅ，２０ｇ及び一時メモリ２０ｋは、例えばＲＡＭ、レジスタ、ハードディスク又はそれらを複合した記憶領域である。また、判定部２０ｂ、第１特徴量抽出部２０ｃ、第２特徴量抽出部２０ｄ、音声学習部２０ｆ、音声対応付けモデル適用部２０ｈ、音声復元部２０ｉ及び制御部２０ｍは、ＣＰＵ上で所定のプログラムが実行されることにより構成されるものである。なお、計算機２０は、制御部２０ｍの制御のもと各処理を実行する。また、特に明記しない限り、各処理によって算出されたデータは一旦一時メモリ２０ｋに格納され、必要に応じて読み出されるものとする。 As illustrated in FIG. 2, the computer 20 of the present embodiment includes a receiving unit 20 a, a determining unit 20 b, a first feature amount extracting unit 20 c, a second feature amount extracting unit 20 d, storage units 20 e and 20 g, a speech learning unit 20 f, A voice association model application unit 20h, a voice restoration unit 20i, a transmission unit 20j, a temporary memory 20k, and a control unit 20m are provided. Here, the receiving unit 20a and the transmitting unit 20j are communication devices compatible with a transmission format driven under the control of the CPU, and the storage units 20e and 20g and the temporary memory 20k are, for example, a RAM, a register, a hard disk, or the like. This is a composite storage area. Further, the determination unit 20b, the first feature quantity extraction unit 20c, the second feature quantity extraction unit 20d, the speech learning unit 20f, the speech association model application unit 20h, the speech restoration unit 20i, and the control unit 20m are predetermined on the CPU. It is configured by executing a program. The computer 20 executes each process under the control of the control unit 20m. Unless otherwise specified, the data calculated by each process is temporarily stored in the temporary memory 20k and read out as necessary.

＜音声再現装置１０の動作＞
次に、本形態の音声再現装置１０の動作について説明する。
利用者は、声帯運動を伴う通常の発話を行うか、声帯振動を伴わない発話を行うかに応じ、切り替えスイッチ１４ａを切り替える。切り替えスイッチ１４ａのスイッチング状態は電気信号として制御部１４ｂに入力される。制御部１４ｂは、入力された電気信号が示す切り替えスイッチ１４ａのスイッチング状態に応じ、スイッチ１４ｃ，１４ｄを以下のように制御する。 <Operation of the sound reproduction device 10>
Next, the operation of the sound reproduction device 10 of this embodiment will be described.
The user switches the changeover switch 14a according to whether he / she performs a normal utterance with vocal cord movement or an utterance without vocal cord vibration. The switching state of the changeover switch 14a is input to the control unit 14b as an electrical signal. The control unit 14b controls the switches 14c and 14d as follows according to the switching state of the changeover switch 14a indicated by the input electrical signal.

すなわち、切り替えスイッチ１４ａが声帯運動を伴う発話を示す状態にスイッチングされた場合、制御部１４ｂは、スイッチ１４ｃをＯＦＦにし、スイッチ１４ｄをＯＮとする。一方、切り替えスイッチ１４ａが声帯運動を伴わない発話を示す状態にスイッチングされた場合、制御部１４ｂは、スイッチ１４ｃをＯＮにし、スイッチ１４ｄをＯＦＦとする。このような状態において、声帯運動を伴う発話又は声帯運動を伴わない発話が行われると、音声再現装置１０は以下のように動作する。

In other words, when the changeover switch 14a is switched to a state indicating speech with vocal cord movement, the control unit 14b turns off the switch 14c and turns on the switch 14d. On the other hand, when the changeover switch 14a is switched to a state indicating an utterance not accompanied by vocal cord movement, the control unit 14b turns on the switch 14c and turns off the switch 14d. In such a state, when an utterance accompanied by vocal cord movement or an utterance not accompanied by vocal cord movement is performed, the voice reproduction device 10 operates as follows.

［声帯運動を伴う発話時の動作（スイッチ１４ｃ：ＯＦＦ，スイッチ１４ｄ：ＯＮ）］
発話者が声帯運動を伴う発話を行う場合、発話された通常の音声は音声集音用マイク１３で集音されてアナログ電気信号に変換される。それと同時に、この発話に伴う体内伝達音が体内伝導音用マイク１２で集音されてアナログ電気信号に変換される。集音された体内伝達音のアナログ電気信号と音声のアナログ電気信号とは、それぞれＡ／Ｄ変換器１４ｅ，１４ｆでデジタル電気信号に変換され、同期部１４ｇに入力される。同期部１４ｇは、体内伝達音のデジタル電気信号と音声のデジタル電気信号とを時間的に同期させ、送信部１４ｊに送る。なお、この同期は、例えば入力された順に、体内伝達音のデジタル電気信号と音声のデジタル電気信号とを離散時間毎に対応付けることによって行われる。また、時間的に同期させた体内伝達音のデジタル電気信号と音声のデジタル電気信号とには、制御部１４ｂの制御のもと、例えば、信号の種別を示すデータなどの制御用データも付加される。送信部１４ｊは、送られたデジタル電気信号をデジタルデータとして、或いは、特定のプロトコルに基づくネットワークデータとして、接続線３０を経由して計算機２０に送信する。なお、音声再現装置１０にＡ／Ｄ変換器１４ｅ，１４ｆを設けず、集音された体内伝達音のアナログ電気信号と音声のアナログ電気信号と時間的に同期させたアナログ電気信号をそのまま計算機２０に送信する構成としてもよい。この場合には、計算機２０がアナログ電気信号からデジタル信号への変換を行う。 [Operation when speaking with vocal cord movement (switch 14c: OFF, switch 14d: ON)]
When the speaker utters with vocal cord movement, the normal voice uttered is collected by the voice collecting microphone 13 and converted into an analog electric signal. At the same time, the in-body transmission sound accompanying this utterance is collected by the in-body conduction sound microphone 12 and converted into an analog electric signal. The collected analog electrical signal of internal transmission sound and audio analog electrical signal are converted into digital electrical signals by A / D converters 14e and 14f, respectively, and input to the synchronization unit 14g. The synchronization unit 14g temporally synchronizes the digital electrical signal of the internal transmission sound and the audio digital electrical signal, and sends them to the transmission unit 14j. This synchronization is performed, for example, by associating the digital electric signal of the body-transmitted sound and the digital electric signal of the sound every discrete time in the input order. Also, control data such as data indicating the type of the signal is added to the digital electrical signal of the internal transmission sound and the audio digital electrical signal synchronized in time under the control of the control unit 14b. The The transmission unit 14j transmits the transmitted digital electrical signal to the computer 20 via the connection line 30 as digital data or as network data based on a specific protocol. The audio reproduction device 10 does not include the A / D converters 14e and 14f, and the computer 20 receives the analog electrical signal of the collected body-borne sound and the analog electrical signal that is temporally synchronized with the analog analog signal of the speech. It is good also as a structure which transmits to. In this case, the computer 20 performs conversion from an analog electrical signal to a digital signal.

このように計算機２０に送信されたデータは、計算機２０が体内伝導音と音声とを対応付ける音声対応付けモデルを学習するために用いられる。計算機２０の処理は後述する。また、切り替えスイッチ１４ａが声帯運動を伴う発話を示す状態にスイッチングされている場合、制御部１４ｂはスピーカ１１や骨導振動子１１５での再生動作を実行させない。 The data transmitted to the computer 20 in this manner is used for the computer 20 to learn a voice association model that associates the body conduction sound with the voice. The processing of the computer 20 will be described later. In addition, when the changeover switch 14a is switched to a state indicating an utterance accompanied by vocal cord movement, the control unit 14b does not cause the speaker 11 or the bone conduction vibrator 115 to perform a reproduction operation.

［声帯運動を伴わない発話時の動作（スイッチ１４ｃ：ＯＮ，スイッチ１４ｄ：ＯＦＦ）］
発話者が声帯運動を伴わない発話を行う場合、その発話に伴う体内伝達音は、体内伝導音用マイク１２で集音されてアナログ電気信号に変換される。集音された体内伝達音は、Ａ／Ｄ変換器１４ｆでデジタル電気信号に変換され、同期部１４ｇに入力される。同期部１４ｇは、制御部１４ｂの制御のもと、当該体内伝達音のデジタル電気信号に対し、例えば、信号の種別を示すデータなどの制御用データを付加し、送信部１４ｊに送る。送信部１４ｊは、送られたデジタル電気信号をデジタルデータとして、或いは、特定のプロトコルに基づくネットワークデータとして、接続線３０を経由して計算機２０に送信する。なお、集音された体内伝達音のアナログ電気信号をそのまま計算機２０に送信する構成としてもよい。この場合には、計算機２０がアナログ電気信号からデジタル電気信号への変換を行う。 [Operation when speaking without vocal cord movement (switch 14c: ON, switch 14d: OFF)]
When a speaker performs an utterance without accompanying vocal cord movement, the in-body transmission sound accompanying the utterance is collected by the in-body conduction sound microphone 12 and converted into an analog electric signal. The collected body-transmitted sound is converted into a digital electric signal by the A / D converter 14f and input to the synchronization unit 14g. Under the control of the control unit 14b, the synchronization unit 14g adds control data such as data indicating the type of the signal to the digital electrical signal of the in-vivo transmission sound and sends the data to the transmission unit 14j. The transmission unit 14j transmits the transmitted digital electrical signal to the computer 20 via the connection line 30 as digital data or as network data based on a specific protocol. The collected analog electrical signal of the in-vivo transmission sound may be transmitted to the computer 20 as it is. In this case, the computer 20 performs conversion from an analog electric signal to a digital electric signal.

計算機２０は、送られた体内伝達音のデータに対し、音声対応付けモデルを用いて音声のデータを生成し、生成した音声のデータをデジタル電気信号として音声再現装置１０に送信する（計算機２０の動作は後述）。計算機２０から送信された音声のデジタル電気信号は音声再現装置１０の受信部１４ｋで受信され、スイッチ１４ｃを介してＤ／Ａ変換器１４ｈに入力される。Ｄ／Ａ変換器１４ｈは、入力された音声のデジタル電気信号をアナログ信号に変換してスピーカ１１に入力する。スピーカ１１は、入力された音声のアナログ信号に基づいた音声を、利用者が設定するか予め設定された音量で再生する。なお、計算機２０から音声のアナログ信号が伝送される構成の場合には、Ｄ／Ａ変換器１４ｈは不要であり、スピーカ１１は伝送された音声のアナログ信号から音声再生を行う。 The computer 20 generates voice data using the voice association model for the transmitted body sound, and transmits the generated voice data to the voice reproduction device 10 as a digital electrical signal (of the computer 20). The operation will be described later). The audio digital electric signal transmitted from the computer 20 is received by the receiving unit 14k of the audio reproduction device 10, and is input to the D / A converter 14h via the switch 14c. The D / A converter 14 h converts the input digital audio signal into an analog signal and inputs the analog signal to the speaker 11. The speaker 11 reproduces the sound based on the analog signal of the input sound at a volume set by the user or set in advance. In the case of a configuration in which an audio analog signal is transmitted from the computer 20, the D / A converter 14h is not necessary, and the speaker 11 performs audio reproduction from the transmitted audio analog signal.

＜計算機２０の動作＞
次に、計算機２０の動作を例示する。
図３は、本形態の計算機２０の動作を説明するためのフローチャートである。以下、この図に従い、本形態の計算機２０の動作を例示する。
まず、音声再現装置１０から送信された電気信号は受信部２０ａで受信される（ステップＳ１）。受信された電気信号は、必要に応じて計算機２０での処理に適したデータに変換され、変換されたデータは一時メモリ２０ｋにバッファされ、例えば、フレーム単位で判定部２０ｂに送られる。 <Operation of computer 20>
Next, the operation of the computer 20 will be exemplified.
FIG. 3 is a flowchart for explaining the operation of the computer 20 of this embodiment. Hereinafter, the operation of the computer 20 of this embodiment will be exemplified according to this figure.
First, the electrical signal transmitted from the audio reproduction device 10 is received by the receiving unit 20a (step S1). The received electrical signal is converted into data suitable for processing by the computer 20 as necessary, and the converted data is buffered in the temporary memory 20k and sent to the determination unit 20b in units of frames, for example.

判定部２０ｂは、送られた受信データが音声データを含むか否かを判定する（ステップＳ２）。この判定は、例えば、判定部２０ｂが、受信データが具備する制御用データを参照して行う。ここで、受信データが音声データを含むと判定された場合には、以下のステップＳ３からＳ６の処理が実行され、受信データが音声データを含まないと判定された場合には、以下のステップＳ７からＳ１０の処理が実行される。この制御は制御部２０ｍが行う。 The determination unit 20b determines whether or not the received reception data includes audio data (step S2). This determination is performed by, for example, the determination unit 20b with reference to control data included in the reception data. If it is determined that the received data includes audio data, the following steps S3 to S6 are executed. If it is determined that the received data does not include audio data, the following step S7 is performed. To S10 are executed. This control is performed by the control unit 20m.

［受信データが音声データを含むと判定された場合（Ｓ３〜Ｓ６）］
まず、第１特徴量抽出部２０ｃが、判定部２０ｂから転送された体内伝達音のデータを分析し、当該体内伝達音の特徴量〔Ｙ_ｊ（ｊ＝１，....，Ｊ、Ｊは自然数）〕を抽出する（ステップＳ３）。また、第２特徴量抽出部２０ｄが、判定部２０ｂから転送された音声のデータ（ステップＳ３の体内伝達音のデータと時間的に同期した音声のデータ）を分析し、音声の特徴量（Ｘ_ｊ）を抽出する（ステップＳ４）。ここで抽出する特徴量としては、例えばスペクトル、基本周波数、音源成分、非周期成分やそれらの動的特徴（一次時間差分、二次時間差分）等を例示できる。また分析方法には、例えば、ＬＰＣ分析法、ケプストラム分析法、STRAIGHT分析法などの周知の方法を用いる。また、自然数Ｊは抽出する特徴量の種別数を示し、ｊは抽出する特徴量の種別に対応する識別子である。
次に、第１特徴量抽出部２０ｃで抽出された体内伝達音の特徴量（Ｙ_ｊ）と、第２特徴量抽出部２０ｄで抽出された音声の特徴量（Ｘ_ｊ）とを、例えば、所定の時間区間であるフレーム単位で対応付けて記憶部２０ｅに格納する（ステップＳ５）。 [When it is determined that the received data includes audio data (S3 to S6)]
First, the first feature quantity extraction unit 20c analyzes the in-vivo transmission sound data transferred from the determination unit 20b and determines the feature quantity [Y _j (j = 1,..., J, J Is a natural number)] (step S3). In addition, the second feature amount extraction unit 20d analyzes the voice data transferred from the determination unit 20b (the voice data synchronized in time with the in-body transmission sound data in step S3), and the voice feature amount (X _j ) is extracted (step S4). Examples of the feature amount extracted here include a spectrum, a fundamental frequency, a sound source component, an aperiodic component, and dynamic features thereof (primary time difference, secondary time difference) and the like. As the analysis method, for example, a known method such as an LPC analysis method, a cepstrum analysis method, or a STRAIGHT analysis method is used. The natural number J indicates the number of types of feature values to be extracted, and j is an identifier corresponding to the type of feature values to be extracted.
Next, the in-vivo transmission sound feature value (Y _j ) extracted by the first feature value extraction unit 20c and the sound feature value (X _j ) extracted by the second feature value extraction unit 20d are, for example, The data is stored in the storage unit 20e in association with each other as a predetermined time interval (step S5).

次に、音声学習部２０ｆが、相互に対応付けられた体内伝達音の特徴量（Ｙ_ｊ）と音声の特徴量（Ｘ_ｊ）とを記憶部２０ｅから読み込み、これらを相互に対応する学習データとし、学習処理によって、任意の体内伝達音の特徴量と任意の音声の特徴量との対応関係を示す音声対応付けモデルのパラメータを算出する（ステップＳ６）。この処理は特徴量の種別毎（ｊ毎）にそれぞれ実行され、算出された音声対応付けモデルのパラメータはｊ毎に記憶部２０ｇに格納される。なお、音声対応付けモデルとしては、例えば、非特許文献１と同様、混合正規分布モデル（GMM: Gaussian Mixture Model）を用いることができる。ただし、非特許文献１の手法では動的計画法を用いて両特徴量の対応付けを行っているが、本形態の場合、両特徴量は時間的に同期しているデータであり、動的計画法による対応付けは不要である。この点も本形態の特徴である。すなわち本形態の場合、動的計画法による両特徴量の対応付けが不要であるため、計算機２０での計算時間を短縮できるという効果、並びに、対応付けの誤りによってモデルの分散が増加し、結果的に再現音声の品質劣化を引き起こすという問題を発生させないという効果を有する。なお、ステップＳ６は必ずしも毎回実行する必要はなく、所定数組の特徴量Ｘ_ｊ，Ｙ_ｊが収集されるたびに実行してもよい。 Next, the speech learning unit 20f reads from the storage unit 20e the feature values (Y _j ) and the sound feature values (X _j ) of the in-vivo transmission sound that are associated with each other, and learning data that corresponds to these features. Then, by the learning process, a parameter of the voice association model indicating the correspondence relationship between the feature quantity of any in-body transmitted sound and the feature quantity of any voice is calculated (step S6). This process is executed for each type of feature quantity (for each j), and the calculated parameters of the voice association model are stored in the storage unit 20g for each j. As the voice association model, for example, a mixed normal distribution model (GMM: Gaussian Mixture Model) can be used as in Non-Patent Document 1. However, in the method of Non-Patent Document 1, both feature quantities are associated using dynamic programming, but in the case of this embodiment, both feature quantities are data synchronized in time, No association by the programming method is necessary. This is also a feature of this embodiment. That is, in the case of this embodiment, since the association of both feature quantities by dynamic programming is not necessary, the effect that the calculation time in the computer 20 can be shortened, and the variance of the model increases due to the association error, and the result In particular, there is an effect that the problem of causing the quality deterioration of the reproduced voice is not generated. Note that step S6 is not necessarily executed every time, and may be executed every time a predetermined number of feature amounts X _j and Y _j are collected.

［モデル学習の具体例］
以下にモデル学習の具体例を示す。
この具体例の音声対応付けモデルは以下のＧＭＭである。

ここで、特徴量Ｘ_ｊ，Ｙ_ｊはそれぞれベクトルであり、p_j(X_ｊ ,Y_ｊ)は特徴量Ｘ_ｊ，Ｙ_ｊの結合確率密度であり、ＭはＧＭＭを構成する正規分布の数を示す自然数であり、N(X_ｊ, Y_ｊ ; M_m ^j, Σ_m ^j)はｍ（m=1,...,M）番目の正規分布を意味し、λ_ｍ ^jはｍ番目の正規分布N(X_ｊ, Y_ｊ ; M_m ^j, Σ_m ^j)の重みを意味する。また、Ｍ_ｍ ^jは、特徴量Ｘ_ｊのｍ番目の平均ベクトルμ_ｍ ^Ｘjと、特徴量Ｙ_ｊのｍ番目の平均ベクトルμ_ｍ ^Ｙjとの集合を示す。また、Σ_ｍは、平均ベクトルμ_ｍ ^Ｘjに対応する特徴量Ｘ_ｊの共分散行列Σ_ｍ ^Ｘjと、平均ベクトルμ_ｍ ^Ｙjに対応する特徴量Ｙ_ｊの共分散行列Σ_ｍ ^Ｙjと、平均ベクトルμ_ｍ ^Ｘjに対応する特徴量Ｘ_ｊと平均ベクトルμ_ｍ ^Ｙjに対応する特徴量Ｙ_ｊとの組に対する共分散行列Σ_ｍ ^ＹjＸjと、の集合を示す。 [Specific examples of model learning]
Specific examples of model learning are shown below.
The voice association model of this specific example is the following GMM.

Here, the feature amounts X _j and Y _j are vectors, respectively, p _j (X _j , Y _j ) is the joint probability density of the feature amounts X _j and Y _j , and M is the number of normal distributions constituting the GMM N (X _j , Y _j ; M _m ^j , Σ _m ^j ) means the mth (m = 1, ..., M) normal distribution, and λ _m ^j is the mth It means the weight of the normal distribution N (X _j , Y _j ; M _m ^j , Σ _m ^j ). Further, _{M m} ^j denotes the m-th mean vector mu _m ^Xj feature quantity _{X j,} a set of m-th mean vector mu _m ^Yj feature amount _{Y j.} Furthermore, sigma _m is a covariance matrix sigma _m ^Xj feature quantity _{X j} which corresponds to the mean vector mu _m ^Xj, and covariance matrix sigma _m ^Yj feature amount _{Y j} corresponding to the average vector mu _m ^Yj, the average vector It shows the covariance matrix Σ _m ^YjXj for the set of the feature amount _{Y j} corresponding to the feature quantity _{X j} and the average vector mu _m ^Yj corresponding to mu _m ^Xj, a set of.

また、ＧＭＭの学習には周知のＥＭアルゴリズムを利用することが一般的である（例えば、K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi and T. Kitamura, "Speech parameter generation algorithm for HMM-based speech synthesis", Proc. ICASSP'2000, pp. 1315-1318等参照）。以下にＥＭアルゴリズムを用いたＧＭＭの学習方法を例示する。
まず、特徴量Ｘ_ｊ，Ｙ_ｊについて以下のような関係を定義できる。 In addition, it is common to use a well-known EM algorithm for GMM learning (for example, K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi and T. Kitamura, “Speech parameter generation algorithm for HMM”). -based speech synthesis ", Proc. ICASSP'2000, pp. 1315-1318). A GMM learning method using the EM algorithm will be exemplified below.
First, the following relationships can be defined for the feature amounts X _j and Y _j .

ここでＥは期待値であり、Ｗ_ｊ ^ｍはｍ番目の変換行列であり、ｂ_ｊ ^ｍはｍ番目のバイアス定数である。なお、P(c_m|Y_j)は、ｍ番目の正規分布N(X_ｊ, Y_ｊ ; M_m ^ｊ, Σ_m ^ｊ)からの特徴量Y_ｊの出現確率であり、下記のように示される。

Here, E is an expected value, W _j ^m is the m-th transformation matrix, and b _j ^m is the m-th bias constant. P (c _m | Y _j ) is an appearance probability of the feature quantity Y _j from the m-th normal distribution N (X _j , Y _j ; M _m ^j , Σ _m ^j ), and is expressed as follows: It is.

このとき、ＥＭアルゴリズムのＱ関数を以下のようにする。

At this time, the Q function of the EM algorithm is as follows.

そして、ＥＭアルゴリズムのＥステップでは、パラメータλ_m ^j，Ｍ_ｍ ^j，Σ_ｍ ^jに対して式（４）の計算を行い、Ｑを算出する。なお、パラメータＭ_ｍ ^jは、学習データである複数の特徴量Ｘ_ｊ，Ｙ_ｊに対して決定される値である（どの特徴量Ｘ_ｊ，Ｙ_ｊがどのｍに対応するかは未知）。また、パラメータＭ_ｍ ^jが決定されればパラメータΣ_ｍ ^jも定まる。

Then, in the E step of the EM algorithm, Q is calculated by calculating the equation (4) for the parameters λ _m ^j , M _m ^j , and Σ _m ^j . The parameter M _m ^j is a value determined for a plurality of feature amounts X _j , Y _j as learning data (which feature amount X _j , Y _j corresponds to which m is unknown). If the parameter M _m ^j is determined, the parameter Σ _m ^j is also determined.

また、ＥＭアルゴリズムのＭステップでは、上記Ｅステップで用いたパラメータＭ_ｍ ^j，Σ_ｍ ^jが代入された式（４）によって算出されるＱを最大化するパラメータλ_m ^jを次のＥステップにおけるパラメータλ_m ^jの値する。具体的には、例えば、上記Ｅステップで用いたパラメータＭ_ｍ ^j，Σ_ｍ ^jが代入され、パラメータλ_m ^jを変数とした式（４）をλ_m ^jで偏微分し、その偏微分結果が０となるλ_m ^jを次のＥステップにおけるパラメータλ_m ^jの値する。さらに、このＭステップでは、上記Ｅステップで用いたλ_m ^jが代入された式（４）において、学習データである複数の特徴量Ｘ_ｊ，Ｙ_ｊに対してパラメータＭ_ｍ ^j，Σ_ｍ ^jを決定した場合にＱが最大となるようにパラメータＭ_ｍ ^j，Σ_ｍ ^jを決定する。このパラメータＭ_ｍ ^j，Σ_ｍ ^jの決定は、例えば、パラメータＭ_ｍ ^jに関する変数の偏微分によって行う。すなわち、まず、学習データである複数の特徴量Ｘ_ｊ，Ｙ_ｊとｍとを対応付ける関数ＦによってパラメータＭ_ｍ ^j，Σ_ｍ ^jを表現する。そして、このように表現したパラメータＭ_ｍ ^j，Σ_ｍ ^jと上記Ｅステップで用いたλ_m ^jとが代入された式（４）を関数Ｆの変数によって偏微分し、その偏微分結果が０となる関数Ｆの変数を決定して、次のＥステップにおけるパラメータＭ_ｍ ^j，Σ_ｍ ^jを決定する。 In the M step of the EM algorithm, the parameter λ _m ^j for maximizing Q calculated by the equation (4) substituted with the parameters M _m ^j and Σ _m ^j used in the E step is used in the next E step. The value of the parameter λ _m ^j . Specifically, for example, the parameters M _m ^j and Σ _m ^j used in the E step are substituted, and the equation (4) with the parameter λ _m ^j as a variable is partially differentiated by λ _m ^j , and the partial differentiation result thereof There 0 become lambda _m ^j to the value of the parameter lambda _m ^j in the next E-step. Further, in this M step, in the formula (4) in which λ _m ^j used in the E step is substituted, the parameters M _m ^j , Σ _m ^j are used for a plurality of feature amounts X _j , Y _{j as} learning data. Parameters M _m ^j and Σ _m ^j are determined so that Q is maximized. The determination of the parameters M _m ^j and Σ _m ^j is performed by, for example, partial differentiation of a variable related to the parameter M _m ^j . That is, first, the parameters M _m ^j and Σ _m ^j are expressed by a function F that associates a plurality of feature amounts X _j and Y _j that are learning data with m. Then, the equation (4) in which the parameters M _m ^j and Σ _m ^j expressed as described above and λ _m ^j used in the E step is substituted is partially differentiated by the variable of the function F, and the partial differentiation result is 0. And the parameters M _m ^j and Σ _m ^j in the next E step are determined.

そして、上記のようなＥステップとＭステップとをＱが所定範囲に収束するまで繰り返し、Ｑが所定範囲に収束した際の各パラメータλ_ｍ ^j，Ｍ_ｍ ^j，Σ_ｍ ^jをＧＭＭのモデルパラメータとして決定する。
また、各パラメータλ_ｍ ^j，Ｍ_ｍ ^j，Σ_ｍ ^jが決定された場合、式（２）と式（４）とを照合すれば、変換行列Ｗ_ｊ ^ｍとバイアス定数ｂ_ｊ ^ｍとが
W_j ^m=Σ_m ^YjXj・(Σ_m ^Yj)^-1 …(5)
b_j ^m=μ_m ^Xj‐Σ_m ^YjXj・(Σ_m ^Yj)^-1・μ_m ^Yj …(6)
のように定まる（［モデル学習の具体例］の説明終わり）。 Then, the E step and the M step as described above are repeated until Q converges to a predetermined range, and the parameters λ _m ^j , M _m ^j , and Σ _m ^j when Q converges to the predetermined range are changed as model parameters of the GMM. Determine as.
Further, when the parameters λ _m ^j , M _m ^j , and Σ _m ^j are determined, if the equations (2) and (4) are collated, the transformation matrix W _j ^m and the bias constant b _j ^m are obtained.
W _j ^m = Σ _m ^YjXj・ (Σ _m ^Yj ) ^-1 (5)
b _j ^m = μ _m ^Xj ‐Σ _m ^YjXj・ (Σ _m ^Yj ) ⁻¹・ μ _m ^Yj … (6)
(End of explanation of [Specific example of model learning]).

［受信データが音声データを含まないと判定された場合（Ｓ７〜Ｓ１０）］
まず、第１特徴量抽出部２０ｃが、判定部２０ｂから転送された体内伝達音のデータを分析し、当該体内伝達音の特徴量（Ｙ_ｊ’）を抽出する（ステップＳ７）。抽出された体内伝達音の特徴量（Ｙ_ｊ’）は音声対応付けモデル適用部２０ｈに転送される。音声対応付けモデル適用部２０ｈは、記憶部２０ｇから読み込んだ音声対応付けモデルのパラメータと体内伝達音の特徴量（Ｙ_ｊ’）とを用い、各ｊについて、体内伝達音の特徴量（Ｙ_ｊ’）に対応する音声の特徴量（Ｘ_ｊ’）を算出する（ステップＳ８）。具体的には、例えば、前述の式（５）（６）の演算結果と体内伝達音の特徴量（Ｙ_ｊ’）とを式（２）に代入し、その演算結果をＸ_ｊ’とする。 [When it is determined that the received data does not include audio data (S7 to S10)]
First, the first feature amount extraction unit 20c analyzes the in-vivo transmission sound data transferred from the determination unit 20b, and extracts the feature amount (Y _j ′) of the in-body transmission sound (step S7). The extracted feature quantity (Y _j ′) of the in-vivo transmission sound is transferred to the voice association model application unit 20h. The speech association model application unit 20h uses the parameters of the speech association model read from the storage unit 20g and the feature amount (Y _j ′) of the in-vivo transmission sound, and uses the feature amount (Y _j of the in-vivo transmission sound) for each _j. An audio feature quantity (X _j ′) corresponding to “)” is calculated (step S8). Specifically, for example, the calculation results of the above formulas (5) and (6) and the feature value (Y _j ′) of the in-vivo transmission sound are substituted into the formula (2), and the calculation result is set as X _j ′. .

算出された音声の特徴量（Ｘ_ｊ’）は音声復元部２０ｉに転送され、音声復元部２０ｉは音声の特徴量（Ｘ_ｊ’）から音声波形を合成し、音声データを生成する（ステップＳ９）。音声波形データの合成は、特徴量（Ｘ_ｊ’）を抽出する際の分析方法と対となる合成方法によって行う。例えば、第２特徴量抽出部２０ｄがＬＰＣ分析法によって特徴量を抽出する構成であれば、音声復元部２０ｉはＬＰＣ合成法によって合成を行う。また、第２特徴量抽出部２０ｄがケプストラム分析法によって特徴量を抽出する構成であれば、音声復元部２０ｉはケプストラム合成法によって合成を行う。また、第２特徴量抽出部２０ｄがSTRAIGHT分析法によって特徴量を抽出する構成であれば、音声復元部２０ｉはSTRAIGHT合成法によって合成を行う。
音声復元部２０ｉで生成された音声データは送信部２０ｊから音声再現装置１０に送信され、前述のように音声再現装置１０のスピーカ１１で再生される。 The calculated speech feature quantity (X _j ′) is transferred to the speech restoration unit 20i, and the speech restoration unit 20i synthesizes a speech waveform from the speech feature quantity (X _j ′) to generate speech data (step S9). ). The synthesis of the speech waveform data is performed by a synthesis method that is paired with the analysis method for extracting the feature value (X _j ′). For example, if the second feature quantity extraction unit 20d is configured to extract feature quantities by the LPC analysis method, the speech restoration unit 20i performs synthesis by the LPC synthesis method. Further, if the second feature quantity extraction unit 20d is configured to extract feature quantities by the cepstrum analysis method, the speech restoration unit 20i performs synthesis by the cepstrum synthesis method. If the second feature quantity extraction unit 20d is configured to extract feature quantities by the STRAIGHT analysis method, the voice restoration unit 20i performs synthesis by the STRAIGHT synthesis method.
The audio data generated by the audio restoration unit 20i is transmitted from the transmission unit 20j to the audio reproduction device 10, and is reproduced by the speaker 11 of the audio reproduction device 10 as described above.

〔第２実施形態〕
次に、この発明の別な実施形態を述べる。本形態は、体内伝導音から音声だけではなく骨導音をも再生する形態である。以下では、第１実施形態との相違点を中心に説明し、第１実施形態と共通する事項については説明を簡略化する。 [Second Embodiment]
Next, another embodiment of the present invention will be described. In this embodiment, not only a voice but also a bone conduction sound is reproduced from the body conduction sound. Below, it demonstrates centering around difference with 1st Embodiment, and simplifies description about the matter which is common in 1st Embodiment.

＜構成＞
図４は、本形態のシステム全体の構成を例示した概念図である。なお、図４において第１実施形態と共通する部分については図１（ａ）と同じ符号を付した。
図４に例示するように、本形態のシステムは、音声再現装置１１０と計算機１２０とを具備し、それらは接続線３０を介して通信可能に電気信号的に接続されている。音声再現装置１１０は、発話に起因する骨導音の集音及び骨導音の再生を行う骨導振動子（「骨導振動部」に相当）、スピーカ１１（「音声再生部」に相当）、体内伝導音用マイク１２（「体内伝導音集音部」に相当）、音声集音用マイク１３（「音声集音部」に相当）、本体１４及び切り替えスイッチ１４ａ（「発話状態入力部」に相当）を有している。なお、第１実施形態と同様、計算機１２０は、一般的なＰＣでもよいし、ＣＰＵやＲＡＭ等を内蔵する携帯機器であってもよいし、さらには本形態専用の計算処理可能な機器であってもよい。また、音声再現装置１１０と計算機１２０とは、別々の筺体内に構成されてもよいし、同一の筺体内に構成されてもよい。また、音声再現装置１１０でＤ／Ａ変換やＡ／Ｄ変換を行わない構成とし、音声再現装置１１０と計算機１２０とをＤ／Ａ変換器やＡ／Ｄ変換器を介して接続してもよい。 <Configuration>
FIG. 4 is a conceptual diagram illustrating the configuration of the entire system of this embodiment. In FIG. 4, the same reference numerals as those in FIG. 1A are assigned to portions common to the first embodiment.
As illustrated in FIG. 4, the system according to the present embodiment includes an audio reproduction device 110 and a computer 120, which are connected to each other via a connection line 30 in an electrical signal manner. The sound reproduction device 110 includes a bone-conducting vibrator (corresponding to a “bone-conducting vibration unit”) that collects and reproduces bone-conducted sound resulting from speech, and a speaker 11 (corresponding to a “sound reproduction unit”). In-body conduction sound microphone 12 (corresponding to “in-body conduction sound collection unit”), voice collection microphone 13 (corresponding to “sound collection unit”), main body 14 and changeover switch 14a (“speech state input unit”) Equivalent). As in the first embodiment, the computer 120 may be a general PC, a portable device incorporating a CPU, a RAM, or the like, or a device capable of calculation processing dedicated to this embodiment. May be. Further, the sound reproduction device 110 and the computer 120 may be configured in separate casings, or may be configured in the same casing. Further, the audio reproduction device 110 may be configured not to perform D / A conversion or A / D conversion, and the audio reproduction device 110 and the computer 120 may be connected via a D / A converter or an A / D converter. .

図５は、本形態の本体１１４の構成を例示した概念図である。なお、図５において第１実施形態と共通する部分については図１（ｂ）と同じ符号を付した。
図５に例示するように、本形態の本体１１４は、切り替えスイッチ１４ａ、制御部１４ｂ、スイッチ１４ｃ，１４ｄ，１１４ａ，１１４ｂ、Ａ／Ｄ変換器１４ｅ，１４ｆ，１１４ｃ、同期部１４ｇ、Ｄ／Ａ変換器１４ｈ，１１４ｂ、アンプ１４ｉ、送信部１４ｊ及び受信部１４ｋを具備する。なお、Ａ／Ｄ変換器１４ｅ，１４ｆ，１１４ｃは、物理的な回路としては１個であってもよく、Ｄ／Ａ変換器１４ｈ，１１４ｂも、物理的な回路としては１個であってもよい。 FIG. 5 is a conceptual diagram illustrating the configuration of the main body 114 of this embodiment. In FIG. 5, the same reference numerals as those in FIG.
As illustrated in FIG. 5, the main body 114 of this embodiment includes a changeover switch 14a, a control unit 14b, switches 14c, 14d, 114a, 114b, A / D converters 14e, 14f, 114c, a synchronization unit 14g, and a D / A. Converters 14h and 114b, an amplifier 14i, a transmission unit 14j, and a reception unit 14k are provided. The A / D converters 14e, 14f, and 114c may be one physical circuit, and the D / A converters 14h and 114b may be one physical circuit. Good.

図５に例示するように、受信部１４ｋは、スイッチ１１４ｂを介してＤ／Ａ変換器１１４ｄと電気的に接続され、Ｄ／Ａ変換器１１４ｄは骨導振動子１１５に電気的に接続される。また、受信部１４ｋは、スイッチ１４ｃを介してＤ／Ａ変換器１４ｈと電気的に接続され、Ｄ／Ａ変換器１４ｈはアンプ１４ｉを介してスピーカ１１に電気的に接続される。また、送信部１４ｊは、同期部１４ｇと接続される。同期部１４ｇは、スイッチ１１４ａとＡ／Ｄ変換器１１４ｃとを介して骨導振動子１１５に電気的に接続され、スイッチ１４ｄとＡ／Ｄ変換器１４ｆとを介して音声集音用マイク１３に電気的に接続され、Ａ／Ｄ変換器１４ｅを介して体内伝導音用マイク１２に電気的に接続される。また、切り替えスイッチ１４ａは、その出力信号が制御部１４ｂに入力可能に構成され、制御部１４ｂはスイッチ１４ｃ，１４ｄ，１１４ａ，１１４ｂに対して制御信号を供給可能に構成される。なお、本体１１４は、制御部１４ｂの制御のもと各処理を実行する。 As illustrated in FIG. 5, the receiving unit 14k is electrically connected to the D / A converter 114d via the switch 114b, and the D / A converter 114d is electrically connected to the bone-conducting vibrator 115. . The receiving unit 14k is electrically connected to the D / A converter 14h via the switch 14c, and the D / A converter 14h is electrically connected to the speaker 11 via the amplifier 14i. The transmission unit 14j is connected to the synchronization unit 14g. The synchronization unit 14g is electrically connected to the bone conduction vibrator 115 via the switch 114a and the A / D converter 114c, and connected to the sound collection microphone 13 via the switch 14d and the A / D converter 14f. It is electrically connected and is electrically connected to the body conduction sound microphone 12 via the A / D converter 14e. Further, the changeover switch 14a is configured such that the output signal can be input to the control unit 14b, and the control unit 14b is configured to be able to supply a control signal to the switches 14c, 14d, 114a, and 114b. The main body 114 executes each process under the control of the control unit 14b.

また、本形態の計算機１２０には所定のプログラムが読み込まれ、ＣＰＵがそのプログラムを実行することにより各機能構成が実現される。図６は、このように実現される本形態の計算機１２０の機能構成を例示したブロック図である。なお、図６において第１実施形態と共通する部分については図２と同じ符号を付した。 In addition, a predetermined program is read into the computer 120 of this embodiment, and each functional configuration is realized by the CPU executing the program. FIG. 6 is a block diagram illustrating a functional configuration of the computer 120 of the present embodiment realized in this way. In FIG. 6, the same reference numerals as those in FIG.

図６に例示するように、本形態の計算機１２０は、受信部２０ａ、判定部２０ｂ、第１特徴量抽出部２０ｃ、第２特徴量抽出部２０ｄ、第３特徴量抽出部１２０ｄ、記憶部１２０ｅ，１２０ｇ、音声学習部２０ｆ、骨導音学習部１２０ｆ、音声対応付けモデル適用部２０ｈ、骨導音対応付けモデル適用部１２０ｈ、音声復元部２０ｉ、骨導音復元部１２０ｉ、送信部２０ｊ、一時メモリ２０ｋ及び制御部２０ｍを具備する。ここで、第３特徴量抽出部１２０ｄ、骨導音学習部１２０ｆ、骨導音対応付けモデル適用部１２０ｈ及び骨導音復元部１２０ｉは、ＣＰＵ上で所定のプログラムが実行されることにより構成されるものである。なお、計算機１２０は、制御部２０ｍの制御のもと各処理を実行する。 As illustrated in FIG. 6, the computer 120 according to the present exemplary embodiment includes a receiving unit 20a, a determination unit 20b, a first feature quantity extraction unit 20c, a second feature quantity extraction unit 20d, a third feature quantity extraction unit 120d, and a storage unit 120e. 120g, speech learning unit 20f, bone conduction sound learning unit 120f, speech association model application unit 20h, bone conduction association model application unit 120h, speech restoration unit 20i, bone conduction sound restoration unit 120i, transmission unit 20j, temporary A memory 20k and a control unit 20m are provided. Here, the third feature quantity extraction unit 120d, the bone conduction sound learning unit 120f, the bone conduction sound association model application unit 120h, and the bone conduction sound restoration unit 120i are configured by executing a predetermined program on the CPU. Is. The computer 120 executes each process under the control of the control unit 20m.

＜音声再現装置１１０の動作＞
次に、本形態の音声再現装置１１０の動作について説明する。
利用者は、声帯運動を伴う通常の発話を行うか、声帯振動を伴わない発話を行うかに応じ、切り替えスイッチ１４ａを切り替える。切り替えスイッチ１４ａのスイッチング状態は電気信号として制御部１４ｂに入力される。制御部１４ｂは、入力された電気信号が示す切り替えスイッチ１４ａのスイッチング状態に応じ、スイッチ１４ｃ，１４ｄ，１１４ａ，１１４ｂを以下のように制御する。 <Operation of the sound reproduction device 110>
Next, the operation of the sound reproduction device 110 of this embodiment will be described.
The user switches the changeover switch 14a according to whether he / she performs a normal utterance with vocal cord movement or an utterance without vocal cord vibration. The switching state of the changeover switch 14a is input to the control unit 14b as an electrical signal. The control unit 14b controls the switches 14c, 14d, 114a, and 114b as follows according to the switching state of the changeover switch 14a indicated by the input electric signal.

すなわち、切り替えスイッチ１４ａが声帯運動を伴う発話を示す状態にスイッチングされた場合、制御部１４ｂは、スイッチ１４ｃをＯＦＦにし、スイッチ１４ｄをＯＮとし、スイッチ１１４ａをＯＮにし、スイッチ１１４ｂをＯＦＦとする。一方、切り替えスイッチ１４ａが声帯運動を伴わない発話を示す状態にスイッチングされた場合、制御部１４ｂは、スイッチ１４ｃをＯＮにし、スイッチ１４ｄをＯＦＦとし、スイッチ１１４ａをＯＦＦにし、スイッチ１１４ｂをＯＮとする。このような状態において、声帯運動を伴う発話又は声帯運動を伴わない発話が行われると、音声再現装置１１０は以下のように動作する。

That is, when the changeover switch 14a is switched to a state that indicates speech accompanied by vocal cord movement, the control unit 14b turns off the switch 14c, turns on the switch 14d, turns on the switch 114a, and turns off the switch 114b. On the other hand, when the changeover switch 14a is switched to a state that indicates an utterance without vocal cord movement, the control unit 14b turns on the switch 14c, turns off the switch 14d, turns off the switch 114a, and turns on the switch 114b. . In such a state, when an utterance with vocal cord movement or an utterance without vocal cord movement is performed, the voice reproduction device 110 operates as follows.

［声帯運動を伴う発話時の動作（スイッチ１４ｃ：ＯＦＦ，スイッチ１４ｄ：ＯＮ，スイッチ１１４ａ：ＯＮ，スイッチ１１４ｂ：ＯＦＦ）］
発話者が声帯運動を伴う発話を行う場合、発話された通常の音声は音声集音用マイク１３で集音されてアナログ電気信号に変換される。それと同時に、この発話に伴う骨導音が骨導振動子１１５で集音されてアナログ電気信号に変換され、体内伝達音が体内伝導音用マイク１２で集音されてアナログ電気信号に変換される。体内伝達音のアナログ電気信号と骨導音のアナログ電気信号と音声のアナログ電気信号とは、それぞれＡ／Ｄ変換器１４ｅ，１１４ｃ，１４ｆでデジタル電気信号に変換され、同期部１４ｇに入力される。同期部１４ｇは、体内伝達音のデジタル電気信号と骨導音のデジタル電気信号と音声のデジタル電気信号とを時間的に同期させ、送信部１４ｊに送る。なお、第１実施形態と同様、これらの信号には、制御部１４ｂの制御のもと、例えば、信号の種別を示すデータなどの制御用データも付加される。送信部１４ｊは、送られたデジタル電気信号をデジタルデータとして、或いは、特定のプロトコルに基づくネットワークデータとして、接続線３０を経由して計算機１２０に送信する。なお、第１実施形態で述べた変形例のように、集音された体内伝達音と骨導音と音声をアナログ信号のまま計算機１２０に送信する構成としてもよい。この場合には、計算機１２０がアナログ電気信号からデジタル信号への変換を行う。 [Operation during utterance with vocal cord movement (switch 14c: OFF, switch 14d: ON, switch 114a: ON, switch 114b: OFF)]
When the speaker utters with vocal cord movement, the normal voice uttered is collected by the voice collecting microphone 13 and converted into an analog electric signal. At the same time, the bone conduction sound accompanying this utterance is collected by the bone conduction vibrator 115 and converted into an analog electrical signal, and the internal body transmission sound is collected by the body conduction sound microphone 12 and converted into an analog electrical signal. . The analog electrical signal of the body-borne sound, the analog electrical signal of the bone conduction sound, and the analog electrical signal of the voice are converted into digital electrical signals by the A / D converters 14e, 114c, and 14f, respectively, and input to the synchronization unit 14g. . The synchronization unit 14g temporally synchronizes the digital electrical signal of the body-transmitted sound, the digital electrical signal of the bone conduction sound, and the digital electrical signal of the sound, and sends them to the transmission unit 14j. Note that, as in the first embodiment, control signals such as data indicating the type of the signal are also added to these signals under the control of the control unit 14b. The transmitter 14j transmits the transmitted digital electrical signal to the computer 120 via the connection line 30 as digital data or as network data based on a specific protocol. Note that, as in the modification described in the first embodiment, the collected body-borne sound, bone conduction sound, and sound may be transmitted to the computer 120 as analog signals. In this case, the computer 120 performs conversion from an analog electrical signal to a digital signal.

このように計算機１２０に送信されたデータは、計算機１２０が体内伝導音と音声とを対応付ける音声対応付けモデル及び体内伝導音と骨同音とを対応付ける骨導音対応付けモデルを学習するために用いられる。計算機１２０の処理は後述する。また、切り替えスイッチ１４ａが声帯運動を伴う発話を示す状態にスイッチングされている場合、制御部１４ｂはスピーカ１１に再生動作を実行させない。 The data transmitted to the computer 120 in this way is used for the computer 120 to learn a speech association model that associates the body conduction sound and the speech and a bone conduction sound association model that associates the body conduction sound and the bone homophone. . The processing of the computer 120 will be described later. In addition, when the changeover switch 14a is switched to a state indicating utterance accompanied by vocal cord movement, the control unit 14b does not cause the speaker 11 to perform a reproduction operation.

［声帯運動を伴わない発話時の動作（スイッチ１４ｃ：ＯＮ，スイッチ１４ｄ：ＯＦＦ，スイッチ１１４ａ：ＯＦＦ，スイッチ１１４ｂ：ＯＮ）］
発話者が声帯運動を伴わない発話を行う場合、その発話に伴う体内伝達音は、体内伝導音用マイク１２で集音されてアナログ電気信号に変換される。集音された体内伝達音は、Ａ／Ｄ変換器１４ｆでデジタル電気信号に変換され、同期部１４ｇに入力される。同期部１４ｇは、制御部１４ｂの制御のもと、当該体内伝達音のデジタル電気信号に対し、例えば、信号の種別を示すデータなどの制御用データを付加し、送信部１４ｊに送る。送信部１４ｊは、送られたデジタル電気信号をデジタルデータとして、或いは、特定のプロトコルに基づくネットワークデータとして、接続線３０を経由して計算機１２０に送信する。なお、集音された体内伝達音のアナログ電気信号をそのまま計算機１２０に送信する構成としてもよい。この場合には、計算機１２０がアナログ電気信号からデジタル電気信号への変換を行う。 [Operation during utterance without vocal cord movement (switch 14c: ON, switch 14d: OFF, switch 114a: OFF, switch 114b: ON)]
When a speaker performs an utterance without accompanying vocal cord movement, the in-body transmission sound accompanying the utterance is collected by the in-body conduction sound microphone 12 and converted into an analog electric signal. The collected body-transmitted sound is converted into a digital electric signal by the A / D converter 14f and input to the synchronization unit 14g. Under the control of the control unit 14b, the synchronization unit 14g adds control data such as data indicating the type of the signal to the digital electrical signal of the in-vivo transmission sound, and sends the data to the transmission unit 14j. The transmitter 14j transmits the transmitted digital electrical signal to the computer 120 via the connection line 30 as digital data or as network data based on a specific protocol. The collected analog electrical signal of the in-vivo transmission sound may be transmitted to the computer 120 as it is. In this case, the computer 120 performs conversion from an analog electric signal to a digital electric signal.

計算機１２０は、送られた体内伝達音のデータに対し、音声対応付けモデルを用いて音声のデータを生成し、骨導音対応付けモデルを用いて骨導音のデータを生成し、生成したこれらのデータをデジタル電気信号として音声再現装置１１０に送信する（計算機１２０の動作は後述）。計算機１２０から送信された音声及び骨導音のデジタル電気信号は音声再現装置１１０の受信部１４ｋで受信される。音声のデジタル電気信号はスイッチ１４ｃを介してＤ／Ａ変換器１４ｈに入力される。Ｄ／Ａ変換器１４ｈは、入力された音声のデジタル電気信号をアナログ信号に変換してスピーカ１１に入力する。また、骨導音のデジタル電気信号はスイッチ１１４ｂを介してＤ／Ａ変換器１１４ｄに入力される。Ｄ／Ａ変換器１１４ｄは、入力された骨導音のデジタル電気信号をアナログ信号に変換して骨導振動子１１５に入力する。スピーカ１１は、入力された音声のアナログ信号に基づいた音声を、利用者が設定するか予め設定された音量で再生し、骨導振動子１１５は、入力された骨導音のアナログ信号に基いた骨同音を再生する。なお、計算機１２０から骨導音のアナログ信号が伝送される構成の場合には、Ｄ／Ａ変換器１１４ｄは不要であり、骨導振動子１１５は伝送された骨導音のアナログ信号から骨導音の再生を行う。 The computer 120 generates voice data using the voice association model for the transmitted body conduction sound data, and generates bone conduction sound data using the bone conduction sound correspondence model. Is transmitted as a digital electrical signal to the sound reproduction device 110 (the operation of the computer 120 will be described later). The digital electric signal of the voice and the bone conduction sound transmitted from the computer 120 is received by the receiving unit 14k of the voice reproduction device 110. The audio digital electric signal is input to the D / A converter 14h via the switch 14c. The D / A converter 14 h converts the input digital audio signal into an analog signal and inputs the analog signal to the speaker 11. The digital electrical signal of bone conduction sound is input to the D / A converter 114d via the switch 114b. The D / A converter 114 d converts the input digital electric signal of the bone conduction sound into an analog signal and inputs the analog signal to the bone conduction vibrator 115. The speaker 11 reproduces the sound based on the input analog signal of the sound at a volume set by the user or set in advance, and the bone conduction vibrator 115 is based on the input analog signal of the bone conduction sound. Play the bones that sounded. In the case where the bone conduction sound analog signal is transmitted from the computer 120, the D / A converter 114d is not necessary, and the bone conduction vibrator 115 performs the bone conduction from the transmitted bone conduction sound analog signal. Play sound.

＜計算機１２０の動作＞
次に、計算機１２０の動作を例示する。図７は、本形態の計算機１２０の動作を説明するためのフローチャートである。以下、この図に従い、本形態の計算機１２０の動作を例示する。
まず、音声再現装置１１０から送信された電気信号は受信部２０ａで受信される（ステップＳ２１）。受信された電気信号は、必要に応じて計算機１２０での処理に適したデータに変換され、変換されたデータは一時メモリ２０ｋにバッファされ、例えば、フレーム単位で判定部２０ｂに送られる。
判定部２０ｂは、送られた受信データが音声データや骨導音データを含むか否かを判定する（ステップＳ２２）。ここで、受信データが音声データや骨導音データを含むと判定された場合には、以下のステップＳ２３からＳ２９の処理が実行され、受信データが音声データを含まないと判定された場合には、以下のステップＳ３０からＳ３５の処理が実行される。この制御は制御部２０ｍが行う。 <Operation of computer 120>
Next, the operation of the computer 120 will be exemplified. FIG. 7 is a flowchart for explaining the operation of the computer 120 of this embodiment. Hereinafter, the operation of the computer 120 of this embodiment will be exemplified according to this figure.
First, the electrical signal transmitted from the audio reproduction device 110 is received by the receiving unit 20a (step S21). The received electrical signal is converted into data suitable for processing by the computer 120 as necessary, and the converted data is buffered in the temporary memory 20k and sent to the determination unit 20b in units of frames, for example.
The determination unit 20b determines whether or not the received reception data includes audio data and bone conduction data (step S22). Here, when it is determined that the received data includes audio data or bone conduction sound data, the following steps S23 to S29 are executed, and when it is determined that the received data does not include audio data. The following steps S30 to S35 are executed. This control is performed by the control unit 20m.

［受信データが音声データを含むと判定された場合（Ｓ２３〜Ｓ２９）］
まず、第１特徴量抽出部２０ｃが、判定部２０ｂから転送された体内伝達音のデータを分析し、当該体内伝達音の特徴量（Ｙ_ｊ）を抽出する（ステップＳ２３）。また、第２特徴量抽出部２０ｄが、判定部２０ｂから転送された音声のデータ（ステップＳ２３の体内伝達音のデータと時間的に同期した音声のデータ）を分析し、音声の特徴量（Ｘ_ｊ）を抽出する（ステップＳ２４）。さらに、第３特徴量抽出部１２０ｄが、判定部２０ｂから転送された骨導音のデータ（ステップＳ２３の体内伝達音のデータと時間的に同期した骨導音のデータ）を分析し、骨導音の特徴量（Ｚ_ｊ）を抽出する（ステップＳ２４）。なお、ここで抽出する各特徴量は、第１実施形態で例示したものと同様である。 [When it is determined that the received data includes audio data (S23 to S29)]
First, the first feature amount extraction unit 20c analyzes the in-vivo transmission sound data transferred from the determination unit 20b, and extracts the feature amount (Y _j ) of the in-body transmission sound (step S23). Further, the second feature amount extraction unit 20d analyzes the voice data transferred from the determination unit 20b (the voice data synchronized in time with the in-body transmission sound data in step S23), and the voice feature amount (X _j ) is extracted (step S24). Further, the third feature quantity extraction unit 120d analyzes the bone conduction sound data transferred from the determination unit 20b (the bone conduction sound data synchronized in time with the in-body transmission sound data in step S23), and the bone conduction A sound feature quantity (Z _j ) is extracted (step S24). Note that the feature amounts extracted here are the same as those exemplified in the first embodiment.

次に、第１特徴量抽出部２０ｃで抽出された体内伝達音の特徴量（Ｙ_ｊ）と、第２特徴量抽出部２０ｄで抽出された音声の特徴量（Ｘ_ｊ）とを、例えば、所定の時間区間であるフレーム単位で対応付けて記憶部１２０ｅに格納する（ステップＳ２６）。また、第１特徴量抽出部２０ｃで抽出された体内伝達音の特徴量（Ｙ_ｊ）と、第３特徴量抽出部１２０ｄで抽出された骨導音の特徴量（Ｚ_ｊ）とを、例えば、所定の時間区間であるフレーム単位で対応付けて記憶部１２０ｅに格納する（ステップＳ２７）。 Next, the in-vivo transmission sound feature value (Y _j ) extracted by the first feature value extraction unit 20c and the sound feature value (X _j ) extracted by the second feature value extraction unit 20d are, for example, The data is stored in the storage unit 120e in association with each other as a predetermined time interval (step S26). Further, the feature quantity (Y _j ) of the in-body transmitted sound extracted by the first feature quantity extraction unit 20c and the feature quantity (Z _j ) of the bone conduction sound extracted by the third feature quantity extraction unit 120d are, for example, Then, they are stored in the storage unit 120e in association with each other in units of frames that are predetermined time intervals (step S27).

次に、音声学習部２０ｆが、相互に対応付けられた体内伝達音の特徴量（Ｙ_ｊ）と音声の特徴量（Ｘ_ｊ）とを記憶部２０ｅから読み込み、これらを相互に対応する学習データとし、学習処理によって、任意の体内伝達音の特徴量と任意の音声の特徴量との対応関係を示す音声対応付けモデルのパラメータを算出する（ステップＳ２８）。また、骨導音学習部１２０ｆが、相互に対応付けられた体内伝達音の特徴量（Ｙ_ｊ）と骨導音の特徴量（Ｚ_ｊ）とを記憶部１２０ｅから読み込み、これらを相互に対応する学習データとし、学習処理によって、任意の体内伝達音の特徴量と任意の骨導音の特徴量との対応関係を示す骨導音対応付けモデルのパラメータを算出する（ステップＳ２９）。なお、骨導音対応付けモデルは、例えば、非特許文献１と同様、混合正規分布モデル（GMM）であり、その学習方法は例えば第１実施形態で述べた通りである。なお、第１実施形態で述べたのと同様、本形態の骨導音対応付けモデルの学習においても、動的計画法による両特徴量の対応付けが不要であるため、計算機１２０での計算時間を短縮できるという効果、並びに、対応付けの誤りによってモデルの分散が増加し、結果的に再現骨導音の品質劣化を引き起こすという問題を発生させないという効果が得られる。なお、ステップＳ２８，Ｓ２９は必ずしも毎回実行する必要はなく、所定数組の特徴量Ｘ_ｊ，Ｙ_ｊ及び特徴量Ｚ_ｊ，Ｙ_ｊが収集されるたびに実行してもよい。 Next, the speech learning unit 20f reads from the storage unit 20e the feature values (Y _j ) and the sound feature values (X _j ) of the in-vivo transmission sound that are associated with each other, and learning data that corresponds to these features. Then, by the learning process, the parameter of the voice association model indicating the correspondence between the feature quantity of any in-body transmitted sound and the feature quantity of any voice is calculated (step S28). Further, the bone conduction sound learning unit 120f reads in-vivo transmission sound feature values (Y _j ) and bone conduction sound feature values (Z _j ) associated with each other from the storage unit 120e, and these correspond to each other. As a learning data to be obtained, a bone conduction sound association model parameter indicating a correspondence relationship between a feature quantity of an arbitrary in-vivo transmitted sound and a feature quantity of an arbitrary bone conduction sound is calculated by learning processing (step S29). Note that the bone conduction sound association model is, for example, a mixed normal distribution model (GMM), as in Non-Patent Document 1, and the learning method is as described in the first embodiment, for example. As described in the first embodiment, the learning of the bone-conducted sound association model of the present embodiment does not require the association of both feature amounts by dynamic programming, and therefore the calculation time in the computer 120 is not necessary. As well as the effect that the variance of the model increases due to an error in association, and as a result the quality deterioration of the reproduced bone conduction sound does not occur. Note that steps S28 and S29 are not necessarily executed every time, and may be executed each time a predetermined number of feature amounts X _j and Y _j and feature amounts Z _j and Y _j are collected.

［受信データが音声データを含まないと判定された場合（Ｓ３０〜Ｓ３５）］
まず、第１特徴量抽出部２０ｃが、判定部２０ｂから転送された体内伝達音のデータを分析し、当該体内伝達音の特徴量（Ｙ_ｊ’）を抽出する（ステップＳ３０）。抽出された体内伝達音の特徴量（Ｙ_ｊ’）は音声対応付けモデル適用部２０ｈと骨導音対応付けモデル適用部１２０ｈに転送される。 [When it is determined that the received data does not include audio data (S30 to S35)]
First, the first feature quantity extraction unit 20c analyzes the in-vivo transmission sound data transferred from the determination unit 20b, and extracts the feature quantity (Y _j ′) of the in-body transmission sound (step S30). The extracted feature quantity (Y _j ′) of the in-vivo transmitted sound is transferred to the speech association model application unit 20h and the bone conduction sound association model application unit 120h.

第１実施形態と同様、音声対応付けモデル適用部２０ｈは、記憶部２０ｇから読み込んだ音声対応付けモデルのパラメータと体内伝達音の特徴量（Ｙ_ｊ’）とを用い、各ｊについて、体内伝達音の特徴量（Ｙ_ｊ’）に対応する音声の特徴量（Ｘ_ｊ’）を算出する（ステップＳ３１）。算出された音声の特徴量（Ｘ_ｊ’）は音声復元部２０ｉに転送され、音声復元部２０ｉは音声の特徴量（Ｘ_ｊ’）から音声波形を合成し、音声データを生成する（ステップＳ３２）。 Similar to the first embodiment, the speech association model application unit 20h uses the parameters of the speech association model read from the storage unit 20g and the feature value (Y _j ′) of the in-vivo transmission sound, and transmits the in-vivo transmission for each j. calculating feature quantity of the sound 'feature of the corresponding audio _{(X j} _{(Y j)'} a) (step S31). The calculated speech feature quantity (X _j ′) is transferred to the speech restoration unit 20i, and the speech restoration unit 20i synthesizes a speech waveform from the speech feature quantity (X _j ′) to generate speech data (step S32). ).

また、骨導音対応付けモデル適用部１２０ｈは、記憶部２０ｇから読み込んだ骨導音対応付けモデルのパラメータと体内伝達音の特徴量（Ｙ_ｊ’）とを用い、各ｊについて、体内伝達音の特徴量（Ｙ_ｊ’）に対応する骨導音の特徴量（Ｚ_ｊ’）を算出する（ステップＳ３３）。算出された骨導音の特徴量（Ｚ_ｊ’）は骨導音復元部１２０ｉに転送され、骨導音復元部１２０ｉは骨導音の特徴量（Ｚ_ｊ’）から骨導音波形を合成し、骨導音データを生成する（ステップＳ３４）。
音声復元部２０ｉで生成された音声データと骨導音データは送信部２０ｊから音声再現装置１１０に送信され、前述のように音声再現装置１１０のスピーカ１１と骨導振動子１１５で再生される。 Further, the bone conduction sound association model application unit 120h uses the parameters of the bone conduction sound association model read from the storage unit 20g and the feature value (Y _j ′) of the body conduction sound, and uses the body conduction sound for each j. The feature amount (Z _j ′) of the bone conduction sound corresponding to the feature amount (Y _j ′) is calculated (step S33). The calculated bone conduction sound feature amount (Z _j ′) is transferred to the bone conduction sound restoration unit 120i, and the bone conduction sound restoration unit 120i synthesizes the bone conduction sound waveform from the bone conduction sound feature amount (Z _j ′). Then, bone conduction sound data is generated (step S34).
The voice data and the bone conduction data generated by the voice restoration unit 20i are transmitted from the transmission unit 20j to the voice reproduction device 110, and are reproduced by the speaker 11 and the bone conduction vibrator 115 of the voice reproduction device 110 as described above.

〔変形例等〕
なお、本発明は上述の実施の形態に限定されるものではない。例えば、第１，第２実施形態において、声帯運動を伴わない発話時にスイッチ１４ｃをＯＦＦとすることができる構成としてもよい。これにより、フィードバック音の漏れが問題となるような静寂な環境でも本システムを利用できる。特に第２実施形態では、音声によるフィードバックを行わなくても骨導音でのフィードバックが可能であるため、声帯運動を伴わない発話時にスイッチ１４ｃをＯＦＦとできる構成は有効である。 [Modifications, etc.]
The present invention is not limited to the embodiment described above. For example, in the first and second embodiments, the switch 14c may be turned OFF when speaking without accompanying vocal cord movement. As a result, the system can be used even in a quiet environment where leakage of feedback sound becomes a problem. Particularly in the second embodiment, since it is possible to perform feedback with bone conduction sound without performing voice feedback, a configuration in which the switch 14c can be turned off at the time of speech without accompanying vocal cord movement is effective.

また、上述の各種の処理は、記載に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。その他、本発明の趣旨を逸脱しない範囲で適宜変更が可能であることはいうまでもない。
また、上述の構成をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記処理機能がコンピュータ上で実現される。 In addition, the various processes described above are not only executed in time series according to the description, but may be executed in parallel or individually according to the processing capability of the apparatus that executes the processes or as necessary. Needless to say, other modifications are possible without departing from the spirit of the present invention.
Further, when the above-described configuration is realized by a computer, processing contents of functions that each device should have are described by a program. The processing functions are realized on the computer by executing the program on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよいが、具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、ＤＶＤ（Digital Versatile Disc）、ＤＶＤ−ＲＡＭ（Random Access Memory）、ＣＤ−ＲＯＭ（Compact Disc Read Only Memory）、ＣＤ−Ｒ（Recordable）／ＲＷ（ReWritable）等を、光磁気記録媒体として、ＭＯ（Magneto-Optical disc）等を、半導体メモリとしてＥＥＰ−ＲＯＭ（Electronically Erasable and Programmable-Read Only Memory）等を用いることができる。 The program describing the processing contents can be recorded on a computer-readable recording medium. The computer-readable recording medium may be any medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, or a semiconductor memory. Specifically, for example, the magnetic recording device may be a hard disk device or a flexible Discs, magnetic tapes, etc. as optical disks, DVD (Digital Versatile Disc), DVD-RAM (Random Access Memory), CD-ROM (Compact Disc Read Only Memory), CD-R (Recordable) / RW (ReWritable), etc. As the magneto-optical recording medium, MO (Magneto-Optical disc) or the like can be used, and as the semiconductor memory, EEP-ROM (Electronically Erasable and Programmable-Read Only Memory) or the like can be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Furthermore, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記録媒体に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. When executing the process, the computer reads a program stored in its own recording medium and executes a process according to the read program. As another execution form of the program, the computer may directly read the program from a portable recording medium and execute processing according to the program, and the program is transferred from the server computer to the computer. Each time, the processing according to the received program may be executed sequentially. Also, the program is not transferred from the server computer to the computer, and the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition. It is good. Note that the program in this embodiment includes information that is used for processing by an electronic computer and that conforms to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

本発明の産業上の利用分野としては、例えば、非可聴つぶやき（NAM）を用いた無音電話等を例示できる。 As an industrial application field of the present invention, for example, a silent telephone using a non-audible tweet (NAM) can be exemplified.

図１（ａ）は、第１実施形態のシステム全体の構成を例示した概念図である。図１（ｂ）は、第１実施形態の本体の構成を例示した概念図である。FIG. 1A is a conceptual diagram illustrating the configuration of the entire system of the first embodiment. FIG. 1B is a conceptual diagram illustrating the configuration of the main body of the first embodiment. 図２は、第１実施形態の計算機の機能構成を例示したブロック図である。FIG. 2 is a block diagram illustrating a functional configuration of the computer according to the first embodiment. 図３は、第１実施形態の計算機の動作を説明するためのフローチャートである。FIG. 3 is a flowchart for explaining the operation of the computer according to the first embodiment. 図４は、第２実施形態のシステム全体の構成を例示した概念図である。FIG. 4 is a conceptual diagram illustrating the configuration of the entire system of the second embodiment. 図５は、第２実施形態の本体の構成を例示した概念図である。FIG. 5 is a conceptual diagram illustrating the configuration of the main body of the second embodiment. 図６は、第２実施形態の計算機の機能構成を例示したブロック図である。FIG. 6 is a block diagram illustrating a functional configuration of a computer according to the second embodiment. 図７は、第２実施形態の計算機の動作を説明するためのフローチャートである。FIG. 7 is a flowchart for explaining the operation of the computer of the second embodiment.

Explanation of symbols

１０，１００音声再現装置
２０，１２０計算機 10,100 Voice reproduction device 20,120 Computer

Claims

In a sound reproduction device that reproduces sound from internal body sound,
An in-body transmission sound collection unit that collects in-body transmission sounds resulting from utterances;
An audio collection unit that collects audio,
Synchronize temporally the in-vivo transmitted sound data collected by the in-vivo transmitted sound collection unit and the voice data collected by the voice collection unit at the same time when speaking with vocal cord movement. Is transmitted to a computer for learning a voice correspondence model that correlates the body-transmitted sound and voice, and the body-transmitted sound data collected by the body-transmitted sound collecting unit during speech without accompanying vocal cord movement A first transmitter for transmitting to the computer;
First reception in which the computer receives voice data generated by using the voice association model with respect to in-vivo transmitted sound data collected by the in-vivo transmitted sound collecting unit during utterance not accompanied by vocal cord movement And
An audio reproduction unit for reproducing audio from the audio data received by the first reception unit;
A bone conduction vibration part that collects bone conduction sound due to speech and reproduces bone conduction sound, and
The first transmitter is
Furthermore, in-situ transmission sound data collected by the in-vivo sound collection unit during speech with vocal cord movement and simultaneously with the bone conduction sound data collected by the bone-conduction vibration unit in time. To synchronize and send these to the computer to learn the bone conduction sound correspondence model that correlates the body conduction sound and the bone conduction sound,
The first receiver is
The computer receives bone conduction sound data generated by the computer using the bone conduction sound correspondence model with respect to the body conduction sound data collected by the body sound collection section during speech without the vocal cord movement. And
The bone conduction vibration part is
Reproducing bone conduction sound from the bone conduction sound data received by the first receiving unit;
An audio reproduction device characterized by that.

The sound reproduction device according to claim 1,
An utterance state input unit that accepts an input operation indicating whether the utterance is accompanied by vocal cord movement or the utterance not accompanied by vocal cord movement;
When the input operation to the utterance state input unit indicates an utterance accompanied by vocal cord movement, the first transmission unit is configured to transmit in-vivo transmission sound data collected by the in-vivo transmission sound collection unit, At the same time, the bone conduction sound data collected by the bone conduction vibration unit is temporally synchronized and transmitted to the computer, and the bone conduction vibration unit does not reproduce the bone conduction sound.
When the input operation to the utterance state input unit indicates an utterance that does not involve vocal cord movement, the first transmission unit receives only the data of the in-vivo transmission sound collected by the in-body transmission sound collection unit. Transmitting to the computer, the first receiving unit receives the bone conduction sound data, the bone conduction vibration unit reproduces the bone conduction sound from the bone conduction data,
An audio reproduction device characterized by that.

The sound reproduction device according to claim 1 or 2,
The above calculator
A second receiver for receiving data transmitted from the sound reproduction device;
A first feature amount extraction unit for extracting a feature amount of in-vivo transmitted sound using in-body transmitted sound data;
A third feature quantity extraction unit for extracting the feature quantity of the bone conduction sound using the bone conduction sound data;
The feature values of the internal transmission sound and the bone conduction sound that are synchronized in time are used as learning data that correspond to each other. A bone conduction sound learning unit for calculating a parameter of a bone conduction sound correspondence model indicating a correspondence relationship of
A bone-conducted sound association model applying unit that calculates the bone-conducted sound feature amount corresponding to the body-conducted sound feature amount using the parameters calculated by the bone-conducted sound learning unit and the body-conducted sound feature amount; ,
Using the bone conduction sound feature amount calculated by the bone conduction sound association model application unit, a bone conduction sound restoration unit that generates bone conduction sound data;
A second transmission unit for transmitting the bone conduction sound data generated by the bone conduction sound restoration unit to the sound reproduction device;
When the second receiving unit receives the temporally synchronized body-transmitted sound data and bone conduction sound data, the first feature amount extracting unit receives the body-transmitted sound data received by the second receiving unit. The third feature amount extraction unit extracts the bone conduction sound feature amount using the bone conduction sound data received by the second reception unit, and extracts the bone conduction sound feature amount. The sound conduction learning unit calculates the parameters of the bone conduction sound association model using these feature amounts,
When the second receiving unit receives only in-vivo transmission sound data, the first feature amount extraction unit extracts the in-body transmission sound feature amount using the in-vivo transmission sound data received by the second receiving unit. The bone conduction sound matching model application unit uses the extracted feature value of the body conduction sound and the parameter calculated by the bone conduction sound learning unit, and uses the extracted feature value of the body conduction sound to correspond to the feature value of the body conduction sound. The bone conduction sound restoration unit generates bone conduction sound data using the calculated bone conduction sound feature amount, and the second transmission unit calculates the bone conduction sound restoration unit. Send the bone conduction sound data generated in step 1 to the sound reproduction device.
An audio reproduction device characterized by that.

In the sound reproduction method to reproduce the sound from the internal transmission sound,
A step in which a body-transmitted sound collecting unit collects body-transmitted sound resulting from utterance;
A step of collecting a sound by a sound collecting unit;
A step of extracting a feature amount of the in-vivo transmitted sound using a data of the in-vivo transmitted sound collected by the in-vivo transmitted sound collecting portion at the time of utterance accompanied by vocal cord movement;
A step of extracting a feature amount of the voice by using the voice data collected by the voice collecting unit at the time of utterance accompanied by a vocal cord movement;
The speech learning unit sets the feature values of the in-vivo transmission sound and the feature amount of the speech corresponding to the same sound collection time as learning data corresponding to each other, and the learning process performs the feature amount of the arbitrary in-vivo transmission sound and the Calculating a parameter of a voice association model indicating a correspondence relationship with a feature amount;
The first feature quantity extraction unit extracts the feature quantity of the in-vivo transmission sound using the data of the in-body transmission sound collected by the in-body transmission sound collection section at the time of utterance not accompanied by vocal cord movement;
The voice association model application unit uses the parameters calculated by the voice learning unit and the feature amount of the internal transmission sound collected by the internal transmission sound collection unit at the time of utterance not accompanied by vocal cord movement. Calculating a voice feature amount corresponding to the sound feature amount;
A step of generating a voice data using a voice feature amount calculated by the voice correlation model applying unit;
An audio reproduction unit reproducing audio from the audio data;
A step of collecting a bone conduction sound caused by an utterance by the bone conduction vibration unit;
A step of extracting a feature quantity of the bone conduction sound using a data of the bone conduction sound collected by the bone conduction vibration section at the time of utterance accompanied by vocal cord movement;
The bone conduction sound learning unit uses the feature values of the body conduction sound and the bone conduction sound corresponding to the same sound collection time as learning data corresponding to each other. Calculating a bone conduction sound correspondence model parameter indicating a correspondence relationship with a feature quantity of an arbitrary bone conduction sound;
The bone conduction sound association model application unit calculates the parameters calculated by the bone conduction sound learning unit and the feature amount of the internal body transmission sound collected by the internal body sound collection unit during speech without accompanying vocal cord movement. Extracting a bone conduction sound feature quantity corresponding to the body conduction sound feature quantity;
A step of generating bone conduction sound data by using the bone conduction sound feature amount calculated by the voice association model application unit, the bone conduction sound restoration unit;
The bone conductive vibrating portion includes the steps of reproducing the bone-conducted sound from the data of the bone-conduction sound,
An audio reproduction method characterized by comprising: