JP2008042740A

JP2008042740A - Non-audible murmur pickup microphone

Info

Publication number: JP2008042740A
Application number: JP2006217028A
Authority: JP
Inventors: Yoshitaka Nakajima; 淑貴中島
Original assignee: Nara Institute of Science and Technology NUC
Current assignee: Nara Institute of Science and Technology NUC
Priority date: 2006-08-09
Filing date: 2006-08-09
Publication date: 2008-02-21

Abstract

<P>PROBLEM TO BE SOLVED: To provide a non-audible murmur pickup microphone (NAM microphone) capable of picking up a non-audible voice, and converting it into a voice signal that a listener and a voice recognition means easily recognize and then outputting the voice signal with simple constitution. <P>SOLUTION: The NAM microphone X includes an NAM propagation unit 12 as a soft member which is brought into contact with a skin surface 1a to propagate a non-audible murmur (NAM) propagated in the human body, a microphone 11 which transduces the NAM propagated in the NAM propagation unit 12 into an electric signal, and a high-pass filter 22 which performs high-pass filter processing of about 1000 Hz in cutoff frequency for a signal of a voice obtained by the microphone 11. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、非可聴つぶやき音声を採取する非可聴つぶやき音声採取用マイクロホンに関するものである。 The present invention relates to a non-audible tweet voice collecting microphone for collecting a non-audible tweet voice.

昨今、携帯電話機及びその通信網の普及により、いつでもどこでも他の人と音声（会話）によるコミュニケーションをとることが可能となっている。さらに、音声認識手段を備えたパーソナルコンピュータやカーナビゲーション装置等の各種の装置が、音声の指令に応じて処理を実行することも可能となっている。
その一方で、電車内や図書館内など、周囲の人への迷惑防止のために発声が制限される状況や、会話の内容が機密事項等であるために発声が制限される状況も多い。そのように発声が制限される状況においても、周囲に発声内容が漏れることなく音声通話や装置に対する音声での指令を行うことができれば、音声によるコミュニケーションのさらなるオンデマンド化や、音声による機器の遠隔制御等が促進され、各種業務の効率化にもつながる。
また、咽頭部（声帯など）に障害があるため通常音声を発声できない障害者であっても、非可聴つぶやき音声であれば発声できる場合が多い。このため、非可聴つぶやき音声による通話や機器に対する指令が可能になれば、そのような咽頭部の障害者の利便性が格段に向上する。
これに対し、特許文献１には、非可聴つぶやき音声（ＮＡＭ：Non−Audible Murmur）を採取することによって音声入力するコミュニケーションインタフェースシステムが提案されている。非可聴つぶやき音声（ＮＡＭ）は、声帯の規則振動を伴わない音声（無声音）であって、外部からは非可聴な体内軟部組織を伝播する振動音（呼吸音）である。即ち、人体の声道において生じる声帯の振動を伴わない呼吸音による音声である。例えば、防音室環境において、１〜２ｍ程度離れた周囲の人に聞こえない程度の非可聴音声（呼吸音）を「非可聴つぶやき音声」と定義し、声道（特に、口腔）を絞って声道を通過する空気の流速を上げることにより、１〜２ｍ程度離れた周囲の人に聞こえる程度に無声音を発声する可聴音声を「可聴ささやき音声」と定義する。
このような非可聴つぶやき音声の信号は、音響空間の振動を検知する通常のマイクロホンでは採取できないため、通常、体内の肉伝導音を採取する肉伝導マイクロホンにより採取される。この肉伝導マイクロホンは、従来、主として非可聴ささやき音声（ＮＡＭ）の採取に用いられるため、ＮＡＭマイクロホンとも呼ばれ、その詳細は、特許文献１等に示されている。
このＮＡＭマイクロホン（肉伝導マイクロホン）は、人体の皮膚表面に密着されることにより非可聴つぶやき音声を伝播させるシリコン等からなる軟性部材と、その軟性部材を伝播する非可聴つぶやき音声を電気信号に変換するマイクロホンとを備えている。そして、ＮＡＭマイクロホンは、耳介の下方部における頭蓋骨の乳様突起直下の、胸鎖乳頭筋上の皮膚表面に前記軟性部材が密着するよう装着され、声道で発生して体内の軟組成（骨以外の筋肉や脂肪など）を伝わる肉伝導音（非可聴ささやき音声）を採取する。
ここで、ＮＡＭマイクロホンは、主として人体の軟組成を経由して伝播してくる音のみを採取するため、周囲の騒音（ノイズ音）が大きい場合であっても、人体がノイズ除去フィルタの機能を果たし、ＳＮ比の高い音響信号を採取することができる。即ち、ＮＡＭマイクロホンは、空中を伝播する音響に対する耐ノイズ性が高い。 Recently, with the spread of mobile phones and their communication networks, it is possible to communicate by voice (conversation) with other people anytime and anywhere. Furthermore, various devices such as a personal computer and a car navigation device equipped with voice recognition means can execute processing in response to a voice command.
On the other hand, there are many situations where utterances are restricted to prevent inconvenience to surrounding people, such as in trains and libraries, and utterances are restricted because the contents of conversations are confidential matters. Even in situations where utterances are restricted in this way, if voice commands and voice commands to devices can be issued without leaking the content of utterances to the surroundings, voice communication can be further on-demand, or voice devices can be remotely controlled. Control etc. is promoted and it leads to the improvement of efficiency of various operations.
Further, even a disabled person who cannot speak normal speech due to a disorder in the pharynx (such as vocal cords) can often speak with a non-audible muttering voice. For this reason, if it becomes possible to make a call with a non-audible muttering voice or give a command to the device, the convenience of the handicapped person with such a throat is greatly improved.
On the other hand, Patent Document 1 proposes a communication interface system that inputs voice by collecting non-audible murmur voice (NAM: Non-Audible Murmur). A non-audible murmur voice (NAM) is a voice (unvoiced sound) that does not accompany regular vibration of the vocal cords, and is a vibration sound (breathing sound) that propagates through a non-audible soft tissue in the body from the outside. In other words, the sound is a breathing sound that is not accompanied by vocal cord vibration that occurs in the human vocal tract. For example, in a soundproof room environment, a non-audible voice (breathing sound) that cannot be heard by people around 1 to 2 meters away is defined as a “non-audible muttering voice” and the vocal tract (especially the oral cavity) is narrowed down. An audible voice that produces an unvoiced sound to the extent that it can be heard by people around 1 to 2 meters away by increasing the flow velocity of the air passing through the road is defined as an “audible whispering voice”.
Such a non-audible murmur voice signal cannot be collected by a normal microphone that detects vibration in the acoustic space, and is therefore generally collected by a meat conduction microphone that collects body conduction sound in the body. Conventionally, this meat conduction microphone is mainly used for collecting non-audible whispering sound (NAM), so it is also called a NAM microphone, and details thereof are disclosed in Patent Document 1 and the like.
This NAM microphone (meat conduction microphone) converts a non-audible murmur sound propagating through a soft member made of silicon or the like that propagates a non-audible murmur sound by being in close contact with the skin surface of a human body into an electrical signal. And a microphone. The NAM microphone is attached so that the soft member is in close contact with the skin surface on the thoracic papillary muscle, just below the mastoid process of the skull in the lower part of the auricle, and the soft composition ( Meat conduction sounds (non-audible whispering sounds) transmitted through muscles and fat other than bone) are collected.
Here, since the NAM microphone mainly collects only the sound that propagates through the soft composition of the human body, even if the surrounding noise (noise sound) is large, the human body functions as a noise removal filter. As a result, an acoustic signal having a high S / N ratio can be collected. That is, the NAM microphone has high noise resistance against sound propagating in the air.

ところで、可聴ささやき音声は、十分な音量に増幅さえすれば、特段の訓練を受けていない一般的な人によってもその発話内容を高い認識率で聴き取ることができる。
一報、非可聴つぶやき音声は、その音声を単に増幅しても、受話者が発話内容を聴き取りにくい（認識率が低い）という問題点がある。このことは、音声認識手段においても同様である。
これに対し、例えば非特許文献１には、統計的スペクトル変換法によるモデルの一例である混合正規分布モデルに基づいて、ＮＡＭマイクロホン（肉伝導マイクロホン）により得られる非可聴つぶやき音声の信号を、通常発声した音声（有声音）の信号に変換する技術が示されている。
また、特許文献２には、２つのＮＡＭマイクロホン（肉伝導マイクロホン）により得られる非可聴つぶやき音声の信号のパワーの比較により、通常の発声音（有声音）のピッチ周波数を推定し、その推定結果に基づいて、非可聴つぶやき音声の信号を通常発声した音声（有声音）の信号に変換する技術が示されている。
これら非特許文献１や特許文献１に示される技術を用いることにより、体内伝導マイクロホンを通じて得られた非可聴つぶやき音声の信号を、受話者が比較的聴き取りやすい通常音声（有声音）の信号に変換できる。
ＷＯ２００４／０２１７３８号パンフレット特開２００６−０８６８７７号公報戸田智基他、「混合正規分布モデルに基づく非可聴つぶやき声（ＮＡＭ）から通常音声への変換」電子情報通信学会信学技報、SP2004-107、pp.67-72、2004年12月 By the way, as long as the audible whispering sound is amplified to a sufficient volume, even a general person who has not received special training can hear the utterance content at a high recognition rate.
One problem is that the inaudible tweet voice has a problem that even if the voice is simply amplified, it is difficult for the listener to hear the utterance content (recognition rate is low). The same applies to the voice recognition means.
On the other hand, for example, in Non-Patent Document 1, a signal of a non-audible muttering voice obtained by a NAM microphone (meat conduction microphone) based on a mixed normal distribution model which is an example of a model based on a statistical spectrum conversion method is usually used. A technique for converting a voice signal (voiced sound) into a signal is shown.
Further, Patent Document 2 estimates the pitch frequency of a normal uttered sound (voiced sound) by comparing the power of inaudible murmur voice signals obtained by two NAM microphones (meat conduction microphones), and the estimation result. Based on the above, a technique for converting a signal of a non-audible muttering voice into a signal of voice (voiced sound) normally uttered is shown.
By using the techniques shown in Non-Patent Document 1 and Patent Document 1, a non-audible muttering voice signal obtained through a body conduction microphone is converted into a normal voice (voiced sound) signal that is relatively easy for the listener to hear. Can be converted.
WO2004 / 021738 pamphlet JP 2006-086877 A Toda Tomomoto et al. “Conversion from non-audible murmur (NAM) to normal speech based on mixed normal distribution model” IEICE Technical Report, SP2004-107, pp.67-72, December 2004

しかしながら、音声変換モデルに基づいて非可聴つぶやき音声を通常の可聴音声に変換するという従来技術（特許文献１や特許文献２に示される技術）は、その変換に要する時間だけ音声信号伝送の遅延が生じるという問題点を有していた。さらに、その従来技術を実現する装置は、音声変換モデルに基づく信号変換処理を実行する演算手段（マイクロコンピュータ）等が必要となることから、電力消費量が比較的大きく、コストも高いという問題点を有していた。ここで、音声通話や機器に対する音声での指令を行う場合、携帯型或いは身体装着型の音声の入出力装置を構成する必要があるが、その場合、電力消費量が大きいという問題は、十分な連続使用時間を確保できないという問題点にもつながる。
また、特許文献２にも示されるように、非可聴つぶやき音声は、声帯の規則振動を伴わない無声音である。そして、特許文献１や特許文献２に示されるように、無声音である非可聴つぶやき音声の信号を通常音声（有声音）の信号へ変換する場合、声道による音響的な特徴量の変換特性（入力信号の特徴量から出力信号の特徴量への変換特性）を表す声道特徴量変換モデルと、音源（声帯）による音響的な特徴量の変換特性を表す声帯特徴量変換モデルとを組み合わせた音声変換モデルが用いられる。このような音声変換モデルを用いた処理は、声の高さの情報に関して「無」から「有」を作り出す（推定する）処理を含むこととなる。このため、非可聴つぶやき音声の信号を通常音声（有声音）の信号へ変換すると、イントネーションが不自然な音声や本来発声していない誤った音声を含む信号が得られてしまい、受話者や音声認識手段の音声認識率が低下するという問題点があった。
従って、本発明は上記事情に鑑みてなされたものであり、その目的とするところは、ごく簡易な構成により、非可聴つぶやき音声を採取し、これを受話者や音声認識手段が認識しやすい音声信号に変換して出力することができる非可聴つぶやき音声採取用マイクロホン（ＮＡＭマイクロホン）を提供することにある。 However, the conventional technique (the technique disclosed in Patent Document 1 and Patent Document 2) that converts a non-audible murmur sound into a normal audible sound based on the sound conversion model has a delay in sound signal transmission by the time required for the conversion. It had the problem that it occurred. Furthermore, since the apparatus that realizes the conventional technique requires an arithmetic means (microcomputer) that executes signal conversion processing based on a voice conversion model, the power consumption is relatively large and the cost is high. Had. Here, when performing a voice call or a voice command to a device, it is necessary to configure a portable or body-worn voice input / output device. However, in that case, the problem of high power consumption is sufficient. It also leads to the problem that continuous use time cannot be secured.
In addition, as shown in Patent Document 2, the non-audible murmur sound is an unvoiced sound that does not involve regular vibration of the vocal cords. Then, as shown in Patent Document 1 and Patent Document 2, when converting an inaudible muttering voice signal, which is an unvoiced sound, into a normal voice (voiced sound) signal, a conversion characteristic of an acoustic feature amount by the vocal tract ( A combination of a vocal tract feature value conversion model that represents the conversion characteristics of input signal feature values to output signal feature values) and a vocal cord feature value conversion model that represents acoustic feature value conversion characteristics of the sound source (voice vocal cords) A speech conversion model is used. Processing using such a speech conversion model includes processing for generating (estimating) “existing” from “absent” regarding voice pitch information. For this reason, if a non-audible tweet signal is converted into a normal (voiced) signal, a signal containing unnatural intonation or incorrect speech that is not uttered can be obtained. There was a problem that the speech recognition rate of the recognition means was lowered.
Accordingly, the present invention has been made in view of the above circumstances, and the object of the present invention is to collect a non-audible murmur voice with a very simple configuration, which can be easily recognized by the listener or voice recognition means. An object of the present invention is to provide a non-audible tweet voice collecting microphone (NAM microphone) which can be converted into a signal and output.

上記目的を達成するために本発明は、次の（１）〜（３）に示す構成要素を備えることにより、非可聴つぶやき音声（人体の声道において生じる声帯の振動を伴わない呼吸音による音声）を採取する非可聴つぶやき音声採取用マイクロホン（以下、ＮＡＭマイクロホンという）である。
（１）人体の皮膚表面に密着されることにより前記非可聴つぶやき音声を伝播させる軟性部材。
（２）前記軟性部材を伝播する前記非可聴つぶやき音声を電気信号に変換するマイクロホン。
（３）前記マイクロホンにより得られる前記非可聴つぶやき音声の信号に対してハイパスフィルタ処理を施すハイパスフィルタ。
ここで、前記ハイパスフィルタのカットオフ周波数は、例えば、８００〜１４００Ｈｚ程度であることが望ましい。さらにその場合、前記ハイパスフィルタのスロープ特性は、−１６〜−１４ｄＢ／ｏｃｔ程度であることが望ましい。
このように、非可聴つぶやき音声の信号を前記ハイパスフィルタに通して得られる信号（前記ハイパスフィルタの出力信号）は、それを単に増幅するだけで、前記可聴ささやき音声の信号とほぼ同等に聴き取りやすい（認識しやすい）信号になることがわかった。しかも、前記ハイパスフィルタは、ごく簡易で電力消費が極めて少ない回路により実現できる（例えば、コンデンサと抵抗素子とによって実現できる）。
なお、一般に、マイクロホンは音声信号を増幅するアンプを備えているが、本発明に係るＮＡＭマイクロホンにおいては、前記マイクロホンと前記ハイパスフィルタとの間で前記非可聴つぶやき音声の信号を増幅するアンプを備えることが望ましい。
これにより、音量レベルが低い前記非可聴つぶやき音声の信号を、前記ハイパスフィルタの入力信号として十分なレベルに増幅することができる。 In order to achieve the above object, the present invention includes the components shown in the following (1) to (3), so that a non-audible murmur voice (a voice by a breathing sound without a vocal cord vibration generated in a human vocal tract). ) Is a non-audible muttering voice collecting microphone (hereinafter referred to as a NAM microphone).
(1) A soft member that propagates the inaudible murmur sound by being in close contact with the skin surface of a human body.
(2) A microphone that converts the inaudible murmur sound propagating through the soft member into an electric signal.
(3) A high-pass filter that performs a high-pass filter process on the inaudible tweet signal obtained by the microphone.
Here, the cut-off frequency of the high-pass filter is preferably about 800 to 1400 Hz, for example. In that case, the slope characteristic of the high-pass filter is preferably about −16 to −14 dB / oct.
In this way, a signal obtained by passing a non-audible murmur voice signal through the high-pass filter (the output signal of the high-pass filter) can be heard almost equivalently to the audible whisper voice signal simply by amplifying it. It turns out that it becomes an easy (recognizable) signal. Moreover, the high-pass filter can be realized by a very simple circuit that consumes very little power (for example, it can be realized by a capacitor and a resistance element).
In general, the microphone includes an amplifier that amplifies an audio signal. However, the NAM microphone according to the present invention includes an amplifier that amplifies the signal of the inaudible murmur audio between the microphone and the high-pass filter. It is desirable.
This makes it possible to amplify the inaudible murmur voice signal having a low volume level to a level sufficient as an input signal of the high-pass filter.

本発明に係るＮＡＭマイクロホンよれば、非可聴つぶやき音声を採取することにより、受話者にとって聴き取りやすい音声信号を得ることができる。その聴き取りやすさは、前記可聴ささやき音声と同程度である。
また、本発明に係るＮＡＭマイクロホンにより得られる音声は、従来手法により得られる通常音声（非可聴音声の信号を、声道特徴量変換モデルと音源特徴量変換モデルとを組合せたモデルに基づいて変換した通常音声（有声音））のように、イントネーションが不自然な音声や本来発声していない誤った音声を含むことがなく安定している。
さらに、本発明によれば、音声変換モデルに基づく信号変換処理（比較的負荷の高い処理）が不要であり、音声信号伝送の遅延が生じず、その処理を実行する演算手段（マイクロコンピュータ）等も不要である。このため、本発明に係るＮＡＭマイクロホンは、携帯電話機や身体装着型の機器などの小型の機器に組み込まれるような場合でも、機器の重量や体積の増大や、連続使用時間の短縮を招くことがほとんどない。 According to the NAM microphone of the present invention, it is possible to obtain an audio signal that is easy for the listener to hear by collecting inaudible tweets. The ease of listening is comparable to the audible whispering sound.
In addition, the sound obtained by the NAM microphone according to the present invention is converted based on a model obtained by combining a normal sound (a non-audible sound signal obtained by a conventional method) with a vocal tract feature value conversion model and a sound source feature value conversion model. The normal voice (voiced sound)) is stable without including unnatural voices that are not intonation and false voices that are not originally uttered.
Furthermore, according to the present invention, signal conversion processing (relatively high load processing) based on the voice conversion model is unnecessary, there is no delay in voice signal transmission, and arithmetic means (microcomputer) for executing the processing is provided. Is also unnecessary. For this reason, the NAM microphone according to the present invention may increase the weight and volume of the device and shorten the continuous use time even when incorporated in a small device such as a mobile phone or a body-worn device. rare.

以下添付図面を参照しながら、本発明の実施の形態について説明し、本発明の理解に供する。尚、以下の実施の形態は、本発明を具体化した一例であって、本発明の技術的範囲を限定する性格のものではない。
ここに、図１は本発明の実施形態に係るＮＡＭマイクロホンＸの概略構成図（一部ブロック図）、図２はＮＡＭマイクロホンＸが人体に装着された状態を表す模式図、図３は非可聴つぶやき音声の声紋と可聴ささやき音声の声紋とを表す図、図４はＮＡＭマイクロホンＸの第１評価実験（自然性評価実験）の結果を表すグラフ、図５はＮＡＭマイクロホンＸの第２評価実験（単語認識精度評価実験）の結果を表すグラフである。 Embodiments of the present invention will be described below with reference to the accompanying drawings for understanding of the present invention. In addition, the following embodiment is an example which actualized this invention, Comprising: It is not the thing of the character which limits the technical scope of this invention.
FIG. 1 is a schematic configuration diagram (partial block diagram) of the NAM microphone X according to the embodiment of the present invention, FIG. 2 is a schematic diagram showing a state in which the NAM microphone X is attached to a human body, and FIG. FIG. 4 is a graph showing the voiceprint of a murmured voice and the voiceprint of an audible whispering voice, FIG. 4 is a graph showing the result of a first evaluation experiment (naturalness evaluation experiment) of the NAM microphone X, and FIG. 5 is a second evaluation experiment of the NAM microphone X. It is a graph showing the result of a word recognition accuracy evaluation experiment.

まず、図１を参照しつつ、本発明の実施形態に係るＮＡＭマイクロホンＸの構成について説明する。
ここで、図１（ａ）は、ＮＡＭマイクロホンＸの全体構成図、図１（ｂ）は、ＮＡＭマイクロホンＸの一部を構成する音声検出部１０の正面図（皮膚への接触面側から見た図）、図１（ｃ）は、ＮＡＭマイクロホンＸの一部を構成する信号処理部２０の構成を表すブロック図である。なお、図１（ａ）において、音声検出部１０については断面図を示している。
ＮＡＭマイクロホンＸは、人体の声道において生じる声帯の振動を伴わない呼吸音による音声である非可聴つぶやき音声（ＮＡＭ）を採取する非可聴つぶやき音声採取用マイクロホンである。
図１（ａ）に示すように、ＮＡＭマイクロホンＸは、音声検出部１０と、信号処理部２０と、それらを接続する信号線３０とを備え、信号処理部２０は、耳装着用筐体２７に収容されている。
音声検出部１０は、人体の皮膚表面に装着されることにより、人体の声道によって発生し、体内軟部組織（主として肉の部分）を伝播する非可聴つぶやき音声（ＮＡＭ）の振動を電気信号に変換するものである。
また、信号処理部２０は、音声検出部１０による検出信号（音声信号）に対する各種の処理（信号増幅処理や外部機器への伝送処理等）を行うものである。 First, the configuration of the NAM microphone X according to the embodiment of the present invention will be described with reference to FIG.
Here, FIG. 1A is an overall configuration diagram of the NAM microphone X, and FIG. 1B is a front view of the sound detection unit 10 constituting a part of the NAM microphone X (viewed from the skin contact surface side). FIG. 1C is a block diagram showing the configuration of the signal processing unit 20 that constitutes a part of the NAM microphone X. In addition, in Fig.1 (a), about the audio | voice detection part 10, sectional drawing is shown.
The NAM microphone X is a non-audible murmur voice collecting microphone that collects a non-audible murmur voice (NAM), which is a voice generated by a breathing sound that is not accompanied by vocal cord vibration generated in the human vocal tract.
As shown in FIG. 1A, the NAM microphone X includes an audio detection unit 10, a signal processing unit 20, and a signal line 30 that connects them, and the signal processing unit 20 includes an ear mounting casing 27. Is housed in.
The sound detection unit 10 is attached to the skin surface of the human body, thereby generating vibrations of non-audible muttering sound (NAM) generated by the vocal tract of the human body and propagating through the soft tissue (mainly the meat part) in the body as an electrical signal. To convert.
The signal processing unit 20 performs various processing (signal amplification processing, transmission processing to an external device, etc.) on the detection signal (audio signal) by the audio detection unit 10.

以下、図１（ａ）及び図１（ｂ）を参照しつつ、音声検出部１０について説明する。
この音声検出部１０は、マイクロホン１１と、ＮＡＭ伝播部１２と、内側カバー部材１３と、粘着遮音部１４と、外側カバー部材１５とを備えて構成されている。
ＮＡＭ伝播部１２は、その一面１２ａが人体１の皮膚表面１ａに密着されることにより、人体内を伝播する非可聴つぶやき音声を伝播させる軟性部材である。このＮＡＭ伝播部１２は、その音響インピーダンスの特性が、人体における肉部の音響インピーダンスの特性に近い材料、例えば、ウレタンエラストマーやシリコン等により構成されている。これにより、非可聴つぶやき音声を、人体（皮膚）からＮＡＭ伝播部１２へ効率的に伝播させることができる。
マイクロホン１１は、ＮＡＭ伝播部１２（軟性部材）を伝播する非可聴つぶやき音声（振動）を電気信号に変換するものである。このマイクロホン１１は、音声の振動を感知する感音部（図１（ａ）における左側面）の全体がＮＡＭ伝播部１２に対して直接接触している。これにより、マイクロホン１１は、ＮＡＭ伝播部１２の振動（音声）を高感度で検出する。このマイクロホン１１の出力信号（非可聴つぶやき音声の信号）は、信号線３０を通じて信号処理部２０へ伝送される。
内側カバー部材１３は、ＮＡＭ伝播部１２の皮膚表面１ａとの接触面（以下、皮膚接触面１２ａという）以外の部分全体を覆うものである。即ち、内側カバー部材１３は、一の面が開口状態となった容器状の部材であり、その内側に、マイクロホン１１が収容されるとともに、ＮＡＭ伝播部１２（軟性部材）がほぼ隙間なく充填された状態（ＮＡＭ伝播部１２の中にマイクロホン１１が埋め込まれた状態）となっている。
粘着遮音部１４は、粘着性のある軟性部材であるウレタンエラストマーからなり、内側カバー部材１３の外側全体を覆うように形成されている。さらに、粘着遮音部１４は、人体１の皮膚表面１ａに接触して粘着する部分（以下、皮膚接着部１４ａという）を有している。この皮膚接着部１４ａは、ＮＡＭ伝播部１２における皮膚１ａとの接触面（以下、皮膚接触面１２ａという）の周り全体に渡って形成されている。
また、外側カバー部材１５は、粘着遮音部１４の前記皮膚接着部１４ａ以外の外側全体を覆うものであり、音声検出部１０の外装を形成するものである。即ち、外側カバー部材１５は、一の面が開口状態となった容器状の部材であり、その内側に、ＮＡＭ伝播部１２及びマイクロホン１１を内包する内側カバー部材１３が収容されるとともに、粘着遮音部１４（ウレタンエラストマー）がほぼ隙間なく充填された状態となっている。 Hereinafter, the voice detection unit 10 will be described with reference to FIGS. 1 (a) and 1 (b).
The voice detection unit 10 includes a microphone 11, a NAM propagation unit 12, an inner cover member 13, an adhesive sound insulation unit 14, and an outer cover member 15.
The NAM propagation unit 12 is a soft member that propagates a non-audible murmur sound propagating through the human body when its one surface 12 a is in close contact with the skin surface 1 a of the human body 1. The NAM propagation part 12 is made of a material whose acoustic impedance characteristic is close to the acoustic impedance characteristic of the flesh in the human body, such as urethane elastomer or silicon. Thereby, a non-audible murmur voice can be efficiently propagated from the human body (skin) to the NAM propagation unit 12.
The microphone 11 converts an inaudible murmur sound (vibration) propagating through the NAM propagation unit 12 (soft member) into an electric signal. The microphone 11 is in direct contact with the NAM propagation unit 12 as a whole of the sound sensing unit (the left side surface in FIG. 1A) that senses vibrations of sound. Thereby, the microphone 11 detects the vibration (sound) of the NAM propagation unit 12 with high sensitivity. The output signal of the microphone 11 (inaudible tweet voice signal) is transmitted to the signal processing unit 20 through the signal line 30.
The inner cover member 13 covers the entire portion other than the contact surface (hereinafter referred to as the skin contact surface 12a) with the skin surface 1a of the NAM propagation portion 12. That is, the inner cover member 13 is a container-like member whose one surface is in an open state. The microphone 11 is housed inside the inner cover member 13 and the NAM propagation portion 12 (soft member) is filled with almost no gap. (The microphone 11 is embedded in the NAM propagation unit 12).
The adhesive sound insulation part 14 is made of urethane elastomer, which is an adhesive soft member, and is formed so as to cover the entire outside of the inner cover member 13. Furthermore, the adhesive sound insulating part 14 has a part (hereinafter referred to as a skin adhesive part 14 a) that contacts and adheres to the skin surface 1 a of the human body 1. The skin adhesive portion 14a is formed over the entire contact surface (hereinafter referred to as the skin contact surface 12a) with the skin 1a in the NAM propagation portion 12.
The outer cover member 15 covers the entire outside of the adhesive sound insulating portion 14 other than the skin adhesive portion 14a, and forms the exterior of the voice detecting portion 10. That is, the outer cover member 15 is a container-like member whose one surface is in an open state, and an inner cover member 13 that encloses the NAM propagation unit 12 and the microphone 11 is housed inside thereof, and adhesive sound insulation. The portion 14 (urethane elastomer) is filled with almost no gap.

ここで、粘着遮音部１４は、その皮膚接着部１４ａが皮膚表面１ａに対して粘着することにより、ＮＡＭ伝播部１２の皮膚接触面１２ａを皮膚表面１ａに密着させるとともに、当該音声検出部１０を人体に対して接着した状態（装着状態）に保持するものである。なお、ＮＡＭ伝播部１２をウレタンエラストマー（粘着性の軟性部材）によって構成することにより、ＮＡＭ伝播部１２の皮膚接触面１２ａと皮膚接着部１４ａとの両方が皮膚表面１ａに対して粘着し、当該音声検出部１０がより強固に人体に対して保持されるので好適である。
また、粘着遮音部１４は、内側カバー部材１３の外側全体を覆うとともに、ＮＡＭ伝播部１２の皮膚接触面１２ａの周り全体で皮膚接着部１４ａが皮膚表面１ａに接着することにより、空気中を伝播する外乱音響がマイクロホン１１に浸入することを防ぐ遮音材としても機能する。 Here, the adhesive sound insulation unit 14 causes the skin contact surface 12a of the NAM propagation unit 12 to be in close contact with the skin surface 1a when the skin adhesive unit 14a adheres to the skin surface 1a, and the voice detection unit 10 is It is held in a state where it is adhered to the human body (mounted state). In addition, by comprising the NAM propagation part 12 with urethane elastomer (adhesive soft member), both the skin contact surface 12a and the skin adhesion part 14a of the NAM propagation part 12 adhere to the skin surface 1a, The voice detection unit 10 is preferable because it is more firmly held against the human body.
The adhesive sound insulation part 14 covers the entire outside of the inner cover member 13 and propagates in the air by the skin adhesion part 14a adhering to the skin surface 1a around the skin contact surface 12a of the NAM propagation part 12. It also functions as a sound insulating material that prevents disturbance sound from entering the microphone 11.

次に、図１（ｃ）を参照しつつ、信号処理部２０の構成について説明する。
信号処理部２０は、アンプ２１、ハイパスフィルタ２２、Ａ／Ｄ変換部２３、無線通信部２４、アンテナ２５及びバッテリ２６を備えて構成されている。
アンプ２１は、音声検出部１０から伝送されてくる非可聴つぶやき音声の信号（マイクロホン１１により得られる音声の信号）を増幅するものである。
ハイパスフィルタ２２は、音声検出部１０から伝送されてくる非可聴つぶやき音声の信号（マイクロホン１１により得られる音声の信号）に対してハイパスフィルタ処理を施し、処理後の信号を後段の回路へ出力するものである。
ここで、アンプ２１は、音声検出部１０におけるマイクロホン１１とハイパスフィルタ２２との間に配置され、ハイパスフィルタ２２に入力される前の非可聴つぶやき音声の信号を増幅する。
これにより、音量レベルが低い非可聴つぶやき音声の信号を、ハイパスフィルタ２２の入力信号として十分なレベルに増幅することができる。
図１（ｃ）に示すハイパスフィルタ２２は、コンデンサと抵抗素子とにより構成されるごく簡易なＣ−Ｒ回路（微分回路）である。例えば、そのコンデンサ容量を０．１μＦ程度、抵抗値を１．６ｋΩ程度とすることが考えられる（カットオフ周波数≒１ｋＨｚ）。 Next, the configuration of the signal processing unit 20 will be described with reference to FIG.
The signal processing unit 20 includes an amplifier 21, a high-pass filter 22, an A / D conversion unit 23, a wireless communication unit 24, an antenna 25, and a battery 26.
The amplifier 21 amplifies the inaudible murmur voice signal (the voice signal obtained by the microphone 11) transmitted from the voice detection unit 10.
The high-pass filter 22 performs a high-pass filter process on the inaudible murmur voice signal (the voice signal obtained by the microphone 11) transmitted from the voice detection unit 10, and outputs the processed signal to a subsequent circuit. Is.
Here, the amplifier 21 is disposed between the microphone 11 and the high-pass filter 22 in the sound detection unit 10 and amplifies a signal of a non-audible tweet sound before being input to the high-pass filter 22.
This makes it possible to amplify a signal of a non-audible murmur voice having a low volume level to a level sufficient as an input signal of the high-pass filter 22.
The high-pass filter 22 shown in FIG. 1C is a very simple CR circuit (differential circuit) composed of a capacitor and a resistance element. For example, it is conceivable that the capacitor capacity is about 0.1 μF and the resistance value is about 1.6 kΩ (cutoff frequency≈1 kHz).

Ａ／Ｄ変換部２３は、アンプ２１により増幅され、ハイパスフィルタ２２によりフィルタ処理が施された非可聴音声の信号（アナログ信号）を、所定のサンプリング周波数でデジタル信号に変換するものである。例えば、Ａ／Ｄ変換部２３は、８ｋＨｚ程度のサンプリング周波数でＡ／Ｄ変換を行う。
無線通信部２４は、Ａ／Ｄ変換部２３によってデジタル化された非可聴音声信号を、通信機能を備えた外部装置に対してアンテナ２５を通じて無線送信するものである。例えば、周知のＢｌｕｅｔｏｏｔｈの通信規格に従って、デジタル音声信号を外部装置に送信する。
バッテリ２６は、信号処理部２０を構成する各機器（アンプ２１、Ａ／Ｄ変換部２３、無線通信部２４）に対して電力を供給するものである。また、バッテリ２６は、音声検出部１０のマイクロホン１１が駆動電力を必要とするもの（例えば、バイアス型のコンデンサマイクロホン等）である場合、そのマイクロホン１１に対しても電力供給を行う。もちろん、マイクロホン１１が、駆動電力を必要としないもの（例えば、バックエレクトレット型のコンデンサマイクロホン等）である場合には、音声検出部１０側への電力供給は要しない。 The A / D converter 23 converts the inaudible audio signal (analog signal) amplified by the amplifier 21 and filtered by the high-pass filter 22 into a digital signal at a predetermined sampling frequency. For example, the A / D converter 23 performs A / D conversion at a sampling frequency of about 8 kHz.
The wireless communication unit 24 wirelessly transmits the inaudible audio signal digitized by the A / D conversion unit 23 to an external device having a communication function through the antenna 25. For example, a digital audio signal is transmitted to an external device in accordance with a well-known Bluetooth communication standard.
The battery 26 supplies power to each device (the amplifier 21, the A / D conversion unit 23, and the wireless communication unit 24) constituting the signal processing unit 20. The battery 26 also supplies power to the microphone 11 when the microphone 11 of the sound detection unit 10 requires driving power (for example, a bias condenser microphone). Of course, when the microphone 11 does not require driving power (for example, a back electret condenser microphone or the like), power supply to the sound detection unit 10 side is not required.

図２はＮＡＭマイクロホンＸが人体に装着された状態を表す模式図である。
図２に示すように、音声検出部１０は、耳介の下方部における頭蓋骨の乳様突起直下の、胸鎖乳頭筋上の皮膚表面に、ＮＡＭ伝播部１２の皮膚接触面１２ａが密着するようにして人体に装着（粘着遮音部１４により粘着）される。これにより、声道で発生した非可聴つぶやき音声（ＮＡＭ）が、骨等が障害物となることなく体内の肉部からＮＡＭ伝播部１２へ効率的に伝播する。
また、信号処理部２０は、その筐体である耳装着用筐体２７が耳に係合することにより人体に装着される。
これにより、ハンズフリー状態で、音声検出部１０で検出され、ハイパフフィルタ処理が施された非可聴つぶやき音声の信号が、信号処理部２０から外部装置に対して無線伝送される。従って、信号処理部２０と通信可能な外部装置が、例えば、非可聴つぶやき音声の信号に基づく音声認識機能と、認識した音声に応じて自装置の動作を制御する自動制御機能とを備えれば、ＮＡＭマイクロホンＸの装着者は、ハンズフリーの状態で、かつ、周囲に音が漏れない非可聴つぶやき音声の発声により、外部装置を遠隔制御することができる。
また、信号処理部２０に、外部装置から非可聴つぶやき音声の信号を受信して音声として出力する機器を付加すれば、非可聴つぶやき音声による通話機となる。この場合、例えば、無線通信部２４に、外部装置から非可聴つぶやき音声の信号を受信する機能を設ける。さらに、信号処理部２０に、無線通信部２４によって受信した音声信号（ディジタル信号）をアナログ信号に変換するＤ／Ａ変換部と、Ｄ／Ａ変換後の音声信号（アナログ信号）を増幅するアンプと、増幅後の音声信号を出力するスピーカ（イヤホン）とを付加する。これにより、ＮＡＭマイクロホンＸは、非可聴つぶやき音声による通話機となる。 FIG. 2 is a schematic diagram showing a state in which the NAM microphone X is worn on the human body.
As shown in FIG. 2, the voice detection unit 10 is configured so that the skin contact surface 12 a of the NAM propagation unit 12 is in close contact with the skin surface on the thoracic papillary muscle immediately below the mastoid process of the skull in the lower part of the auricle. Then, it is attached to the human body (adhered by the adhesive sound insulation unit 14). Thereby, inaudible murmur sound (NAM) generated in the vocal tract efficiently propagates from the flesh of the body to the NAM propagation unit 12 without the bones or the like becoming an obstacle.
Further, the signal processing unit 20 is attached to the human body when the ear mounting housing 27 that is the housing engages the ear.
Thereby, in the hands-free state, the signal of the inaudible murmur voice detected by the voice detection unit 10 and subjected to the high puff filter processing is wirelessly transmitted from the signal processing unit 20 to the external device. Therefore, if the external device that can communicate with the signal processing unit 20 includes, for example, a voice recognition function based on a signal of a non-audible murmur voice and an automatic control function that controls the operation of the own device according to the recognized voice. The wearer of the NAM microphone X can remotely control the external device in a hands-free state and by uttering a non-audible murmur voice that does not leak sound around.
Further, if a device that receives a signal of non-audible murmur voice from an external device and outputs it as a voice is added to the signal processing unit 20, a caller using a non-audible murmur voice is obtained. In this case, for example, the wireless communication unit 24 is provided with a function of receiving a signal of a non-audible murmur voice from an external device. Further, the signal processing unit 20 includes a D / A conversion unit that converts an audio signal (digital signal) received by the wireless communication unit 24 into an analog signal, and an amplifier that amplifies the audio signal (analog signal) after the D / A conversion. And a speaker (earphone) for outputting the amplified audio signal. As a result, the NAM microphone X becomes a caller using a non-audible tweet voice.

ところで、肉伝導音として採取された非可聴つぶやき音声（ＮＡＭ）と、空気伝導音として採取された可聴ささやき音声とは、いずれも無声音（声帯の振動を伴わない音声）である。しかしながら、前述したように、非可聴つぶやき音声は、単に増幅しただけではその発話内容を認識し難く（聴き取りにくい）、可聴つぶやき音声は比較的その発話内容を認識しやすい。
図３は、従来のＮＡＭマイクロホン（フィルタ処理なし）により採取された非可聴つぶやき音声（ＮＡＭ）の声紋（ａ）と、一般的なマイクロホンにより空気伝導音として採取された可聴ささやき音声の声紋（ｂ）とを表す図である。
図３（ａ）と図３（ｂ）とを比較すると、発話内容を比較的認識しやすいささやき音声（ｂ）に比べ、非可聴つぶやき音声（ａ）は、７５０Ｈｚ以下の周波数成分の信号強度が強い傾向がある。これは、非可聴ささやき音声の信号において、７５０Ｈｚを超える信号成分が、主として発話内容の識別に寄与する信号成分（Ｓ）であり、７５０Ｈｚ以下の信号成分が、主として発話内容の識別に寄与しないノイズ成分（Ｎ）であることが予想される。 By the way, the non-audible murmuring sound (NAM) collected as the meat conduction sound and the audible whispering sound collected as the air conduction sound are both unvoiced sounds (sounds without accompanying vocal cord vibration). However, as described above, it is difficult to recognize the utterance content of non-audible tweet speech simply by amplification (it is difficult to hear), and audible tweet speech is relatively easy to recognize the utterance content.
FIG. 3 shows a voiceprint (a) of a non-audible tweet voice (NAM) collected by a conventional NAM microphone (without filtering) and a voiceprint (b) of an audible whisper voice collected as an air conduction sound by a general microphone. ).
Comparing FIG. 3 (a) and FIG. 3 (b), the non-audible murmur voice (a) has a signal intensity of a frequency component of 750 Hz or less compared to the whisper voice (b) in which the utterance content is relatively easy to recognize. There is a strong tendency. This is a signal component (S) in which the signal component exceeding 750 Hz mainly contributes to the identification of the utterance content in the signal of the non-audible whispering sound, and the signal component of 750 Hz or less mainly does not contribute to the identification of the utterance content. It is expected to be component (N).

以上のことから、肉伝導音として採取された非可聴つぶやき音声の信号から、７５０Ｈｚ以下の周波数成分をハイパスフィルタ２２（即ち、ローカットフィルタ）により除去すれば、ＳＮ比が向上することが予想される。
実際に、ＮＡＭマイクロホンＸにより、非可聴つぶやき音声を採取すると、それを単に増幅するだけで受話者にとって聴き取りやすい（認識しやすい）音声信号が得られることがわかった。その聴き取りやすさは、前記可聴ささやき音声と同程度である。
また、本発明に係るＮＡＭマイクロホンにより得られる音声は、音声変換モデルに基づいて変換した通常音声（有声音）のように、イントネーションが不自然な音声や本来発声していない誤った音声を含むことがなく安定している。
さらに、ハイパスフィルタ２２は、携帯電話機や身体装着型の機器などの小型の機器に組み込まれるような場合でも、機器の重量や体積をほとんど増大させることがなく、また、連続使用時間の短縮を招くこともない。
また、ＮＡＭマイクロホンＸにおいては、音声変換モデルに基づく信号変換処理のように、演算負荷の高い処理が不要であるので、音声信号伝送の遅延が生じない。これにより、非可聴つぶやき音声によるスムーズな対話や機器の遠隔制御を実現することができる。 From the above, it is expected that the signal-to-noise ratio will be improved if the high-pass filter 22 (that is, the low-cut filter) removes a frequency component of 750 Hz or less from the inaudible murmur voice signal collected as the meat conduction sound. .
In fact, it has been found that when a non-audible murmur voice is collected by the NAM microphone X, a voice signal that is easy to hear (easy to recognize) can be obtained by simply amplifying it. The ease of listening is comparable to the audible whispering sound.
In addition, the sound obtained by the NAM microphone according to the present invention includes a sound with an unnatural intonation and a false sound that is not originally uttered, such as a normal sound (voiced sound) converted based on a sound conversion model. There is no stability.
Furthermore, even when the high-pass filter 22 is incorporated in a small device such as a mobile phone or a body-worn device, the weight and volume of the device are hardly increased, and the continuous use time is shortened. There is nothing.
In addition, the NAM microphone X does not require processing with a high calculation load unlike the signal conversion processing based on the voice conversion model, so that there is no delay in audio signal transmission. As a result, it is possible to realize a smooth conversation by non-audible murmur voice and remote control of the device.

次に、ＮＡＭマイクロホンＸに採用するハイパスフィルタの特性とマイク性能との関係を評価した実験である第１評価実験及び第２評価実験について説明する。
第１及び第２評価実験は、いずれも２２人の被験者（男性１０名、女性１２名）が、それぞれ音源や信号処理（ハイパスフィルタ処理）の内容が異なる２２種類のサンプル音声（第１〜第２２のサンプル音声）を聴き取り、その聴き取りよって得た感覚（評価結果）に従って、予め定められた評価項目について回答するという方式で行われたものである。
ここで、第１〜第１０のサンプル音声は、所定の発話者が、所定のサンプルテキスト（文章、単語を含む）を非可聴つぶやき音声（ＮＡＭ）で発話したときの肉伝導音を、それぞれカットオフ周波数が２００、４００、６００、…２０００Ｈｚ（２００Ｈｚきざみ）であり、スロープ特性が−１２０ｄＢ／ｏｃｔであるハイパスフィルタを備えたＮＡＭマイクロホンＸにより収録されたフィルタ処理後の非可聴ささやき音声である。
また、第１１〜第２０のサンプル音声は、所定の発話者が、前記サンプルテキストを非可聴つぶやき音声で発話したときの肉伝導音を、それぞれカットオフ周波数が２００、４００、６００、…２０００Ｈｚ（２００Ｈｚきざみ）であり、、それぞれスロープ特性が−２３．０、−１８．６、−１６．８、−１５．６、−１５．０、−１４．４、−１３．９、−１３．６、−１３．２及び−１３．０ｄＢ／ｏｃｔであるハイパスフィルタを備えたＮＡＭマイクロホンＸにより収録されたフィルタ処理後の非可聴ささやき音声である。
また、第２１のサンプル音声は、通常のマイクロホン（空気伝導音を採取するマイクロホン）により収録された前記可聴ささやき音声である。
また、第２２のサンプル音声は、ハイパスフィルタ処理を行わない従来のＮＡＭマイクロホンにより収録された非可聴ささやき音声（ＮＡＭ）である。
また、前記サンプルテキストは、単語数２０〜３０程度の新聞記事等である。
なお、いずれのサンプル音声も、量子化ビット数が１６ｂｉｔ、サンプリングレートが８ｋＨｚのデジタル音声（ＰＣＭ音声）として録音し、これを再生した。 Next, a first evaluation experiment and a second evaluation experiment, which are experiments evaluating the relationship between the characteristics of the high-pass filter employed in the NAM microphone X and the microphone performance, will be described.
In each of the first and second evaluation experiments, 22 subjects (10 males and 12 females) each have 22 types of sample sounds (first to first) with different contents of sound source and signal processing (high-pass filter processing). 22 sample voices) and according to a sense (evaluation result) obtained by the listening, answers are made for predetermined evaluation items.
Here, the first to tenth sample sounds are cut out of meat conduction sounds when a predetermined utterer utters a predetermined sample text (including sentences and words) with a non-audible muttering sound (NAM). This is a non-audible whispered sound after filtering recorded by a NAM microphone X having a high-pass filter with an off frequency of 200, 400, 600,..., 2000 Hz (in 200 Hz increments) and a slope characteristic of −120 dB / oct.
In addition, the 11th to 20th sample sounds are meat conduction sounds when a predetermined utterer utters the sample text with a non-audible muttering sound, with cut-off frequencies of 200, 400, 600,. And slope characteristics are -23.0, -18.6, -16.8, -15.6, -15.0, -14.4, -13.9, -13.6, respectively. , 13.2 and -13.0 dB / oct, a non-audible whispered sound after filtering recorded by the NAM microphone X provided with a high-pass filter.
The 21st sample sound is the audible whisper sound recorded by a normal microphone (microphone for collecting air conduction sound).
The twenty-second sample sound is a non-audible whisper sound (NAM) recorded by a conventional NAM microphone that does not perform high-pass filter processing.
The sample text is a newspaper article having about 20 to 30 words.
Each sample sound was recorded as digital sound (PCM sound) having a quantization bit number of 16 bits and a sampling rate of 8 kHz, and played back.

＜第１評価実験（自然性評価実験）＞
第１評価実験では、２２種類のサンプル音声の中から任意に選択した２種類で１組（ペア）のサンプル音声を１１組ずつ各被験者に聴き取らせ、各組について、いずれのサンプル音声の方が会話音声として自然であると感じたかを選択させた。この第１評価実験では、より多くの被験者によって自然であるものとして選択されたサンプル音声が、自然性が高いといえる。
図４は、第１評価実験（自然性評価実験）の結果を表すグラフであり、２２種類のサンプル信号それぞれについて、被験者により、当該サンプル信号の方が、比較対象となった他のサンプル信号よりも自然であるとして選択された割合（二者択一選択率）を表す。
図４からわかるように、ハイパスフィルタ処理を施さない非可聴つぶやき音声（第２１のサンプル音声ｖ２１）は、それが自然であると評価された率（選択率）が５割を下回るのに対し、第１５のサンプル音声ｖ１５（カットオフ周波数１０００Ｈｚ、スロープ特性−１５．０ｄＢ／ｏｃｔのハイパスフィルタ処理）は、それが自然であると評価された率が８割を超えている。ここで、可聴ささやき音声（第２２のサンプル音声ｖ２２）の選択率が９割強であるので、非可聴つぶやき音声に対して単にハイパスフィルタ処理を施すだけで、自然性が可聴ささやき音声に匹敵する程度まで自然性が向上することがわかる。
また、図４からわかるように、スロープ特性によらず、カットオフ周波数が８００〜１２００Ｈｚのハイパスフィルタ処理を施した非可聴つぶやき音声（第４〜第６、第１４〜第１６のサンプル音声信号ｖ４〜ｖ６、ｖ１４〜ｖ１６）の選択率が６割を超え、自然性の向上が認められる。
また、図４からわかるように、ハイパスフィルタ処理におけるスロープ特性が急峻（−１２０ｄＢ）な信号よりも、スロープ特性が比較的緩やかな信号（特に、第１４〜第２０のサンプル信号）の方が、自然性の向上に適している。 <First evaluation experiment (naturalness evaluation experiment)>
In the first evaluation experiment, each subject listened to 11 pairs of sample sounds of 2 types arbitrarily selected from 22 types of sample sounds. Was selected as a natural conversational voice. In this first evaluation experiment, it can be said that the sample voice selected as natural by more subjects has high naturalness.
FIG. 4 is a graph showing the results of the first evaluation experiment (naturalness evaluation experiment). For each of the 22 types of sample signals, the subject sample signal is compared with the other sample signals to be compared. Represents the ratio selected as natural (an alternative selection ratio).
As can be seen from FIG. 4, the non-audible tweet voice (the 21st sample voice v21) that is not subjected to the high-pass filter processing has a rate (selectivity) at which it is evaluated as natural, which is less than 50%. The rate at which the fifteenth sample voice v15 (high-pass filter processing with a cutoff frequency of 1000 Hz and a slope characteristic of -15.0 dB / oct) is evaluated as natural is over 80%. Here, since the selection rate of the audible whisper sound (22nd sample sound v22) is over 90%, the naturalness is comparable to the audible whisper sound simply by applying the high-pass filter process to the non-audible murmur sound. It can be seen that the naturalness is improved to a certain extent.
Further, as can be seen from FIG. 4, non-audible muttering voices (fourth to sixth, fourteenth to sixteenth sample voice signals v4) subjected to high-pass filter processing with a cutoff frequency of 800 to 1200 Hz, regardless of the slope characteristics. ˜v6, v14 to v16) exceeds 60%, and an improvement in naturalness is observed.
Further, as can be seen from FIG. 4, a signal having a relatively gentle slope characteristic (particularly, the 14th to 20th sample signals) is higher than a signal having a steep (−120 dB) slope characteristic in high-pass filter processing. Suitable for improving naturalness.

＜第２評価実験（単語認識精度評価実験）＞
第２評価実験では、新聞記事（即ち、意味のある文章）の読み上げ音声である２２種類のサンプル音声の中から任意に選択したものを被験者に聴き取らせ、単語の認識精度（被験者が認識した単語の正解率）を評価した。
図５は、第２評価実験（単語認識精度評価実験）の結果を表すグラフであり、２２種類のサンプル信号（新聞記事の読み上げ音声）それぞれについて、被験者による単語の認識精度（認識した単語の正解率）を表す。
図５からわかるように、ハイパスフィルタ処理を施さない非可聴つぶやき音声（第２１のサンプル音声ｖ２１）は、単語認識精度が約８５％であるのに対し、第１４のサンプル音声ｖ１４（カットオフ周波数８００Ｈｚ、スロープ特性−１５．６ｄＢ／ｏｃｔのハイパスフィルタ処理）は、単語認識精度が約９０％である。
また、図５からわかるように、カットオフ周波数が２００〜１４００Ｈｚの範囲で、比較的緩やかなスロープ特性（−２３．０〜−１３．９ｄＢ／ｏｃｔ）のハイパスフィルタ処理を施した非可聴つぶやき音声（第１４〜１７のサンプル音声信号ｖ１４〜ｖ１７）は、ハイパスフィルタ処理を施さない非可聴つぶやき音声に比べて同等以上の単語認識精度が得られる。
同様に、スロープ特性が急峻（−１２０ｄＢ／ｏｃｔ）である場合は、カットオフ周波数が２００〜６００Ｈｚのハイパスフィルタ処理を施した非可聴つぶやき音声（第１〜３のサンプル音声信号ｖ１〜ｖ３）は、ハイパスフィルタ処理を施さない非可聴つぶやき音声に比べて同等以上の単語認識精度が得られる。
以上より、前記ＮＡＭマイクロホンＸにおいて、ハイパスフィルタ２２のカットオフ周波数を８００〜１４００Ｈｚ程度（特に、８００〜１０００Ｈｚ程度）とし、さらにスロープ特性を−１４〜−１６ｄＢ／ｏｃｔ程度とすることにより、自然で聴き取りやすい（単語を認識しやすい）音声を採取できることがわかる。 <Second evaluation experiment (word recognition accuracy evaluation experiment)>
In the second evaluation experiment, the subject listens to an arbitrary selection of 22 types of sample voices that are read-out voices of newspaper articles (that is, meaningful sentences), and the word recognition accuracy (recognized by the subject). The correct answer rate of words was evaluated.
FIG. 5 is a graph showing the results of the second evaluation experiment (word recognition accuracy evaluation experiment), and for each of the 22 types of sample signals (reading speech of newspaper articles), the subject's word recognition accuracy (correction of recognized words). Rate).
As can be seen from FIG. 5, the non-audible murmur voice (the twenty-first sample voice v21) not subjected to the high-pass filter processing has a word recognition accuracy of about 85%, whereas the fourteenth sample voice v14 (cutoff frequency). The high-pass filter processing (800 Hz, slope characteristic −15.6 dB / oct) has a word recognition accuracy of about 90%.
Further, as can be seen from FIG. 5, the non-audible murmur sound subjected to the high-pass filter processing having a relatively gentle slope characteristic (-23.0 to −13.9 dB / oct) in the range of the cutoff frequency of 200 to 1400 Hz. (Fourteenth to seventeenth sample voice signals v14 to v17) have a word recognition accuracy equal to or higher than that of a non-audible murmur voice that is not subjected to high-pass filter processing.
Similarly, when the slope characteristic is steep (−120 dB / oct), the non-audible murmur sound (first to third sample audio signals v1 to v3) subjected to the high-pass filter processing with a cutoff frequency of 200 to 600 Hz is obtained. Compared with a non-audible tweet voice that is not subjected to high-pass filter processing, a word recognition accuracy equal to or higher than that can be obtained.
As described above, in the NAM microphone X, the cutoff frequency of the high-pass filter 22 is set to about 800 to 1400 Hz (particularly about 800 to 1000 Hz), and the slope characteristic is set to about −14 to −16 dB / oct. It turns out that it is possible to collect sounds that are easy to hear (easy to recognize words).

以上に示した実施形態では、ハイパスフィルタ処理後の非可聴つぶやき音声を無線伝送（送信）するＮＡＭマイクロホンＸ（無線タイプのマイクロホン）について示したが、これに限るものではない。
例えば、図１に示した前記ＮＡＭマイクロホンＸから、Ａ／Ｄ変換部２３及び無線通信部２４が除かれ、その代わりに外部装置の音声信号（アナログ）入力端子に接続する出力端子が設けられた有線タイプのＮＡＭマイクロホンも考えられる。
また、ハイパスフィルタ２２は、７５０Ｈｚ〜１４００Ｈｚ程度のカットオフ周波数を有するものであれば、ＣＲ回路以外の周知の回路により構成されたものも考えられる。
また、ハイパスフィルタ２２のスロープ特性は、−１６ｄＢ／ｏｃｔ〜−１４ｄＢ／ｏｃｔ程度に限るものではない。 In the embodiment described above, the NAM microphone X (wireless type microphone) that wirelessly transmits (transmits) the inaudible murmur sound after the high-pass filter processing has been described. However, the present invention is not limited to this.
For example, the A / D converter 23 and the wireless communication unit 24 are removed from the NAM microphone X shown in FIG. 1, and an output terminal connected to the audio signal (analog) input terminal of the external device is provided instead. A wired NAM microphone is also conceivable.
The high-pass filter 22 may be configured by a known circuit other than the CR circuit as long as it has a cutoff frequency of about 750 Hz to 1400 Hz.
Further, the slope characteristic of the high-pass filter 22 is not limited to about −16 dB / oct to −14 dB / oct.

本発明は、非可聴つぶやき音声を採取するためのマイクロホンに利用可能である。 The present invention can be used for a microphone for collecting non-audible murmur voices.

本発明の実施形態に係るＮＡＭマイクロホンＸの概略構成図（一部ブロック図）。1 is a schematic configuration diagram (partial block diagram) of a NAM microphone X according to an embodiment of the present invention. ＮＡＭマイクロホンＸが人体に装着された状態を表す模式図。The schematic diagram showing the state with which NAM microphone X was mounted | worn with the human body. 非可聴つぶやき音声の声紋と可聴ささやき音声の声紋とを表す図。The figure showing the voiceprint of a non-audible murmur voice and the voiceprint of an audible whisper voice. ＮＡＭマイクロホンＸの第１評価実験（自然性評価実験）の結果を表すグラフ。The graph showing the result of the 1st evaluation experiment (natural nature evaluation experiment) of NAM microphone X. ＮＡＭマイクロホンＸの第２評価実験（単語認識精度評価実験）の結果を表すグラフ。The graph showing the result of the 2nd evaluation experiment (word recognition accuracy evaluation experiment) of NAM microphone X.

Explanation of symbols

Ｘ…本発明の実施形態に係る非可聴つぶやき音声採取用マイクロホン
１０…音声検出部
１１…マイクロホン
１２…ＮＡＭ伝播部
１３…内側カバー部材
１４…粘着遮音部
１５…外側カバー部材
２０…信号処理部
２１…アンプ
２２…ハイパスフィルタ
２３…Ａ／Ｄ変換部
２４…無線通信部
２５…アンテナ
２６…バッテリ
２７…耳装着用筐体
３０…信号線 X: Microphone 10 for collecting an inaudible murmur sound according to an embodiment of the present invention 10 ... Audio detecting unit 11 ... Microphone 12 ... NAM propagation unit 13 ... Inner cover member 14 ... Adhesive sound insulating unit 15 ... Outer cover member 20 ... Signal processing unit 21 ... Amplifier 22 ... High-pass filter 23 ... A / D conversion unit 24 ... Wireless communication unit 25 ... Antenna 26 ... Battery 27 ... Ear mounting case 30 ... Signal line

Claims

A non-audible muttering voice collecting microphone for collecting a non-audible muttering voice that is a voice generated by a breathing sound that is not accompanied by a vocal cord vibration generated in a human vocal tract,
A soft member that propagates the inaudible murmur sound by being in close contact with the skin surface of the human body;
A microphone that converts the inaudible murmur sound propagating through the soft member into an electrical signal;
A high-pass filter that performs high-pass filter processing on the inaudible tweet signal obtained by the microphone;
A microphone for collecting a non-audible tweet sound characterized by comprising:

The inaudible muttering voice collecting microphone according to claim 1, wherein a cutoff frequency of the high-pass filter is approximately 800 Hz to approximately 1400 Hz.

The inaudible muttering voice collecting microphone according to claim 2, wherein a slope characteristic of the high-pass filter is approximately -16 dB / oct to approximately -14 dB / oct.

The inaudible tweet voice collecting microphone according to any one of claims 1 to 3, further comprising an amplifier that amplifies the signal of the inaudible tweet voice between the microphone and the high-pass filter.