JP2019087798A

JP2019087798A - Voice input device

Info

Publication number: JP2019087798A
Application number: JP2017212581A
Authority: JP
Inventors: 増田　英之; Hideyuki Masuda; 英之増田; 健二石原; Kenji Ishihara
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2017-11-02
Filing date: 2017-11-02
Publication date: 2019-06-06
Anticipated expiration: 2037-11-02
Also published as: JP7143579B2

Abstract

To appropriately transmit uttered voice even when a low-volume voice pronunciation is performed.SOLUTION: A drive control unit 130 generates a vibration given to the user's throat by an actuator 131. In response to this vibration, a low-volume voice can be pronounced from the user's mouth such that the surrounding people cannot hear the voice. A microphone 140 picks up this voice and outputs a voice signal. A voice conversion unit 150 performs voice conversion processing to facilitate recognition of the voice signal. Specifically, the voice conversion unit 150 performs voice recognition processing of the voice signal to output character information, further synthesizes a voice signal from the character information, or replace the waveform of a consonant section of the voice signal with a voice signal waveform of the appropriate consonant obtained by the voice recognition processing.SELECTED DRAWING: Figure 1

Description

この発明は、携帯電話機等における音声入力に好適な音声入力装置に関する。 The present invention relates to a voice input device suitable for voice input in a cellular phone or the like.

声帯のない人に音声を発音させるための技術的手段として、電気式人工喉頭がある（例えば特許文献１参照）。この電気式人工喉頭は、人の喉に振動を与え、声帯が振動することにより生じる空気振動に類似した空気振動を口腔内に作り出す装置である。利用者は、この電気式人工喉頭による振動を喉に与え、口形状を変化させることにより、口から音声を発音することができる。声帯のない人は、この電気式人工喉頭を利用することにより、通常の会話を行う他、電話機を利用した通話を行うことも可能である。 There is an electric artificial larynx as a technical means for making a person without a vocal cord pronounce a voice (see, for example, Patent Document 1). The electric artificial larynx is a device that vibrates a person's throat and produces an air vibration similar to that caused by the vibration of the vocal cords in the oral cavity. The user can produce sound from the mouth by giving vibration to the throat by the electric artificial larynx and changing the shape of the mouth. A person without a vocal cord can use this electric artificial larynx to make a normal conversation and to make a call using a telephone.

国際公開第１９９９／１２５０１号International Publication No. 1999/12501

ところで、電話機では、マイクロホンにより収音した音声信号をそのまま送信する。従って、収音した音声信号が聴き取り辛い音声信号である場合、通話相手が聴く音声信号も聴き取り辛いものとなる。このような事態が発生する状況の一例として、電車内において周囲の人に聴こえないような小声で携帯電話機による通話を行う場合が挙げられる。ここで、携帯電話機の利用者が小声で発音した場合、携帯電話機のマイクロホンにより得られる音声信号は、レベルが極めて小さく、かつ、その音声信号波形は通常の音量での発音時の音声信号波形に比べて歪んだものとなる。従って、マイクロホンにより得られた音声信号を通話相手に送ったとしても、通話相手は携帯電話機の利用者が何を話しているのか認識するのが困難である。これは、声帯のない人が電気式人工喉頭を利用して小声で通話を行う場合、健常者が小声で通話を行う場合の両方において生じる問題である。 By the way, in a telephone, an audio signal collected by a microphone is transmitted as it is. Therefore, when the collected voice signal is a difficult voice signal, the voice signal heard by the other party is also hard to hear. As an example of a situation where such a situation occurs, there is a case where a mobile phone makes a call with a low voice so as not to be heard by people around the train. Here, when the user of the mobile phone utters a low voice, the voice signal obtained by the microphone of the mobile phone has an extremely small level, and the voice signal waveform is an audio signal waveform at the time of tone generation at a normal volume. It will be distorted compared. Therefore, even if the voice signal obtained by the microphone is sent to the calling party, it is difficult for the calling party to recognize what the user of the mobile phone is talking. This is a problem that occurs both when a person without a vocal cord talks with a small voice using the electric artificial larynx and when a healthy person talks with a low voice.

この発明は以上のような事情に鑑みてなされたものであり、小声での発音が行われる場合においてもその発音された音声を適切に伝達することを可能にする技術的手段を提供することを目的とする。 The present invention has been made in view of the above circumstances, and it is an object of the present invention to provide a technical means capable of appropriately transmitting the voiced voice even when a low voiced voice is produced. To aim.

この発明は、利用者の喉に与えた振動に応じて前記利用者の口から発音される音声を示す音声信号を取得する音声取得手段と、前記音声信号に認識を容易にする音声変換処理を施す音声変換手段とを具備することを特徴とする音声入力装置を提供する。 The present invention relates to a voice acquiring means for acquiring a voice signal indicating a voice to be produced from the mouth of the user according to a vibration given to the throat of the user, and a voice conversion process for facilitating recognition of the voice signal. A voice input device is provided which comprises: voice conversion means for applying.

この発明によれば、音声取得手段により取得される音声信号の音量が周囲の人に聴こえない程度の小音量であったとしても、音声変換手段により、その音声信号が認識の容易な情報に変換される。従って、小声での発音が行われる場合においてもその発音された音声を適切に伝達することができる。 According to the present invention, even if the volume of the audio signal acquired by the audio acquisition means is low enough that the surrounding people can not hear it, the audio conversion means converts the audio signal into information for easy recognition. Be done. Therefore, even when a low voice is produced, the voiced voice can be properly transmitted.

この発明の一実施形態である音声入力装置を含む携帯電話機の構成を示すブロック図である。It is a block diagram which shows the structure of the mobile telephone containing the audio | voice input device which is one Embodiment of this invention. 同実施形態における音声変換部の構成を示すブロック図である。It is a block diagram which shows the structure of the speech conversion part in the embodiment.

以下、図面を参照し、この発明の実施形態について説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

図１は、この発明の一実施形態である音声入力装置１００を含む携帯電話機１０００の構成を示すブロック図である。図１には、音声入力装置１００の他、携帯電話機１０００の送信部２０１および受信部２０２と、携帯電話機１０００の利用者の頭部１０が図示されている。なお、携帯電話機１０００は、アンテナ等、通常の携帯電話機と同様な各種の装置を有しているが、図１では、その図示は省略されている。 FIG. 1 is a block diagram showing the configuration of a mobile telephone 1000 including a voice input device 100 according to an embodiment of the present invention. FIG. 1 illustrates, in addition to the voice input device 100, the transmission unit 201 and the reception unit 202 of the mobile phone 1000, and the head 10 of the user of the mobile phone 1000. Although the mobile phone 1000 includes various devices similar to an ordinary mobile phone such as an antenna, the illustration thereof is omitted in FIG.

音声入力装置１００において、遮音フード１３２とその内部に収められたアクチュエータ１３１は、ベルト１３３により利用者の首に固定される。ここで、アクチュエータ１３１は、その振動面を利用者の喉に接触させ、利用者の喉に振動を与える手段として機能する。遮音フード１３２は、アクチュエータ１３１の振動音が周囲に漏れないように遮蔽する手段である。 In the voice input device 100, the sound insulation hood 132 and the actuator 131 accommodated therein are fixed to the neck of the user by the belt 133. Here, the actuator 131 functions as a means for bringing its vibrating surface into contact with the user's throat and applying vibration to the user's throat. The sound insulation hood 132 is a means for shielding so that the vibration noise of the actuator 131 does not leak to the surroundings.

傾きセンサ１５１は、耳掛け式のセンサであり、携帯電話機１０００の利用者の耳に装着され、利用者の頭部の傾きを検出する。 The tilt sensor 151 is an ear-hook type sensor, which is attached to the user's ear of the mobile phone 1000 and detects the tilt of the user's head.

発音開始ボタン１３４は、発音の開始を指示する押しボタンである。携帯電話機１０００の利用者は、この発音開始ボタン１３４の設けられた操作子を手に持ち、発音を開始するときに発音開始ボタン１３４をＯＮにする。この発音開始ボタン１３４は、押圧されている期間のみＯＮになるボタンスイッチでもよく、押圧される都度、ＯＦＦからＯＮへ、ＯＮからＯＦＦへ反転するボタンスイッチでもよい。 The sound generation start button 134 is a push button for instructing start of sound generation. The user of the portable telephone 1000 holds the operator provided with the sound generation start button 134 in hand, and turns on the sound generation start button 134 when sound generation is started. The sound generation start button 134 may be a button switch that is turned on only during a pressed period, or may be a button switch that is turned from off to on and from on to off each time it is pressed.

制御部１１０は、音声入力装置１００の制御中枢として機能し、かつ、携帯電話機１０００全体の制御中枢としても機能する。 The control unit 110 functions as a control center of the voice input device 100 and also functions as a control center of the entire mobile phone 1000.

操作表示部１２０は、例えばタッチパネルであり、利用者に各種の情報を表示するとともに、利用者の操作を受け付ける装置である。 The operation display unit 120 is, for example, a touch panel, and is a device that displays various types of information to the user and receives an operation of the user.

駆動制御部１３０は、制御部１１０による制御の下、周期的な駆動パルス波形をアクチュエータ１３１に与え、これにより利用者の喉に与える振動を生成する駆動制御手段である。より具体的には、制御部１１０は、発音開始ボタン１３４がＯＮになることにより、駆動制御部１３０に駆動パルス波形の出力を開始させる。 The drive control unit 130 is a drive control unit that applies a periodic drive pulse waveform to the actuator 131 under the control of the control unit 110 and thereby generates a vibration to be applied to the user's throat. More specifically, the control unit 110 causes the drive control unit 130 to start outputting the drive pulse waveform when the sound generation start button 134 is turned on.

アクチュエータ１３１から喉に振動が与えられている間、利用者は、口腔１の形状を変化させつつ口の開閉を行うことにより所望の音声を発音することが可能である。マイクロホン１４０は、通常の携帯電話機に設けられているものと同様、利用者の音声を収音する手段であるが、本実施形態では、これに加えて、利用者の喉に与えた振動に応じて利用者の口から発音される音声を示す音声信号を取得する音声取得手段として機能する。 While vibration is applied to the throat from the actuator 131, the user can produce a desired voice by opening and closing the mouth while changing the shape of the oral cavity 1. The microphone 140 is a means for collecting the user's voice as in the case provided in a normal cellular phone, but in the present embodiment, in addition to this, according to the vibration given to the user's throat It functions as an audio acquisition means for acquiring an audio signal indicating an audio generated from the user's mouth.

本実施形態では、アクチュエータ１３１を利用した発音において、発音の音量を制御することが可能である。さらに詳述すると、本実施形態において、利用者は、操作表示部１２０の操作により、駆動制御部１３０がアクチュエータ１３１に与える駆動パルス波形のパルス幅を設定することが可能である。ここで、駆動パルス波形のパルス幅を長くすると、利用者の口から発音される音声の音量は大きくなる。また、駆動パルス波形のパルス幅を短くすると、利用者の口から発音される音声の音量は小さくなる。そして、駆動パルス波形のパルス幅を所定長より短くすると、利用者の口から発音される音声の音量を周囲の人に聴こえない小音量に設定することができる。通常、電気式人工喉頭を会話に用いる場合は、発音音声の音量を確保するため、電気式人工喉頭を大振幅で駆動する。しかしながら、電車内において携帯電話機１０００による通話を行う場合には、アクチュエータ１３１を小振幅で駆動し、利用者の口から発音される音声の音量を周囲の人に聴こえない小音量にする。 In the present embodiment, it is possible to control the volume of sound generation in sound generation using the actuator 131. More specifically, in the present embodiment, the user can set the pulse width of the drive pulse waveform provided to the actuator 131 by the drive control unit 130 by operating the operation display unit 120. Here, when the pulse width of the drive pulse waveform is increased, the volume of the sound produced from the user's mouth is increased. Further, if the pulse width of the drive pulse waveform is shortened, the volume of the sound produced from the user's mouth will be reduced. When the pulse width of the drive pulse waveform is shorter than a predetermined length, the volume of the sound produced from the user's mouth can be set to a small volume that can not be heard by the surrounding people. In general, when the electric artificial larynx is used for conversation, the electric artificial larynx is driven with large amplitude in order to secure the volume of the voiced speech. However, when making a call by the mobile phone 1000 in a train, the actuator 131 is driven with a small amplitude to make the volume of the sound produced from the user's mouth small enough to be heard by the surrounding people.

また、本実施形態における音声入力装置１００は、アクチュエータ１３１の振動を利用して発音を行う利用者の口から発音される音声に対し、利用者の頭部の傾きの変化に応じたイントネーション、すなわち、ピッチの変化を与える機能を有している。 Further, in the voice input device 100 according to the present embodiment, the intonation according to the change in the inclination of the head of the user, that is, the voice generated from the user's mouth making a sound using the vibration of the actuator 131 , Has a function to give a change in pitch.

さらに詳述すると、傾きセンサ１５１の出力信号は制御部１１０に供給される。制御部１１０は、傾きセンサ１５１の出力信号に基づいて、利用者が発音する音声についてのピッチ変換比を示すイントネーション情報を生成する。このピッチ変換比は、ピッチ変換後の音声のピッチのピッチ変換前のピッチに対する比である。 More specifically, the output signal of the inclination sensor 151 is supplied to the control unit 110. The control unit 110 generates intonation information indicating the pitch conversion ratio of the voice produced by the user based on the output signal of the tilt sensor 151. The pitch conversion ratio is a ratio of the pitch of the voice after pitch conversion to the pitch before pitch conversion.

ここで、利用者の頭部前面が正面を向いた状態の傾き角を０°とする。そして、頭部前面が上方を向くと傾き角が０°から正方向に変化し、頭部前面が下方を向くと傾き角が０°から負方向に変化するものとする。 Here, the inclination angle in a state where the front of the head of the user is directed to the front is 0 °. When the head front surface faces upward, the inclination angle changes from 0 ° to the positive direction, and when the head front surface faces downward, the inclination angle changes from 0 ° to the negative direction.

この場合、傾き角が０°から正方向に変化すると、制御部１１０はイントネーション情報が示すピッチ変換比を１から増加させ、傾き角が０°から負方向に変化すると、制御部１１０はイントネーション情報が示すピッチ変換比を１から減少させる。 In this case, when the inclination angle changes from 0 ° to the positive direction, the control unit 110 increases the pitch conversion ratio indicated by the intonation information from 1 and when the inclination angle changes from 0 ° to the negative direction, the control unit 110 changes the intonation information. Decreases the pitch conversion ratio shown by.

駆動制御部１３０は、制御部１１０が出力するイントネーション情報に基づいて、アクチュエータ１３１に与える駆動パルス波形の周期を制御する。イントネーション情報が示すピッチ変換比が１である場合、駆動制御部１３０は、標準的な周期の駆動パルス波形をアクチュエータ１３１に与える。これにより利用者の口から標準的なピッチの音声が発音される。イントネーション情報が示すピッチ変換比が１から増加すると、駆動制御部１３０は、そのピッチ変換比の増加に応じてアクチュエータ１３１に与える駆動パルス波形の周期を短くする。これにより利用者の口から発音される音声のピッチが標準的なピッチから上昇する。また、イントネーション情報が示すピッチ変換比が１から減少すると、駆動制御部１３０は、そのピッチ変換比の減少に応じてアクチュエータ１３１に与える駆動パルス波形の周期を長くする。これにより利用者の口から発音される音声のピッチが標準的なピッチから低下する。 The drive control unit 130 controls the cycle of the drive pulse waveform given to the actuator 131 based on the intonation information output by the control unit 110. When the pitch conversion ratio indicated by the intonation information is 1, the drive control unit 130 provides the actuator 131 with a drive pulse waveform of a standard cycle. As a result, a standard pitch voice is produced from the user's mouth. When the pitch conversion ratio indicated by the intonation information increases from 1, the drive control unit 130 shortens the cycle of the drive pulse waveform given to the actuator 131 according to the increase in the pitch conversion ratio. As a result, the pitch of the sound produced from the user's mouth rises from the standard pitch. Also, when the pitch conversion ratio indicated by the intonation information decreases from 1, the drive control unit 130 lengthens the cycle of the drive pulse waveform given to the actuator 131 according to the decrease in the pitch conversion ratio. As a result, the pitch of the sound produced from the user's mouth drops from the standard pitch.

音声変換部１５０は、マイクロホン１４０から出力された音声信号に認識を容易にする音声変換処理を施す手段である。この音声変換処理に関しては、音声／文字変換モード、音声／音声変換モードおよび音声加工モードの３種類のモードが用意されている。ここで、音声／文字変換モードは、マイクロホン１４０から出力された音声信号について音声認識処理を実行し、文字情報を出力するモードである。また、音声／音声変換モードは、マイクロホン１４０から出力された音声信号について音声認識処理を実行し、この結果得られる文字情報に基づいて音声信号を合成するモードである。また、音声加工モードは、マイクロホン１４０から出力される音声信号について音声認識処理を実行し、この音声認識処理結果に基づいて、マイクロホン１４０から出力された音声信号を認識の容易な音声信号に加工するモードである。本実施形態において、携帯電話機１０００の利用者は、操作表示部１２０を操作することにより、３種類のモードの中から所望のモードを選択し、音声変換部１５０に実行させることができる。 The voice conversion unit 150 is a means for performing voice conversion processing to facilitate recognition of the voice signal output from the microphone 140. As this voice conversion process, three modes of voice / character conversion mode, voice / voice conversion mode and voice processing mode are prepared. Here, the voice / character conversion mode is a mode in which voice recognition processing is performed on the voice signal output from the microphone 140 and character information is output. The voice / voice conversion mode is a mode in which voice recognition processing is performed on the voice signal output from the microphone 140, and the voice signal is synthesized based on the character information obtained as a result. In the voice processing mode, voice recognition processing is performed on the voice signal output from the microphone 140, and the voice signal output from the microphone 140 is processed into a voice signal easy to recognize based on the result of the voice recognition processing. It is a mode. In the present embodiment, the user of the mobile phone 1000 can select a desired mode from the three types of modes by operating the operation display unit 120 and can cause the voice conversion unit 150 to execute the desired mode.

図２は音声変換部１５０の構成を示すブロック図である。図２に示すように、音声変換部１５０は、音声認識部１５１と、音声合成部１５２と、音声加工部１５３と、スイッチ１５４とを有する。 FIG. 2 is a block diagram showing the configuration of the speech conversion unit 150. As shown in FIG. As shown in FIG. 2, the speech conversion unit 150 includes a speech recognition unit 151, a speech synthesis unit 152, a speech processing unit 153, and a switch 154.

音声認識部１５１は、マイクロホン１４０から出力される音声信号の音声認識を行い、文字情報を出力する手段である。ここで、アクチュエータ１３１に与えられる駆動パルス波形のパルス幅が短く設定され、利用者の口から発音される音声が周囲の人に聴こえない程度の小音量である場合、マイクロホン１４０から出力される音声信号の母音区間のレベルは非常に小さく、子音区間のレベルはさらに小さい。この場合、音声信号における母音の認識は可能であるが、子音の認識は困難である。 The voice recognition unit 151 performs voice recognition of the voice signal output from the microphone 140 and outputs character information. Here, when the pulse width of the drive pulse waveform given to the actuator 131 is set short and the sound produced from the user's mouth is small enough to be heard by the surrounding people, the sound output from the microphone 140 The level of the vowel section of the signal is very small and the level of the consonant section is even smaller. In this case, although recognition of vowels in the speech signal is possible, recognition of consonants is difficult.

そこで、本実施形態における音声認識部１５１は、音声信号のイントネーション（ピッチ変化）に基づいて音声信号における単語の区切りを判定して、音声認識を実行する。具体的には、音声認識部１５１は、会話に使用される各種の単語について測定された音声のピッチ変化パターンのデータベースを記憶している。そして、音声認識部１５１は、マイクロホン１４０から出力される音声信号にデータベース中のいずれかのピッチ変化パターンと合致するピッチ変化パターンが現れた場合に、そのピッチ変化パターンに対応した音声信号の区間を一語として取り扱い、音声認識を実行する。 Therefore, the speech recognition unit 151 in the present embodiment performs speech recognition by determining a word break in the speech signal based on intonation (pitch change) of the speech signal. Specifically, the speech recognition unit 151 stores a database of pitch change patterns of speech measured for various words used for conversation. Then, when a pitch change pattern that matches any of the pitch change patterns in the database appears in the sound signal output from the microphone 140, the voice recognition unit 151 determines the section of the sound signal corresponding to the pitch change pattern. Treat as one word and execute speech recognition.

また、本実施形態における音声認識部１５１は、音声認識処理を実行しつつ、現時点までに得られた音声認識結果を参照し、認識が完了していない区間（種類の不明な音韻が存在する区間）の音声認識を実行する。例えば音声認識処理において、音声信号のある区間の子音の種類が不明であったとする。この場合に、音声認識処理では、その区間の前後の区間の音声認識処理結果である文字が示す文脈から当該子音の種類を推定する。具体的には、子音の種類が不明な区間の音声信号に基づいて、幾つかの子音の候補を選択する。そして、子音の候補の中から子音を１つずつ選択して当該区間に当てはめ、当該区間とその前後の区間とからなる区間内の文字が意味のある文を構成するか否かを判定する。そして、意味のある文を構成することとなる子音を音声認識結果として選択するのである。 Further, the speech recognition unit 151 in the present embodiment refers to the speech recognition result obtained up to the present time while executing the speech recognition process, and the section in which the recognition is not completed (a section in which the unknown phoneme of the type exists) Perform voice recognition). For example, in speech recognition processing, it is assumed that the type of consonant in a section where a speech signal is present is unknown. In this case, in the speech recognition process, the type of the consonant is estimated from the context indicated by the character that is the speech recognition process result of the sections before and after the section. Specifically, several consonant candidates are selected based on the speech signal of the section where the type of consonant is unknown. Then, one consonant is selected from the consonant candidates one by one and applied to the section, and it is determined whether characters in a section including the section and the sections before and after the section constitute a meaningful sentence. Then, consonants that constitute meaningful sentences are selected as speech recognition results.

音声合成部１５２は、音声認識部１５１から出力される文字情報に基づいて音声信号を合成する手段である。具体的には、音声合成部１５２は、子音や母音等の各種の音声素片の音声波形のデータベースを記憶しており、文字情報が示す子音や母音の音声素片の音声波形をデータベースから読み出し、時間軸上において繋ぎ合わせることにより音声信号を合成する。 The voice synthesis unit 152 is means for synthesizing a voice signal based on the text information output from the voice recognition unit 151. Specifically, the speech synthesis unit 152 stores a database of speech waveforms of various speech segments such as consonants and vowels, and reads speech waveforms of speech segments of consonants and vowels indicated by character information from the database. The voice signal is synthesized by connecting on the time axis.

好ましい態様では、ハスキーな男性音声、透明感の高い女性音声等、各種の音声に対応した音声素片の音声波形のデータベースが音声合成部１５２に記憶されている。利用者は、操作表示部１２０の操作により、所望の種類のデータベースを選択し、音声合成に使用することができる。 In a preferred embodiment, a speech synthesis unit 152 stores a database of speech waveforms of speech segments corresponding to various kinds of speech such as husky male speech and highly transparent female speech. The user can select a desired type of database by the operation of the operation display unit 120 and use it for speech synthesis.

本実施形態における音声合成部１５２は、合成した音声信号のピッチをイントネーション情報に基づいて制御する手段を有している。ここで、イントネーション情報は、制御部１１０が傾きセンサ１５１の出力信号に基づいて生成する情報であり、利用者の口から発音される音声のピッチの変化を示している。従って、音声合成部１５２から出力される音声信号は、利用者の口から発音される音声が有するピッチ変化と同様なピッチ変化を有する音声信号となる。 The speech synthesis unit 152 in the present embodiment has means for controlling the pitch of the synthesized speech signal based on intonation information. Here, the intonation information is information generated by the control unit 110 based on the output signal of the inclination sensor 151, and indicates a change in the pitch of the sound produced from the user's mouth. Therefore, the voice signal output from the voice synthesis unit 152 is a voice signal having a pitch change similar to the pitch change of the voice produced from the user's mouth.

音声加工部１５３は、音声認識部１５１の音声認識処理結果に基づいて、マイクロホン１４０から出力された音声信号を認識が容易な音声信号に加工する手段である。上述したように、利用者がアクチュエータ１３１を利用して周囲の人に聴こえない程度の小音量で発音する場合、マイクロホン１４０から出力される音声信号の特に子音区間の音量は極めて小さく認識が困難である。そこで、本実施形態では、各種の子音を表す音声信号波形のデータベースを音声加工部１５３に予め記憶させる。そして、音声加工部１５３は、マイクロホン１４０から出力された音声信号において、音声認識部１５１の音声認識処理により子音と判定された区間の音声信号波形をデータベース中の当該子音に対応した適切な音声信号波形に置き換える。また、音声加工部１５３は、この子音の音声信号波形の置き換え後の音声信号を聴き取りが容易な適切なレベルに増幅して出力する。 The voice processing unit 153 is a means for processing the voice signal output from the microphone 140 into a voice signal that can be easily recognized based on the voice recognition processing result of the voice recognition unit 151. As described above, when the user uses the actuator 131 to produce sounds with a small volume that can not be heard by the surrounding people, the volume of the sound signal output from the microphone 140, especially in the consonant section, is extremely small and difficult to recognize. is there. Therefore, in the present embodiment, the speech processing unit 153 stores a database of speech signal waveforms representing various consonants in advance. Then, in the audio signal output from the microphone 140, the audio processing unit 153 selects an audio signal waveform of a section determined as a consonant by the speech recognition processing of the speech recognition unit 151 as an appropriate audio signal corresponding to the consonant in the database. Replace with waveform. Further, the audio processing unit 153 amplifies and outputs the audio signal after replacement of the audio signal waveform of the consonant to an appropriate level for easy listening.

スイッチ１５４は、音声認識部１５１が出力する文字情報、音声合成部１５２が出力する音声信号、音声加工部１５３が出力する音声信号のいずれかを選択し、図１に示す送信部２０１に出力する手段である。音声変換部１５０が音声／文字変換モードに設定されている場合、音声認識部１５１が起動され、音声認識部１５１が出力する文字情報がスイッチ１５４を介して送信部２０１に供給される。音声変換部１５０が音声／音声変換モードに設定されている場合、音声認識部１５１および音声合成部１５２が起動され、音声合成部１５２が出力する音声信号がスイッチ１５４を介して送信部２０１に供給される。音声変換部１５０が音声加工モードに設定されている場合、音声認識部１５１および音声加工部１５３が起動され、音声加工部１５３が出力する音声信号がスイッチ１５４を介して送信部２０１に供給される。 The switch 154 selects one of the character information output from the speech recognition unit 151, the speech signal output from the speech synthesis unit 152, and the speech signal output from the speech processing unit 153, and outputs the selected signal to the transmission unit 201 shown in FIG. It is a means. When the speech conversion unit 150 is set to the speech / character conversion mode, the speech recognition unit 151 is activated, and the character information output from the speech recognition unit 151 is supplied to the transmission unit 201 via the switch 154. When the voice conversion unit 150 is set to the voice / voice conversion mode, the voice recognition unit 151 and the voice synthesis unit 152 are activated, and the voice signal output from the voice synthesis unit 152 is supplied to the transmission unit 201 via the switch 154. Be done. When the speech conversion unit 150 is set to the speech processing mode, the speech recognition unit 151 and the speech processing unit 153 are activated, and the speech signal output from the speech processing unit 153 is supplied to the transmission unit 201 via the switch 154. .

図１において、送信部２０１は、通話相手に対して音声信号または文字情報を送信する手段である。音声変換部１５０が音声／文字変換モードに設定されている場合、送信部２０１は、音声変換部１５０から出力される文字情報を携帯電話機１０００の利用者の通話相手の電話機に送信する。音声変換部１５０が音声／音声変換モードまたは音声加工モードに設定されている場合、送信部２０１は、音声変換部１５０から出力される音声信号を携帯電話機１０００の利用者の通話相手の電話機に送信する。携帯電話機１０００は、通話相手の電話機との通話を開始する際に、通話相手の電話機との間でネゴシエーションを行う。その際、携帯電話機１０００は、送信部２０１から送信するのが文字情報であるか音声信号であるかを示す情報を通話相手の電話機に送信する。これにより通話相手の電話機は、文字情報が送られてきた場合にはその表示を行い、音声信号が送られてきた場合にはその放音を行うという対応が可能になる。 In FIG. 1, the transmission unit 201 is a means for transmitting an audio signal or text information to a calling party. When the voice conversion unit 150 is set to the voice / character conversion mode, the transmission unit 201 transmits the character information output from the voice conversion unit 150 to the telephone set of the other party of the user of the mobile telephone 1000. When the voice conversion unit 150 is set to the voice / voice conversion mode or the voice processing mode, the transmission unit 201 transmits the voice signal output from the voice conversion unit 150 to the telephone of the other party of the user of the mobile telephone 1000. Do. The portable telephone 1000 negotiates with the other party's telephone when starting a call with the other party's telephone. At this time, the cellular phone 1000 transmits information indicating whether the transmission unit 201 transmits character information or voice signal to the telephone of the other party. As a result, the telephone of the other party of the call can display when the character information is sent, and can emit the sound when the voice signal is sent.

受信部２０２は、通話相手の電話機から音声信号を受信する手段である。受信部２０２により受信された音声信号は、加算器１６３を介してスピーカ１７０に送られ、スピーカ１７０によって放音される。 The receiving unit 202 is means for receiving an audio signal from the telephone of the other party. The audio signal received by the receiving unit 202 is sent to the speaker 170 via the adder 163 and emitted by the speaker 170.

本実施形態において、携帯電話機１０００の利用者は、操作表示部１２０の操作を行うことにより、音声変換部１５０の処理結果のモニタリングを行うことができる。例えば音声／文字変換モードが設定されている状態において、利用者は、操作表示部１２０の操作によりスイッチ１６２をＯＮにすることができる。この結果、音声変換部１５０の音声認識部１５１が出力する文字情報がスイッチ１６２を介して操作表示部１２０に送られ、操作表示部１２０に表示される。また、音声／音声変換モードまたは音声加工モードが設定されている状態において、利用者は、操作表示部１２０の操作によりスイッチ１６１をＯＮにすることができる。この結果、音声変換部１５０の音声合成部１５２または音声加工部１５３が出力する音声信号がスイッチ１６１および加算器１６３を介してスピーカ１７０に送られ、スピーカ１７０から放音される。
以上が本実施形態における携帯電話機１０００の詳細である。 In the present embodiment, the user of the mobile phone 1000 can monitor the processing result of the voice conversion unit 150 by operating the operation display unit 120. For example, in the state where the voice / character conversion mode is set, the user can turn on the switch 162 by operating the operation display unit 120. As a result, character information output from the voice recognition unit 151 of the voice conversion unit 150 is sent to the operation display unit 120 via the switch 162 and displayed on the operation display unit 120. Further, in a state where the voice / voice conversion mode or the voice processing mode is set, the user can turn on the switch 161 by operating the operation display unit 120. As a result, the audio signal output from the audio synthesis unit 152 or the audio processing unit 153 of the audio conversion unit 150 is sent to the speaker 170 via the switch 161 and the adder 163 and emitted from the speaker 170.
The above is the details of the mobile phone 1000 according to the present embodiment.

本実施形態において、携帯電話機１０００の利用者は、例えば電車内において携帯電話機１０００による通話を行う場合、操作表示部１２０の操作により、アクチュエータ１３１に与える駆動パルス波形のパルス幅を最低値に設定する。そして、利用者は、発音開始ボタン１３４をＯＮにして、口腔１の形状を変化させる。これにより利用者の口から周囲の人に聴こえない程度の音量で音声が発音され、この音声を示す音声信号がマイクロホン１４０から出力される。その際、利用者は、頭部を傾けることによりイントネーション情報を変化させ、利用者の口から発音される音声にイントネーション、すなわち、ピッチの変化を与えることができる。 In the present embodiment, the user of the mobile phone 1000 sets the pulse width of the drive pulse waveform given to the actuator 131 to the minimum value by the operation of the operation display unit 120, for example, when making a call by the mobile phone 1000 in a train. . Then, the user turns on the sound generation start button 134 to change the shape of the oral cavity 1. As a result, the voice is produced at a volume that can not be heard by the surrounding person from the user's mouth, and a voice signal indicating this voice is output from the microphone 140. At this time, the user can change intonation information by tilting the head, and can give intonation, that is, change of the pitch to the sound pronounced from the user's mouth.

音声／文字変換モードが設定されている場合、音声変換部１５０の音声認識部１５１がマイクロホン１４０から出力される音声信号の音声認識を行い、文字情報を出力する。ここで、マイクロホン１４０から出力される音声信号はレベルが小さく、特に子音の認識が困難である。そこで、音声認識部１５１は、音声信号に現れるピッチ変化に基づいて、音声信号における単語の区切りを判定し、かつ、音声認識処理において子音の前後の文字が示す文脈に基づいて子音を推定することにより、音声信号から文字情報を生成する。この音声認識部１５１から出力される文字情報は、送信部２０１により通話相手の電話機に送られ、その電話機により表示される。 When the voice / character conversion mode is set, the voice recognition unit 151 of the voice conversion unit 150 performs voice recognition of the voice signal output from the microphone 140, and outputs character information. Here, the sound signal output from the microphone 140 has a low level, and in particular, it is difficult to recognize consonants. Therefore, the speech recognition unit 151 determines the break of words in the speech signal based on the pitch change appearing in the speech signal, and estimates the consonant based on the context indicated by the characters before and after the consonant in the speech recognition process. Generates character information from the speech signal. The character information output from the voice recognition unit 151 is sent to the telephone of the other party by the transmission unit 201, and displayed by the telephone.

また、音声／音声変換モードが設定されている場合、音声変換部１５０の音声認識部１５１がマイクロホン１４０から出力される音声信号の音声認識処理を行い、文字情報を出力する。そして、音声合成部１５２がこの文字情報から音声信号を合成し、イントネーション情報に基づいて音声信号のピッチを制御して出力する。そして、この音声合成部１５２から出力される音声信号が送信部２０１により通話相手の電話機に送られ、その電話機のスピーカから放音される。 When the voice / voice conversion mode is set, the voice recognition unit 151 of the voice conversion unit 150 performs voice recognition processing of the voice signal output from the microphone 140, and outputs character information. Then, the speech synthesis unit 152 synthesizes the speech signal from the character information, and controls and outputs the pitch of the speech signal based on the intonation information. Then, the voice signal output from the voice synthesis unit 152 is sent to the telephone of the other party by the transmission unit 201, and the speaker of the telephone emits the sound.

また、音声加工モードが設定されている場合、音声変換部１５０の音声認識部１５１がマイクロホン１４０から出力される音声信号の音声認識処理を行う。そして、音声加工部１５３は、この音声認識処理結果に基づいて、マイクロホン１４０から出力された音声信号における子音期間の音声信号波形を適切な子音の音声信号波形に置き換え、かつ、この置き換え後の音声信号を適切なレベルに増幅して出力する。そして、この音声加工部１５３から出力される音声信号が送信部２０１により通話相手の電話機に送られ、その電話機のスピーカから放音される。 When the voice processing mode is set, the voice recognition unit 151 of the voice conversion unit 150 performs voice recognition processing of the voice signal output from the microphone 140. Then, based on the speech recognition processing result, the speech processing unit 153 replaces the speech signal waveform of the consonant period in the speech signal output from the microphone 140 with the speech signal waveform of the appropriate consonant, and the speech after this substitution The signal is amplified to an appropriate level and output. Then, the voice signal output from the voice processing unit 153 is sent by the transmission unit 201 to the telephone of the other party, and the speaker of the telephone emits the sound.

以上のように、本実施形態によれば、携帯電話機１０００の利用者は、アクチュエータ１３１の振動を利用して、周囲の人に聴こえない程度の小音量での発音を行い、携帯電話機１０００の音声入力装置１００に音声を入力することができる。そして、このように小声での音声の入力を行った場合でも、音声入力装置１００では、音声変換部１５０が、マイクロホン１４０により得られた音声信号に認識を容易にする音声変換処理を施し、送信部２０１が通話相手に送信する。従って、利用者は、小声での発音を行ったとしても、所望の情報を適切に通話相手に伝達することができる。 As described above, according to the present embodiment, the user of the mobile phone 1000 uses the vibration of the actuator 131 to perform sound production at a small volume that can not be heard by the surrounding people, and the voice of the mobile phone 1000 is output. Audio can be input to the input device 100. Then, even when the input of the voice with a small voice is performed as described above, in the voice input device 100, the voice conversion unit 150 performs a voice conversion process for facilitating the recognition on the voice signal obtained by the microphone 140 and transmits it. The unit 201 transmits to the other party. Therefore, the user can appropriately transmit desired information to the other party even if the user utters a low voice.

また、本実施形態では、音声／文字変換モードにおいて、マイクロホン１４０により得られた音声信号を文字情報に変換して通話相手に送信することができる。従って、利用者は所望の情報を正確に通話相手に伝達することができる。 Further, in the present embodiment, in the voice / character conversion mode, the voice signal obtained by the microphone 140 can be converted into character information and transmitted to the other party. Therefore, the user can accurately transmit desired information to the other party.

また、本実施形態では、音声／音声変換モードにおいて、マイクロホン１４０により得られた音声信号を文字情報に変換し、この文字情報から音声信号を合成して通話相手に送信することができる。従って、利用者は通常の通話に近い形態で所望の情報を通話相手に伝達することができる。 Further, in the present embodiment, in the voice / voice conversion mode, the voice signal obtained by the microphone 140 can be converted into character information, and the voice signal can be synthesized from the character information and transmitted to the other party. Therefore, the user can transmit desired information to the other party in a form close to that of a normal call.

また、本実施形態では、音声加工モードにおいて、マイクロホン１４０により得られた音声信号の子音区間の音声信号波形のみを適切な子音の音声信号波形に置き換えて通話相手に送信することができる。従って、利用者は、本人が発音する音声に近い音声を通話相手に伝達することができる。本人の音声を通話相手に届けたい利用者に好適である。 Further, in the present embodiment, in the sound processing mode, only the sound signal waveform of the consonant section of the sound signal obtained by the microphone 140 can be replaced with the sound signal waveform of appropriate consonant and transmitted to the other party. Therefore, the user can transmit the voice close to the voice pronounced by the user to the other party. It is suitable for users who want to deliver their voice to the other party.

また、本実施形態における音声入力装置１００は、声帯のある健常者も利用可能である。従って、声帯のない人と、声帯のある健常者の両方が、音声入力装置１００を備えた携帯電話機１０００を利用し、周囲の人に聴こえない小声での通話を行うこととなる。このように声帯のない人が、声帯のある健常者と同じ形態で通話を行うこととなるので、声帯のない人に対し、携帯電話機１０００による通話を行う意欲を与えることができる。 In addition, the voice input device 100 according to the present embodiment can also be used by healthy persons with vocal cords. Therefore, both the person without the vocal cords and the normal subject with the vocal cords use the cellular phone 1000 equipped with the voice input device 100 to make a low-voice call that can not be heard by the surrounding people. Since a person without a vocal cord talks in the same manner as a normal person with a vocal cord in this way, the person without the vocal cord can be motivated to talk by the mobile phone 1000.

＜他の実施形態＞
以上、この発明の各実施形態について説明したが、この発明には他にも実施形態が考えられる。例えば次の通りである。 Other Embodiments
As mentioned above, although each embodiment of this invention was described, other embodiments can be considered to this invention. For example:

（１）上記実施形態では、傾きセンサ１５１の出力信号から得られるイントネーション情報によりアクチュエータ１３１に与える駆動パルス波形の周期を制御したが、この周期の制御を行わなくてもよい。この場合、音声認識部１５１では、イントネーション情報が示すピッチ変化に基づいて、マイクロホン１４０から出力される音声信号における単語の区切りを判定すればよい。 (1) In the above embodiment, the period of the drive pulse waveform given to the actuator 131 is controlled by the intonation information obtained from the output signal of the inclination sensor 151, but it is not necessary to control this period. In this case, the speech recognition unit 151 may determine the break of words in the speech signal output from the microphone 140 based on the pitch change indicated by the intonation information.

（２）上記実施形態では、利用者の頭部に装着される傾きセンサ１５１によりイントネーション情報を生成した。しかし、傾きセンサ以外のセンサによりイントネーション情報を生成してもよい。例えば加速度センサ等を利用者の身体の部位に装着し、このセンサによりイントネーション情報を生成してもよい。また、イントネーション情報を生成するためのセンサは、利用者の頭部以外の部位に装着してもよい。また、スライドスイッチ等の操作子を利用者に操作させ、イントネーション情報を生成してもよい。 (2) In the above embodiment, intonation information is generated by the inclination sensor 151 mounted on the head of the user. However, intonation information may be generated by a sensor other than the tilt sensor. For example, an acceleration sensor or the like may be attached to a body part of the user, and the intonation information may be generated by this sensor. In addition, the sensor for generating intonation information may be attached to a part other than the head of the user. In addition, the user may operate an operator such as a slide switch to generate intonation information.

（３）上記実施形態の音声／文字変換モードにおいて、音声認識部１５１が出力する文字情報に同期させてイントネーション情報を通話相手の電話機に送信するようにしてもよい。この場合において、通話相手の電話機では、受信される文字情報から音声信号を合成し、この文字情報から得られる音声信号のピッチを当該文字情報と同期して受信されるイントネーション情報に基づいて制御してもよい。 (3) In the voice / character conversion mode of the above embodiment, the intonation information may be transmitted to the telephone of the other party in synchronization with the character information output from the voice recognition unit 151. In this case, the telephone of the other party of the call synthesizes a voice signal from the received character information, and controls the pitch of the voice signal obtained from this character information based on intonation information received in synchronization with the character information. May be

（４）上記実施形態の音声／音声変換モードにおいて、傾きセンサ１５１の出力信号から得られるイントネーション情報を使用する代わりに、マイクロホン１４０から出力される音声信号のピッチの標準的なピッチに対するピッチ比を示すイントネーション情報を生成し、音声合成部１５２により合成された音声信号のピッチをこのイントネーション情報に基づいて制御してもよい。 (4) In the voice / voice conversion mode of the above embodiment, instead of using the intonation information obtained from the output signal of the inclination sensor 151, the pitch ratio of the pitch of the voice signal output from the microphone 140 to the standard pitch Intonation information may be generated, and the pitch of the speech signal synthesized by the speech synthesis unit 152 may be controlled based on the intonation information.

（５）語頭の発音タイミングを指示する操作子を設け、音声認識部１５１が、この操作子の操作に基づいて、音声信号における語頭のタイミングを検知し、音声認識を行うようにしてもよい。あるいは上記実施形態における発音開始ボタン１３４をこの語頭の発音タイミングを指示する操作子として利用してもよい。 (5) An operator may be provided for instructing the timing of sound generation of the beginning of the word, and the speech recognition unit 151 may detect the timing of the beginning of the speech signal based on the operation of the operator and perform speech recognition. Alternatively, the sound generation start button 134 in the above embodiment may be used as an operator for instructing the sound generation timing of this word head.

（６）音声／音声変換モードでは、音声認識処理により得られた文字情報から音声信号を合成したが、音声認識処理の過程において得られる結果、例えばフォルマント情報等から音声信号を合成してもよい。 (6) In the voice / voice conversion mode, a voice signal is synthesized from character information obtained by voice recognition processing, but as a result obtained in the process of voice recognition processing, for example, voice signals may be synthesized from formant information etc. .

（７）音声加工モードにおいて、音声認識処理を利用しない単なる波形変換処理により音声信号の加工を行ってもよい。例えばマイクロホン１４０から得られる音声信号のレベルや周波数に基づいて音声信号における子音区間を検出し、検出した子音区間だけレベル変換処理や子音強調処理を行う。このような音声加工処理を行うことにより、音声信号を通常の音声の音声信号に近づけることができる。 (7) In the speech processing mode, the speech signal may be processed by simple waveform conversion processing not using speech recognition processing. For example, a consonant interval in the audio signal is detected based on the level and frequency of the audio signal obtained from the microphone 140, and level conversion processing and consonant enhancement processing are performed only for the detected consonant interval. By performing such audio processing, the audio signal can be brought close to the audio signal of normal audio.

（８）マイクロホン１４０から得られる音声信号を音声入力装置１００がネットワークを介してサーバに送信し、サーバに音声変換部１５０の処理を実行させ、その実行結果である音声信号や文字情報をサーバから受け取り、送信部２０１により通話相手に送信するようにしてもよい。この態様によれば、携帯電話機１０００に音声変換部１５０を設ける必要がないので、携帯電話機１０００のコストの増加を回避することができる。この発明の目的を達成するためには、利用者の喉に与えた振動に応じて前記利用者の口から発音される音声を示す音声信号を取得する音声取得処理と、前記音声信号に認識を容易にする音声変換処理とを何等かの装置が実行すればよい。音声取得処理を実行する装置と、音声変換処理を実行する装置をどのような装置とし、どのように配置するかは任意である。 (8) The voice input device 100 transmits the voice signal obtained from the microphone 140 to the server via the network, and causes the server to execute the processing of the voice conversion unit 150, and the voice signal and character information as the execution result from the server It may be received and transmitted to the other party by the transmitting unit 201. According to this aspect, since it is not necessary to provide the voice conversion unit 150 in the mobile phone 1000, an increase in cost of the mobile phone 1000 can be avoided. In order to achieve the object of the present invention, there is provided an audio acquisition process for acquiring an audio signal indicating an audio generated from the user's mouth according to a vibration given to the user's throat, and recognition of the audio signal. Any device may perform speech conversion processing to facilitate. The device for executing the voice acquisition process and the device for executing the voice conversion process may be any devices and arranged.

（９）上記実施形態では、この発明による音声入力装置を電話に用いたが、この発明による音声入力装置の用途はこれに限定されるものではない。この発明による音声入力装置は、例えば音声を用いたテキスト入力装置等を含む音声入力装置全般に適用可能である。 (9) Although the voice input device according to the present invention is used for a telephone in the above embodiment, the application of the voice input device according to the present invention is not limited to this. The voice input device according to the present invention is applicable to all voice input devices including, for example, a text input device using voice.

１０００……携帯電話機、２０１……送信部、２０２……受信部、１００……音声入力装置、１１０……制御部、１２０……操作表示部、１３０……駆動制御部、１３１……アクチュエータ、１３２……遮音フード、１３３……ベルト、１４０……マイクロホン、１５０……音声変換部、１６１，１６３，１５４……スイッチ、１６３……加算器、１５１……音声認識部、１５２……音声合成部、１５３……音声加工部、１０……頭部、１……口腔。 Reference numeral 1000: mobile phone, 201: transmission unit, 202: reception unit, 100: voice input device, 110: control unit, 120: operation display unit, 130: drive control unit, 131: actuator, 132: sound insulation hood, 133: belt, 140: microphone, 150: voice conversion unit, 161, 163, 154: switch, 163: adder, 151: voice recognition unit, 152: voice synthesis Part, 153: voice processing part, 10: head, 1: oral cavity.

Claims

Voice acquisition means for acquiring a voice signal indicating a voice generated from the user's mouth according to a vibration given to the user's throat;
A voice conversion unit for performing voice conversion processing to facilitate recognition of the voice signal.

2. The voice input device according to claim 1, further comprising drive control means for generating vibration to be applied to the throat of the user.

3. The voice input device according to claim 1, wherein the voice conversion unit performs voice recognition processing of the voice signal to generate character information.

3. The voice input device according to claim 1, wherein the voice conversion unit performs voice recognition processing of the voice signal, and synthesizes a voice signal based on a voice recognition processing result.

The voice according to claim 1 or 2, wherein the voice conversion means performs voice recognition processing of the voice signal, and processes the voice signal into a voice signal that is easy to recognize based on the voice recognition processing result. Input device.

The voice input means comprises an operator for operating the cycle of the vibration;
The voice conversion unit according to any one of claims 3 to 5, wherein the voice conversion unit divides the voice signal on a time axis based on a change in pitch of the voice signal and performs the voice recognition process. Voice input device as described.

The speech according to any one of claims 3 to 6, wherein the speech conversion means executes the speech recognition process based on the result of the character information obtained by the speech recognition process. Input device.