JP2006119377A

JP2006119377A - Voice input device and method, and program and storage medium

Info

Publication number: JP2006119377A
Application number: JP2004307249A
Authority: JP
Inventors: Yutaka Hiyama; 豊檜山
Original assignee: Canon Electronics Inc
Current assignee: Canon Electronics Inc
Priority date: 2004-10-21
Filing date: 2004-10-21
Publication date: 2006-05-11

Abstract

<P>PROBLEM TO BE SOLVED: To provide a voice input device and method, and a program and a storage medium with which voice input and voice recognition can be performed even with a voice whose sound volume is low, and which is not articulate. <P>SOLUTION: In the voice input device, a throat microphone 1 is equipped near a throat in which the vocal cords of a human exist, and detects vibration of the vocal cords corresponding to a voice when the human uttered the voice and converts the vibration into an electric signal. The converted electric signal is transmitted to an A/D conversion part 2. The A/D conversion part 2 converts an analog signal transmitted from the throat microphone 1 into a digital signal to transmit it to a feature extracting part 3. The feature extracting part 3 performs the frequency modulation of the digital signal, and extracts feature parameters in a frequency region, and transmits parameters to a collation part 4 as a string of uttered feature parameters. The collation part 4 compares the string of the parameters with standard patterns of respective words stored in a recognition dictionary part 5, and selects a word approximated most. A CPU 6 processes the word selected in the recognition dictionary 5 as an input character. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は、音声入力装置及び方法、並びにプログラム及び記憶媒体に関し、特に、人が音声を発するときに声帯が振動することを利用した音声入出力装置及び方法、並びにプログラム及び記憶媒体に関する。 The present invention relates to a voice input device and method, a program, and a storage medium, and more particularly, to a voice input / output device and method, and a program and a storage medium that use the vibration of a vocal cord when a person utters a voice.

従来より、電子機器等の一般的な入力手段として、キーボード、タッチパネル、マウス等が用いられているが、近年もっと簡易な入力手段が求められており、タッチパネルを用いた手書き文字認識による入力装置、音声認識を用いた入力装置等が考案されている。 Conventionally, keyboards, touch panels, mice, etc. have been used as general input means such as electronic devices, but in recent years there has been a demand for simpler input means, and input devices by handwritten character recognition using a touch panel, An input device using voice recognition has been devised.

例えば、特許文献１に開示されている音声認識を用いた入力装置は、音声による空気振動をマイク等を用いて電気信号に変換し、その信号波形をサンプリングし、サンプリングしたデータと標準パターンとを比較分析して、人が発声した音声を認識することにより、文字の入力を行う。
特開平３−１２０５９５号公報 For example, an input device using speech recognition disclosed in Patent Document 1 converts air vibration caused by speech into an electrical signal using a microphone or the like, samples the signal waveform, and outputs the sampled data and a standard pattern. Characters are input by performing comparative analysis and recognizing speech uttered by a person.
Japanese Patent Laid-Open No. 3-120595

しかしながら、従来の音声認識装置は、人が実際にある程度の音量の音声を発しなければならず、周囲に他人がいた場合に、他人に迷惑をかけるか、又は自分が恥ずかしさを感じるという問題があり、また、周囲の騒音（ノイズ）が大きい場合は、認識率が低下するという問題があった。 However, the conventional voice recognition device has a problem that a person must actually emit a sound of a certain volume, and when there is another person around, the person is inconvenienced or he / she feels embarrassed. In addition, when the surrounding noise (noise) is large, there is a problem that the recognition rate is lowered.

本発明の目的は、低音量で明瞭でない音声でも音声入力や音声認識を行うことができる音声入力装置及び方法、並びにプログラム及び記憶媒体を提供することにある。 An object of the present invention is to provide a voice input device and method, a program, and a storage medium that can perform voice input and voice recognition even with low-volume and unclear voice.

上記の目的を達成するために、請求項１記載の音声入力装置は、人が音声を発するときの声帯の振動を検知して電気信号に変換する変換手段と、前記変換された電気信号から人が発声した音声を音声信号として認識する認識手段とを備えることを特徴とする。 In order to achieve the above object, a voice input device according to claim 1 is provided with a conversion means for detecting a vibration of a vocal cord when a person utters a sound and converting the vibration into an electric signal; Recognizing means for recognizing a voice uttered as a voice signal.

請求項３記載の音声入力方法は、人が音声を発するときの声帯の振動を検知して電気信号に変換する変換工程と、前記変換された電気信号から人が発声した音声を音声信号として認識する認識工程とを備えることを特徴とする。 The voice input method according to claim 3, wherein a conversion step of detecting a vocal cord vibration when a person utters a voice and converting it into an electric signal, and a voice uttered by the person from the converted electric signal are recognized as a voice signal. And a recognition process.

請求項５記載の制御プログラムは、請求項３に記載の音声入力方法をコンピュータに実行させることを特徴とする。 According to a fifth aspect of the present invention, there is provided a control program that causes a computer to execute the voice input method according to the third aspect.

請求項６記載のコンピュータ読み取り可能な記憶媒体は、請求項５に記載の制御プログラムを記憶することを特徴とする。 A computer-readable storage medium according to a sixth aspect stores the control program according to the fifth aspect.

本発明によれば、人が音声を発するときの声帯の振動を検知して電気信号に変換し、変換された電気信号から人が発声した音声を音声信号として認識するので、周囲にいる他人に聞かれることなく、また、周囲の騒音の影響を受けずに、低音量の音声でも音声入力や音声認識を行うことができる。 According to the present invention, the vibration of the vocal cord when a person utters a voice is detected and converted into an electric signal, and the voice uttered by the person is recognized as a voice signal from the converted electric signal. Voice input and voice recognition can be performed even at low volume without being heard and without being affected by ambient noise.

以下、本発明の実施の形態を図面を参照しながら詳述する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

図１は、本発明の第１の実施の形態に係る音声入力装置の構成を概略的に示すブロック図である。 FIG. 1 is a block diagram schematically showing the configuration of the voice input device according to the first embodiment of the invention.

図１の音声入力装置において、咽喉マイク１（変換手段）は、人の声帯のある喉付近に装着され、人が音声を発したときに（周りに聞こえない程度の音声でもよい）、その音声に応じた声帯の振動を検出して、その振動を電気信号に変換する。変換された電気信号は、Ａ／Ｄ変換部２に送られる。Ａ／Ｄ変換部２は、咽喉マイク１から送られてきたアナログ信号をデジタル信号に変換し、そのデジタル信号を特徴抽出部３に送る。特徴抽出部３は、デジタル信号を周波数変換し、周波数領域で特徴パラメータを抽出し、発声された特徴パラメータの列として照合部４（認識手段）に送る。照合部４は、認識辞書部５に記憶された各単語の標準パターンとの比較を行い、最も近似する単語を選択する。この標準パターンで使用されるデータは、通常の音声認識で使用されるデータではなく、声帯の振動を直接検知して作成したデータである。ＣＰＵ６は、認識辞書部５で選択された単語を入力文字として処理する。その処理の１例として、ＣＰＵ６は、人の名前を入力してその人のデータ（例えば、営業マンが使用する携帯情報端末等においては、顧客データ等）をメモリ７より検索して表示部８に表示させる。キー入力部９は、カーソル移動及びエンターキー等で、１画面に表示しきれないデータがある場合に次の画面に行くとき、又は、顧客データを入力ときに項目を選択する場合等にカーソルキーを使用する。また、入力データ等を上記音声入力で行った後、正しく認識され確定する場合等に、エンターキーを入力して確定する。これにより、周囲にいる他人に聞かれることなく、また、周囲の騒音の影響を受けずに、低音量で明瞭でない音声でも音声入力や音声認識を行うことができる。 In the voice input device of FIG. 1, the throat microphone 1 (conversion means) is mounted near the throat where a person's vocal cord is located, and when the person utters a voice (the voice may be inaudible to the surroundings). Is detected, and the vibration is converted into an electrical signal. The converted electrical signal is sent to the A / D converter 2. The A / D conversion unit 2 converts the analog signal sent from the throat microphone 1 into a digital signal, and sends the digital signal to the feature extraction unit 3. The feature extraction unit 3 performs frequency conversion on the digital signal, extracts feature parameters in the frequency domain, and sends them to the collation unit 4 (recognition means) as a sequence of spoken feature parameters. The matching unit 4 compares the standard pattern of each word stored in the recognition dictionary unit 5 and selects the closest word. The data used in this standard pattern is not data used in normal speech recognition but data created by directly detecting vibration of the vocal cords. The CPU 6 processes the word selected by the recognition dictionary unit 5 as an input character. As an example of the processing, the CPU 6 inputs a person's name, retrieves the person's data (for example, customer data in a portable information terminal used by a salesperson) from the memory 7, and displays the display unit 8. To display. The key input unit 9 is a cursor key for moving to the next screen when there is data that cannot be displayed on one screen, such as cursor movement and enter key, or when selecting an item when inputting customer data. Is used. In addition, when the input data or the like is correctly recognized and confirmed after the voice input, the enter key is entered and confirmed. As a result, voice input and voice recognition can be performed even with low-volume and unclear voice without being heard by other people around and without being affected by ambient noise.

図２は、本発明の第２の実施の形態に係る音声入力装置の構成を概略的に示すブロック図である。 FIG. 2 is a block diagram schematically showing the configuration of the voice input device according to the second embodiment of the invention.

本実施の形態の構成は、図１の表示部８及びキー入力部９に代えて、音声データ１０、音声合成部１１、Ｄ／Ａ変換部１２及びスピーカー１２を有する点において、図１の構成と異なり、これ以外のものは、図１の構成と基本的に同じであり、図１の構成と同じものには、同一符号を付してその重複説明を省略する。 The configuration of the present embodiment is the configuration of FIG. 1 in that it has voice data 10, a voice synthesis unit 11, a D / A conversion unit 12, and a speaker 12 instead of the display unit 8 and the key input unit 9 of FIG. Unlike this, the other components are basically the same as the configuration of FIG. 1, and the same components as those of FIG.

図２の音声入出力装置において、ＣＰＵ６は、照合部４で認識された単語を、一旦メモリ１０７に蓄え、それと同時に、今まで入力された単語を連結して構文解析を行って、それが１つの文と認識された場合に、その文に最適化された、読み、アクセント、イントネーション、ポーズ等の音声データを音声データ部１０より読み出して音声合成部１１（合成手段）に送る。音声合成部１１は、送られた音声データに基づいて音声波形を合成し、Ｄ／Ａ変換部１２に送る。Ｄ／Ａ変換部１２は、デジタル信号をアナログ信号に変換し、アナログ化された音声信号は、スピーカー１３（出力手段）により、音声として出力される。これにより、十分な音量で音声を発声できない人や、小さい音量しか発声できない人が、十分な音量で明瞭に発声することができる。図示していないが、ボリューム等を具備すれば、音量調節も可能である。 In the voice input / output device of FIG. 2, the CPU 6 temporarily stores the words recognized by the collation unit 4 in the memory 107, and at the same time, concatenates the words input so far and performs syntax analysis. When it is recognized as one sentence, the voice data such as reading, accent, intonation, pause, etc. optimized for the sentence is read from the voice data unit 10 and sent to the voice synthesis unit 11 (synthesis unit). The voice synthesizer 11 synthesizes a voice waveform based on the sent voice data and sends it to the D / A converter 12. The D / A converter 12 converts the digital signal into an analog signal, and the analog audio signal is output as audio by the speaker 13 (output means). Thereby, a person who cannot utter a sound at a sufficient volume or a person who can utter only a low volume can clearly speak at a sufficient volume. Although not shown, the volume can be adjusted if a volume or the like is provided.

また、本発明の目的は、上記実施形態の機能を実現するソフトウェアのプログラムコードを記録した記憶媒体（又は記録媒体）を、システム又は装置に供給し、そのシステム又は装置のコンピュータ（又はＣＰＵやＭＰＵ）が記憶媒体に格納されたプログラムコードを読み出し実行することによっても、達成されることは言うまでもない。 Another object of the present invention is to supply a storage medium (or recording medium) in which a program code of software for realizing the functions of the above-described embodiments is recorded to a system or apparatus, and to perform computer (or CPU or MPU) of the system or apparatus. Needless to say, this is also achieved by reading and executing the program code stored in the storage medium.

この場合、記憶媒体から読み出されたプログラムコード自体が前述した実施形態の機能を実現することになり、そのプログラムコードを記憶した記憶媒体は本発明を構成することになる。 In this case, the program code itself read from the storage medium realizes the functions of the above-described embodiments, and the storage medium storing the program code constitutes the present invention.

また、コンピュータが読み出したプログラムコードを実行することにより、前述した実施形態の機能が実現されるだけでなく、そのプログラムコードの指示に基づき、コンピュータ上で稼働しているオペレーティングシステム(ＯＳ)等が実際の処理の一部又は全部を行い、その処理によって前述した実施形態の機能が実現される場合も含まれることは言うまでもない。 Further, by executing the program code read by the computer, not only the functions of the above-described embodiments are realized, but also an operating system (OS) or the like running on the computer based on the instruction of the program code. It goes without saying that a case where the function of the above-described embodiment is realized by performing part or all of the actual processing and the processing is included.

さらに、記憶媒体から読み出されたプログラムコードが、コンピュータに挿入された機能拡張カードやコンピュータに接続された機能拡張ユニットに備わるメモリに書き込まれた後、そのプログラムコードの指示に基づき、その機能拡張カードや機能拡張ユニットに備わるＣＰＵ等が実際の処理の一部又は全部を行い、その処理によって前述した実施形態の機能が実現される場合も含まれることは言うまでもない。 Further, after the program code read from the storage medium is written in a memory provided in a function expansion card inserted into the computer or a function expansion unit connected to the computer, the function expansion is performed based on the instruction of the program code. It goes without saying that the case where the CPU or the like provided in the card or the function expansion unit performs part or all of the actual processing and the functions of the above-described embodiments are realized by the processing.

また、上記プログラムは、上述した実施の形態の機能をコンピュータで実現することができればよく、その形態は、オブジェクトコード、インタプリタにより実行されるプログラム、ＯＳに供給されるスクリプトデータ等の形態を有するものでもよい。 The above-described program only needs to be able to realize the functions of the above-described embodiments by a computer, and the form includes forms such as object code, a program executed by an interpreter, and script data supplied to the OS. But you can.

プログラムを供給する記録媒体としては、例えば、ＲＡＭ、ＮＶ−ＲＡＭ、フロッピー（登録商標）ディスク、光ディスク、光磁気ディスク、ＣＤ−ＲＯＭ、ＭＯ、ＣＤ−Ｒ、ＣＤ−ＲＷ、ＤＶＤ（ＤＶＤ−ＲＯＭ、ＤＶＤ−ＲＡＭ、ＤＶＤ−ＲＷ、ＤＶＤ＋ＲＷ）、磁気テープ、不揮発性のメモリカード、他のＲＯＭ等の上記プログラムを記憶できるものであればよい。又は、上記プログラムは、インターネット、商用ネットワーク、若しくはローカルエリアネットワーク等に接続される不図示の他のコンピュータやデータベース等からダウンロードすることにより供給される。 As a recording medium for supplying the program, for example, RAM, NV-RAM, floppy (registered trademark) disk, optical disk, magneto-optical disk, CD-ROM, MO, CD-R, CD-RW, DVD (DVD-ROM, DVD-RAM, DVD-RW, DVD + RW), magnetic tape, non-volatile memory card, other ROM, etc. may be used as long as they can store the above programs. Alternatively, the program is supplied by downloading from another computer or database (not shown) connected to the Internet, a commercial network, a local area network, or the like.

本発明の第１の実施の形態に係る音声入力装置の構成を概略的に示すブロック図である。1 is a block diagram schematically showing a configuration of a voice input device according to a first embodiment of the present invention. 本発明の第２の実施の形態に係る音声入力装置の構成を概略的に示すブロック図である。It is a block diagram which shows roughly the structure of the audio | voice input apparatus which concerns on the 2nd Embodiment of this invention.

Explanation of symbols

１咽喉マイク
２Ａ／Ｄ変換部
３特徴抽出部
４照合部
５認識辞書部
６ＣＰＵ
７メモリ
８表示部
９キー入力部
１０音声データ
１１音声合成部
１２Ｄ／Ａ変換部
１３スピーカー DESCRIPTION OF SYMBOLS 1 Throat microphone 2 A / D conversion part 3 Feature extraction part 4 Collation part 5 Recognition dictionary part 6 CPU
7 Memory 8 Display unit 9 Key input unit 10 Audio data 11 Speech synthesis unit 12 D / A conversion unit 13 Speaker

Claims

Characterized in that it comprises conversion means for detecting vibration of a vocal cord when a person utters a voice and converting it into an electric signal, and recognition means for recognizing a voice uttered by a person from the converted electric signal as a voice signal. Voice input device.

2. The voice input device according to claim 1, further comprising: a synthesis unit that synthesizes the voice signal recognized by the recognition unit; and an output unit that outputs the synthesized voice signal as a voice.

A conversion step of detecting vibration of a vocal cord when a person utters a voice and converting it into an electric signal; and a recognition step of recognizing a voice uttered by a person from the converted electric signal as a voice signal. Voice input method.

4. The voice input method according to claim 3, further comprising: a synthesis step of synthesizing the voice signal recognized by the recognition step; and an output step of outputting the synthesized voice signal as voice.

A control program for causing a computer to execute the voice input method according to claim 3.

A computer-readable storage medium storing the control program according to claim 5.