JP2001290495A

JP2001290495A - Device and method for speech recognition, and storage medium

Info

Publication number: JP2001290495A
Application number: JP2000103491A
Authority: JP
Inventors: Shigeru Nishikawa; 成西川
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2000-04-05
Filing date: 2000-04-05
Publication date: 2001-10-19

Abstract

PROBLEM TO BE SOLVED: To make it possible to be responded by authenticating a user himself (herself) only when the user himself (herself) makes a speech input. SOLUTION: Whether or not user's speech signals are inputted into a highly sensitive bone transmission microphone 105 is detected. User's featured values and speech featured values are extracted in a signal analysis section 109 based on the inputted speech signals. A user certifying section 106 computes a degree of similarity with the user's featured values while referring to the user's featured values beforehand stored in a RAM 102. When the user is not authenticated, the process stopts here. A speech recognition section 109 conducts speech recognition of the speech signals by referring to the speech featured values stored in the RAM 102. A speech recognition result obtained by the section 109 is converted into character information by a CPU 101 and the information is displayed in a character display area 401 and at the same time, a message requesting a next speech input is displayed for the user in an information display area 403.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、音声により機器を
制御したり、音声を情報処理する為に機器に対して音声
を入力する音声認識装置及びその制御方法並びに記憶媒
体に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice recognition device for controlling a device by voice and inputting voice to the device for information processing of voice, a control method thereof, and a storage medium.

【０００２】[0002]

【従来の技術】従来、音声により機器を制御したり、音
声を情報処理する為に機器に対して音声を入力する音声
認識装置としては、音声認識機能付きの電話機、カーナ
ビ、パソコン等がある。2. Description of the Related Art Conventionally, as a voice recognition device for controlling a device by voice or inputting voice to the device in order to process voice, there are a telephone with a voice recognition function, a car navigation, a personal computer and the like.

【０００３】これらの装置を使用するにあたりユーザの
個人情報のセキュリティが要求される場合は、これら音
声認識機能を備えた装置はユーザ本人の音声にのみ応答
して操作可能となることが望ましい。[0003] When security of personal information of a user is required in using these devices, it is desirable that the device having the voice recognition function can be operated only in response to the voice of the user.

【０００４】ユーザ本人の音声に対してのみ応答する
（音声認識を行う）というものとして、例えば、特開平
８−１８６６５４号公報に開示された技術がある。Japanese Patent Laid-Open No. 8-186654 discloses a technique for responding to only the voice of the user (performing voice recognition).

【０００５】特開平８−１８６６５４号公報では、音声
認識入力手段を備えた携帯端末装置に於いて、この音声
認識入力手段が、携帯端末装置のユーザを特定する個人
音声特徴量に基づいて入力音声が前記ユーザの音声であ
るか否かを判断する音声判断手段と、前記音声判断手段
によって入力音声が前記ユーザの音声であると判断され
た場合にのみ、前記音声認識データを出力する手段であ
ることを特徴とする。In Japanese Patent Application Laid-Open No. Hei 8-186654, in a portable terminal device provided with a voice recognition input means, the voice recognition input means uses an input voice based on a personal voice feature amount specifying a user of the mobile terminal device. And a means for outputting the voice recognition data only when the input sound is determined to be the user's voice by the voice determining means. It is characterized by the following.

【０００６】[0006]

【発明が解決しようとする課題】しかし、従来の音声認
識機能を備えた装置では、ユーザ本人の音声をテープな
どの記憶媒体に録音し、その記憶媒体に録音されたデー
タ(ユーザ本人の音声)をこの装置に入力した場合、この
装置は応答してしまう可能性がある。また、ユーザ本人
の音声以外の周囲の他者の音声等の雑音の影響により正
しく音声認識しない可能性があり、これらの問題点を考
慮していなかった。However, in a device having a conventional voice recognition function, a user's own voice is recorded on a storage medium such as a tape, and data recorded on the storage medium (user's own voice) is recorded. If this is entered into this device, this device may respond. In addition, there is a possibility that voice recognition may not be performed correctly due to the influence of noise such as voices of other people other than the voice of the user himself, and these problems are not considered.

【０００７】本発明の目的は上記問題を解決し、ユーザ
本人が音声入力した場合にのみ、このユーザ本人を認証
して応答する音声認識装置及びその制御方法並びに記憶
媒体を提供することである。An object of the present invention is to solve the above problems and to provide a voice recognition apparatus which authenticates and responds to a user only when the user himself / herself inputs a voice, a control method thereof, and a storage medium.

【０００８】[0008]

【課題を解決するための手段】本発明の目的を達成する
ために、たとえば本発明は以下の構成を備える。すなわ
ち、ユーザより入力された音声信号に基づいて、音声認
識と前記ユーザを特定するユーザ特徴量の抽出を行い、
ユーザ特徴量を用いたユーザの認証の結果、ユーザが認
証された場合、音声認識結果を用いて応答する音声認識
装置であって、前記音声信号を骨伝導マイクを用いて入
力する入力手段と、入力された音声信号から音声特徴量
を抽出する音声特徴量抽出手段と、入力された音声信号
から前記ユーザ特徴量を抽出するユーザ特徴量抽出手段
と、前記音声特徴量を用いて音声認識を行う音声認識手
段と、前記ユーザ特徴量を用いてユーザの認証を行う認
証手段と、ユーザが認証された場合、音声認識結果を用
いて応答する応答手段とを備える。To achieve the object of the present invention, for example, the present invention has the following arrangement. That is, based on a voice signal input by a user, perform voice recognition and extraction of a user feature amount that specifies the user,
As a result of user authentication using the user feature, if the user is authenticated, a voice recognition device that responds using a voice recognition result, input means for inputting the voice signal using a bone conduction microphone, Voice feature extraction means for extracting a voice feature from an input voice signal; user feature extraction means for extracting the user feature from an input voice signal; and performing voice recognition using the voice feature. The apparatus includes a voice recognition unit, an authentication unit that authenticates a user by using the user characteristic amount, and a response unit that responds by using a voice recognition result when the user is authenticated.

【０００９】[0009]

【発明の実施の形態】以下、添付図面に従って、本発明
に関わる実施形態を詳細に説明する。Embodiments of the present invention will be described below in detail with reference to the accompanying drawings.

【００１０】［第１の実施形態］本実施形態では、正規
のユーザを音声で判断し、正規のユーザが音声を入力し
た場合のみ、この音声を文字情報として出力する音声認
識装置の内部の各部における処理について説明する。[First Embodiment] In the present embodiment, each part in a speech recognition apparatus which determines a legitimate user by voice and outputs the voice as character information only when the legitimate user inputs the voice. Will be described.

【００１１】図１に本実施形態の音声認識装置の内部の
ブロック図を示す。FIG. 1 is a block diagram showing the inside of a speech recognition apparatus according to this embodiment.

【００１２】１は前述の音声認識装置である。Reference numeral 1 denotes the above-described speech recognition device.

【００１３】１０１はＣＰＵで、ＲＯＭ１０３に格納さ
れた各種のプログラムコードに基づいて後述する各処理
を行う。Reference numeral 101 denotes a CPU which performs various processes described below based on various program codes stored in the ROM 103.

【００１４】１０２はＲＡＭで、音声認識部１０９にお
いて音声認識を行う際に用いる音声特徴量や、ユーザ認
証部１０６においてユーザの認証を行う際に用いるユー
ザ特徴量、その他各種制御データ及びデータを格納す
る。また、ＣＰＵ１０１がプログラムコードを実行中に
使用するワークエリアも備えている。Reference numeral 102 denotes a RAM which stores a voice feature used when performing voice recognition in the voice recognition unit 109, a user feature used when performing user authentication in the user authentication unit 106, and various other control data and data. I do. Further, a work area used while the CPU 101 is executing the program code is provided.

【００１５】１０３はＲＯＭで、ＣＰＵ１０１が実行す
る各種のプログラムコードを格納する。また、表示部１
１０に文字を表示するための文字コードなども格納され
ている。A ROM 103 stores various program codes to be executed by the CPU 101. The display unit 1
Also, character codes for displaying characters are stored in the storage unit 10.

【００１６】１０５は高感度骨伝導マイクで、ユーザの
耳に装着することで、公知の技術により耳の内部の振動
を高周波成分も含めて通常のマイクよりも高感度に検出
することができる。その結果、ユーザの周囲の雑音の影
響をあまり受けることなくユーザの音声を検出すること
ができる。また、高感度骨伝導マイク１０５は検出した
耳の内部の振動を音響電気変換する。Reference numeral 105 denotes a high-sensitivity bone conduction microphone, which can be attached to the user's ear to detect vibrations inside the ear, including high-frequency components, with higher sensitivity than conventional microphones by a known technique. As a result, the user's voice can be detected without being greatly affected by the noise around the user. In addition, the high-sensitivity bone conduction microphone 105 performs acoustoelectric conversion of the detected vibration inside the ear.

【００１７】１０６はユーザ認証部で、ＣＰＵ１０１の
制御に基づきＲＡＭ１０２に格納されたユーザ特徴量を
参照して、後述する方法によりユーザ認証を行う。Reference numeral 106 denotes a user authentication unit which performs user authentication by a method described later with reference to the user characteristic amount stored in the RAM 102 under the control of the CPU 101.

【００１８】１０７は信号増幅部で、高感度骨伝導マイ
ク１０５を介して入力されたユーザの音声信号を増幅す
る。Reference numeral 107 denotes a signal amplifying unit for amplifying a user's voice signal input through the high-sensitivity bone conduction microphone 105.

【００１９】１０８は信号分析部で、信号増幅部１０７
において増幅された音声信号から、後述する方法により
ピッチ周波数やパワースペクトルを算出し、前述の音声
特徴量及びユーザ特徴量を抽出する。なお本実施形態に
おける音声特徴量及びユーザ特徴量としては、両方とも
ケプストラムを用いる。Reference numeral 108 denotes a signal analyzer, and a signal amplifier 107
The pitch frequency and the power spectrum are calculated from the amplified audio signal by the method described later, and the above-described audio feature amount and user feature amount are extracted. Note that both cepstrum are used as the audio feature amount and the user feature amount in the present embodiment.

【００２０】１０９は音声認識部で、ＣＰＵ１０１の制
御に基づきＲＡＭ１０２に格納された音声特徴量を参照
して、ユーザが入力した音声信号の音声認識を行う。な
お音声認識の方法についてはＤＰマッチングを用いる
が、ＨＭＭなど、他の公知の方法及び技術を用いてもよ
いことは明白である。Reference numeral 109 denotes a voice recognition unit which performs voice recognition of a voice signal input by a user with reference to a voice feature stored in the RAM 102 under the control of the CPU 101. Note that, although DP matching is used for the speech recognition method, it is obvious that other known methods and techniques such as HMM may be used.

【００２１】１１０は表示部でＣＲＴや液晶画面などに
より構成されており、音声認識の結果等をユーザに示
す。本実施形態では、音声を入力したユーザがユーザ認
証部１０６で認証された場合、音声認識部１０９による
音声認識結果をＣＰＵ１０１において文字情報として変
換したものを表示部１１０に表示するものとする。音声
認識部１０９による音声認識結果をＣＰＵ１０１におい
て文字情報として変換したものを表示部１１９に表示し
た際の表示画面例を図４に示す。Reference numeral 110 denotes a display unit constituted by a CRT, a liquid crystal screen, or the like, and shows a result of voice recognition to a user. In the present embodiment, when the user who has input the voice is authenticated by the user authentication unit 106, the CPU 101 converts the result of the voice recognition by the voice recognition unit 109 into character information and displays the result on the display unit 110. FIG. 4 shows an example of a display screen when a result obtained by converting the speech recognition result by the speech recognition unit 109 as character information in the CPU 101 is displayed on the display unit 119.

【００２２】４０１は文章表示エリアで、ユーザ認証部
１０６において認証された正規のユーザが入力した音声
信号を文字情報として変換された結果を表示するエリア
である。なお同図における文字情報エリア４０１に表示
された文字情報として漢字が用いられているが、このこ
とについては後述する。Reference numeral 401 denotes a text display area for displaying a result obtained by converting a voice signal input by a legitimate user authenticated by the user authentication unit 106 as character information. It should be noted that Chinese characters are used as character information displayed in the character information area 401 in FIG. 3, which will be described later.

【００２３】４０２はカーソルである。Reference numeral 402 denotes a cursor.

【００２４】４０３は情報表示エリアで、音声認識装置
１からのシステムメッセージなどが表示されるエリアで
ある。例えば、高感度骨伝導マイク１０５に入力される
音声信号の信号レベルが後述する所定のレベル以下であ
った場合に、「声が小さすぎます。」などのメッセージ
を表示するエリアである。なお、この各種のメッセージ
はＲＯＭ１０３に文字コードのデータとして格納されて
いる。Reference numeral 403 denotes an information display area in which system messages from the voice recognition device 1 are displayed. For example, when the signal level of the audio signal input to the high-sensitivity bone conduction microphone 105 is equal to or lower than a predetermined level described later, this is an area for displaying a message such as "The voice is too low." The various messages are stored in the ROM 103 as character code data.

【００２５】４０４はユーザ名表示エリアで、操作部１
１１を用いてユーザが認証の際に後述するキーワードを
入力するエリアである。Reference numeral 404 denotes a user name display area.
An area 11 is used by the user to input a keyword described later at the time of authentication.

【００２６】なお、上述の表示画面構成において、音声
認識装置１に入力された音声信号がすべて文字変換さ
れ、すべて文章表示エリアに表示された際には、情報表
示エリアに次の音声の入力を促すメッセージを表示し、
例えば同図では、「音声を入力してください」と表示さ
れる。In the above-described display screen configuration, when all the voice signals input to the voice recognition device 1 are converted into characters and all are displayed in the text display area, the next voice is input to the information display area. Display a prompt message,
For example, in the same figure, "Please input voice" is displayed.

【００２７】図１に戻って、１１１は操作部で、各種の
スイッチなどにより構成されており、音声認識装置１の
各種の設定などを行う際に用いる。Returning to FIG. 1, reference numeral 111 denotes an operation unit which includes various switches and the like, and is used when various settings of the speech recognition apparatus 1 are performed.

【００２８】１１２はインターフェイス部（以下、Ｉ／
Ｆ）で、プリンタなどの周辺機器を接続することがで
き、例えば、音声認識装置１が出力する文字情報をＩ／
Ｆ１１０を介してプリンタに出力し、紙にプリントする
ことができる。Reference numeral 112 denotes an interface unit (hereinafter referred to as I /
F), a peripheral device such as a printer can be connected. For example, the character information output by the voice
The data can be output to a printer via F110 and printed on paper.

【００２９】１１３は上述の各部を繋ぐバスである。Reference numeral 113 denotes a bus connecting the above-described units.

【００３０】また、ＲＡＭ１０２に格納されている音声
特徴量及びユーザ特徴量は、ユーザの最新のトレーニン
グ音声データにより予め更新されているものとする。Also, it is assumed that the voice feature amount and the user feature amount stored in the RAM 102 have been updated in advance with the latest training voice data of the user.

【００３１】次に、上述の構成を備える本実施形態の音
声認識装置１の内部の各部が行う処理の流れを示すフロ
ーチャートを図２に示し、説明する。Next, FIG. 2 is a flowchart showing a flow of processing performed by each unit in the speech recognition apparatus 1 according to the present embodiment having the above-described configuration, and will be described.

【００３２】ステップＳ１０１において、ユーザの音声
信号が高感度骨伝導マイク１０５に入力されたか否かを
検出する。そして高感度骨伝導マイク１０５において、
音声信号が検出されたら次のステップであるステップＳ
１０２に処理を進める。In step S101, it is detected whether or not the user's voice signal has been input to the high-sensitivity bone conduction microphone 105. Then, in the high-sensitivity bone conduction microphone 105,
If an audio signal is detected, the next step is step S
The process proceeds to 102.

【００３３】ステップＳ１０２において、高感度骨伝導
マイク１０５に入力された音声信号の信号レベルを検出
する。この信号レベルが、操作部１１１によって設定さ
れた基準レベル以上であった場合、次のステップである
ステップＳ１０３に処理を進める。なお、この基準レベ
ルはＲＯＭ１０３に格納されてもよく、その際、操作部
１１１における基準レベルの設定は必要ない。又、音声
信号の信号レベルは上述の基準レベル以下であった場
合、上述の情報表示エリア４０３に「声が小さすぎま
す。」などのメッセージを表示する。In step S102, the signal level of the audio signal input to the high sensitivity bone conduction microphone 105 is detected. If the signal level is equal to or higher than the reference level set by the operation unit 111, the process proceeds to the next step, step S103. Note that this reference level may be stored in the ROM 103, in which case it is not necessary to set the reference level in the operation unit 111. If the signal level of the audio signal is lower than the above-mentioned reference level, a message such as “the voice is too low” is displayed in the above-mentioned information display area 403.

【００３４】ステップＳ１０３において、高感度骨伝導
マイク１０５を介して入力された音声信号を信号増幅部
１０８で、操作部１１１において設定された目標信号レ
ベルに増幅する。なお。この目標信号レベルはＲＯＭ１
０３に格納されてもよく、その際、操作部１１１におけ
る目標信号レベルの設定は必要ない。In step S103, the audio signal input through the high-sensitivity bone conduction microphone 105 is amplified by the signal amplifying unit 108 to the target signal level set in the operation unit 111. In addition. This target signal level is stored in ROM1
03 may be stored, in which case setting of the target signal level in the operation unit 111 is not necessary.

【００３５】ステップＳ１０４において、信号増幅部１
０８において上述の所定の目標信号レベルに増幅された
音声信号に基づいて、信号分析部１０９において後述す
る方法により、ユーザ特徴量及び音声特徴量を抽出す
る。In step S104, the signal amplifying unit 1
In 08, the signal analysis unit 109 extracts a user feature and a voice feature based on the voice signal amplified to the above-described predetermined target signal level by a method described later.

【００３６】ステップＳ１０５において、まずユーザに
ユーザ名表示エリア４０４にキーワード（例えばユーザ
名）の入力を促すメッセージを情報表示エリア４０３に
表示する。一方、ＣＰＵ１０１の制御の下、ユーザ認証
部１０６はＲＡＭ１０２に予め格納されたユーザ特徴量
を参照して、ステップＳ１０４において抽出されたユー
ザ特徴量との類似度を算出する。この類似度については
相関係数を用いる。つまり、ＲＡＭ１０２に予め格納さ
れたユーザ特徴量と、ステップＳ１０４において抽出さ
れたユーザ特徴量との相関係数を算出する。そしてこの
類似度が、操作部１１１で設定された基準類似度以内の
類似度であって、かつ、このユーザ特徴量のデータにリ
ンクされたキーワード（このユーザ特徴量が示すユーザ
が予めＲＡＭ１０２に格納したキーワード）とユーザ名
表示エリア４０４にユーザが操作部１１１を用いて入力
したキーワードが一致すれば、このステップＳ１０１に
おいて検出された音声信号を入力したユーザは認証さ
れ、ステップＳ１０６に処理を進める。なお、この基準
類似度はＲＯＭ１０３に格納されてもよく、その際、操
作部１１１における基準類似度の設定は必要ない。In step S 105, a message prompting the user to enter a keyword (eg, a user name) is displayed in the user name display area 404 in the information display area 403. On the other hand, under the control of the CPU 101, the user authentication unit 106 refers to the user feature stored in the RAM 102 in advance and calculates the similarity with the user feature extracted in step S104. For this similarity, a correlation coefficient is used. That is, a correlation coefficient between the user characteristic amount stored in the RAM 102 in advance and the user characteristic amount extracted in step S104 is calculated. The similarity is a similarity within the reference similarity set by the operation unit 111, and a keyword linked to the data of the user characteristic amount (the user indicated by the user characteristic amount is stored in the RAM 102 in advance. If the keyword input by the user using the operation unit 111 matches the keyword input in the user name display area 404, the user inputting the voice signal detected in step S101 is authenticated, and the process proceeds to step S106. Note that this reference similarity may be stored in the ROM 103, and at that time, setting of the reference similarity in the operation unit 111 is not necessary.

【００３７】なお、このユーザが認証されなかった場合
はここで処理は終了となる。又、同様にユーザが認証さ
れなかった場合、ユーザの音声信号の入力方法が間違っ
ている可能性もあるので、所定の回数だけ音声信号の入
力許可をユーザに与えるために、操作部１１１において
設定される、入力許可をユーザに与える回数だけステッ
プＳ１０１に戻ってもよい。又その際には情報表示エリ
ア４０３に「もう一度入力してください。」等のメッセ
ージを表示する。If the user has not been authenticated, the process ends here. Similarly, if the user is not authenticated, there is a possibility that the input method of the user's voice signal is wrong. Therefore, in order to give the user permission for inputting the voice signal a predetermined number of times, the setting in the operation unit 111 is performed. Alternatively, the process may return to step S101 the number of times that the input permission is given to the user. In this case, a message such as "Please enter again" is displayed in the information display area 403.

【００３８】ステップＳ１０６において、音声認識部１
０９はＲＡＭ１０２に格納された音声特徴量を参照して
音声信号の音声認識を行う。音声認識の方法については
上述の公知の方法を用いる。なお、音声認識部１０９に
おけるこの音声認識の処理はステップＳ１０５と平行、
もしくは先に行ってもよい。In step S106, the speech recognition unit 1
Reference numeral 09 performs voice recognition of a voice signal with reference to the voice feature amount stored in the RAM 102. The above-mentioned known method is used for the method of voice recognition. Note that the speech recognition processing in the speech recognition unit 109 is parallel to step S105.
Or you may go first.

【００３９】ステップＳ１０７においては、ステップＳ
１０６において音声認識部１０９による音声認識結果を
ＣＰＵ１０１が文字情報に変換する。そして、この文字
情報は読み方の情報であり、この読み方から漢字への変
換は図６に示す漢字変換テーブルを用いてＣＰＵ１０１
が行う。なおこの漢字変換テーブルはＲＯＭ１０３に格
納されている。その結果、文章表示エリア４０１にこの
変換された漢字が最終的に文章表示エリア４０１に表示
される文字情報として表示されると共に、ユーザに対し
て次の音声の入力を促すメッセージを情報表示エリア４
０３に表示する。In step S107, step S
At 106, the CPU 101 converts the speech recognition result by the speech recognition unit 109 into character information. This character information is information on how to read, and conversion from this reading to kanji is performed by using the kanji conversion table shown in FIG.
Do. This kanji conversion table is stored in the ROM 103. As a result, the converted Chinese characters are finally displayed in the text display area 401 as character information to be displayed in the text display area 401, and a message prompting the user to input the next voice is displayed in the information display area 4.
03 is displayed.

【００４０】また、図３に音声認識装置１の内部におけ
るユーザの音声信号の流れを示す。なお、同図の説明に
関しては上述の説明と同じなので、省く。FIG. 3 shows the flow of a user's voice signal inside the voice recognition device 1. Note that the description of the figure is the same as that described above, and thus will be omitted.

【００４１】図５に音声信号から音声特徴量及びユーザ
特徴量を算出する信号分析部１０８のフローチャートを
示し、同図を説明する。FIG. 5 shows a flowchart of the signal analysis unit 108 for calculating the audio feature amount and the user feature amount from the audio signal, which will be described.

【００４２】ステップＳ５０１においては、音声信号は
高速フーリエ変換（ＦＦＴ）される。このステップでの
処理は音声信号の周波数成分を抽出するためなので、こ
のステップにおいてはＦＦＴに限らず、ウェーブレット
変換であってもよい。In step S501, the audio signal is subjected to fast Fourier transform (FFT). Since the processing in this step is to extract the frequency component of the audio signal, this step is not limited to FFT, and may be wavelet transform.

【００４３】ステップＳ５０２においては、音声信号の
周波数成分が分析される。具体的には、ステップＳ５０
１において抽出された音声信号の周波数成分において、
各周波数成分毎にヒストグラムが生成される（スペクト
ラム変換）。In step S502, the frequency components of the audio signal are analyzed. Specifically, step S50
In the frequency component of the audio signal extracted in step 1,
A histogram is generated for each frequency component (spectrum conversion).

【００４４】ステップＳ５０３においては、人間の声帯
の基本周波数であるピッチが音声には含まれているの
で、そのピッチをスペクトラム中から検出すると共に、
エンベロープ情報も検出する。In step S503, since the voice contains the pitch which is the fundamental frequency of the human vocal cords, the pitch is detected from the spectrum, and
Envelope information is also detected.

【００４５】ステップＳ５０４においは、ステップＳ５
０３において検出されたエンベロープ情報の低周波数成
分には、人間の個人情報の１つである声道の情報が含ま
れているので、このエンベロープ情報から低周波成分を
抽出する。In step S504, step S5
Since the low-frequency component of the envelope information detected in 03 includes information on the vocal tract, which is one of the personal information of a person, the low-frequency component is extracted from the envelope information.

【００４６】ステップＳ５０５においては、ステップＳ
５０４において抽出されたユーザの声道の情報を、個人
情報を良く表す特徴パラメータであるケプストラムに変
換する。In step S505, step S
The vocal tract information of the user extracted in 504 is converted into a cepstrum, which is a characteristic parameter well representing personal information.

【００４７】上述の処理方法、及び上述の処理方法を音
声認識装置１の制御方法としてその内部で行うことで、
周囲の雑音に依存しない正規のユーザの認証及び、正規
のユーザの音声の認識を行うことができる。又、これ以
降、本装置を使用しての各種の処理を行わせることが可
能になる。By performing the above-described processing method and the above-described processing method as a control method of the voice recognition device 1 therein,
It is possible to perform authentication of a legitimate user who does not depend on ambient noise and recognize voice of the legitimate user. Thereafter, it becomes possible to perform various processes using the present apparatus.

【００４８】［他の実施形態］なお、上述の実施形態
は、複数の機器（例えばホストコンピュータ、インタフ
ェイス機器、リーダ、プリンタなど）から構成されるシ
ステムに適用しても、一つの機器からなる装置（例え
ば、複写機、ファクシミリ装置など）に適用してもよ
い。[Other Embodiments] Even if the above-described embodiment is applied to a system composed of a plurality of devices (for example, a host computer, an interface device, a reader, a printer, etc.), it is composed of one device. The present invention may be applied to an apparatus (for example, a copying machine, a facsimile machine, etc.).

【００４９】また、上述の実施形態の目的は、前述した
実施形態の機能を実現するソフトウェアのプログラムコ
ードを記録した記憶媒体（または記録媒体）を、システ
ムあるいは装置に供給し、そのシステムあるいは装置の
コンピュータ（またはCPUやMPU）が記憶媒体に格納され
たプログラムコードを読み出し実行することによって
も、達成されることは言うまでもない。この場合、記憶
媒体から読み出されたプログラムコード自体が前述した
実施形態の機能を実現することになり、そのプログラム
コードを記憶した記憶媒体は上述の実施形態を構成する
ことになる。また、コンピュータが読み出したプログラ
ムコードを実行することにより、前述した実施形態の機
能が実現されるだけでなく、そのプログラムコードの指
示に基づき、コンピュータ上で稼働しているオペレーテ
ィングシステム(OS)などが実際の処理の一部または全部
を行い、その処理によって前述した実施形態の機能が実
現される場合も含まれることは言うまでもない。An object of the above-described embodiment is to supply a storage medium (or a recording medium) in which a program code of software for realizing the functions of the above-described embodiment is recorded to a system or an apparatus, and to provide the system or the apparatus. It is needless to say that the present invention is also achieved when a computer (or a CPU or an MPU) reads and executes a program code stored in a storage medium. In this case, the program code itself read from the storage medium realizes the functions of the above-described embodiment, and the storage medium storing the program code constitutes the above-described embodiment. By executing the program code read by the computer, not only the functions of the above-described embodiments are realized, but also an operating system (OS) running on the computer based on the instruction of the program code. It goes without saying that a case where some or all of the actual processing is performed and the functions of the above-described embodiments are realized by the processing is also included.

【００５０】さらに、記憶媒体から読み出されたプログ
ラムコードが、コンピュータに挿入された機能拡張カー
ドやコンピュータに接続された機能拡張ユニットに備わ
るメモリに書込まれた後、そのプログラムコードの指示
に基づき、その機能拡張カードや機能拡張ユニットに備
わるCPUなどが実際の処理の一部または全部を行い、そ
の処理によって前述した実施形態の機能が実現される場
合も含まれることは言うまでもない。Further, after the program code read from the storage medium is written into a memory provided in a function expansion card inserted into the computer or a function expansion unit connected to the computer, the program code is read based on the instruction of the program code. Needless to say, the CPU included in the function expansion card or the function expansion unit performs part or all of the actual processing, and the processing realizes the functions of the above-described embodiments.

【００５１】上述の実施形態を上記記憶媒体に適用する
場合、その記憶媒体には、先に説明した（図２及び図５
に示す）フローチャートに対応するプログラムコードが
格納されることになる。When the above-described embodiment is applied to the above-mentioned storage medium, the storage medium has been described previously (FIGS. 2 and 5).
The program code corresponding to the flowchart shown in FIG.

【００５２】[0052]

【発明の効果】ユーザ本人が音声入力した場合にのみ、
このユーザ本人を認証して応答する効果がある。[Effect of the Invention] Only when the user himself makes a voice input,
This has the effect of authenticating the user and responding.

[Brief description of the drawings]

【図１】本発明の第１の実施形態の音声認識装置の内部
のブロック図である。FIG. 1 is a block diagram showing the inside of a speech recognition apparatus according to a first embodiment of the present invention.

【図２】本発明の第１の実施形態の音声認識装置が行う
処理の流れを示すフローチャートである。FIG. 2 is a flowchart illustrating a flow of a process performed by the voice recognition device according to the first embodiment of the present invention.

【図３】音声認識装置の内部におけるユーザの音声信号
の流れを示す図である。FIG. 3 is a diagram showing a flow of a user's voice signal inside the voice recognition device.

【図４】音声認識部１０９による音声認識結果をＣＰＵ
１０１において文字情報として変換したものを表示部１
１０に表示した際の表示画面例を示す図である。FIG. 4 shows a result of speech recognition by a speech recognition unit 109 by a CPU.
The display unit 1 converts the information as character information in 101.
FIG. 10 is a diagram showing an example of a display screen displayed on the display 10;

【図５】信号分析部１０８のフローチャートである。FIG. 5 is a flowchart of a signal analysis unit 108;

【図６】漢字変換テーブルを示す図である。FIG. 6 is a diagram showing a kanji conversion table.

Claims

[Claims]

1. A speech recognition and extraction of a user feature for specifying the user based on a speech signal input by the user, and a result of user authentication using the user feature,
What is claimed is: 1. A voice recognition apparatus method for responding using a voice recognition result when a user is authenticated, comprising: an input unit configured to input the voice signal using a bone conduction microphone; and extracting a voice feature amount from the input voice signal. Voice feature extracting means for extracting, the user feature quantity extracting means for extracting the user feature quantity from the input voice signal, speech recognizing means for performing speech recognition using the voice feature quantity, and using the user feature quantity. A voice recognition device comprising: an authentication unit that authenticates a user by using the authentication unit; and a response unit that responds by using a voice recognition result when the user is authenticated.

2. The apparatus according to claim 1, further comprising a calculating unit configured to calculate a similarity between the user characteristic amount by the user characteristic amount extracting unit and a user characteristic amount prepared in advance. The voice recognition device according to claim 1, wherein the user is authenticated when the user is present in the device.

3. The speech recognition apparatus according to claim 2, wherein the user feature amount prepared in advance is updated with the latest training speech data of the user.

4. The speech recognition apparatus according to claim 1, wherein said speech recognition means performs speech recognition with reference to a user's speech feature quantity prepared in advance.

5. The speech recognition apparatus according to claim 4, wherein the speech feature amount of the user prepared in advance is updated with the latest training speech data of the user.

6. Based on a voice signal input by a user, voice recognition and extraction of a user feature specifying the user are performed, and as a result of user authentication using the user feature,
A voice recognition method for responding using a voice recognition result when a user is authenticated, comprising: an inputting step of inputting the voice signal to predetermined input means using a bone conduction microphone; A voice feature amount extraction step of extracting a feature amount; a user feature amount extraction step of extracting the user feature amount from an input voice signal; a voice recognition step of performing voice recognition using the voice feature amount; A voice recognition method comprising: an authentication step of authenticating a user using a feature amount; and a response step of responding using a voice recognition result when the user is authenticated.

7. A user characteristic amount extracted in the user characteristic amount extracting step and a calculating step of calculating a similarity between a user characteristic amount prepared in advance, and the similarity degree is calculated in the user authentication step. The voice recognition method according to claim 6, wherein the user is authenticated when the user is within a predetermined range.

8. The speech recognition method according to claim 7, wherein the user feature amount prepared in advance is updated with the latest training speech data of the user.

9. The speech recognition method according to claim 6, wherein in the speech recognition step, speech recognition is performed with reference to a user voice feature quantity prepared in advance.

10. The voice recognition method according to claim 9, wherein the voice feature amount of the user prepared in advance is updated with the latest training voice data of the user.

11. A computer that reads a voice signal based on a voice signal input by a user to perform voice recognition and extraction of a user characteristic value for specifying the user, and a result of user authentication using the user characteristic value. A storage medium that stores a program code that functions as a voice recognition device that responds by using a voice recognition result when a user is authenticated, wherein the voice signal is input to predetermined input means using a bone conduction microphone A program code of an input step, a program code of an audio feature amount extraction step of extracting an audio feature amount from an input audio signal, and a program code of a user feature amount extraction step of extracting the user characteristic amount from an input audio signal A program code for a voice recognition step of performing voice recognition using the voice feature amount; and a user code using the user feature amount. A program code authentication step of performing THE authentication, if the user is authenticated, the storage medium comprising: a program code of the response step for responding with a speech recognition result.

12. The program code of a calculating step of calculating a similarity between a user feature quantity extracted in the user feature quantity extraction and a user feature quantity prepared in advance, wherein the program code of the user authentication step is The storage medium according to claim 11, wherein the user is authenticated when the similarity is within a predetermined range.

13. The storage medium according to claim 12, wherein the user feature amount prepared in advance is updated with the latest training voice data of the user.

14. The storage medium according to claim 13, wherein in the voice recognition step, voice recognition is performed with reference to a voice feature amount of a user prepared in advance.

15. The storage medium according to claim 14, wherein the voice feature amount of the user prepared in advance is updated by the latest training voice data of the user.