JP2001100786A

JP2001100786A - Method and device for speech recognition, and storage medium

Info

Publication number: JP2001100786A
Application number: JP27437199A
Authority: JP
Inventors: Kenichiro Nakagawa; 賢一郎中川; Tetsuo Kosaka; 哲夫小坂; Tsuyoshi Yagisawa; 津義八木沢; Katsuhiko Kawasaki; 勝彦川崎; Hiroki Yamamoto; 寛樹山本; Masaaki Yamada; 雅章山田
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 1999-09-28
Filing date: 1999-09-28
Publication date: 2001-04-13

Abstract

PROBLEM TO BE SOLVED: To prevent the repeat of erroneous recognition of the same vocabulary in speech recognition processing and to improve the recognition ratio of a voice inputted from a terminal connected through a network. SOLUTION: A penalty value is subtracted by a penalty value subtraction part 204 after the degree of similarity of an input speech is calculated by a calculation part 203, and a vocabulary to be outputted as the recognition result is selected in accordance with the result by parts 205 and 206.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、入力した音声を認
識する技術に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a technique for recognizing input speech.

【０００２】本発明は、音声が誤認識された時の処理に
関するものである。[0002] The present invention relates to processing when speech is erroneously recognized.

【０００３】本発明は、不特定話者の音声認識に関する
ものである。[0003] The present invention relates to speech recognition of unspecified speakers.

【０００４】[0004]

【従来の技術】不特定話者の音声を取り込み、認識対象
語彙と入力音声との類似度を算出し、この類似度に基づ
いて認識結果を出力する技術がある。また、このとき、
認識結果として複数の候補を出力する場合には、それら
を類似度順に並べてユーザに提示し、ユーザに音声また
はその他の入力手段によりその候補群から一つを選択さ
せる手法もある。2. Description of the Related Art There is a technique in which a voice of an unspecified speaker is taken in, a similarity between a vocabulary to be recognized and an input voice is calculated, and a recognition result is output based on the similarity. At this time,
When a plurality of candidates are output as a recognition result, there is a method in which the candidates are arranged in order of similarity and presented to the user, and the user selects one from the candidate group by voice or other input means.

【０００５】また、複数のクライアントから入力された
音声データを認識するシステムの場合は、マイク感度や
雑音のレベル、話者の特性を学習することにより、同じ
環境で入力された音声を処理する場合には認識率を向上
させることができる。Further, in the case of a system for recognizing voice data input from a plurality of clients, a method for processing voice input in the same environment by learning microphone sensitivity, noise level, and speaker characteristics. Can improve the recognition rate.

【０００６】[0006]

【発明が解決しようとする課題】音声認識技術は常に誤
ることなくユーザの意図する語彙を第１位の認識候補と
して選択するわけではない。ユーザが意図し、発声した
語彙が認識対象語彙中で最も高い類似度を得る（ここで
は第１位認識結果と呼ぶ）ことは保証されない。これ
は、ユーザの発声上の癖、周囲雑音、マイク特性、音声
認識システムの特性のためである。The speech recognition technology does not always select the vocabulary intended by the user as the first recognition candidate without error. It is not guaranteed that the vocabulary intended and uttered by the user obtains the highest similarity in the vocabulary to be recognized (referred to herein as the first recognition result). This is due to the user's utterance habits, ambient noise, microphone characteristics, and characteristics of the speech recognition system.

【０００７】このため、音声入力を用いた対話システム
では、ユーザ発声の認識結果が正しいかどうかをユーザ
自身が確認し、修正する必要がある。現在あるシステム
では、次のような手段を用いることが一般的である。１．第Ｎ位までの認識結果を画面に表示し、ユーザはそ
の中からユーザの意図する語彙をボタン、タッチパネル
等で指示するか、或は語彙に付けて表示されている番号
を発声することにより、システムの側で音声認識してユ
ーザの意図する語彙を得る。２．第１位〜第Ｎ位認識結果をシステムが音声合成して
読み上げ、それぞれの語彙ごとにユーザに確認をとる。
ユーザは音声の「はい」「いいえ」、ボタンなどで答え
る。３．第Ｎ位までの認識結果を音声合成により先に全て読
み上げ、ユーザはその中からユーザの意図する語彙をボ
タン、タッチパネル等で指示するか、または語彙に付け
られた番号を発声することにより音声で入力する。For this reason, in a dialogue system using voice input, it is necessary for the user himself to check whether or not the recognition result of the user's utterance is correct and correct it. In existing systems, the following means are generally used. 1. The recognition results up to the Nth position are displayed on the screen, and the user designates the vocabulary intended by the user with a button, a touch panel, or the like, or utters the number displayed with the vocabulary, The system recognizes the speech and obtains the vocabulary intended by the user. 2. The system synthesizes and reads the first to Nth recognition results, and asks the user for each vocabulary.
The user answers with voice “yes”, “no”, buttons, and the like. 3. The recognition results up to the Nth place are all read out first by speech synthesis, and the user designates the vocabulary intended by the user with a button, a touch panel, or the like, or utters a number assigned to the vocabulary, thereby producing a speech. input.

【０００８】上記１の手法は、画面を用いて選択するこ
とが可能なため、ユーザは意図する語彙を高速に選択す
ることができる。しかし、画面を用いた端末が必要にな
るため、システムが高価なものになる。In the first method, since the selection can be made using the screen, the user can quickly select the intended vocabulary. However, since a terminal using a screen is required, the system becomes expensive.

【０００９】上記２、３の手法は、電話回線を用いたシ
ステムで一般的であり、携帯電話のように音声が出力さ
れる端末により操作することが可能である。更に、２の
手法はユーザの意図する語彙が認識結果の上位にあれ
ば、時間的損失が少なくてすむが、認識結果の下位の場
合、ユーザの意図する語彙が発声されるまでに長い時間
が必要となり、時間的な損失が大きくなる。しかも、認
識結果のＮ位に入っていない場合は、Ｎ個全てに「いい
え」と答えた後に、もう一度語彙を発声して再入力及び
再認識を行うため、ユーザへの負担は大きい。また、手
法２で特にＮ＝１としたシステムも多い。これは語彙入
力、確認を繰り返すシステムであり、認識タスクが簡単
なものでは、ユーザへの負担が少なくなるが、難しい認
識タスクでは、先に述べた理由により、どうしてもうま
く入らない場合が生じる可能性が出てくる。The above two and three methods are generally used in a system using a telephone line, and can be operated by a terminal such as a cellular phone which outputs sound. Furthermore, the second method requires less time loss if the vocabulary intended by the user is higher in the recognition result, but has a longer time until the vocabulary intended by the user is uttered in the lower case. Required and time loss is increased. In addition, when the recognition result does not fall in the N-th place, the vocabulary is spoken again to perform re-input and re-recognition after answering "No" to all N words, so that the burden on the user is large. In addition, there are many systems in which N = 1 in particular in Method 2. This is a system that repeats vocabulary input and confirmation. A simple recognition task will reduce the burden on the user, but a difficult recognition task may not be able to be entered properly for the reasons described above. Comes out.

【００１０】３の手法は難しい認識タスクで有効である
が、ユーザは確認作業のため、長いシステムアナウンス
を聞かなくてはならない。Ｎ位までの認識結果にユーザ
の意図する語彙がなかった場合の、ユーザへの負担は手
法２と同様に大きい。The third method is effective for difficult recognition tasks, but the user must listen to a long system announcement for confirmation. When there is no vocabulary intended by the user in the recognition results up to the N-th place, the burden on the user is as large as in Method 2.

【００１１】また、音声認識処理は、一般的に非常に重
い処理であるため、高いスペックの計算機が必要とな
る。更に、ユーザが持ち歩く情報機器端末は、高スペッ
クよりもいかにコンパクトにするかという携帯性が求め
られる。インターネットブラウザといった、サーバと交
信できる最低限度のアプリケーションしか持たない携帯
型の情報機器端末も現れてきた。このことから、音声認
識処理はサーバに任せ、クライアントはそのデータ送受
信に勤めることが現実的である。Also, the speech recognition processing is generally very heavy processing, and therefore requires a computer with high specifications. Further, information equipment terminals carried by users are required to be more portable than how high the specifications. Portable information device terminals, such as Internet browsers, having only a minimum number of applications that can communicate with servers have emerged. For this reason, it is realistic to leave the voice recognition processing to the server and the client to work on the data transmission and reception.

【００１２】音声認識処理は、発話者のいる環境、発話
者の声に大きく影響を受けるため、不特定話者に対応し
た音声認識装置でも、発話者の声、発話者のいる雑音を
学習することにより、大きく性能が向上する。しかし、
音声認識システムが、何らかの回線で繋がったサーバと
多数のクライアントに分かれたシステムの場合、あるク
ライアントについて学習しても、その結果は他のクライ
アントで使用することができないため、クライアントが
サーバに接続するごとに学習をしなおすか、デフォルト
の学習結果をすべてのクライアントで保存、管理しなけ
ればならない。このため、音声認識サーバはクライアン
トを認証し、そのクライアントに適合した音声認識処理
を行うことが重要となる。Since the speech recognition process is greatly affected by the environment in which the speaker is present and the voice of the speaker, even a speech recognition device corresponding to an unspecified speaker learns the voice of the speaker and noise in which the speaker is present. This greatly improves the performance. But,
If the speech recognition system is a system that is divided into a server and a number of clients connected by a certain line, even if learning about a certain client, the result cannot be used by other clients, so the client connects to the server. You have to relearn every time, or save and manage the default learning results for all clients. For this reason, it is important that the speech recognition server authenticates the client and performs a speech recognition process suitable for the client.

【００１３】[0013]

【課題を解決するための手段】上記従来技術の課題を解
決するために、本発明は、音声を入力し、前記入力した
音声と辞書データとの類似度を求め、前記辞書データに
対応するペナルティ値を前記求めた類似度より引いてペ
ナルティ値を考慮した評価値を求め、前記評価値に基づ
いて前記入力音声の認識結果として出力する語彙を選択
する音声認識方法、装置及び記憶媒体を提供する。SUMMARY OF THE INVENTION In order to solve the above-mentioned problems of the prior art, the present invention provides a method for inputting voice, obtaining a similarity between the input voice and dictionary data, and obtaining a penalty corresponding to the dictionary data. A speech recognition method, apparatus, and storage medium for obtaining an evaluation value in consideration of a penalty value by subtracting a value from the obtained similarity and selecting a vocabulary to be output as a recognition result of the input voice based on the evaluation value are provided. .

【００１４】上記従来技術の課題を解決するために、本
発明は、好ましくは前記評価値が上位のものを前記認識
結果として出力する語彙として選択する。[0014] In order to solve the above-mentioned problems of the prior art, the present invention preferably selects a word having a higher evaluation value as a vocabulary to be output as the recognition result.

【００１５】上記従来技術の課題を解決するために、本
発明は、好ましくは前記選択した語彙が誤りであるか否
かを判定し、前記誤りと判定される場合は前記ペナルテ
ィ値を更新する。In order to solve the above-mentioned problems of the prior art, the present invention preferably determines whether or not the selected vocabulary is incorrect, and updates the penalty value if it is determined that the selected vocabulary is incorrect.

【００１６】上記従来技術の課題を解決するために、本
発明は、好ましくは前記ペナルティ値は、語彙毎に保持
した値とする。In order to solve the above-mentioned problems of the prior art, according to the present invention, preferably, the penalty value is a value held for each vocabulary.

【００１７】上記従来技術の課題を解決するために、本
発明は、好ましくは前記ペナルティ値の更新は、値を大
きくするものとする。In order to solve the above-mentioned problem of the prior art, the present invention preferably updates the penalty value by increasing the value.

【００１８】上記従来技術の課題を解決するために、本
発明は、好ましくは前記ペナルティ値は、認識処理を繰
り返す毎に小さい値に更新する。In order to solve the above-mentioned problems of the prior art, the present invention preferably updates the penalty value to a smaller value each time the recognition process is repeated.

【００１９】上記従来技術の課題を解決するために、本
発明は、好ましくは前記音声を、ネットワークを介して
入力する。In order to solve the above-mentioned problem of the prior art, the present invention preferably inputs the voice via a network.

【００２０】上記従来技術の課題を解決するために、本
発明は、好ましくは前記選択した語彙を出力する。In order to solve the above-mentioned problems of the prior art, the present invention preferably outputs the selected vocabulary.

【００２１】上記従来技術の課題を解決するために、本
発明は、好ましくは前記選択した語彙を、ネットワーク
を介して出力する。In order to solve the above-mentioned problems of the prior art, the present invention preferably outputs the selected vocabulary via a network.

【００２２】上記従来技術の課題を解決するために、本
発明は、好ましくは前記出力した語彙に対して、第一位
以外の語彙を選択する指示が入力された場合に、前記誤
りであると判定する。In order to solve the above-mentioned problem of the prior art, the present invention is preferably configured such that, when an instruction to select a vocabulary other than the first place is input to the output vocabulary, the error is determined. judge.

【００２３】上記従来技術の課題を解決するために、本
発明は、好ましくは前記指示はネットワークを介して入
力する。In order to solve the above-mentioned problem of the prior art, the present invention preferably inputs the instruction via a network.

【００２４】上記従来技術の課題を解決するために、本
発明は、好ましくは前記ネットワークを介して音声を送
ってきた端末の識別情報に従って、当該音声を認識する
時の処理を変える。[0024] In order to solve the above-mentioned problems of the prior art, the present invention preferably changes the processing for recognizing the voice according to the identification information of the terminal that has transmitted the voice via the network.

【００２５】上記従来技術の課題を解決するために、本
発明は、好ましくは前記ネットワークを介して音声を送
ってきた端末の識別情報に従って、当該音声を認識する
時に用いるパラメータを変える。In order to solve the above-mentioned problems of the prior art, the present invention preferably changes parameters used for recognizing the voice according to the identification information of the terminal that has transmitted the voice via the network.

【００２６】上記従来技術の課題を解決するために、本
発明は、好ましくは前記ネットワークを介して接続し得
る端末の識別情報と、その端末に適した音声認識に関す
る情報を保持する。[0026] In order to solve the above-mentioned problems of the prior art, the present invention preferably holds identification information of a terminal connectable via the network and information relating to speech recognition suitable for the terminal.

【００２７】上記従来技術の課題を解決するために、本
発明は、好ましくは前記保持した情報に従って、前記認
識時の処理を変える。In order to solve the above-mentioned problems of the prior art, the present invention preferably changes the processing at the time of recognition in accordance with the stored information.

【００２８】上記従来技術の課題を解決するために、本
発明は、好ましくは前記保持した情報に従って、前記認
識時のパラメータを変える。In order to solve the above-mentioned problems of the prior art, according to the present invention, the parameters at the time of recognition are preferably changed according to the stored information.

【００２９】上記従来技術の課題を解決するために、本
発明は、好ましくは前記ネットワークを介して音声を送
ってきた端末の識別情報が予め保持されていない場合
は、その識別情報と、その端末に適した音声認識に関す
る情報を新たに登録する。In order to solve the above-mentioned problems of the prior art, the present invention is preferably arranged such that, when identification information of a terminal which has transmitted a voice via the network is not stored in advance, the identification information and the terminal New information relating to speech recognition suitable for is registered.

【００３０】[0030]

【発明の実施の形態】図１は、本発明に係る音声認識装
置の機能構成図である。図１において、１０１、１０７
は電話、１０２、１０６は公衆網、１０３、１０８はマ
イク、１０４はスピーカ、１０５はディスプレイ画面、
１０９はボタン、２０１は音声認識装置である。FIG. 1 is a functional block diagram of a speech recognition apparatus according to the present invention. In FIG. 1, 101, 107
Is a telephone, 102 and 106 are public networks, 103 and 108 are microphones, 104 is a speaker, 105 is a display screen,
Reference numeral 109 denotes a button, and 201 denotes a voice recognition device.

【００３１】次に、音声認識装置（２０１）を構成する
各要素について説明する。マイク（１０３）、電話機
（１０１）から取り込まれたユーザの音声は、音声取り
込み部（２０２）から音声認識装置（２０１）に入り、
類似度算出部（２０３）で認識対象語彙のデータ（２１
０）との類似度が算出される。この類似度はペナルティ
値減算部（２０４）で各認識語彙のペナルティ値（２１
１）が引かれる。即ち、ペナルティ値が大きい程、類似
度のスコアを減少させることになる。そしてペナルティ
値を減算した類似度を大きい順に結果ソート部（２０
５）でソートし、その上から順に１〜Ｎ位までの語彙を
認識の結果得た候補群としてユーザに文字情報でディス
プレイ画面（１０５）に出力するか、音声でスピーカ
（１０４）、ユーザと回線をつないでいる電話機（１０
７）など出力することによりユーザにアナウンス（報
知）する。Next, each element constituting the speech recognition apparatus (201) will be described. The voice of the user captured from the microphone (103) and the telephone (101) enters the voice recognition device (201) from the voice capturing unit (202).
The similarity calculation unit (203) uses the data (21
0) is calculated. The similarity is calculated by a penalty value subtracting unit (204).
1) is subtracted. That is, as the penalty value increases, the score of the similarity decreases. Then, the result sorting unit (20
5) and output the vocabulary from the top to the Nth order as a candidate group obtained as a result of recognition on the display screen (105) with character information or by voice to the speaker (104) and the user. Telephone connected to line (10
7) Announce (notify) to the user by outputting.

【００３２】ユーザはこの認識結果を受け、出力された
Ｎ個の認識結果から、ユーザの意図する語彙を電話帳
（１０７）からの音声、プッシュボタン、マイク（１０
８）、ディスプレイ画面に付随するボタン（１０９）、
タッチパネルなどにより選択する。これがシステムの音
声認識結果に対するユーザの認識処理となる。この確認
結果は誤認識検出部（２０７）に送られる。ここで音声
認識結果が誤認識だったかどうかを判定し、誤認識でな
かった場合はユーザの選択した語彙を、誤認識であった
場合はその旨を本音声認識装置の正式な認識結果として
結果出力部（２０９）から出力する。誤認識か否かの判
定は、ユーザの確認処理により第１位の候補が選択され
た場合には誤認識でない、それ以外の場合は誤認識、と
することにより行なう。The user receives the recognition result and, based on the output N recognition results, inputs the vocabulary intended by the user from a voice from the telephone directory (107), a push button, a microphone (10).
8), buttons (109) attached to the display screen,
Select by touch panel. This is the user's recognition processing for the speech recognition result of the system. This confirmation result is sent to the misrecognition detection unit (207). Here, it is determined whether or not the speech recognition result is incorrect recognition. If not, the vocabulary selected by the user is determined as a result. Output from the output unit (209). The determination as to whether or not the recognition is erroneous is made by determining that the first candidate is selected by the user's confirmation processing, that no erroneous recognition is performed, otherwise, erroneous recognition is performed.

【００３３】ユーザの確認結果はペナルティ値設定部
（２０８）にも送られる。ここでは、ユーザの確認結果
に従って、ペナルティ値データベース（２１１）に格納
するペナルティ値を更新する。The result of the user's confirmation is also sent to the penalty value setting section (208). Here, the penalty value stored in the penalty value database (211) is updated according to the user's confirmation result.

【００３４】図２は、本発明による音声認識方式のフロ
ーチャートである。ここでは、ユーザの発声する駅名を
認識し、その確認を取るシステムを例にして説明する。FIG. 2 is a flowchart of the voice recognition system according to the present invention. Here, an example of a system that recognizes a station name spoken by a user and confirms the station name will be described.

【００３５】システムが立ち上がると、認識対象語彙全
てに対応するｐｅｎａｌｔｙ（ペナルティ）配列を０で
初期化する（Ｓ１０１〜Ｓ１０３）。この値はペナルテ
ィ値に相当する。次にシステムは「駅名をどうぞ」とい
うアナウンスを出力（Ｓ１０４）することにより、ユー
ザに駅名の発声を促す。その後、ユーザが発声した音声
を取り込み（Ｓ１０５）、音声の認識処理を行う。When the system starts up, the penalty arrays corresponding to all the recognition target words are initialized to 0 (S101 to S103). This value corresponds to a penalty value. Next, the system outputs an announcement “Please give me the station name” (S104) to prompt the user to say the station name. Thereafter, a voice uttered by the user is captured (S105), and a voice recognition process is performed.

【００３６】まず、パラメータｍａｘ＿ｒｕｉｊｉｄｏ
を−∞で初期化する（Ｓ１０６）。このパラメータは、
ペナルティ値を考慮した最大の類似度を保持する格納エ
リアであって、常にそれまでの最大値が入る変数であ
る。全ての認識対象語彙に対し、入力音声と認識対象語
彙との類似度を算出し（Ｓ１０８）、その値から各語彙
に対するペナルティ値を引いた数値を求める度にこのパ
ラメータの値と比較し、大きい方を選択して格納するこ
とによりペナルティ値を考慮した類似度が最大となる
“認識対象語彙［ｍａｘ］”を探し出す（Ｓ１０９〜Ｓ
１１１）。First, the parameter max_ruijido
Is initialized with -∞ (S106). This parameter is
This is a storage area that holds the maximum similarity in consideration of the penalty value, and is a variable that always contains the maximum value up to that point. The degree of similarity between the input speech and the recognition target vocabulary is calculated for all the recognition target vocabularies (S108), and a value obtained by subtracting a penalty value for each vocabulary from the value is compared with the value of this parameter. The vocabulary [recognition target vocabulary [max] "that maximizes the similarity in consideration of the penalty value by selecting and storing the vocabulary is searched for (S109 to S109).
111).

【００３７】次にその結果をユーザに確認する処理を行
う。システムアナウンスとして「“認識対象語彙［ｍａ
ｘ］”でよろしいですか？」と発声し（Ｓ１１２）、ユ
ーザに「はい」「いいえ」のボタンを押して指示させる
（Ｓ１１３）。Ｓ１１２で発声する語彙は、Ｓ１１０で
ｍａｘに格納した番号の語彙である。「いいえ」のボタ
ンが押されたならば、誤認識であったことになるため、
「認識失敗」とアナウンスし（Ｓ１１５）、そのＳ１１
２で発声した認識語彙、即ちパラメータｍａｘに格納さ
れている値で特定される語彙にペナルティ値として１を
与える（Ｓ１１６）。Ｓ１１４で「はい」ボタンが押さ
れたと判定されるならば、認識は成功したとしてその認
識結果を再度アナウンスする（Ｓ１１７）。Next, a process for confirming the result with the user is performed. As a system announcement, ""
x] "Is it OK?" (S112), and prompts the user by pressing the "Yes" or "No" button (S113). The vocabulary uttered in S112 is the vocabulary of the number stored in max in S110. If the "No" button is pressed, it is a misrecognition,
Announce "recognition failure" (S115), and the S11
A 1 is given as a penalty value to the recognized vocabulary uttered in 2, that is, the vocabulary specified by the value stored in the parameter max (S116). If it is determined in S114 that the "Yes" button has been pressed, the recognition is successful and the recognition result is announced again (S117).

【００３８】最後に、全ての認識対象語彙に付随するペ
ナルティ値を更新する（Ｓ１１８〜Ｓ１２０）。実際に
は、各ペナルティ値をそれぞれ０．８倍している（Ｓ１
２０）。これにより、誤認識した直後のペナルティ値は
０．８で、認識回数が増すにつれて０．８がかけ続けら
れるので、徐々に０に近づくようになっている。Finally, the penalty values associated with all the words to be recognized are updated (S118 to S120). Actually, each penalty value is multiplied by 0.8 (S1
20). As a result, the penalty value immediately after erroneous recognition is 0.8, and the penalty value is continuously multiplied by 0.8 as the number of times of recognition increases, so that the value gradually approaches zero.

【００３９】これらの処理が終わると、システムはユー
ザに終了するかどうかを問い（Ｓ１２１）、まだ続ける
のであれば駅名の発声を促すアナウンスを流す処理（Ｓ
１０４）まで戻る。When these processes are completed, the system asks the user whether or not to end the process (S121). If the process is to be continued, an announcement prompting the utterance of the station name is issued (S121).
Return to 104).

【００４０】次に、図３のフローチャートを用いて、上
述したような音声認識を用いた内線取り次ぎシステムサ
ーバの処理について説明する。Next, the processing of the extension agent system server using the above-described speech recognition will be described with reference to the flowchart of FIG.

【００４１】システムが立ち上がると、ペナルティ値デ
ータベースを０で初期化し（Ｓ３０１）、ユーザから電
話がかかってくるのを待つ（Ｓ３０２）。電話がかかっ
てくると、「誰におつなぎしましょう」とアナウンスを
流し（Ｓ３０３）、ユーザに人名の発声を促す（Ｓ３０
４）。ここで取り込まれた音声は、認識対象語彙との類
似度（ｒｕｉｊｉｄｏ［ｉ］）を計算し、その語彙中で
ｒｕｉｊｉｄｏ［ｉ］−ｐｅｎａｌｔｙ［ｉ］が大きい
順に３つの語彙、ｉ＝ｍａｘ₁，ｍａｘ₂，ｍａｘ₃を探
す（Ｓ３０５）。When the system starts up, the penalty value database is initialized with 0 (S301), and waits for a call from the user (S302). When the call is received, an announcement is sent to "Who will you connect?" (S303), and the user is prompted to speak a personal name (S30).
4). The speech taken in here calculates the similarity (ruijido [i]) with the vocabulary to be recognized, and three vocabularies in the vocabulary in the descending order of ruijido [i] −penalty [i], i = max ₁ , i = max ₁ , Search for max ₂ and max ₃ (S305).

【００４２】このでの１〜３位までの認識結果を音声合
成して出力することによりユーザに示し、ユーザが意図
する人名を選択させる。まず、予め用意してあるメッセ
ージの一部に“認識対象語彙［ｍａｘｉ］（ｉ＝１〜
３）”を挿入することにより、「“認識対象語彙［ｍａ
ｘ₁］”さんなら１と、“認識対象語彙［ｍａｘ₂］”さ
んなら２と、“認識対象語彙［ｍａｘ₃］”さんなら３
と、いない場合は４と発声して下さい。」というメッセ
ージを生成してシステムアナウンスを出力する（Ｓ３０
６）。次にユーザの音声を取り込み（Ｓ３０７）、認識
対象語彙を「１」「２」「３」「４」としてＳ３０７で
取り込んだ音声がどの認識対象語彙に近いかという音声
認識処理を行う（Ｓ３０８）。The recognition results of the first to third ranks are synthesized and output by voice to indicate to the user, and allow the user to select a desired person name. First, “a vocabulary to be recognized [max i] (i = 1 to
3) to insert ““ recognition target vocabulary [ma
x ₁ ] ”is 1, 1 is“ recognition target vocabulary [max ₂ ] ”, and _{3 is} “ recognition target vocabulary [max ₃ ] ”.
If not, say 4. Is generated and a system announcement is output (S30).
6). Next, the user's voice is captured (S307), and the recognition target vocabulary is set to "1", "2", "3", or "4", and a voice recognition process is performed to determine which recognition target vocabulary is closer to the voice captured in S307 (S308). .

【００４３】ここでの認識結果ｉが１〜３であれば、
“認識対象語彙［ｍａｘ_i］”さんに電話を転送し（Ｓ
３０９〜Ｓ３１１）、システムはペナルティ値初期化の
処理（Ｓ３０１）に戻り、ユーザの電話に対する待機状
態に戻る（Ｓ３０２）。認識結果ｉが４の場合、認識結
果は誤認識と判断し、認識結果となった認識語彙のペナ
ルティ値にその認識語彙の入力音声との類似度を代入す
る（Ｓ３１２）。そして、全ての認識対象語彙のペナル
ティ値を更新し、再びユーザに人名の発声を促す処理に
戻る（Ｓ３０３）。If the recognition result i is 1 to 3,
A call is forwarded to “recognition target vocabulary [max _i ]” (S
309 to S311), the system returns to the penalty value initialization process (S301), and returns to the standby state for the user's telephone (S302). If the recognition result i is 4, the recognition result is determined to be erroneous recognition, and the similarity with the input speech of the recognized vocabulary is substituted for the penalty value of the recognized vocabulary as the recognition result (S312). Then, the penalty values of all the recognition target vocabularies are updated, and the process returns to the process of prompting the user to speak a personal name again (S303).

【００４４】次に、図４のフローチャートを用いて、上
述したような音声認識を用いて、駅名を入力するシステ
ムについて説明する。Next, a system for inputting a station name by using the above-described voice recognition will be described with reference to the flowchart of FIG.

【００４５】システムが立ち上がると、ペナルティ配列
を０で初期化する（Ｓ４０１〜Ｓ４０３）。次に駅名の
発声を促すアナウンスを流し（Ｓ４０４）、ユーザの発
声を取り込む（Ｓ４０５）。ここから駅名を認識対象語
彙とする音声認識処理を行うのだが、ペナルティ値が０
のものしか認識対象語彙としない（Ｓ４０８）。これに
より、認識にかかる処理を多少軽減することができる。
ペナルティ値が０の認識対象語彙は、入力音声との類似
度を算出する（Ｓ４０９）。ペナルティ値が０の認識対
象語彙の中で、最も類似度が高いものを１つ選び、それ
を認識結果とする（Ｓ４０６〜Ｓ４１２）。When the system starts up, the penalty array is initialized with 0 (S401 to S403). Next, an announcement urging the utterance of the station name is played (S404), and the utterance of the user is taken in (S405). From here, speech recognition processing using the station name as the vocabulary to be recognized is performed, but the penalty value is 0.
Only the vocabulary for the word is recognized (S408). As a result, the processing for recognition can be reduced somewhat.
The recognition target vocabulary whose penalty value is 0 calculates the similarity with the input voice (S409). One of the vocabulary words having the highest similarity is selected from the recognition target vocabulary having the penalty value of 0, and the selected vocabulary is used as the recognition result (S406 to S412).

【００４６】この認識結果をユーザにアナウンスで確認
する（Ｓ４１３）。ユーザは「はい」「いいえ」の音声
で答え（Ｓ４１４、Ｓ４１５）、「はい」であるなら認
識結果を再びアナウンスし（Ｓ４１６）、本処理を終了
する。「いいえ」である場合、認識結果である認識対象
語彙のペナルティ値を設定する。この値は、その認識結
果の類似度に１０を掛けたものの整数部分とする（Ｓ４
１７）。The recognition result is confirmed to the user by an announcement (S413). The user answers with voices of “yes” and “no” (S414, S415), and if “yes”, announces the recognition result again (S416), and terminates this processing. If “No”, a penalty value of the recognition target vocabulary as a recognition result is set. This value is an integer part obtained by multiplying the similarity of the recognition result by 10 (S4
17).

【００４７】ペナルティ値を設定すると、もう一度ユー
ザに駅名の発声を促すアナウンスを流し（Ｓ４１８）、
全体のペナルティ値の更新を行う（Ｓ４１９〜Ｓ４２
３）。これは、ペナルティ値が０でないものに関して、
ペナルティ値を１ずつ減算する処理を行っている（Ｓ４
２２）。When the penalty value is set, an announcement to prompt the user to speak the station name is played again (S418),
The entire penalty value is updated (S419 to S42)
3). This means that for non-zero penalty values,
The penalty value is subtracted by one (S4).
22).

【００４８】本システムにおいてペナルティ値は、ある
認識対象語彙が誤認識となってから、認識処理に加えな
い回数と考えることができる。In the present system, the penalty value can be considered to be the number of times that a certain vocabulary to be recognized is not added to the recognition processing after the recognition error.

【００４９】ここまで説明してきたシステムは、ユーザ
が電話を用いて音声認識機能を有する装置にアクセスし
て音声入力し、その入力音声の認識結果もユーザの電話
へ返すシステムであったが、ここからは何等かの回線で
結ばれたネットワーク（インターネット、ＬＡＮ等）を
介してユーザの端末から音声認識サーバへ接続するシス
テムについて説明する。The system described so far is a system in which a user accesses a device having a voice recognition function using a telephone and inputs a voice, and also returns a recognition result of the input voice to the user's telephone. A system for connecting a user terminal to a speech recognition server via a network (the Internet, a LAN, or the like) connected by some kind of line will be described.

【００５０】図５は、そのようなシステムの構成を示す
図である。FIG. 5 is a diagram showing the configuration of such a system.

【００５１】この音声認識システムは、音声認識サーバ
（５２１）とそれにネットワークを介して接続された複
数の音声認識クライアント（５０１）からなる。This speech recognition system comprises a speech recognition server (521) and a plurality of speech recognition clients (501) connected thereto via a network.

【００５２】音声認識クライアント（５０１）はマイク
（５０２）などの音声取り込みデバイスからユーザの音
声波形を取り込み、音声波形送信部（５０３）によって
音声認識サーバ（５２１）に送られる。音声認識クライ
アントは直接音声認識処理を行うことはなく、音声を取
り込んで、その音声波形をネットワークで送信するのに
適当な形式に変形することしか行わない。その分、処理
が軽いため、処理性能があまり高くない環境でも動作が
可能である。The voice recognition client (501) captures a user's voice waveform from a voice capture device such as a microphone (502), and sends the user's voice waveform to the voice recognition server (521) by the voice waveform transmission unit (503). The speech recognition client does not perform the speech recognition process directly, but only captures the speech and transforms the speech waveform into a form suitable for transmission over the network. Since the processing is light, the operation is possible even in an environment where the processing performance is not so high.

【００５３】音声認識サーバ（５２１）はネットワーク
を介して得られた音声波形を音声認識する部分である。
ここでは、実際の音声認識を行う音声認識部（５２７）
とクライアント情報から音声認識に必要な音声認識パラ
メータを取得するクライアント管理部（５２２）に分か
れる。クライアントＩＤ取得部（５２３）では、ネット
ワークを介して得られた音声波形データに付属するクラ
イアントのＩＰアドレス、ポート番号から、各クライア
ントにユニークなＩＤを取得し、そのＩＤとクライアン
トデータベース（５２６）とを比較する。The voice recognition server (521) is a part for voice-recognizing the voice waveform obtained via the network.
Here, a speech recognition unit (527) for performing actual speech recognition
And a client management unit (522) for acquiring speech recognition parameters necessary for speech recognition from the client information. The client ID acquisition unit (523) acquires a unique ID for each client from the IP address and port number of the client attached to the audio waveform data obtained via the network, and stores the ID and the client database (526). Compare.

【００５４】ここでクライアントＩＤに対応するデータ
がクライアントデータベース中に存在した場合、以前に
そのクライアントからアクセスを受けたことになり、そ
のときに学習した音声認識パラメータを、音声認識パラ
メータ取得部（５２５）で取得する。もし、クライアン
トデータ中にクライアントＩＤに相当するデータが存在
しなかった場合、そのクライアントは音声認識サーバに
初めてアクセスしたと考えられ、音声認識パラメータを
学習し、クライアントデータベースにそのクライアント
ＩＤと共に格納する。If the data corresponding to the client ID exists in the client database, it means that the client has previously been accessed, and the speech recognition parameters learned at that time are input to the speech recognition parameter acquisition unit (525). ) To get. If there is no data corresponding to the client ID in the client data, it is considered that the client has accessed the voice recognition server for the first time, and the voice recognition parameters are learned and stored in the client database together with the client ID.

【００５５】このようにして得られたクライアントの音
声認識パラメータは、音声波形と共に音声認識部（５２
７）に送られ、音声認識処理が行われる。The voice recognition parameters of the client obtained in this way are stored in the voice recognition unit (52) together with the voice waveform.
7) to perform voice recognition processing.

【００５６】図６は、図５のシステムで実行される処理
を示すフローチャートである。FIG. 6 is a flowchart showing the processing executed in the system of FIG.

【００５７】音声認識クライアントは音声波形の取り込
み（Ｓ６０１）、その波形を音声認識サーバに送信する
（Ｓ６０２）。The voice recognition client fetches a voice waveform (S601) and transmits the waveform to the voice recognition server (S602).

【００５８】音声認識サーバは、クライアントからデー
タが送られてくる（Ｓ６１１）と、その送られてきたク
ライアントのＩＰアドレスやポート番号からクライアン
トＩＤを取得する（Ｓ６１２）。次に、クライアントデ
ータベースを検索し、もし取得されたクライアントＩＤ
に相当する音声認識パラメータがその中に存在するなら
ば（Ｓ６１３）、そのパラメータを取得する（Ｓ６１
４）。無いのであれば、初めてアクセスしてきたクライ
アントとみなし、音声認識パラメータの学習を行い（Ｓ
６１５）、結果をクライアントデータベースに格納する
（Ｓ６１６）。これらの音声認識パラメータを用いて、
音声波形を認識し（Ｓ６１７）、認識結果をクライアン
トに送信する（Ｓ６１８）。When data is transmitted from the client (S611), the voice recognition server acquires a client ID from the transmitted IP address and port number of the client (S612). Next, the client database is searched, and the obtained client ID is obtained.
If there is a voice recognition parameter corresponding to (S613), the parameter is acquired (S61).
4). If not, it is assumed that the client has accessed for the first time, and the speech recognition parameters are learned (S
615), and store the result in the client database (S616). Using these speech recognition parameters,
The voice waveform is recognized (S617), and the recognition result is transmitted to the client (S618).

【００５９】音声認識クライアントは、音声認識サーバ
からの結果を表示し（Ｓ６０３）、処理を終了する。The speech recognition client displays the result from the speech recognition server (S603), and ends the processing.

【００６０】図８は、図５のシステムにおける図６とは
異なる処理例を示すフローチャートである。FIG. 8 is a flowchart showing a processing example different from that of FIG. 6 in the system of FIG.

【００６１】ここでのクライアントは、インターネット
に繋がったコンピュータと考えることができる。クライ
アントのソフトウェアはＷｅｂブラウザのプラグインと
してコンピュータにインストールする。このインストー
ル時にこのクライアント固有のＩＤを生成する。このＩ
Ｄはすべてのクライアント中でただ一つのものである必
要があり、マシンや、クライアントソフトウェアのシリ
アルＮＯや、ＩＰアドレスなどから生成する。The client here can be considered a computer connected to the Internet. The client software is installed on a computer as a Web browser plug-in. At the time of this installation, an ID unique to this client is generated. This I
D needs to be unique among all clients, and is generated from the machine, the serial number of the client software, the IP address, and the like.

【００６２】ユーザがＷｅｂブラウザで音声認識サーバ
にアクセスすると、コンピュータに接続されたマイクに
よりユーザの音声を取り込み（Ｓ６０１）、プラグイン
がインストール時に作成したＩＤを取得し（Ｓ７０
１）、音声波形と共にサーバに送信する（Ｓ７０３）。
ここで、サーバに送信する音声波形はマイクから取り込
んだ生の音声データとは限らず、音声認識に特化して圧
縮された音声データでもよい。また、「はい」「いい
え」のような、軽い処理で済む簡単な音声認識タスクで
あれば、音声認識サーバには送らず、このクライアント
のプラグインで音声認識を行ってしまってもよい。この
とき、音声波形をすべて取り込んでからサーバに一括し
て送信してもよいし、取り込まれた分だけ音声波形をリ
アルタイムに送信していってもよい。When the user accesses the voice recognition server with a Web browser, the voice of the user is captured by a microphone connected to the computer (S601), and the ID created by the plug-in at the time of installation is acquired (S70).
1), is transmitted to the server together with the audio waveform (S703).
Here, the audio waveform transmitted to the server is not limited to the raw audio data captured from the microphone, but may be audio data compressed specifically for voice recognition. Further, if the task is a simple speech recognition task such as "yes" or "no" that requires only a small amount of processing, the speech recognition may not be sent to the speech recognition server but may be performed by the plug-in of this client. At this time, the entire audio waveform may be fetched and then transmitted to the server all at once, or the audio waveform may be transmitted in real time for the fetched amount.

【００６３】音声認識サーバでは、クライアントからデ
ータが送られてくるのを常に監視しており（Ｓ６１
１）、送られてくると、クライアントデータベースにそ
のクライアントＩＤがあるかどうか検索を行う（Ｓ６１
２）。もし、クライアントデータベースにクライアント
ＩＤが登録されていれば、その音声認識パラメータを取
得する（Ｓ６１３）。登録されていなければデフォルト
値を音声認識パラメータとして設定する（Ｓ６１６）。
これらの音声認識パラメータと音声波形を用い、音声認
識を行い（Ｓ６１４）、結果をクライアントに送信する
（Ｓ６１５）。The voice recognition server constantly monitors the data sent from the client (S61).
1) When it is sent, a search is made to see if the client ID is present in the client database (S61).
2). If the client ID is registered in the client database, the voice recognition parameter is obtained (S613). If not registered, a default value is set as a voice recognition parameter (S616).
Voice recognition is performed using these voice recognition parameters and voice waveforms (S614), and the result is transmitted to the client (S615).

【００６４】クライアントは、サーバへデータの送信
後、サーバから認識結果が返ってくるのを監視する（Ｓ
６１４）。結果が返ってくると、その結果をユーザに示
し（Ｓ６１５）、クライアントの処理を終える。After transmitting the data to the server, the client monitors for the recognition result returned from the server (S).
614). When the result is returned, the result is shown to the user (S615), and the processing of the client ends.

【００６５】サーバはクライアントへの結果の送信後、
そのクライアントに特有の音声認識パラメータを計算し
直す（Ｓ６１７）。具体的には、１．音声波形中の最低パワーを算出し、その値を雑音レ
ベルとする。２．音声波形中の最低パワーを算出し、その付近のパワ
ースペクトルを雑音パワースペクトルとする。３．音声波形中の最高パワーを算出し、その値をマイク
感度レベルとする。４．話者のクラスタリングを行い、その結果を話者情報
とする。が考えられる。これらの音声認識パラメータは、そのク
ライアントＩＤと共にクライアントデータベースに格納
する（Ｓ６１８）。これにより一アクセス前の音声波形
から得られた音声認識パラメータを用いて音声認識を行
うことができるため、比較的クライアントの環境に適応
しやすい。After the server sends the result to the client,
The speech recognition parameters unique to the client are calculated again (S617). Specifically, 1. The lowest power in the audio waveform is calculated, and the value is used as the noise level. 2. The lowest power in the speech waveform is calculated, and the power spectrum in the vicinity is set as the noise power spectrum. 3. The highest power in the audio waveform is calculated, and the calculated value is used as the microphone sensitivity level. 4. Speaker clustering is performed, and the result is used as speaker information. Can be considered. These voice recognition parameters are stored in the client database together with the client ID (S618). This makes it possible to perform speech recognition using the speech recognition parameters obtained from the speech waveform before one access, so that it is relatively easy to adapt to the environment of the client.

【００６６】１の雑音レベル、３のマイク感度レベルを
用いることにより、音声／非音声判定のための閾値をク
ライアントの環境に応じて設定することができる。ま
た、２の雑音パワースペクトルを用い、Ｓｐｅｃｔｒａ
ｌＳｕｂｔｒａｃｔｉｏｎを行うことで、入力されたパ
ワースペクトルから雑音のないクリーンな音声波形を推
定することができる。４の話者情報を用いることで、そ
の話者に合った音響モデルを使用することができる。By using the noise level of 1 and the microphone sensitivity level of 3, the threshold value for voice / non-voice determination can be set according to the client environment. In addition, Spectra using the noise power spectrum of 2
By performing lSubtraction, a clean speech waveform without noise can be estimated from the input power spectrum. By using the speaker information of No. 4, an acoustic model suitable for the speaker can be used.

【００６７】一つのサーバに多数のクライアントが接続
する大規模な音声認識システムの場合、クライアントデ
ータベースが大きくなり過ぎてしまうことが考えられ
る。そのため、クライアントデータベースは最新の１万
件までしか登録しないようにしておけば、頻繁に使われ
るクライアントに対しては、クライアントデータを使用
する可能性が高くなり、過去に一度だけしか使用しなか
ったクライアントのクライアントデータは、早々にデー
タベースから消えることになる。In the case of a large-scale speech recognition system in which many clients are connected to one server, the client database may be too large. Therefore, if only the latest 10,000 client databases are registered, the client data is more likely to be used for frequently used clients, and has been used only once in the past. The client's client data will quickly disappear from the database.

【００６８】上述のようなシステムとすることにより、
音声認識システムがサーバと多数のクライアントに分け
れたシステムの場合でも、個々のクライアントに対応し
た音声認識パラメータを使用することができる。By adopting the system as described above,
Even in the case where the speech recognition system is divided into a server and a large number of clients, speech recognition parameters corresponding to each client can be used.

[Brief description of the drawings]

【図１】本発明に係る音声認識装置の機能構成図FIG. 1 is a functional configuration diagram of a speech recognition device according to the present invention.

【図２】本発明に係る音声認識処理を示すフローチャー
トFIG. 2 is a flowchart showing a speech recognition process according to the present invention.

【図３】内線取り次ぎシステムの処理を示すフローチャ
ートFIG. 3 is a flowchart showing processing of an extension intermediary system.

【図４】駅名入力システムの処理を示すフローチャートFIG. 4 is a flowchart showing processing of a station name input system.

【図５】ネットワークを介してサーバとクライアントを
接続したシステムの構成図FIG. 5 is a configuration diagram of a system in which a server and a client are connected via a network.

【図６】図５のシステムにおける第一の処理を示すフロ
ーチャートFIG. 6 is a flowchart showing a first process in the system of FIG. 5;

【図７】図５のシステムにおける第二の処理を示すフロ
ーチャートFIG. 7 is a flowchart showing a second process in the system of FIG. 5;

───────────────────────────────────────────────────── フロントページの続き (72)発明者八木沢津義東京都大田区下丸子３丁目30番２号キヤノン株式会社内 (72)発明者川崎勝彦東京都大田区下丸子３丁目30番２号キヤノン株式会社内 (72)発明者山本寛樹東京都大田区下丸子３丁目30番２号キヤノン株式会社内 (72)発明者山田雅章東京都大田区下丸子３丁目30番２号キヤノン株式会社内Ｆターム(参考） 5D015 AA02 HH05 KK02 LL02 LL04 LL05 LL07 LL12 ──────────────────────────────────────────────────続き Continuing on the front page (72) Inventor Tsuyoshi Yagisawa 3-30-2 Shimomaruko, Ota-ku, Tokyo Within Canon Inc. (72) Inventor Katsuhiko Kawasaki 3-30-2 Shimomaruko, Ota-ku, Tokyo (72) Inventor Hiroki Yamamoto 3-30-2 Shimomaruko, Ota-ku, Tokyo Canon Inc. (72) Inventor Masaaki Yamada 3-30-2, Shimomaruko, Ota-ku, Tokyo Canon Inc. F term (reference) 5D015 AA02 HH05 KK02 LL02 LL04 LL05 LL07 LL12

Claims

[Claims]

1. A voice is input, a similarity between the input voice and dictionary data is obtained, and a penalty value corresponding to the dictionary data is subtracted from the obtained similarity to obtain an evaluation value in consideration of a penalty value. A speech recognition method comprising selecting a vocabulary to be output as a recognition result of the input speech based on the evaluation value.

2. The speech recognition method according to claim 1, wherein a word having a higher evaluation value is selected as a vocabulary to be output as the recognition result.

3. The speech recognition method according to claim 1, wherein it is determined whether or not the selected vocabulary is incorrect, and if the selected vocabulary is determined to be incorrect, the penalty value is updated.

4. The speech recognition method according to claim 1, wherein the penalty value is a value held for each vocabulary.

5. The speech recognition method according to claim 3, wherein the updating of the penalty value increases the value.

6. The speech recognition method according to claim 1, wherein the penalty value is updated to a smaller value each time the recognition process is repeated.

7. The speech recognition method according to claim 1, wherein the speech is input via a network.

8. The speech recognition method according to claim 1, wherein the selected vocabulary is output.

9. The speech recognition method according to claim 1, wherein the selected vocabulary is output via a network.

10. The speech recognition according to claim 3, wherein, when an instruction to select a vocabulary other than the first vocabulary is input to the output vocabulary, the vocabulary is determined to be the error. Method.

11. The speech recognition method according to claim 10, wherein the instruction is input via a network.

12. The voice recognition method according to claim 7, wherein a process for recognizing the voice is changed according to the identification information of the terminal that has transmitted the voice via the network.

13. The apparatus according to claim 7, wherein parameters used for recognizing the voice are changed according to the identification information of the terminal that has transmitted the voice via the network.
Voice recognition method described in.

14. The speech recognition method according to claim 1, wherein identification information of a terminal connectable via the network and information relating to speech recognition suitable for the terminal are stored.

15. The speech recognition method according to claim 12, wherein the processing at the time of recognition is changed according to the held information.

16. The speech recognition method according to claim 13, wherein parameters for the recognition are changed according to the stored information.

17. If the identification information of the terminal that has transmitted the voice via the network is not stored in advance,
13. The speech recognition method according to claim 12, wherein the identification information and information relating to speech recognition suitable for the terminal are newly registered.

18. An input unit for inputting a voice, a similarity deriving unit for obtaining a similarity between the input voice and the dictionary data, and a penalty value obtained by subtracting a penalty value corresponding to the dictionary data from the obtained similarity. A speech recognition apparatus comprising: an evaluation value deriving unit that obtains an evaluation value in consideration of a value; and a selection unit that selects a vocabulary to be output as a recognition result of the input speech based on the evaluation value.

19. The speech recognition apparatus according to claim 18, wherein said selection means selects a word having a higher evaluation value as a vocabulary to be output as said recognition result.

20. A determination unit for determining whether or not the selected vocabulary is incorrect, and a penalty value updating unit for updating the penalty value when the determination unit determines that the word is incorrect. The speech recognition device according to claim 18, wherein:

21. The speech recognition apparatus according to claim 18, further comprising a holding unit that holds the penalty value for each vocabulary.

22. The speech recognition apparatus according to claim 20, wherein the penalty value updating unit updates the value so as to increase the value.

23. The method according to claim 2, wherein the penalty value is updated to a smaller value each time the recognition process is repeated.
2. The speech recognition device according to 1.

24. The speech recognition apparatus according to claim 18, wherein said input means inputs speech via a network.

25. The speech recognition apparatus according to claim 18, further comprising output means for outputting the selected vocabulary.

26. The speech recognition apparatus according to claim 18, further comprising output means for outputting the selected vocabulary via a network to a network.

27. The method according to claim 20, wherein the determination unit determines that the error is the error when an instruction to select a vocabulary other than the first vocabulary is input to the output vocabulary. The speech recognition device according to the above.

28. The speech recognition apparatus according to claim 27, wherein the instruction is input via a network.

29. The voice according to claim 24, further comprising control means for controlling a process for recognizing the voice in accordance with the identification information of the terminal which has transmitted the voice via the network. Recognition device.

30. The speech recognition apparatus according to claim 24, further comprising control means for changing a parameter used for recognizing the speech in accordance with the identification information of the terminal that has sent the speech via the network.

31. The voice according to claim 18, further comprising identification information of a terminal connectable via said network, and recognition information holding means for holding information relating to voice recognition suitable for the terminal. Recognition device.

32. The speech recognition apparatus according to claim 29, wherein the control unit changes the processing at the time of the recognition in accordance with the information held in the holding unit.

33. The speech recognition apparatus according to claim 30, wherein the control unit changes the parameter at the time of the recognition according to the information held in the holding unit.

34. If the identification information of the terminal that has transmitted the voice via the network is not stored in advance,
30. The speech recognition apparatus according to claim 29, wherein the identification information and information relating to speech recognition suitable for the terminal are newly registered.

35. A control program for inputting voice, a control program for obtaining a similarity between the input voice and dictionary data, and a penalty value corresponding to the dictionary data is subtracted from the obtained similarity. A control program for obtaining an evaluation value in consideration of a penalty value, and a control program for selecting a vocabulary to be output as a recognition result of the input voice based on the evaluation value. A readable storage medium.

36. The computer-readable storage medium according to claim 35, wherein a word having a higher evaluation value is selected as a vocabulary to be output as the recognition result.

37. A control program for determining whether the selected vocabulary is incorrect, and a control program for updating the penalty value when the selected vocabulary is determined to be incorrect. A storage medium readable by a computer according to claim 35.

38. The computer-readable storage medium according to claim 35, wherein the penalty value is a value held for each vocabulary.

39. The computer-readable storage medium according to claim 37, wherein the updating of the penalty value increases the value.

40. The method according to claim 3, wherein the penalty value is updated to a smaller value each time the recognition process is repeated.
A storage medium readable by a computer according to claim 5.

41. The computer-readable storage medium according to claim 35, wherein the voice is input via a network.

42. The computer-readable storage medium according to claim 35, wherein a control program for outputting the selected vocabulary is stored.

43. The computer-readable storage medium according to claim 35, wherein a control program for outputting the selected vocabulary via a network is stored.

44. A control program for determining an error when an instruction to select a vocabulary other than the first vocabulary is input to the output vocabulary. Item 38. A computer-readable storage medium according to Item 37.

45. The computer-readable storage medium according to claim 44, wherein the instruction is input via a network.

46. The computer according to claim 41, wherein a control program for changing a process of recognizing the voice according to the identification information of the terminal that has transmitted the voice via the network is stored. Storage medium readable by.

47. The computer according to claim 44, wherein a control program for changing a parameter used for recognizing the voice according to the identification information of the terminal that has transmitted the voice via the network is stored. Storage medium readable by.

48. The apparatus according to claim 35, wherein identification information of a terminal connectable via the network and a control program for reading and using information relating to speech recognition suitable for the terminal are stored. Computer readable storage medium.

49. The computer-readable storage medium according to claim 46, wherein the processing at the time of recognition is changed according to the stored information.

50. The computer-readable storage medium according to claim 47, wherein the parameter at the time of recognition is changed according to the stored information.

51. When the identification information of the terminal that has transmitted the voice via the network is not stored in advance,
47. The computer-readable storage medium according to claim 46, wherein a control program for newly registering the identification information and information relating to speech recognition suitable for the terminal is stored.