JPH06110495A

JPH06110495A - Speech recognition device

Info

Publication number: JPH06110495A
Application number: JP25658892A
Authority: JP
Inventors: Hiroshi Matsuura; 博松浦
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 1992-09-25
Filing date: 1992-09-25
Publication date: 1994-04-22
Anticipated expiration: 2017-05-27
Also published as: JP3285954B2

Abstract

PURPOSE:To enable a user to easily take the timing of vocalization. CONSTITUTION:The speech vocalized by the user is quantized by an A/D converter 1 and then an LPC analysis by an analytic feature extraction part 2, a continuous matching process by a continuous matching part 3, and a recognizing process by an HMM recognition part 5 are performed; and the recognition result is passed to a control part 9 and a 1st candidate or 1st and 2nd candidates are displayed at a display part 7. The candidates displayed at the display part 7 are selectable, but when the user vocalizes again before the selection, the speech is recognized by the HMM recognition part 5 and its recognition candidates are passed to the control part; and then the 1st candidate or 1st and 2nd candidates are displayed additionally to the candidates which are already displayed at the display part 7, and any of the candidates which are already displayed and the added candidates can be selected through a selection part 8.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、人間の発声した音声を
認識して機器等を制御するのに好適な音声認識装置に関
する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice recognition apparatus suitable for recognizing a voice uttered by a human and controlling a device or the like.

【０００２】[0002]

【従来の技術】この種の音声認識装置では、音声が入力
可能な状態になると、利用者に発声を勧誘するための表
示（例えば、文字列「発声して下さい」等の表示）、あ
るいは音（例えば、“ピー”という音）や音声（例え
ば、「発声して下さい」等）の出力がなされる。その
後、利用者が発声すれば、その発声された音声は、音声
認識装置において認識され、例えば第１位および第２位
の認識候補が表示される。この表示された認識候補の中
に、利用者の発声した音声に対応する候補があれば、そ
の候補を利用者は選択する。2. Description of the Related Art In this type of voice recognition device, when a voice is ready for input, a display for inviting the user to speak (for example, a display such as a character string "please say") or a sound (For example, a "beep" sound) or voice (for example, "please say") is output. Thereafter, when the user utters, the uttered voice is recognized by the voice recognition device, and, for example, the first and second recognition candidates are displayed. If there is a candidate corresponding to the voice uttered by the user among the displayed recognition candidates, the user selects the candidate.

【０００３】[0003]

【発明が解決しようとする課題】しかし、従来の音声認
識装置では、装置から発声の勧誘がなされる前に、利用
者が発声してしまうと、認識結果が出ないという問題が
あった。However, the conventional voice recognition device has a problem that the recognition result is not obtained if the user utters the voice before the device invites the utterance.

【０００４】また、従来の音声認識装置では、発声の勧
誘がなされてから発声しても、利用者が、装置の認識対
象とすべき音声を発声する前に、不適当な発声（例え
ば、「えー」とか「え、喋っていいの」など）をしてし
まうと、誤った認識結果が得られるという問題もあっ
た。この場合、表示される認識候補中に利用者の意図し
た候補は存在しないため、利用者は候補選択をあきらめ
て認識結果をキャンセルし、一連の操作を最初からやり
直さなければならなかった。Further, in the conventional voice recognition apparatus, even if a voice is uttered after the utterance is solicited, an inappropriate utterance (for example, "" is given before the user utters a voice to be recognized by the apparatus. There is also a problem that erroneous recognition results will be obtained if you do "Eh" or "Eh, you can talk". In this case, there is no candidate intended by the user among the displayed recognition candidates, so the user must give up the candidate selection, cancel the recognition result, and restart the series of operations from the beginning.

【０００５】このように従来の音声認識装置では、利用
者は、装置の認識対象とすべき音声の発声タイミングを
意識しなければならず、認識対象とすべき音声の発声前
にうっかりと不適当な発声をしてしまうと、誤った認識
結果が得られて意図した候補が選択できないという問題
があった。As described above, in the conventional voice recognition device, the user has to be aware of the utterance timing of the voice to be recognized by the device, and is inadvertently inappropriate before the utterance of the voice to be recognized. However, there is a problem in that an incorrect recognition result is obtained and the intended candidate cannot be selected.

【０００６】本発明は上記事情に鑑みてなされたもので
その目的は、利用者が発声のタイミングを取りやすいよ
うにした音声認識装置を提供することにある。The present invention has been made in view of the above circumstances, and an object of the present invention is to provide a voice recognition device which allows a user to easily adjust the timing of utterance.

【０００７】[0007]

【課題を解決するための手段】本発明は、利用者が発声
した音声を入力するための音声入力手段と、入力された
音声を分析し、特徴量を抽出する分析・特徴抽出手段
と、抽出された特徴量を用いて音声を認識する認識手段
とを備えた音声認識装置において、認識手段により認識
された候補を１つまたは複数表示するのに用いられる表
示手段と、この表示手段に表示された候補の１つを選択
するための選択手段と、表示手段を制御する制御手段で
あって、選択手段による候補選択が行われる前に、音声
入力手段により音声が入力されて認識手段により認識さ
れた候補については、その１つまたは複数を、表示手段
に追加表示させる制御手段とを設けたことを特徴とする
ものである。According to the present invention, a voice input means for inputting a voice uttered by a user, an analysis / feature extraction means for analyzing the input voice and extracting a feature amount, and an extraction. In a voice recognition device provided with a recognition means for recognizing a voice using the identified feature amount, a display means used for displaying one or a plurality of candidates recognized by the recognition means, and a display means displayed on this display means. Selection means for selecting one of the candidates and control means for controlling the display means, wherein voice is input by the voice input means and recognized by the recognition means before the selection by the selection means is performed. With regard to the candidates, one or a plurality of them are provided with a control means for additionally displaying them on the display means.

【０００８】[0008]

【作用】上記の構成においては、利用者が発声した音声
は音声入力手段により入力されて、分析・特徴抽出手段
により分析され、その特徴量が抽出され、しかる後、認
識手段によりその特徴量を用いた認識処理が行われ、認
識候補が求められる。この認識候補は制御手段に渡され
る。制御手段は、認識手段から渡された候補のうちの第
２位の候補の類似度が所定値以下の場合、あるいは第１
位の候補と第２位の候補の類似度値の差が所定値以上の
場合には、第１位の候補だけを、そうでない場合には少
なくとも第２位までの候補を、表示手段に表示させる。In the above structure, the voice uttered by the user is input by the voice input means, analyzed by the analysis / feature extraction means, the feature amount is extracted, and thereafter, the feature amount is detected by the recognition means. The recognition process used is performed and a recognition candidate is obtained. This recognition candidate is passed to the control means. The control means, when the degree of similarity of the second-ranked candidate among the candidates passed from the recognition means is less than or equal to a predetermined value, or
If the difference in similarity value between the second-ranked candidate and the second-ranked candidate is equal to or more than a predetermined value, only the first-ranked candidate is displayed on the display means, and if not, at least the second-ranked candidate is displayed on the display means. Let

【０００９】利用者は、選択手段を用いて、表示手段に
表示された候補から、自身が発声した音声に対応した候
補を選択するための操作を行う。もし、表示された候補
の中に発声した候補に対応するものが存在しない場合に
は、利用者は認識対象とすべき音声を再発声する。The user uses the selection means to perform an operation for selecting a candidate corresponding to the voice uttered by the user from the candidates displayed on the display means. If none of the displayed candidates corresponds to the uttered candidate, the user re-voices the voice to be recognized.

【００１０】利用者から再発声された音声は、上記と同
様にして認識手段により認識され、その候補が求められ
る。制御手段は、既に表示されている認識候補が選択さ
れる前に、利用者から再発声された音声が認識手段によ
り認識された場合、発声勧誘のタイミングに無関係に、
その候補を、既に表示されている候補に追加して表示す
る。このとき、追加表示すべき候補と同じ候補が既に表
示されているならば、制御手段は、この追加表示すべき
候補の表示をしないか、あるいはブリンク等の強調表示
を行うことにより、無用な選択対象候補を減らす。The voice re-voiced by the user is recognized by the recognition means in the same manner as described above, and a candidate for the voice is obtained. The control means, when the voice re-voiced by the user is recognized by the recognition means before the already displayed recognition candidate is selected, regardless of the timing of solicitation.
The candidate is displayed in addition to the already displayed candidate. At this time, if the same candidate as the candidate to be additionally displayed has already been displayed, the control means does not display the candidate to be additionally displayed, or highlights such as blinking, thereby making unnecessary selection. Reduce target candidates.

【００１１】これにより利用者は、既に表示されている
候補と追加表示された候補のいずれからも、自身が発声
した音声に対応する候補を選択することができる。Thus, the user can select a candidate corresponding to the voice uttered by the user from both the candidates already displayed and the candidates additionally displayed.

【００１２】このように上記の構成によれば、発声のタ
イミングおよび選択のタイミングに対する制限が大幅に
緩和され、使い勝手が向上する。As described above, according to the above configuration, the restrictions on the timing of utterance and the timing of selection are greatly relaxed, and usability is improved.

【００１３】[0013]

【実施例】以下、本発明の一実施例について、駅の券売
機に用いる音声認識装置に適用した場合を例に、図面を
参照して説明する。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS An embodiment of the present invention will be described below with reference to the drawings, taking as an example the case of being applied to a voice recognition device used in a ticket vending machine at a station.

【００１４】図１は、同実施例における音声認識装置の
構成を概略的に示すブロック図である。FIG. 1 is a block diagram schematically showing the configuration of the voice recognition device in the embodiment.

【００１５】図１において、１は本装置に入力される音
声信号（入力音声）をＡ／Ｄ（アナログ／ディジタル）
変換するＡ／Ｄ変換器である。Ａ／Ｄ変換器１は、入力
音声を、例えばサンプリング周波数１２ｋＨｚ，１２ビ
ットで量子化する。In FIG. 1, reference numeral 1 denotes an A / D (analog / digital) audio signal (input audio) input to the apparatus.
It is an A / D converter for conversion. The A / D converter 1 quantizes the input voice at a sampling frequency of 12 kHz and 12 bits, for example.

【００１６】Ａ／Ｄ変換器１により量子化された入力音
声は、その音声を分析して特徴量を抽出するための分析
・特徴抽出部２に与えられる。分析・特徴抽出部２は、
Ａ／Ｄ変換器１によって量子化された入力音声の音声パ
ワ−の計算と、ＬＰＣ（Linear Predictive Coding）分
析とを行う。このＬＰＣ分析は、例えばフレ−ム長１６
msec、フレ−ム周期８msecで１６次のＬＰＣメルケプス
トラムを分析パラメ−タとして行われる。なお、分析・
特徴抽出部２での分析は、ＬＰＣ分析に限るものではな
く、ＢＰＦ（Band Pass Filter）分析等でもよい。The input voice quantized by the A / D converter 1 is given to an analysis / feature extraction unit 2 for analyzing the voice and extracting a feature amount. The analysis / feature extraction unit 2
The calculation of the voice power of the input voice quantized by the A / D converter 1 and the LPC (Linear Predictive Coding) analysis are performed. This LPC analysis is performed, for example, with a frame length of 16
A 16th-order LPC mel cepstrum is used as an analysis parameter with msec and a frame period of 8 msec. Analysis /
The analysis by the feature extraction unit 2 is not limited to the LPC analysis, but may be a BPF (Band Pass Filter) analysis or the like.

【００１７】分析・特徴抽出部２で分析された特徴パラ
メ−タは連続マッチング部３に与えられる。この連続マ
ッチング部３は、音声セグメント（Phonetic Segment）
複合辞書部（以下、ＰＳ辞書部と称する）４に登録され
ている所定のＰＳ単位の認識辞書との間で時間軸方向に
連続的にマッチング処理して、第１位乃至第ｎ位までの
ラベル系列（ＰＳラベル系列）とその類似度を求めるた
めのものである。以上の音声セグメント（ＰＳ）につい
ては、例えば特願平２−３０６０６１号に詳述されてい
る。なお、上記認識辞書は、各ＰＳ（ＰＳラベル）毎に
複数の標準パタ−ンから作成された識別用辞書からな
る。The feature parameters analyzed by the analysis / feature extraction unit 2 are given to the continuous matching unit 3. The continuous matching unit 3 is a voice segment.
A matching process is continuously performed in the time axis direction with a predetermined PS unit recognition dictionary registered in the composite dictionary unit (hereinafter referred to as the PS dictionary unit) 4, and the first to nth ranks are matched. This is for obtaining a label series (PS label series) and its similarity. The above audio segment (PS) is described in detail, for example, in Japanese Patent Application No. 2-306061. The recognition dictionary is an identification dictionary created from a plurality of standard patterns for each PS (PS label).

【００１８】連続マッチング部３でのＰＳによる連続マ
ッチング処理は、次式に示す複合ＬＰＣメルケプストラ
ム類似尺度を用いて行われる。The continuous matching process by PS in the continuous matching unit 3 is performed using the composite LPC mel cepstrum similarity scale shown in the following equation.

【００１９】[0019]

【数１】 [Equation 1]

【００２０】なお、（１）式において、ＣはＬＰＣメル
ケプストラム、Ｗ_m ^(Ki)，φ_m ^(ki)はそれぞれＰＳ名Ｋ
i の固有値から求められる重みと固有ベクトルである。
また、（・）は内積を示し、‖ ‖はノルムを示し
ている。In the equation (1), C is the LPC mel cepstrum, and W _m ^(Ki) and φ _m ^(ki) are the PS names K, respectively.
These are weights and eigenvectors obtained from the eigenvalues of i.
In addition, (・) indicates the inner product and ‖ ‖ indicates the norm.

【００２１】連続マッチング部３で求められたＰＳラベ
ル系列のうち、第１位の系列が、ＨＭＭ（hidden marco
v model ；隠れマルコフモデル）を用いた単語照合を行
うためのＨＭＭ認識部５に送られる。Of the PS label series obtained by the continuous matching unit 3, the first series is the HMM (hidden marco).
v model; Hidden Markov model) is sent to the HMM recognition unit 5 for word matching.

【００２２】このＨＭＭ認識部５における単語照合につ
き説明する。まず単語照合は、連続マッチング部３から
送られた第１位のＰＳラベル系列を単語毎（カテゴリ
毎）のＨＭＭに通すことにより行われる。The word matching in the HMM recognition unit 5 will be described. First, word matching is performed by passing the first-order PS label sequence sent from the continuous matching unit 3 through the HMM for each word (for each category).

【００２３】ここで、ＨＭＭの一般的定式化について述
べる。ＨＭＭでは、Ｎ個の状態Ｓ₁，Ｓ₂，…，Ｓ_Nを
持ち、初期状態がこれらＮ個の状態に確率的に分布して
いるとする。音声では、一定のフレ−ム周期毎に、ある
確率（遷移確率）で状態を遷移するモデルが使われる。
遷移の際には、ある確率（出力確率）でラベルを出力す
るが、ラベルを出力しないで状態を遷移するナル遷移を
導入することもある。出力ラベル系列が与えられても状
態遷移系列は一意には決らない。観測できるのは、ラベ
ル系列だけであることからhidden（隠れ）marcov model
（ＨＭＭ）と呼ばれている。ＨＭＭのモデルＭは次の６
つのパラメ−タから定義される。Here, a general formulation of the HMM will be described. It is assumed that the HMM has N states S ₁ , S ₂ , ..., _SN , and the initial state is stochastically distributed to these N states. For speech, a model is used in which a state transits with a certain probability (transition probability) at every constant frame period.
At the time of transition, a label is output with a certain probability (output probability), but a null transition that transitions the state without outputting the label may be introduced. Even if an output label sequence is given, the state transition sequence cannot be uniquely determined. Since only label series can be observed, hidden marcov model
(HMM). The HMM model M is
It is defined by two parameters.

【００２４】Ｎ：状態数（状態Ｓ₁，Ｓ₂，…，Ｓ_N）Ｋ：ラベル数（ラベルＲ＝１，２，…，Ｋ）ｐ_ij ：遷移確率Ｓ_iからＳ_jに遷移する確率ｑ_ij(k) ：Ｓ_iからＳ_jへの遷移の際にラベルｋを出力
する確率ｍ_i ：初期状態確率初期状態がＳ_iである確率Ｆ：最終状態の集合次に、モデルＭに対して音声の特徴を反映した遷移上の
制限を加える。音声では、一般に状態Ｓ_iから以前に通
過した状態（Ｓ_i-1，Ｓ_i-2，…）に戻るようなル−プ
の遷移は時間的前後関係を乱すため許されない。N: number of states (states S ₁ , S ₂ , ..., S _N ) K: number of labels (labels R = 1, 2, ..., K) p _ij : transition probability Probability of transition from S _i to S _j q _ij (k): Probability of outputting label k at the transition from S _i to S _j m _i : Probability of initial state S Probability of initial state S _i F: Set of final states Next, for model M Add restrictions on transitions that reflect the characteristics of voice. In the case of speech, in general, a loop transition that returns from the state S _i to the previously passed states (S _i-1 , S _i-2 , ...) Is not allowed because it disturbs the temporal context.

【００２５】ＨＭＭの評価は、モデルＭが第１位のラベ
ル系列Ｏ₁＝ｏ₁₁，ｏ₂₁，…，ｏ_T1を出力する確率Ｐｒ
（Ｏ／Ｍ）を求めることにより行われる。認識時には、
ＨＭＭ認識部５で各モデルを仮定し、連続マッチング部
３から送られる第１位のラベル系列（ＰＳラベル系列）
を用いて、Ｐｒ（Ｏ／Ｍ）が最大になるようなモデルＭ
を探す。このＨＭＭ認識部５で仮定される各モデル（の
パラメータ）は、ＨＭＭの学習により求められるもので
あり、ＨＭＭバッファ６に蓄積されている。The HMM is evaluated by the probability Pr that the model M outputs the first-ranked label sequence O ₁ = o ₁₁ , o ₂₁ , ..., O _T1.
It is performed by obtaining (O / M). Upon recognition,
Assuming each model in the HMM recognition unit 5, the first-order label sequence (PS label sequence) sent from the continuous matching unit 3
Model M that maximizes Pr (O / M) using
Look for. Each model (parameters) assumed by the HMM recognition unit 5 is obtained by learning the HMM, and is stored in the HMM buffer 6.

【００２６】以上のようにして、発声された入力音声を
認識処理することによって、その入力音声、例えば行先
の駅名を高精度に認識することが可能となる。By recognizing the uttered input voice as described above, the input voice, for example, the destination station name can be recognized with high accuracy.

【００２７】さて、図１の音声認識装置は、以上のＡ／
Ｄ変換器１、分析・特徴抽出部２、連続マッチング部
３、ＰＳ辞書部４、ＨＭＭ認識部５およびＨＭＭバッフ
ァ６の他に、表示部７、選択部８および制御部９を有し
ている。The speech recognition apparatus shown in FIG.
In addition to the D converter 1, the analysis / feature extraction unit 2, the continuous matching unit 3, the PS dictionary unit 4, the HMM recognition unit 5, and the HMM buffer 6, the display unit 7, the selection unit 8, and the control unit 9 are included. .

【００２８】表示部７は、制御部９のもとで、例えば行
先駅名の発声勧誘のための表示（ここでは、文字列「発
声して下さい」の表示）、ＨＭＭ認識部５での認識結果
（認識候補）の表示等を行う。この表示部７による認識
結果表示は、図３（ａ），（ｂ）に示すように、第１位
の認識候補のみを表示しても、複数の候補、例えば図３
（ｃ）に示すように、第１位の認識候補（ここでは「大
崎」）と第２位の認識候補（ここでは「川崎」）を表示
しても構わない。Under the control unit 9, the display unit 7 displays, for example, a destination station name for soliciting utterance (in this case, the display of the character string "Please speak"), and the recognition result by the HMM recognition unit 5. (Recognition candidate) is displayed. As shown in FIGS. 3 (a) and 3 (b), the display of the recognition result by the display unit 7 displays a plurality of candidates, for example, FIG.
As shown in (c), the first recognition candidate (here, "Osaki") and the second recognition candidate (here, "Kawasaki") may be displayed.

【００２９】選択部８は、表示部７に表示された認識候
補を利用者が選択可能なように構成されたものである。
この選択部８と上記表示部７は、例えば液晶ディスプレ
イ（ＣＲＴディスプレイなどの表示モニタでも構わな
い）上に感圧型の透明タブレットを組合わせたタッチパ
ネルを用いて構成されている。The selection unit 8 is configured so that the user can select the recognition candidate displayed on the display unit 7.
The selection unit 8 and the display unit 7 are configured using, for example, a touch panel in which a pressure-sensitive transparent tablet is combined on a liquid crystal display (a display monitor such as a CRT display may be used).

【００３０】制御部９は、表示部７に対する表示制御を
ＨＭＭ認識部５の認識結果等に従って行うと共に、選択
部８からの選択指示情報に従い外部の装置（ここでは、
券売機）を制御する。The control unit 9 performs display control on the display unit 7 according to the recognition result of the HMM recognition unit 5 and the like, and also according to the selection instruction information from the selection unit 8, an external device (here,
Ticket vending machine).

【００３１】図２は、図１の音声認識装置内の表示部７
および選択部８の構成を示す。FIG. 2 shows a display unit 7 in the voice recognition device of FIG.
And the structure of the selection part 8 is shown.

【００３２】図２に示すように、表示部７は、液晶ディ
スプレイ７１と、同ディスプレイ７１に表示する表示情
報を格納するための表示メモリ７２とから構成される。
この表示メモリ７２内の表示情報は、図１に示す制御部
９により書込まれる。液晶ディスプレイ７１に表示され
る内容には、上記した発声勧誘のための表示情報、認識
候補の表示情報の他に、表示された認識候補の確認（第
１位の候補だけの表示の場合）を勧誘するための表示情
報（文字列「確認して下さい」）、利用者による画面上
での確認操作に供される領域（キー領域）の表示情報
（［確認］という項目キー）、（表示された認識候補が
誤りである場合、即ち利用者の意図した候補が表示され
ていない場合に）利用者による再発声を受付けるための
領域（キー領域）の表示情報（［言い直し］という項目
キー）、認識結果の選択操作を勧誘するための表示情報
（文字列「選択して下さい」）等がある。As shown in FIG. 2, the display unit 7 comprises a liquid crystal display 71 and a display memory 72 for storing display information displayed on the display 71.
The display information in the display memory 72 is written by the control unit 9 shown in FIG. The contents displayed on the liquid crystal display 71 include confirmation of the displayed recognition candidates (in the case of displaying only the first candidate), in addition to the display information for utterance invitation and the display information of the recognition candidates described above. Display information for solicitation (character string "Please check"), display information for the area (key area) used for confirmation operations by the user on the screen (item key [Confirm]), (displayed Display information of the area (key area) for accepting the re-voiced voice of the user (when the candidate intended by the user is not displayed, that is, when the candidate intended by the user is not displayed) , Display information for inviting the recognition result selection operation (character string “Please select”), etc.

【００３３】一方、選択部８は、表示部７の表示画面
上、即ち液晶ディスプレイ７１上に積層されて、同ディ
スプレイ７１と一体に形成された感圧シート型の透明タ
ブレット８１と、透明タブレット８１が利用者の指等に
より押圧された場合に、その透明タブレット８１面上の
座標位置を検出するための指示座標検出部８２と、指示
情報判定部８３とから構成される。この指示情報判定部
８３は、指示座標検出部８２により検出された座標と表
示メモリ７２の内容とから、画面上のいずれの項目キー
（表示情報）が利用者により選択指定されたかを判定
し、その判定結果を選択指示情報として制御部９に送
る。On the other hand, the selection unit 8 is laminated on the display screen of the display unit 7, that is, on the liquid crystal display 71, and is formed as a pressure sensitive sheet type transparent tablet 81 and a transparent tablet 81. When is pressed by a user's finger or the like, it is composed of a designated coordinate detection unit 82 for detecting the coordinate position on the surface of the transparent tablet 81, and a designated information determination unit 83. The instruction information determination unit 83 determines which item key (display information) on the screen is selected and designated by the user from the coordinates detected by the designated coordinate detection unit 82 and the contents of the display memory 72. The determination result is sent to the control unit 9 as selection instruction information.

【００３４】このような構成において、制御部９の制御
による発声の勧誘に従って、利用者が行先駅名として
「おおさき」と発声したのに対し、ＨＭＭ認識部５に
て、その音声に対する認識処理が行われ、その認識結果
として複数の認識候補が制御部９に送られたものとす
る。この認識結果には、第１位の認識候補「大崎」が含
まれているものとする。In such a configuration, the user uttered “Osaki” as the destination station name in accordance with the solicitation of utterance under the control of the control unit 9, whereas the HMM recognizing unit 5 performs the recognition process for the voice. It is assumed that a plurality of recognition candidates are sent to the control unit 9 as the recognition result. It is assumed that the recognition result includes the first recognition candidate “Osaki”.

【００３５】制御部９は、発声勧誘タイミングの一定時
間前から始まる、同タイミングを挟む一定期間（また
は、利用者による候補選択が行われるまでの期間）は、
ＨＭＭ認識部５の認識結果を全て受取るように構成され
ている。このようにすることにより、利用者の発声のタ
イミングに対する制限が緩和される。なお、ＨＭＭ認識
部５の認識結果に対する制御部９の受取り期間を設定す
る代わりに、ＨＭＭ認識部５の動作期間を設定するよう
にしてもよい。また、認識の対象とする音声の入力期間
を設定するためのスイッチを設け、利用者がこのスイッ
チをオンにしている期間だけ利用者の発声した音声が本
装置に入力され、その入力音声が全てＨＭＭ認識部５で
の認識処理に供されて、その認識結果が制御部９で受取
られる構成とするようにしてもよい。The control unit 9 starts from a certain time before the voice solicitation invitation timing, for a certain period between the same timings (or until a user selects a candidate),
It is configured to receive all the recognition results of the HMM recognition unit 5. By doing so, the restriction on the timing of the user's utterance is relaxed. Instead of setting the reception period of the control unit 9 for the recognition result of the HMM recognition unit 5, the operation period of the HMM recognition unit 5 may be set. In addition, a switch is provided to set the input period of the voice to be recognized, and the voice uttered by the user is input to this device only while the user keeps this switch on, and all the input voice is input. The HMM recognition unit 5 may be provided with the recognition processing, and the recognition result may be received by the control unit 9.

【００３６】さて制御部９は、ＨＭＭ認識部５から送ら
れた第１位の認識候補「大崎」を含む複数の認識候補を
受取り、表示部７（の液晶ディスプレイ７１）の表示画
面上に、例えば図３（ａ）に示すように第１位の認識候
補「大崎」を表示させる。この画面上には、同時に、例
えば画面上方に「確認して下さいの」勧誘メッセージ
が、画面右下に項目キー［確認］が、画面左下に項目キ
ー［言い直し］が、それぞれ表示される。The control unit 9 receives a plurality of recognition candidates including the first-ranked recognition candidate "Osaki" sent from the HMM recognition unit 5, and displays on the display screen of the display unit 7 (the liquid crystal display 71 thereof). For example, the first-ranked recognition candidate “Osaki” is displayed as shown in FIG. At the same time, for example, a "please confirm" solicitation message, an item key [confirm] at the lower right of the screen, and an item key [reword] at the lower left of the screen are displayed on this screen, for example.

【００３７】ここで、画面表示された認識候補「大崎」
が正しいならば、利用者は「確認して下さい」の要求に
従って、表示画面右下の［確認］の項目キー（の領域）
を、指により透明タブレット８１上で触る。Here, the recognition candidate "Osaki" displayed on the screen
If is correct, the user follows the request of "Please confirm", and the item key (area of) of [Confirm] in the lower right of the display screen
Is touched on the transparent tablet 81 with a finger.

【００３８】すると、その［確認］の項目キーの座標が
指示座標検出部８２により検出される。指示情報判定部
８３は、指示座標検出部８２により検出された座標と表
示メモリ７２の内容とから、この検出座標位置に表示さ
れている表示情報の示す項目キーが選択されたこと、即
ち［確認］が選択されたことを判定し、その旨を示す選
択指示情報を制御部９に送る。これにより制御部９は、
第１位の候補「大崎」が確認されたものとして、券売機
を制御する。Then, the coordinates of the [confirmation] item key are detected by the designated coordinate detecting unit 82. The pointing information determining unit 83 selects from the coordinates detected by the pointing coordinate detecting unit 82 and the contents of the display memory 72 that the item key indicated by the display information displayed at the detected coordinate position is selected, that is, [confirmation ] Is selected, and selection instruction information indicating that is selected is sent to the control unit 9. As a result, the control unit 9
Assuming that the first candidate “Osaki” has been confirmed, the ticket vending machine is controlled.

【００３９】なお、図３（ａ）に示すような［確認］の
項目キーを表示する代わりに、第３図（ｂ）に示すよう
に第１位の認識候補自体に［確認］の項目キーの役割を
持たせ、「大崎」の表示領域を指で触って確認入力する
ことが可能な構成としてもよい。Instead of displaying the [confirmation] item key as shown in FIG. 3A, the [confirmation] item key is displayed on the first recognition candidate itself as shown in FIG. 3B. The display area of "Osaki" may be touched with a finger for confirmation input.

【００４０】また、利用者が「おおさき」と発声したと
きに、図３（ｃ）に示すように、第１位の候補「大崎」
と第２位の候補「川崎」が、［確認］の項目キーの役割
を兼ねた形態で表示された場合には、利用者は「確認し
て下さい」の要求に従って、「大崎」の表示領域を指で
触れば、指示情報判定部８３により「大崎」が選択さ
れ、その旨を示す選択指示情報が制御部９に送られる。When the user utters "Osaki", as shown in FIG. 3 (c), the first candidate "Osaki" is selected.
When the second candidate “Kawasaki” is displayed in a form that also serves as the [Confirm] item key, the user follows the request of “Please confirm” and the display area of “Osaki” is displayed. By touching with a finger, “Osaki” is selected by the instruction information determination unit 83, and selection instruction information indicating that is selected is sent to the control unit 9.

【００４１】ここで、図３（ａ）または（ｂ）に示すよ
うに第１位の候補だけを表示するか、図３（ｃ）に示す
ように第２位までの候補（複数の候補）を表示するか
は、制御部９により決定される。この制御部９による決
定条件は、例えば第１位と第２位の類似度の差が第１の
所定値を超えているか否か、あるいは第２位の類似度値
が第２の所定値未満であるか否かである。また、認識誤
りをし易い単語（ここでは、駅名）をテーブルに用意し
ておき、同テーブルに第２位の候補が存在する場合に
は、第１位と第２位の候補を表示し、第２位の候補が存
在しない場合には第１位の候補だけを表示するようにし
てもよい。Here, only the first-ranked candidates are displayed as shown in FIG. 3A or 3B, or the second-ranked candidates (a plurality of candidates) are displayed as shown in FIG. 3C. Whether to display is determined by the control unit 9. The determination condition by the control unit 9 is, for example, whether or not the difference in similarity between the first rank and the second rank exceeds a first predetermined value, or the similarity value of the second rank is less than the second predetermined value. Or not. In addition, a word (here, a station name) that is likely to cause a recognition error is prepared in a table, and if the second candidate exists in the table, the first and second candidates are displayed, If there is no second-ranked candidate, only the first-ranked candidate may be displayed.

【００４２】このように本実施例では、第２位の候補の
確からしさが低い場合には、第１位の候補だけを表示し
て、利用者が選択し易いようにしている。As described above, in the present embodiment, when the probability of the second-ranked candidate is low, only the first-ranked candidate is displayed so that the user can easily select it.

【００４３】さて本実施例では、利用者が例えば「大
崎」と発声すべきところを、「えー」と発声してしまっ
たために、ＨＭＭ認識部５により第１位の候補として
「上野」が、第２位の候補として「目白」が得られて、
図３（ｄ）のような表示がなされた場合でも、以下に述
べるように容易に訂正の発声（再発声）が可能なように
なっている。In the present embodiment, since the user uttered "E" when he should say "Osaki", the HMM recognition unit 5 selects "Ueno" as the first candidate. "Mejiro" was obtained as the second candidate,
Even when the display as shown in FIG. 3 (d) is made, it is possible to easily make a correction utterance (reoccurrence voice) as described below.

【００４４】この場合、利用者は、［言い直し］の項目
キー（の領域）を、指により透明タブレット８１上で触
る。In this case, the user touches (the area of) the item key [Reword] on the transparent tablet 81 with a finger.

【００４５】すると、前記した［確認］の項目キー（の
領域）が指で触られた場合と同様にして、選択部８内の
指示情報判定部８３により、［言い直し］が選択された
ことが判定され、その旨を示す選択指示情報が制御部９
に送られる。これにより制御部９は、［言い直し］（再
発声）が要求されたものと判断して、言い直しモード
（再発声モード）に設定し、再発声のための勧誘を例え
ば“ピー”音等により行う。Then, as in the case where the (confirmation) item key of [confirmation] is touched with a finger, the instruction information determination unit 83 in the selection unit 8 selects [reword]. Is determined, and the selection instruction information indicating that is displayed on the control unit 9
Sent to. As a result, the control unit 9 determines that "repeat" (repeat voice) has been requested, sets the repeat mode (repeat voice mode), and makes a solicitation for the repeat voice, for example, a "beep" sound. By.

【００４６】この再発声のための勧誘に従い、利用者が
「おおさき」と発声すると、前記したように、その音声
に対するＨＭＭ認識部５での認識処理が行われる。この
認識処理により、利用者の意図した「大崎」を第１位の
候補とする認識結果が得られたものとする。この第１位
の候補「大崎」の類似度は、第２位の候補の類似度より
第１の所定値以上であるものとする。When the user utters "Osaki" according to the solicitation for re-voice, the HMM recognizing unit 5 recognizes the voice as described above. It is assumed that a recognition result in which “Osaki” intended by the user is set as the first candidate is obtained by this recognition processing. It is assumed that the similarity of the first-ranked candidate “Osaki” is greater than or equal to the first predetermined value by the similarity of the second-ranked candidate.

【００４７】ＨＭＭ認識部５での認識結果は制御部９に
送られる。制御部９は、言い直しモードにおいてＨＭＭ
認識部５から送られた認識結果を受取り、その認識結果
中の第１位の候補「大崎」を、既に表示部７（の液晶デ
ィスプレイ７１）上に図３（ｄ）に示すように表示され
ている候補「上野」、「目白」の次の位置に、（表示メ
モリ７２を介して）図３（ｅ）に示すように追加表示す
る。これにより利用者は、既に表示されている候補と追
加表示された候補のいずれからも、自身が発声した音声
に対応する候補を選択することができる。なお、第１位
の候補と第２位の候補の類似度の差が所定値以下である
など、前記したような第２位の候補の表示条件を満たす
場合には、第２位の候補も追加表示される。The recognition result of the HMM recognition unit 5 is sent to the control unit 9. The control unit 9 controls the HMM in the rewording mode.
The recognition result sent from the recognition unit 5 is received, and the first candidate “Osaki” in the recognition result is already displayed on (the liquid crystal display 71 of) the display unit 7 as shown in FIG. 3D. As shown in FIG. 3E (via the display memory 72), additional display is performed at the position next to the candidates “Ueno” and “Mejiro”. This allows the user to select a candidate corresponding to the voice uttered by the user, from both the candidate already displayed and the candidate additionally displayed. In addition, if the display condition of the second-ranked candidate as described above is satisfied such that the difference in similarity between the first-ranked candidate and the second-ranked candidate is equal to or less than a predetermined value, the second-ranked candidate is also It is additionally displayed.

【００４８】以上は、利用者が［言い直し］の項目キー
を選択して、図１の装置（内の制御部９）に対して［言
い直し］（再発声）を宣言することで、言い直しモード
とし、装置からの“ピー”音等に従うタイミングで、
「おおさき」を再発声する場合である。In the above, the user selects the item key of [Reword] and declares [Reword] (re-voice) to the device (the control unit 9 therein) of FIG. In the repair mode, at the timing according to the "beep" sound etc. from the device,
This is the case when "Osaki" is repeated.

【００４９】しかし本実施例では、再発声のために、必
ずしも［言い直し］モードとする必要はない。これにつ
いて説明する。However, in this embodiment, it is not always necessary to use the [rephrase] mode for re-voice. This will be described.

【００５０】まず本実施例では、前記したように、装置
からの発声勧誘タイミングを挟む一定期間（または候補
選択がなされる前）に発声された音声に対するＨＭＭ認
識部５での認識結果は全て制御部９で受取られる。した
がって、前記の例と同様に、利用者が「大崎」と発声す
べきところを「えー」と発声したために、図３（ｄ）の
ような表示がなされた状態で、「おおさき」と正しく再
発声した場合であれば、第１位の候補「大崎」と第２位
の候補「川崎」を含む認識候補がＨＭＭ認識部５により
求められて、制御部９で受取られる。First, in the present embodiment, as described above, all the recognition results in the HMM recognition unit 5 for the voice uttered in a certain period (or before the candidate selection is made) between the utterance invitation timings from the device are controlled. Received in part 9. Therefore, as in the above example, since the user uttered “Oh” when he should say “Osaki,” the display as shown in FIG. If the voice is reissued, recognition candidates including the first candidate “Osaki” and the second candidate “Kawasaki” are obtained by the HMM recognition unit 5 and received by the control unit 9.

【００５１】制御部９は、表示部７（の液晶ディスプレ
イ７１）に表示されている認識候補が選択部８により選
択される前であれば、前記した言い直しモードの場合と
同様に、ＨＭＭ認識部５から新たに受取った認識候補の
うちの例えば第１位の候補「大崎」を、先に受取って表
示してある候補に対して図３（ｅ）に示すように追加表
示させる。If the recognition candidate displayed on (the liquid crystal display 71 of) the display unit 7 has not been selected by the selection unit 8, the control unit 9 recognizes the HMM similarly to the case of the rewording mode. Of the recognition candidates newly received from the unit 5, for example, the first candidate “Osaki” is additionally displayed as shown in FIG. 3E with respect to the candidate that has been received and displayed first.

【００５２】このように本実施例によれば、利用者が再
発声のタイミングを意識せずに、音声を再発声しても、
その再発声した音声に対するＨＭＭ認識部５での認識結
果が制御部９で受取られ、その第１位の候補（または第
１位と第２位の候補）が既に表示されている候補に対し
て追加表示される。この場合、利用者は、既に表示され
ている候補と追加表示された候補のいずれからも、自身
が発声した音声に対応する候補を選択することができ
る。As described above, according to this embodiment, even if the user re-voices without paying attention to the timing of re-voice,
The control unit 9 receives the recognition result of the re-voiced voice by the HMM recognition unit 5, and the first candidate (or the first and second candidates) is already displayed. It is additionally displayed. In this case, the user can select a candidate corresponding to the voice uttered by the user from both the candidate already displayed and the candidate additionally displayed.

【００５３】したがって、利用者が再発声を意識せず
に、単に「えー、おおさき」と発声した場合も、まず
「えー」に対する認識候補（第１位の候補「上野」と第
２位の候補「目白」を含む認識候補）が得られてその候
補（例えば第１位と第２位の候補）が表示され、続いて
「えー」の次の「おおさき」に対する認識候補（第１位
の候補「大崎」と第２位の候補「川崎」を含む認識候
補）が得られてその候補（例えば第１位の候補）が追加
表示される。Therefore, even when the user simply speaks out "Eh, Osaki" without being aware of re-emergence, first the recognition candidates for "Eh" (the first candidate "Ueno" and the second candidate) Recognition candidates including the candidate "Mejiro" are obtained and the candidates (for example, the first and second candidates) are displayed, and subsequently, the recognition candidates for the "Osaki" next to "Eh" (the first candidate). The candidate “Osaki” and the second candidate “Kawasaki” are obtained as recognition candidates, and the candidate (for example, the first candidate) is additionally displayed.

【００５４】ここで、既に表示されている候補の表示
（図３（ｅ）の例では「上野」、「目白」）を消さない
理由について述べる。Here, the reason why the display of the already displayed candidates (“Ueno” and “Mejiro” in the example of FIG. 3E) is not erased will be described.

【００５５】まず本実施例では、上記したように（装置
からの発声勧誘タイミングを挟む一定期間内では）任意
のタイミングでの再発声が許される。このため、既に表
示されている候補の表示を消す方式を適用するならば、
もし利用者が正しく「おおさき」と発声した結果、図３
（ａ）または（ｂ）のように「大崎」を第１位とする候
補が表示された状態で、利用者が誤って無関係な発声を
すると、あるいは雑音等が入ると、その候補が選択され
る前に表示が変わってしまうという不都合が発声する。First, in the present embodiment, as described above, the re-speaking is allowed at an arbitrary timing (within a certain period of time between the voice call invitation timings from the apparatus). Therefore, if you apply the method that erases the already displayed candidates,
If the user correctly says "Osaki,"
If a user mistakenly makes an irrelevant utterance or makes a noise while the candidate with “Osaki” as the first place is displayed as in (a) or (b), the candidate is selected. There is an inconvenience that the display will change before it goes down.

【００５６】そこで本実施例では、このような不都合を
防止するために、既に表示されている候補の表示を消さ
ず、新たに認識された候補を追加表示する方式を適用し
ている。この方式によれば、利用者が正しく「おおさ
き」と発声した結果、図３（ａ）または（ｂ）のように
「大崎」を第１位とする候補が表示された状態で、利用
者が誤って無関係な発声をしても、あるいは雑音等が入
っても、既に表示されている候補は残されるため、利用
者は正しい候補選択が行える。In order to prevent such an inconvenience, this embodiment adopts a method of additionally displaying the newly recognized candidates without erasing the display of the already displayed candidates. According to this method, when the user correctly says "Osaki", the user who has "Osaki" as the first place is displayed as shown in Fig. 3 (a) or (b), and the user is displayed. Even if the user makes an irrelevant utterance by mistake, or if noise or the like is entered, the already displayed candidates are left, so that the user can select the correct candidate.

【００５７】なお、言い直しモードが設定された際に
は、その時点で表示されている候補の表示を消すように
しても構わない。また、追加表示しようとする候補に一
致する候補が既に表示されている場合には、その候補の
追加表示を行わないとか、その候補を所定の表示属性で
強調表示（例えばブリンク表示）することにより、無用
な選択候補の増加を防ぐようにすることも可能である。When the rewording mode is set, the candidates displayed at that time may be erased. If a candidate that matches the candidate to be additionally displayed is already displayed, the additional display of the candidate is not performed, or the candidate is highlighted with a predetermined display attribute (for example, blink display). It is also possible to prevent an increase in useless selection candidates.

【００５８】以上は、駅の券売機に用いる音声認識装置
に適用した場合について説明したが、これに限るもので
はなく、本発明は音声認識装置全般に適用可能である。In the above, the case where the present invention is applied to a voice recognition device used in a ticket vending machine at a station has been described, but the present invention is not limited to this and the present invention can be applied to all voice recognition devices.

【００５９】[0059]

【発明の効果】以上説明したように本発明の音声認識装
置によれば、認識された候補を１つまたは複数表示する
のに用いられる表示手段と、この表示手段に表示された
候補の１つを選択するための選択手段と、表示手段を制
御する制御手段とを設け、選択手段による候補選択が行
われる前に、音声入力手段により入力されて認識手段に
より認識された音声に対応する候補については、制御手
段の制御により、その１つまたは複数を、表示手段に追
加表示させる構成としたので、一旦誤った認識結果が得
られても、利用者が改めて認識対象とすべき音声を発声
すれば、その再発声された音声に対する認識結果の候補
が追加表示され、正しい候補を選択することができる。As described above, according to the voice recognition apparatus of the present invention, the display means used to display one or a plurality of recognized candidates and one of the candidates displayed on this display means. The selection means for selecting the selection means and the control means for controlling the display means are provided, and the candidates corresponding to the voice input by the voice input means and recognized by the recognition means are selected before the selection means performs the candidate selection. Has a configuration in which one or more of them are additionally displayed on the display means under the control of the control means. Therefore, even if an erroneous recognition result is once obtained, the user can utter a voice to be recognized again. For example, recognition result candidates for the re-voiced voice are additionally displayed, and a correct candidate can be selected.

【００６０】また本発明によれば、既に表示されている
候補中に正しい候補が存在し、且つその候補が選択され
る前に、利用者が誤って認識対象とすべきものとは無関
係な発声をして、あるいは雑音が入って、無関係な認識
がなされても、既に表示されている候補は消されないた
め、利用者は正しい候補を選択できる。Further, according to the present invention, there is a correct candidate among the candidates already displayed, and before the candidate is selected, a utterance unrelated to what the user should erroneously recognize is selected. However, even if there is noise or irrelevant recognition is performed, already displayed candidates are not erased, so that the user can select the correct candidate.

【００６１】このように本発明によれば、発声のタイミ
ングおよび選択のタイミングに対する制限が大幅に緩和
されるため、使い勝手が向上する。As described above, according to the present invention, the restrictions on the timing of utterance and the timing of selection are greatly relaxed, so that usability is improved.

[Brief description of drawings]

【図１】本発明の一実施例に係る音声認識装置の基本構
成を示すブロック図。FIG. 1 is a block diagram showing a basic configuration of a voice recognition device according to an embodiment of the present invention.

【図２】図１の音声認識装置内の表示部７および選択部
８の構成を示すブロック図。2 is a block diagram showing a configuration of a display unit 7 and a selection unit 8 in the voice recognition device of FIG.

【図３】同実施例における動作を説明するための表示画
面例を示す図。FIG. 3 is a diagram showing an example of a display screen for explaining the operation in the embodiment.

[Explanation of symbols]

１…Ａ／Ｄ変換器（音声入力手段）、２…分析・特徴抽
出部、３…連続マッチング部、４…ＰＳ辞書部、５…Ｈ
ＭＭ認識部、６…ＨＭＭバッファ、７…表示部、８…選
択部、９…制御部、７１…液晶ディスプレイ、７２…表
示メモリ、８１…透明タブレット、８２…指示座標検出
部、８３…指示情報判定部。1 ... A / D converter (speech input means), 2 ... Analysis / feature extraction unit, 3 ... Continuous matching unit, 4 ... PS dictionary unit, 5 ... H
MM recognition unit, 6 ... HMM buffer, 7 ... Display unit, 8 ... Selection unit, 9 ... Control unit, 71 ... Liquid crystal display, 72 ... Display memory, 81 ... Transparent tablet, 82 ... Pointed coordinate detection unit, 83 ... Instruction information Judgment section.

Claims

[Claims]

1. A voice input means for inputting a voice uttered by a user, an analysis / feature extraction means for analyzing the voice input by the voice input means and extracting a feature amount, and this analysis / feature. In a voice recognition device comprising a recognition means for recognizing a voice using the feature amount extracted by the extraction means to obtain a plurality of recognition candidates, one or a plurality of the recognition candidates obtained by the recognition means are displayed. A selection means for selecting one of the recognition candidates displayed on the display means, and a control means for controlling the display means, in which candidate selection by the selection means is performed. Control means for additionally displaying, on the display means, one or more of the candidates recognized by the recognition means after the voice is input by the voice input means. A voice recognition device comprising:

2. The control means, if the relative or absolute certainty of the second candidate among the candidates obtained by the recognizing means is below a predetermined level, the first means
2. The voice recognition device according to claim 1, wherein only the candidates for rank are displayed on the display means, and if not, candidates for at least the second rank are displayed.

3. The control unit suppresses the additional display of the candidate or highlights the candidate with a predetermined display attribute when a candidate matching the candidate to be additionally displayed is already displayed. The voice recognition device according to claim 1, wherein