JP2005189287A

JP2005189287A - Speech recognition device and speech recognition method

Info

Publication number: JP2005189287A
Application number: JP2003427319A
Authority: JP
Inventors: Mitsunobu Kaminuma; 充伸神沼; Masaru Yamazaki; 勝山崎; Akinobu Ri; 晃伸李
Original assignee: Nissan Motor Co Ltd
Current assignee: Nissan Motor Co Ltd
Priority date: 2003-12-24
Filing date: 2003-12-24
Publication date: 2005-07-14
Anticipated expiration: 2023-12-24
Also published as: JP4411965B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide an efficient speech recognition device which starts with a dictionary of a small capacity at the beginning and dynamically adds contents for learning to the dictionary according to a use environment, because conventionally, when operating speech recognition equipment on a means of transportation such as a vehicle, there has been only a choice between the two alternatives to limit a task to make recognition accuracy high or to ease a task limit to use memory in large quantities. <P>SOLUTION: The speech recognition device is so configured that it recognizes an input speech signal by using a network language dictionary; an extracting means extracts the other vocabularies having much frequency information on connection or appearance concerning a vocabulary as a result of the recognition; and an adding means adds the above other vocabularies together with the frequency information on the connection and appearance to the network language dictionary. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

本発明は特に車載用等乗り物に搭載される機器の制御、あるいはナビゲーション装置等のように限定された用途に適用する音声認識装置とその認識方法に関する。 In particular, the present invention relates to a voice recognition apparatus and a recognition method applied to a limited use such as control of equipment mounted on a vehicle such as a vehicle or a navigation apparatus.

従来、音声認識装置で用いられる辞書としては下記「非特許文献１」に記載されているように、単語を有限オートマトンによって文法を記述したネットワーク文法言語辞書と、単語または単語間の統計的な接続確率を表現した統計的言語辞書とがあった。前者は、単語の接続パターンを予め記述しておくもので、構造が単純で辞書容量が少なくてすむ長所があるものの、使用者は予め設定された単語の接続文法に従うことが要求され、このため限られた用途でしか高い認識精度は得られなかった。 Conventionally, as a dictionary used in a speech recognition apparatus, as described in “Non-patent document 1” below, a network grammar language dictionary in which a grammar is described by a finite automaton and a statistical connection between words or words There was a statistical language dictionary expressing probabilities. The former describes the connection pattern of words in advance, and has the advantage that the structure is simple and the dictionary capacity is small, but the user is required to follow the preset connection grammar of the word. High recognition accuracy was obtained only in limited applications.

一方、統計的言語辞書においては大量のサンプルデータを記憶しておき、統計的な手法で語彙（単語、熟語あるいは形態素）と語彙（単語、熟語あるいは形態素）の遷移確率の推定を行うもので、適用対象に対して柔軟性はあるが大容量メモリを必要とし、比較的限られた用途に対してはメモリの無駄が多くなり、認識率も前記ネットワーク文法言語辞書ほどには高くならない等の問題があった。 On the other hand, a statistical language dictionary stores a large amount of sample data, and estimates the transition probability of a vocabulary (word, idiom or morpheme) and vocabulary (word, idiom or morpheme) using a statistical method. There is a problem that the application target is flexible but requires a large amount of memory, the memory is wasted for a relatively limited use, and the recognition rate is not as high as the network grammar language dictionary. was there.

このような問題を解決する手法については、下記「非特許文献２」および「非特許文献３」等の報告例があるが実用レベルには到達していない。 There are reported examples such as “Non-Patent Document 2” and “Non-Patent Document 3” as methods for solving such problems, but they have not reached the practical level.

「音声認識システム」オーム社出版局"Voice Recognition System" Ohm Publishing Office 鶴身、李、猿渡、鹿野「語彙N-gramとネットワーク文法を併用したアルゴリズムの検討」日本音響学会 2002年秋季研究発表会Tsurumi, Lee, Saruwatari, Shikano “Examination of Algorithm Using Vocabulary N-gram and Network Grammar” Acoustical Society of Japan 2002 Autumn Meeting 小梨他、「対話コーパスを用いた音声対話システムのための文法自動生成の検討」日本音響学会 2003秋季後援論文集Konashi et al., "A Study on Automatic Grammar Generation for Spoken Dialogue System Using Dialogue Corpus" The Acoustical Society of Japan Autumn 2003 Sponsored Papers

本発明においては、ネットワーク言語辞書の利用を考え、特に、利用者の利用する発音文法に柔軟に適応することが出来る構成のネットワーク言語辞書を提供することを目的とした。 An object of the present invention is to provide a network language dictionary that can be flexibly adapted to the pronunciation grammar used by the user in consideration of the use of the network language dictionary.

上記目的を達成するために、本発明においてはネットワーク言語辞書と統計的言語辞書とを記憶する記憶手段を有し、入力された音声を認識する認識手段出力からの認識語彙中の任意の語彙Ａが統計的言語辞書内に存在するとき、この語彙Ａに対して接続或いは出現の頻度情報を多く有する語彙Ｂを抽出し、これら頻度情報と語彙Ｂをネットワーク言語辞書に追加する構成としている。（ここで、頻度情報とは、各語彙の出現頻度や出現確率、またどのような語彙に遷移するか、どのような語彙から遷移するか、ある連接語彙がどんな出現頻度、出現確率を持っているか、等の語彙の出現・接続に関する情報をいう、以下同じ。） In order to achieve the above object, the present invention has a storage means for storing a network language dictionary and a statistical language dictionary, and an arbitrary vocabulary A in the recognition vocabulary A from the recognition means output for recognizing the input speech. Is present in the statistical language dictionary, the vocabulary B having a lot of connection or appearance frequency information is extracted from the vocabulary A, and the frequency information and the vocabulary B are added to the network language dictionary. (Here, frequency information refers to the appearance frequency and probability of each vocabulary, what vocabulary to transition to, what vocabulary to transition from, what kind of appearance frequency and appearance probability a given connected vocabulary has. This refers to information on the appearance and connection of the vocabulary such as “Isaka”.

本発明の構成により、音声認識装置が動作中でもネットワーク言語辞書の記録内容を随時追加し、使用状態に合わせた辞書を動的に構築することが出来るようになる。これにより、設計時は基本文法を記述するのみで済み、かつメモリ容量を有効に使用して高い認識精度を得ることが出来るようになるためコスト面でも有利な音声認識装置および音声認識装置用の言語辞書を作成することが出来る。 According to the configuration of the present invention, the recorded contents of the network language dictionary can be added at any time even when the speech recognition apparatus is in operation, and a dictionary adapted to the usage state can be dynamically constructed. As a result, it is only necessary to describe the basic grammar at the time of design, and it is possible to obtain high recognition accuracy by effectively using the memory capacity. A language dictionary can be created.

音声認識では、入力された音声をＡＤ変換し、その離散系列ｘに最も適合する言語表現ωを推定する。これらを実現するためには言語表現ωを予測するため、予め言語表現を記述した辞書（以下、言語辞書と記述）が必要となる。従来の技術として提案されている手法としては、語彙を有限オートマトンによって文法を記述した辞書（以下、ネットワーク文法言語辞書と記述）と、語彙間の統計的な接続確率を表現した統計的言語辞書とが提案されている。 In speech recognition, input speech is AD-converted, and a language expression ω that best matches the discrete sequence x is estimated. In order to realize these, in order to predict the language expression ω, a dictionary in which language expressions are described in advance (hereinafter referred to as language dictionary) is required. As a technique proposed as a conventional technique, there are a dictionary in which a grammar describes a grammar using a finite automaton (hereinafter referred to as a network grammar language dictionary), a statistical language dictionary that expresses a statistical connection probability between vocabularies, Has been proposed.

ネットワーク文法言語辞書は、システム設計者がそのタスクにおいてユーザが発話し得る文章のパターンを予測して、許される語彙の接続パターンをあらかじめ記述したものである。これは構造が単純で、辞書容量が少ないといった長所を有し、使用者がシステム設計者の意図する入力を行う限りにおいて、極めて高い認識精度を有することが知られている。しかしながら、パターンに含まれない使用者の発話に対しては脆弱である。一般に人間の発話のパターンには揺らぎがあり、同じ内容を発話しても、助詞の省略や倒置、主語の省略などから、発話パターンがいつも同じになるとは限らない。例え、あらゆる発話パターンを事前に調査して予測し、その通りに文法を作成したとしても、ユーザの発話はユーザ毎の個人差や多様性が高く、そのすべてをカバーすることは難しい。また音声コマンド等ある程度タスクの範囲が限定される場合は、そのすべてをカバーする文法の構築も不可能ではないが、そのようなあいまい性や揺らぎに対するカバー率を高めるために文法を追加していくことは、文法が受理する文の範囲を不適切に広げ過ぎる結果となり、不要な仮説の増大から認識率の悪化を招くと言う問題がある。 In the network grammar language dictionary, a system designer predicts a pattern of a sentence that a user can speak in the task and describes in advance a connection pattern of allowed vocabulary. This has the advantage that the structure is simple and the dictionary capacity is small, and it is known that the user has extremely high recognition accuracy as long as the user performs the input intended by the system designer. However, it is vulnerable to user utterances not included in the pattern. Generally speaking, human utterance patterns fluctuate, and even if the same content is uttered, the utterance patterns are not always the same due to omission and inversion of the particles and omission of the subject. For example, even if all utterance patterns are investigated and predicted in advance and the grammar is created accordingly, the user's utterances are highly individual and diverse for each user, and it is difficult to cover all of them. Also, when the scope of tasks such as voice commands is limited to some extent, it is not impossible to construct a grammar that covers all of them, but grammar will be added to increase the coverage of such ambiguity and fluctuations. This results in a problem that the range of sentences accepted by the grammar is inappropriately widened, and the recognition rate deteriorates due to an increase in unnecessary hypotheses.

一方、統計的言語辞書では大量のサンプルデータから統計的な手法によって、語彙と語彙の遷移確率の推定を行うため、大容量メモリを必要とし特定の用途に対しては冗長度が増加するのみならず、認識精度、処理時間に関しても問題がある。 On the other hand, statistical language dictionaries estimate vocabulary-to-vocabulary transition probabilities from a large amount of sample data using statistical methods, which requires a large amount of memory and only increases redundancy for specific applications. In addition, there are problems regarding recognition accuracy and processing time.

既に述べたように、ナビゲーションシステムのように適用範囲が比較的限定されている航空機、船舶、車両、作業車両その他各種乗り物内環境下での音声認識を用いたアプリケーションでは、住所入力や操作コマンド入力など、特定のタスクに限定した音声発話を受理すればよいため、ネットワーク文法を用いた言語辞書が広く用いられてきた。しかし、ネットワーク文法言語辞書を用いた音声認識では、前述のように入力可能な文法を予め決定しておく必要があるため、設計段階で
１．使用者が使用する文法を予め認識しておく、
２．使用者の発話し得る文法をすべて記述しておく、
の何れかの条件を満たしておく必要がある。一方、Ｎ−ｇｒａｍ言語辞書（以下、Ｎ−ｇｒａｍ辞書と略記）と呼ばれる言語辞書では、入力可能な文法の自由度は高いものの、音声認識精度がネットワーク文法言語辞書と比較して低く、計算時間も大きいため、タスク限定音声発話を受理する目的には用いられて来なかった。ここでＮ−ｇｒａｍ辞書とは、統計的言語辞書内に形成されるもので、このＮ−ｇｒａｍ辞書にはＮ−ｇｒａｍモデル（すなわちＮ−ｇｒａｍに基づく接続確率）と語彙辞書とが収納されている。ここでＮ−ｇｒａｍ辞書とは、統計的言語辞書で大量のサンプルデータから統計的な手法により、語彙と語彙の遷移確率の推定を行う際に、最も単純で広く用いられている手法である。これは、入力された言語表現の推定をする場合に、 As already mentioned, in applications such as navigation systems where the scope of application is relatively limited, such as aircraft, ships, vehicles, work vehicles, and other applications that use voice recognition in various vehicle environments, address input and operation command input Language dictionaries using network grammar have been widely used because it is sufficient to accept voice utterances limited to specific tasks. However, in speech recognition using a network grammar language dictionary, it is necessary to determine in advance a grammar that can be input as described above. Recognize in advance the grammar used by the user,
2. Describe all the grammar that the user can speak,
It is necessary to satisfy one of the conditions. On the other hand, in a language dictionary called an N-gram language dictionary (hereinafter abbreviated as N-gram dictionary), although the degree of freedom of grammar that can be input is high, the speech recognition accuracy is low compared to the network grammar language dictionary, and the computation time is short. Therefore, it has not been used for the purpose of accepting task-limited voice utterances. Here, the N-gram dictionary is formed in a statistical language dictionary, and the N-gram dictionary stores an N-gram model (that is, a connection probability based on the N-gram) and a vocabulary dictionary. Yes. Here, the N-gram dictionary is a statistical language dictionary that is the simplest and most widely used method for estimating vocabulary and vocabulary transition probabilities from a large amount of sample data using a statistical method. This is when estimating the input language expression.

のような近似を行うモデルである。このとき、Ｐ(ω_１ω_２…ω_ｎ)は入力された語彙列ω_１ω_２…ω_ｎに対する出現確率を表す。

It is a model that performs an approximation like this. In this _{_{case, P (ω 1 ω 2 ...}} ω n) represents the probability of occurrence for the vocabulary columns ω ₁ ω _{2 ...} ω _n that has been input.

しかしながら、ネットワーク文法言語辞書に要求される上記の条件のうち、２は、多様な文法を多数考慮することになるため設計コスト等の問題で困難である。このため、Ｎ−ｇｒａｍ辞書のように自由度の高い発話の受理機能を保ちつつ、特定の条件下においてはネットワーク文法言語辞書（以下、ネットワーク言語辞書と略記）に近い認識性能を認識動作の進行中でも動的に取得することのできる音声認識装置辞書作成法が望まれる。ここで、Ｎ−ｇｒａｍ言語辞書は学習させるサンプルデータが膨大であれば、多くの語彙間の接続パターンを自動的に含むことが出来るため、ネットワーク文法とは異なり設計者が想像できなかった言い回しで入力された文法を受理することも可能である。しかしながら、統計的言語辞書を用いた音声認識には、メモリ量および計算量が多大に必要とされ、特定のタスクに限定した音声認識に用いることはコストの点から冗長である。また、ネットワーク言語辞書と比較して自由度が高い反面、認識率が低い(前記非特許文献１)と言う問題がある。 However, among the above conditions required for the network grammar language dictionary, 2 is difficult due to problems such as design cost because a large number of various grammars are considered. For this reason, the recognition performance is similar to that of a network grammar language dictionary (hereinafter abbreviated as “network language dictionary”) under a specific condition while maintaining a speech receiving function with a high degree of freedom like an N-gram dictionary. In particular, a speech recognition device dictionary creation method that can be acquired dynamically is desired. Here, the N-gram language dictionary can automatically include many vocabulary connection patterns if there is a large amount of sample data to be learned. It is also possible to accept input grammar. However, speech recognition using a statistical language dictionary requires a large amount of memory and a large amount of calculation, and using it for speech recognition limited to a specific task is redundant in terms of cost. In addition, the degree of freedom is higher than that of the network language dictionary, but there is a problem that the recognition rate is low (Non-Patent Document 1).

また、前記「非特許文献２」に記載の方法（ＧＡ方式）では、ネットワーク文法を事前に決定し、その情報をもとにＮ−ｇｒａｍ辞書でネットワーク文法とカテゴリの一致する連接語彙の対数尤度に係数を乗算し、最終的な認識スコアを修整する操作を行っている。このとき、ネットワーク文法に含まれる語彙が大きいほど、出力時に修整される連接語彙は増加し、出力結果はネットワーク文法言語辞書だけを用いた辞書に近づいていくため、このＧＡ方式をナビゲーションシステムのタスクに単純に適用させても、ネットワーク言語辞書と比較して効果が少ないことが予想される。 In the method described in “Non-Patent Document 2” (GA method), the network grammar is determined in advance, and the logarithmic likelihood of the connected vocabulary whose category matches the network grammar in the N-gram dictionary based on the information. Multiplying the coefficient every time and adjusting the final recognition score. At this time, the larger the vocabulary contained in the network grammar, the more connected vocabulary that is modified at the time of output, and the output result approaches a dictionary that uses only the network grammar language dictionary. Even if it is simply applied, it is expected that it is less effective than the network language dictionary.

このようなことから本発明においては、統計的言語辞書の中に存在する語彙や連接語彙の頻度に関する情報を抽出し、設計者が最初に手動で作成した既存のネットワーク言語辞書にこの抽出した語彙と情報を装置作動中に動的に追加することで、設計者が使用者の発話し得る文法を予めすべて記述していなくとも、自動的にネットワーク言語辞書を追加生成する手法としている。本発明による手法を用いることにより、目的とするタスクに対して的確な語彙を自動的にネットワーク言語辞書に追加することができる。すなわち、（１）必要な語彙（使用者の発話し得る語彙）だけを必要なときにネットワーク言語辞書に追加でき、（２）ネットワーク言語辞書のサイズも逐次変更することができる。
本発明では、基礎となるネットワーク文法言語辞書は、システム設計者が予測しうる語彙の接続パターンを用いるものとして説明しているが、これらのみには限られない。他のものとしては、膨大な言語データベース等から、用途やユーザの好みに応じて一定の条件を設定して状況にあった基礎言語を取得して作成するものや、純粋に出現頻度の高い語彙等のみを抽出して基礎のネットワークを作成するものなど種々の基礎ネットワーク辞書が考えられる。 For this reason, in the present invention, information on the frequency of vocabulary and connected vocabulary existing in the statistical language dictionary is extracted, and this extracted vocabulary is first added to the existing network language dictionary manually created by the designer. This is a technique for automatically generating a network language dictionary even if the designer does not describe all the grammars that the user can speak in advance. By using the method according to the present invention, it is possible to automatically add an accurate vocabulary for a target task to the network language dictionary. That is, (1) only necessary vocabulary (vocabulary that can be spoken by the user) can be added to the network language dictionary when necessary, and (2) the size of the network language dictionary can be changed sequentially.
In the present invention, the basic network grammar language dictionary is described as using a vocabulary connection pattern that can be predicted by the system designer, but is not limited thereto. Others are those that are created by acquiring a basic language according to the situation by setting certain conditions according to the usage and user's preference from a huge language database, etc., or vocabulary with a high frequency of appearance Various basic network dictionaries can be considered, such as those that extract only such as to create a basic network.

以上述べたように前記「非特許文献３」では、文法上に既に存在する語彙に遷移確率を割り当てることで、文法の絞り込み性能の改善を図るものであるのに対し、本発明では、文法を統計的情報に基づいて動的に拡張（新たな語彙を追加）していく点が異なる。 As described above, in “Non-Patent Document 3”, the transition probability is assigned to a vocabulary that already exists in the grammar to improve the narrowing performance of the grammar. The difference is that it dynamically expands (adds new vocabulary) based on statistical information.

以下、図により本発明の構成を説明する。 The configuration of the present invention will be described below with reference to the drawings.

（第１の実施の形態）
図１は本発明における第１の実施の形態である音声認識処理を行うための各手段の系統を示しており、図２はこの系統に関する装置構成を示す。図１の系統図において、音声入力手段１１０では使用者が発話した音声を収集し、扱い易いデジタル音声信号に変換する。これは、図２のマイク３３５、ＡＤ変換装置３３０に相当しており、具体的にはマイクに代表される音声入力装置と、実時間信号離散化装置であるＡＤ変換器等によって構成される。ここで音声信号は収集され、ＡＤ変換を行って離散的な音声信号に変換される。 (First embodiment)
FIG. 1 shows a system of each means for performing speech recognition processing according to the first embodiment of the present invention, and FIG. 2 shows an apparatus configuration relating to this system. In the system diagram of FIG. 1, the voice input means 110 collects voices spoken by the user and converts them into easy-to-handle digital voice signals. This corresponds to the microphone 335 and the AD converter 330 in FIG. 2, and specifically includes an audio input device represented by a microphone, an AD converter that is a real-time signal discretizer, and the like. Here, the audio signal is collected and converted into a discrete audio signal by performing AD conversion.

図１における音声認識手段１２０では、入力された音声信号を認識して、認識結果の信号Ｗ１を送出する。認識結果の信号Ｗ１は、例えばテキスト等の情報形態に変換されている。これは図２の演算装置３２０と記憶装置３１０によって実現される。演算装置３２０としては、例えば、一般的なパーソナルコンピュータ、マイクロコンピュータ、信号処理装置のように演算機能を有するシステムを構成するＣＰＵ、ＭＰＵ、ＤＳＰを単数或いは複数個組み合わせればよく、実時間処理が可能な演算能力を有していることが望ましい。また記憶装置もキャッシュメモリ、メインメモリ、ディスクメモリ、フラッシュメモリ、ＲＯＭ等、一般的な情報処理機器に用いられている情報記憶能力を有する機器を用いればよい。音声認識手段から送出された認識結果の信号Ｗ１は使用者に対して提示する情報に変換、或いは他の機器の操作信号に変換して用いる。 The voice recognition means 120 in FIG. 1 recognizes the input voice signal and sends a recognition result signal W1. The recognition result signal W1 is converted into an information form such as text. This is realized by the arithmetic device 320 and the storage device 310 of FIG. As the arithmetic unit 320, for example, a single or a plurality of CPUs, MPUs, and DSPs constituting a system having an arithmetic function such as a general personal computer, a microcomputer, and a signal processing unit may be combined, and real-time processing can be performed. It is desirable to have possible computing power. The storage device may be a device having information storage capability used for general information processing devices such as a cache memory, a main memory, a disk memory, a flash memory, and a ROM. The recognition result signal W1 sent from the voice recognition means is converted into information to be presented to the user or converted into an operation signal of another device.

図１の記憶手段１３０では、音声認識のパタンマッチングに用いるネットワーク言語辞書１３１と、このネットワーク言語辞書１３１を自動的に生成するための情報を有する統計的言語辞書１３２と、音声認識手段１２０とネットワーク言語辞書との間でデータ授受を行う際の仲介手段としての音韻、語彙辞書１３３とが保存されている。これは図２の記憶装置３１０に相当し、具体的には、キャッシュメモリ、メインメモリ、ディスクメモリ、フラッシュメモリ等、一般的な情報処理機器に用いられている書き換え可能な情報記憶能力を有する機器を用いればよい。 In the storage unit 130 of FIG. 1, a network language dictionary 131 used for speech recognition pattern matching, a statistical language dictionary 132 having information for automatically generating the network language dictionary 131, a speech recognition unit 120, and a network Phonemes and vocabulary dictionaries 133 are stored as mediating means for data exchange with the language dictionary. This corresponds to the storage device 310 of FIG. 2, and specifically, a device having a rewritable information storage capability used for general information processing devices such as a cache memory, a main memory, a disk memory, and a flash memory. May be used.

図１の抽出手段１５０は、使用者の発話を認識し文字情報とした語彙Ａから、接続に関する頻度情報を多く有する一つ以上の語彙Ｂ、語彙Ａと語彙Ｂとの接続に関する頻度情報、語彙Ｂの出現に関する頻度情報、を統計的辞書の中から抽出する。これは、図２の演算装置３２０および記憶装置３１０によって実現できる。 1 extracts one or more vocabulary B having a lot of frequency information related to connection from the vocabulary A that recognizes a user's utterance and is character information, frequency information about the connection between vocabulary A and vocabulary B, and vocabulary. Frequency information regarding the appearance of B is extracted from the statistical dictionary. This can be realized by the arithmetic device 320 and the storage device 310 of FIG.

図１の追加手段１４０は、図１の抽出手段１５０によって抽出された語彙Ｂと、語彙Ａと語彙Ｂとの接続に関する頻度情報と、語彙Ｂの出現に関する頻度情報とをネットワーク言語辞書１３１に追加する手段である。 1 adds to the network language dictionary 131 the vocabulary B extracted by the extraction means 150 of FIG. 1, frequency information regarding the connection between the vocabulary A and vocabulary B, and frequency information regarding the appearance of the vocabulary B. It is means to do.

以上、図１および図２で示した構成における認識処理の流れを図３により説明する。 The flow of recognition processing in the configuration shown in FIGS. 1 and 2 will be described with reference to FIG.

図３において、音声認識装置における音声認識システムが動作を開始すると（ＳＴＡＲＴ）、はじめにステップＳ１１０において、音声認識システムが初期化される。このとき、ネットワーク言語辞書１３１がメモリ上に展開され、更に、音声認識手段１２０を含む音声認識システムが起動され、使用者による音声入力の待ち状態となる。このとき統計的言語辞書１３２もメモリ上に展開するとコストはかかるものの、処理は効率的になる。 In FIG. 3, when the voice recognition system in the voice recognition apparatus starts to operate (START), first, in step S110, the voice recognition system is initialized. At this time, the network language dictionary 131 is expanded on the memory, and the voice recognition system including the voice recognition unit 120 is activated to wait for voice input by the user. At this time, if the statistical language dictionary 132 is also expanded on the memory, the processing becomes efficient although it costs high.

ステップＳ１２０では、音声認識手段１２０によって認識された語彙Ｗ(ｉ)について、統計的言語辞書１３２の中から語彙Ｗ(ｉ)と一致する語彙の存在の有無を調べる。認識された語彙と一致する語彙が統計的言語辞書１３２の中に発見された場合（ステップＳ１２０でＹｅｓ）は、ステップＳ１３０へ語彙Ｗ(ｉ)が送出される。語彙が発見されなかった場合（ステップＳ１２０でＮｏ）は、初期化直後の状態に戻り再び音声認識処理の待ち状態となる。 In step S120, the vocabulary W (i) recognized by the speech recognition means 120 is checked for the presence of a vocabulary that matches the vocabulary W (i) from the statistical language dictionary 132. If a vocabulary that matches the recognized vocabulary is found in the statistical language dictionary 132 (Yes in step S120), the vocabulary W (i) is sent to step S130. If no vocabulary is found (No in step S120), the process returns to the state immediately after initialization, and again enters the wait state for the speech recognition process.

ステップＳ１３０では、統計的言語辞書１３２に含まれる語彙群に認識された語彙Ｗ(ｉ)が存在し、更に、この語彙Ｗ(ｉ)に対し、頻度に関する閾値Jを越える語彙群Ｗｃ(ｉ)が存在した場合（ステップ１３０でＹｅｓ）は、この語彙群Ｗｃ(ｉ)、語彙Ｗ(ｉ)と
語彙群Ｗｃ(ｉ)との接続に関する頻度情報、語彙Ｗ(ｉ)とこの語彙群Ｗｃ(ｉ)の出現に関する頻度情報をステップＳ１４０へ送出する。語彙群Ｗｃ(ｉ)が存在しない場合（ステップ１３０でＮｏ）は、初期化直後の状態に戻り再び音声認識処理の待ち状態となる。 In step S130, there is a recognized vocabulary W (i) in the vocabulary group included in the statistical language dictionary 132, and further, for this vocabulary W (i), a vocabulary group Wc (i) that exceeds a threshold value J regarding frequency. Is present (Yes in step 130), this vocabulary group Wc (i), frequency information regarding the connection between the vocabulary W (i) and the vocabulary group Wc (i), the vocabulary W (i) and this vocabulary group Wc ( The frequency information regarding the appearance of i) is sent to step S140. If the vocabulary group Wc (i) does not exist (No in step 130), the state returns to the state immediately after the initialization, and again enters the waiting state for the speech recognition process.

ステップＳ１４０では、統計的言語辞書１３２から抽出された語彙群Ｗｃ(ｉ)、語彙Ｗ(ｉ)と語彙群Ｗｃ(ｉ)の接続に関する頻度情報、語彙Ｗ(ｉ)と語彙群Ｗｃ(ｉ)の出現に関する頻度情報について、不要な語彙および頻度情報を削除した後、語彙群Ｗｃ(ｉ)、語彙Ｗ(ｉ)から語彙群Ｗｃ(ｉ)への接続に関する頻度情報、語彙群Ｗｃ(ｉ)および語彙Ｗ(ｉ)の出現に関する頻度情報をネットワーク言語辞書に追加する。ここで、不要な語彙とは、完成したネットワーク言語辞書１３１の中でどのタスクにも対応することの無い語彙を示す。 In step S140, the vocabulary group Wc (i) extracted from the statistical language dictionary 132, frequency information regarding the connection between the vocabulary W (i) and the vocabulary group Wc (i), the vocabulary W (i) and the vocabulary group Wc (i) After deleting unnecessary vocabulary and frequency information regarding the frequency information related to the appearance of the vocabulary, the vocabulary group Wc (i), the frequency information regarding the connection from the vocabulary W (i) to the vocabulary group Wc (i), and the vocabulary group Wc (i) And frequency information regarding the appearance of the vocabulary W (i) is added to the network language dictionary. Here, the unnecessary vocabulary indicates a vocabulary that does not correspond to any task in the completed network language dictionary 131.

ステップＳ１６０ではｉ＋１番目の未処理の語彙Ｗ(ｉ＋１)を検索する。未処理の語彙Ｗ(ｉ＋１)が存在したらＳ１５０を経由して、さらにステップＳ１３０に進み、存在しなければ初期化直後の状態に戻り再び音声認識処理の待ち状態となる。ここで未処理の語彙初期化直後の状態に戻り再び音声認識処理の待ち状態となる。ここで未処理の語彙Ｗ(ｉ＋１)とは、ステップＳ１４０で追加された連接語彙群Ｗｃ(ｉ)のことである。 In step S160, the i + 1th unprocessed vocabulary W (i + 1) is searched. If there is an unprocessed vocabulary W (i + 1), the process proceeds to step S130 via S150. If there is no vocabulary W (i + 1), the process returns to the state immediately after initialization and again enters the wait state for the speech recognition process. Here, the state returns to the state immediately after the unprocessed vocabulary initialization, and again enters the waiting state for the speech recognition processing. Here, the unprocessed vocabulary W (i + 1) is the connected vocabulary group Wc (i) added in step S140.

（第２の実施の形態）
図４は本発明における第２の実施の形態である処理手段の系統図を示すもので、履歴手段１６０では、使用者発話に対する認識語彙の種別ごとに出現履歴を保存し、更に、特定の語彙に関し、語彙の入力頻度を計算し、この語彙の入力頻度が一定の閾値を超えた場合は、語彙を図４の統計的言語辞書１３２に送出する。この場合、語彙の種別としては単体の語彙でもよいし、複数の語彙が集合してクラスを形成するものであってもよい。これは図２の演算装置３２０と記憶装置３１０によって実現できる。上記の閾値は設計時に予め設定しておいてもよいし、全語彙の入力頻度に対応して動的に設定されるものであってもよい。更に、この閾値を外部から設定させる場合は、図２の入力装置３４０を用いればよい。入力装置３４０として具体的には、スイッチ、キーボード、音声入力装置のような使用者が個々の語彙を直接入力する装置があれば実現できる。 (Second Embodiment)
FIG. 4 shows a system diagram of the processing means according to the second embodiment of the present invention. In the history means 160, an appearance history is stored for each type of recognized vocabulary for the user utterance, and a specific vocabulary is further stored. , The vocabulary input frequency is calculated, and if the vocabulary input frequency exceeds a certain threshold, the vocabulary is sent to the statistical language dictionary 132 of FIG. In this case, the type of vocabulary may be a single vocabulary, or a plurality of vocabularies may form a class. This can be realized by the arithmetic device 320 and the storage device 310 of FIG. The above threshold value may be set in advance at the time of design, or may be set dynamically according to the input frequency of all vocabularies. Furthermore, when this threshold value is set from the outside, the input device 340 of FIG. 2 may be used. Specifically, the input device 340 can be realized if there is a device that allows a user to directly input individual words such as a switch, a keyboard, and a voice input device.

以下、図４の構成における音声認識処理の流れを図５により説明する。
図５において、ステップＳ１２０では、第１の実施の形態の場合と同様、音声認識処理が終了し、統計的言語辞書１３２の中から認識された語彙Ｗ(ｉ)を検索する。語彙が発見された場合（ステップＳ１２０でＹｅｓ）は、ステップＳ１２１へ語彙Ｗ(ｉ)が送出される。
ステップＳ１２１では、ステップＳ１２０から受理した語彙Ｗ(ｉ)の情報が既に履歴手段１６０に存在すれば語彙Ｗ(ｉ)の入力回数をカウントし、存在しなければ履歴手段１６０に語彙Ｗ(ｉ)を追加し、入力頻度を計数するカウンターを初期化する。更に、語彙Ｗ(ｉ)の入力回数が閾値を超えた場合（ステップＳ１２１でＹｅｓ）は語彙Ｗ(ｉ)をステップＳ１３０へ送出する。また、この閾値を越えない状態（ステップＳ１２１でＮｏ）では初期化直後の状態に戻り再び音声認識処理の待ち状態となる。一度送出された語彙については、再び語彙がＳ１３０へ送られることによる無駄な計算を避けるため、フラグ等を用いて管理を行うと良い。 Hereinafter, the flow of the speech recognition process in the configuration of FIG. 4 will be described with reference to FIG.
In FIG. 5, in step S120, as in the case of the first embodiment, the speech recognition process ends, and the recognized vocabulary W (i) is searched from the statistical language dictionary 132. When the vocabulary is found (Yes in step S120), the vocabulary W (i) is sent to step S121.
In step S121, if the information on the vocabulary W (i) received from step S120 already exists in the history means 160, the number of times the vocabulary W (i) is input is counted. To initialize the counter that counts the input frequency. Further, when the number of times the vocabulary W (i) is input exceeds the threshold (Yes in step S121), the vocabulary W (i) is sent to step S130. If the threshold value is not exceeded (No in step S121), the process returns to the state immediately after initialization and again enters the wait state for the voice recognition process. The vocabulary once sent out may be managed using a flag or the like in order to avoid useless calculation due to the vocabulary being sent to S130 again.

以上第２の実施の形態においても、語彙Ａと語彙Ｂとの２ステップの語彙が接続される場合について説明したが、第１の実施の形態の場合と同様に、語彙Ｃが接続される３ステップの場合、更には語彙Ｄ、語彙Ｅ等多ステップの接続の場合もあり得る。このような場合においても、上記と同様にして、各ステップにおける各語彙の接続に関する頻度情報、出現に関する頻度情報および語彙Ｃ、語彙Ｄ等についてネットワーク言語辞書に逐次追加すればよい。 In the second embodiment, the case where the two-step vocabulary of the vocabulary A and the vocabulary B is connected has been described. However, as in the case of the first embodiment, the vocabulary C is connected 3 In the case of steps, there may be a multi-step connection such as vocabulary D and vocabulary E. Even in such a case, the frequency information regarding connection of each vocabulary, the frequency information regarding appearance, the vocabulary C, and the vocabulary D may be sequentially added to the network language dictionary in the same manner as described above.

（第３の実施の形態）
図６の次発話予測手段１７０では、使用者発話に対する認識語彙をもとに、使用者の次発話語彙を予測し、この次発話語彙を図６の統計的言語辞書１３２に送出する。これは、図２の演算装置３２０と記憶装置３１０によって実現できる。更に、送出する次発話を外部から追加入力する場合は、図２の入力装置３４０を用いればよい。 (Third embodiment)
6 predicts the user's next utterance vocabulary based on the recognition vocabulary for the user's utterance, and sends the next utterance vocabulary to the statistical language dictionary 132 shown in FIG. This can be realized by the arithmetic device 320 and the storage device 310 of FIG. Furthermore, when the next utterance to be transmitted is additionally input from the outside, the input device 340 of FIG. 2 may be used.

以下、図６に示したネットワーク言語辞書自動生成処理の流れについて図７により説明する。
ステップＳ１２０では、音声認識処理が終了し、統計的言語辞書１３２の中から認識された語彙Ｗ(ｉ)を検索する。認識された語彙が統計的言語辞書１３２の中で発見された場合（ステップＳ１２０でＹｅｓ）は、ステップＳ１２２へ当該語彙Ｗ(ｉ)が送出される。また、発見されなかった場合（ステップＳ１２０でＮｏ）はシステム初期化直後の状態に戻り、再び音声認識処理の待ち状態になる。
ステップＳ１２２では、ステップＳ１２０から受理した語彙Ｗ(ｉ)について、次発話語彙群Ｗｎ(ｉ)を検索する。次発話語彙群Ｗｎ(ｉ)が存在する場合（ステップＳ１２２でＹｅｓ）はステップＳ１３０へ進み、存在しない場合（ステップＳ１２２でＮｏ）は初期化直後の状態に戻り、再び音声認識処理の待ち状態となる。以下は図５の場合と同様である。 The flow of the network language dictionary automatic generation process shown in FIG. 6 will be described below with reference to FIG.
In step S120, the speech recognition process ends, and the recognized vocabulary W (i) is searched from the statistical language dictionary 132. When the recognized vocabulary is found in the statistical language dictionary 132 (Yes in step S120), the vocabulary W (i) is sent to step S122. If not found (No in step S120), the process returns to the state immediately after the system initialization, and again enters the wait state for the speech recognition process.
In step S122, the next utterance vocabulary group Wn (i) is searched for the vocabulary W (i) received from step S120. If the next utterance vocabulary group Wn (i) exists (Yes in step S122), the process proceeds to step S130. If it does not exist (No in step S122), the process returns to the state immediately after initialization, and again waits for the voice recognition process. Become. The following is the same as in FIG.

以上第３の実施の形態では、語彙Ａと語彙Ｂとの２ステップの語彙が接続される場合について説明したが、語彙Ｃが接続される３ステップの場合、更には語彙Ｄ、語彙Ｅ等タステップの接続の場合もあり得る。このような場合においても、上記と同様にして、各ステップにおける各語彙の接続に関する頻度情報、出現に関する頻度情報および語彙Ｃ、語彙Ｄ等についてネットワーク言語辞書に逐次追加すればよい。 In the third embodiment, the case where the two-step vocabulary of the vocabulary A and the vocabulary B is connected has been described. However, in the case of the three steps where the vocabulary C is connected, the vocabulary D, the vocabulary E, etc. There can be a connection of steps. Even in such a case, the frequency information regarding connection of each vocabulary, the frequency information regarding appearance, the vocabulary C, and the vocabulary D may be sequentially added to the network language dictionary in the same manner as described above.

（第４の実施の形態）
以下では図３に示した処理の流れについて図８を用いて認識語彙をネットワーク言語辞書に追加する過程を具体的な語彙を用いて説明する。 (Fourth embodiment)
In the following, the process of adding the recognized vocabulary to the network language dictionary with reference to FIG. 8 will be described using the specific vocabulary with respect to the processing flow shown in FIG.

図８の「時刻t-1の言語辞書」は変更前のネットワーク言語辞書２１０を示しており、「時刻tの言語辞書」は音声認識された語彙を追加することによる変更後のネットワーク言語辞書２１１を示している。音声認識された語彙「エアコン」２２０が入力されると、統計的言語辞書２３０に送出される。統計的言語辞書２３０からは、認識された語彙「エアコン」２２０から遷移しやすい語彙として、「入れて」および「つけて」の２つが発見されている。これら両語彙は抽出手段（図１の１５０）によって抽出され、更に、追加手段（図１の１４０）によってネットワーク言語辞書２１１の追加された形となる。「時刻tの言語辞書」のネットワーク言語辞書２１１では、新たに追加された語彙である「入れて」および「つけて」が「エアコン」から一つ下位の階層への接続語彙として追加されている。更に、語彙「エアコン」から「入れて」、「つけて」への接続に関する頻度情報と、出現に関する頻度情報も追加されている。 The “language dictionary at time t-1” in FIG. 8 indicates the network language dictionary 210 before the change, and the “language dictionary at time t” indicates the network language dictionary 211 after the change by adding a vocabulary that has been voice-recognized. Is shown. When the speech-recognized vocabulary “air conditioner” 220 is input, it is sent to the statistical language dictionary 230. From the statistical language dictionary 230, two words, “Put” and “Put”, have been found as vocabularies that are easily shifted from the recognized vocabulary “Air Conditioner” 220. Both these vocabularies are extracted by the extracting means (150 in FIG. 1), and further added to the network language dictionary 211 by the adding means (140 in FIG. 1). In the network language dictionary 211 of “language dictionary at time t”, the newly added vocabulary “Put” and “Pick” are added as connection vocabulary from “Air conditioner” to one lower hierarchy. . Furthermore, frequency information related to the connection from the vocabulary “air conditioner” to “put in” and “put on” and frequency information related to appearance are also added.

出現に関する頻度情報については、該当する語彙に関する情報を追加するだけでもよいし、「エアコン」の出現に関する頻度情報と、ネットワーク言語辞書２１１の第一階層に含まれる他の語彙である「ＣＤ」および「ラジオ」をすべて同じ出現確率としてもよい。また、ネットワーク言語辞書２１１の第１階層に含まれる他の語彙に関して、統計的言語辞書２３０内における出現に関する頻度情報を調べておき、３つの語彙「エアコン」、「ＣＤ」、「ラジオ」と、変更されていない語彙との接続と出現に関するバランス比を改めて計算し、これを追加してもよい。 As for the frequency information related to the appearance, it is sufficient to add only information related to the corresponding vocabulary, or the frequency information related to the appearance of “air conditioner”, “CD” which is another vocabulary included in the first layer of the network language dictionary 211, and All “radio” may have the same appearance probability. In addition, regarding other vocabularies included in the first level of the network language dictionary 211, frequency information regarding appearance in the statistical language dictionary 230 is checked, and the three vocabularies “air conditioner”, “CD”, “radio”, The balance ratio regarding the connection and appearance with the vocabulary that has not been changed may be calculated again and added.

また、接続に関する頻度情報についても、該当する語彙に関する情報を追加するだけでよいが、「エアコン」から「入れて」、「つけて」への２つの接続に関する頻度情報について、改めてバランス比を計算し、これを追加してもよい。例えば、「入れて」、「つけて」、「ＯＮ」、「ＯＦＦ」に遷移する確率をいずれの語彙についても１／４の確率を割り振ってもよい。
ここで、バランス比とは、ある語彙に対して接続される候補となる各語彙に対する遷移確率に相当するもので、例えば、「エアコン」に対して接続される語彙は、変更される前では語彙「ＯＮ」と語彙「ＯＦＦ」の２種類のみとすれば各語彙に対して１／２であり、これに対して、語彙「入れる」がネットワーク言語辞書２１０に追加されて「時刻ｔの言語辞書」のネットワーク言語辞書２１１の状態になると接続される可能性のある語彙はエアコンに同じ動作を実行させる語彙は「入れて」、「つけて」、「ＯＮ」の３つとなるからこの場合のバランス比は１／３となる。 As for the frequency information related to the connection, it is only necessary to add information related to the corresponding vocabulary, but the balance ratio is calculated again for the frequency information related to the two connections from “air conditioner” to “put in” and “turn on”. However, this may be added. For example, the probability of transitioning to “put in”, “put on”, “ON”, and “OFF” may be assigned a probability of 1/4 for any vocabulary.
Here, the balance ratio corresponds to the transition probability for each vocabulary that is a candidate connected to a certain vocabulary. For example, the vocabulary connected to “air conditioner” is a vocabulary before being changed. If there are only two types of “ON” and vocabulary “OFF”, the vocabulary is “½”, whereas the vocabulary “put” is added to the network language dictionary 210 to “language dictionary at time t”. In this case, the vocabulary that can be connected when the network language dictionary 211 is in the state of the network language has three vocabularies that allow the air conditioner to perform the same operation: “put in”, “attach”, and “ON”. The ratio is 1/3.

このように、２語彙間の接続に関するバランス比の調整は、例えば語彙Ａ、語彙Ｂ、語彙Ｃの３語彙について考えた場合、
１．語彙Ｂおよび語彙Ｃが同じ階層にあり、どちらかが語彙Ａの後に接続される場合、
２．語彙Ａおよび語彙Ｃが同じ階層にあり、どちらかが語彙Ｂの前に接続される場合、
３．語彙Ａおよび語彙Ｃが同じ階層にあり、どちらかが語彙Ｂの後に接続される場合、
４．語彙Ｂおよび語彙Ｃが同じ階層にあり、どちらかが語彙Ａの前に接続される場合、
の接続に関する場合があり、これら各接続に対するバランス比について調整することが必要になる。同様のバランス比の調整が、上記３語彙中の２語彙の出現に関するバランス比の調整についても必要になる。 In this way, the adjustment of the balance ratio related to the connection between two vocabularies is, for example, when considering three vocabularies of vocabulary A, vocabulary B, and vocabulary C.
1. If vocabulary B and vocabulary C are in the same hierarchy and one is connected after vocabulary A,
2. If vocabulary A and vocabulary C are in the same hierarchy and either is connected before vocabulary B,
3. If vocabulary A and vocabulary C are in the same hierarchy and one is connected after vocabulary B,
4). If vocabulary B and vocabulary C are in the same hierarchy and either is connected before vocabulary A,
It is necessary to adjust the balance ratio for each of these connections. Similar adjustment of the balance ratio is necessary for adjustment of the balance ratio related to the appearance of two vocabularies in the three vocabularies.

（第５の実施の形態）
以下では図９を用いて、語彙をネットワーク言語辞書に対して追加する過程を具体的語彙により説明する。 (Fifth embodiment)
Hereinafter, the process of adding a vocabulary to the network language dictionary will be described with reference to FIG.

図９の「時刻t-1の言語辞書」４１０は変更前のネットワーク言語辞書を示しており、「時刻tの言語辞書」４１１は音声認識された語彙を追加することによる変更後のネットワーク言語辞書を示している。音声認識された語彙の入力頻度が閾値を超えた語彙「エアコン」４２０が入力されると、統計的言語辞書４３０に送出される。 “Language dictionary at time t-1” 410 in FIG. 9 shows the network language dictionary before the change, and “Language dictionary at time t” 411 shows the network language dictionary after the change by adding the vocabulary recognized by the voice. Is shown. When the vocabulary “air conditioner” 420 whose input frequency of the speech-recognized vocabulary exceeds the threshold is input, it is sent to the statistical language dictionary 430.

統計的言語辞書４３０からは、「エアコン」から遷移し易い語彙として、「入れて」、「つけて」の２つが発見されている。これら両語彙は抽出手段（図４の１５０）によって抽出され、更に、追加手段（図４の１４０）によってネットワーク言語辞書４１１に追加される。「時刻tの言語辞書」のネットワーク言語辞書４１１では、新たに追加された語彙である、「入れて」および「つけて」が「エアコン」から一つ下位の階層に追加されている。更に、語彙「エアコン」から「入れて」および「つけて」への接続に関する頻度情報と、出現に関する頻度情報も追加されている。 From the statistical language dictionary 430, two words, “Put” and “Put”, have been found as vocabularies that are easily changed from “Air conditioner”. Both these vocabularies are extracted by the extracting means (150 in FIG. 4), and further added to the network language dictionary 411 by the adding means (140 in FIG. 4). In the network language dictionary 411 of “language dictionary at time t”, newly added vocabularies “put” and “put” are added to the hierarchy one level lower than “air conditioner”. Furthermore, frequency information related to the connection from the vocabulary “air conditioner” to “put in” and “turn on” and frequency information related to appearance are also added.

出現に関する頻度情報については、該当する語彙に関する情報を追加するだけでもよいし、「エアコン」の出現に関する頻度情報と、ネットワーク言語辞書の第一階層に含まれる他の語彙である「ＣＤ」および「ラジオ」をすべて同じ出現確率としてもよい。また、ネットワーク言語辞書の第１階層に含まれる他の語彙に関して、統計的言語辞書４３０内における出現に関する頻度情報を調べておき、３つの語彙「エアコン」、「ＣＤ」、「ラジオ」のバランス比を改めて計算してもよい。 As for the frequency information related to the appearance, it is sufficient to add information related to the corresponding vocabulary, or the frequency information related to the appearance of “air conditioner” and other vocabularies included in the first layer of the network language dictionary “CD” and “ “Radio” may all have the same appearance probability. In addition, regarding other vocabularies included in the first level of the network language dictionary, frequency information regarding appearance in the statistical language dictionary 430 is checked, and the balance ratio of the three vocabularies “air conditioner”, “CD”, and “radio” is checked. May be recalculated.

接続に関する頻度情報については、該当する語彙に関する情報を追加するだけでよいが、「エアコン」から「入れて」および「つけて」への２つの接続に関する頻度情報について、改めてバランス比を計算してもよい、例えば、「入れて」、「つけて」、「ＯＮ」、「ＯＦＦ」に遷移する確率をいずれの語彙についても０．５の確率を割り振ってもよい。 For the frequency information related to the connection, it is only necessary to add information about the corresponding vocabulary. However, for the frequency information related to the two connections from “air conditioner” to “put in” and “turn on”, the balance ratio is calculated again. For example, the probability of transition to “put in”, “put on”, “ON”, and “OFF” may be assigned a probability of 0.5 for any vocabulary.

(第６の実施の形態)
以下では図１０を用いて、語彙をネットワーク言語辞書に対して追加する過程を具体的に説明する。図１０の「時刻t-1の言語辞書」は変更前のネットワーク言語辞書５１０を示しており、「時刻tの言語辞書」は音声認識された語彙を追加することによる変更後のネットワーク言語辞書５１１を示している。音声認識された語彙の次発話として予測された語彙「空調」５２０が入力されると、統計的言語辞書５３０に送出される。図１０では、入力された語彙「空調」に対し、統計的言語辞書５３０から、「空調」から遷移しやすい語彙として「ＯＮ」、「ＯＦＦ」が発見された例を示している。ここで統計的言語辞書における「ＯＮ」、「ＯＦＦ」については、「時刻t-1の言語辞書」にも含まれているため、抽出手段もしくは追加手段によって削除され、「空調」のみがネットワーク言語辞書に送られる。よって、ここで追加される語彙は、入力され認識された語彙である「空調」のみである。一方、「空調」から「ＯＮ」、「ＯＦＦ」への接続に関する頻度情報と、「空調」、「ＯＮ」、「ＯＦＦ」の出現に関する頻度情報については追加される。接続に関する頻度情報については、該当する語彙に関する情報だけを追加してもよいし、「空調」から「ＯＮ」、「ＯＦＦ」への接続に関する頻度情報について、改めてバランス比を計算してもよい。 (Sixth embodiment)
Hereinafter, the process of adding a vocabulary to the network language dictionary will be specifically described with reference to FIG. The “language dictionary at time t−1” in FIG. 10 indicates the network language dictionary 510 before the change, and the “language dictionary at time t” indicates the network language dictionary 511 after the change by adding a vocabulary that has been voice-recognized. Is shown. When the predicted vocabulary “air conditioning” 520 is input as the next utterance of the speech-recognized vocabulary, it is sent to the statistical language dictionary 530. FIG. 10 shows an example in which “ON” and “OFF” are found from the statistical language dictionary 530 as vocabularies that are easy to transition from “air conditioning” to the input vocabulary “air conditioning”. Here, since “ON” and “OFF” in the statistical language dictionary are also included in the “language dictionary at time t−1”, they are deleted by the extracting means or the adding means, and only “air conditioning” is the network language. Sent to a dictionary. Therefore, the vocabulary added here is only “air conditioning” which is the vocabulary input and recognized. On the other hand, frequency information regarding the connection from “air conditioning” to “ON” and “OFF” and frequency information regarding the appearance of “air conditioning”, “ON”, and “OFF” are added. Regarding the frequency information related to the connection, only the information related to the corresponding vocabulary may be added, or the balance ratio may be calculated again for the frequency information related to the connection from “air conditioning” to “ON” and “OFF”.

（第７の実施の形態）
図６および図７における次発話予測の具体例を以下に説明する。
認識された語彙Ａから類推される語彙Ｂが、使用者によって近日中に使用されると予測し、該当する語彙Ｂおよび語彙Ｂから接続し得る語彙Ｃをネットワーク言語辞書に追加する。語彙Ａから語彙Ｂを生成するには次発話予測手段（図６の１７０）を用いる。次発話予測手段１７０の動作は、i）語彙Ａの類語、ii）語彙Ａの反語、iii）語彙Ａから連想される語彙等を送出することである。 (Seventh embodiment)
A specific example of the next utterance prediction in FIGS. 6 and 7 will be described below.
The vocabulary B inferred from the recognized vocabulary A is predicted to be used in the near future by the user, and the corresponding vocabulary B and the vocabulary C connectable from the vocabulary B are added to the network language dictionary. In order to generate the vocabulary B from the vocabulary A, the next utterance prediction means (170 in FIG. 6) is used. The operation of the next utterance prediction means 170 is to send i) a synonym of vocabulary A, ii) an antonym of vocabulary A, iii) a vocabulary associated with vocabulary A, and the like.

類語：例えば、「エアコン」→ 「空調」、「冷房」、「暖房」など
反語：ＯＮ → ＯＦＦ、入れる、付ける → 消す、
連想語：日産 → 日産プリンス、日産サティオ、XX日産販売、スカイラインミュージアム等
次発話予測手段１７０は、事前に入力される語彙に対して予測される次発話をグルーピングしたデータ群を持っている必要がある。 Synonyms: For example, “air conditioner” → “air conditioner”, “cooling”, “heating” etc. Antonym: ON → OFF, turn on, turn on, turn off,
Associative words: Nissan → Nissan Prince, Nissan Satio, XX Nissan Sales, Skyline Museum, etc. The next utterance prediction means 170 needs to have a data group that groups the predicted next utterances against the vocabulary entered in advance. is there.

(第８の実施の形態)
機器操作を行うことが目的となる音声認識システムにおいて、動的に言語辞書を変更するためには、変更された語彙、および当該語彙の接続に対応する機器操作へのリンクを設定する必要がある。様々な手法が考えられるが、幾つかの手法を以下簡単に述べておく。
（方法１）あらかじめ統計的言語辞書６３０から抽出された語彙をネットワーク言語辞書６１０内に投入しておく（図１１）。 (Eighth embodiment)
In a speech recognition system whose purpose is to perform device operation, in order to dynamically change a language dictionary, it is necessary to set a changed vocabulary and a link to device operation corresponding to the connection of the vocabulary. . Various methods are possible, but some methods are briefly described below.
(Method 1) The vocabulary extracted from the statistical language dictionary 630 in advance is put into the network language dictionary 610 (FIG. 11).

設計者が設計したネットワーク言語辞書６１０（図１１：状態１）に対し、ネットワーク言語辞書６１１は工場出荷時に統計的言語辞書によって語彙が追加されている（図１１：状態２）。このとき追加される語彙は多いが、設計者が意図して記述したネットワーク言語辞書に含まれる語彙（斜体文字および破線部）以外の部分はインセンシティブになっているため、音声認識時には検索対象とはならない。 In contrast to the network language dictionary 610 designed by the designer (FIG. 11: state 1), the network language dictionary 611 has a vocabulary added by a statistical language dictionary at the time of factory shipment (FIG. 11: state 2). Although many vocabularies are added at this time, the parts other than the vocabulary (italic characters and broken lines) included in the network language dictionary intentionally described by the designer are insensitive. Must not.

次発話予測手段によって入力された次発話語彙および次発話語彙から接続される語彙について順次センシティブにしていく（図１１：状態３、「空調」、「空調」から「ＯＮ」、「ＯＦＦ」への接続）。
（方法２）統計的言語辞書に含まれる語彙にクラスを付与
１．統計的言語辞書に含まれる語彙に品詞情報を付与（名詞、動詞、形容詞、助詞等）
２．操作に直接関係する、名詞、動詞に関し、操作タスクに関連するクラス情報を付与
番号、入出力（ＯＮ／ＯＦＦ）
入力された語彙から接続される語彙について、クラス情報をもとに操作タスクを当てはめる。名詞、動詞に関し、操作タスクが存在しないもの、すなわち機器操作に関与する内容が存在しないときは原則として追加しない。しかし、操作に関係しない品詞であっても、例えば「ちょっと」「ラジオ」「いれて」の「ちょっと」のように操作対象機器の名称の前に付加する語彙の場合、認識装置は「ちょっとラジオ」を１語彙として探しに掛かり、無理に関係のない語彙を抽出してしまう場合があり、これを避けるためには「ちょっと」「えーっと」といった類の語彙は追加しておいてもよい。
（方法３）使用者に入力を促す
１．語彙を追加する際に、ダイアログを表示して、使用者に操作情報の入力を促す。 The next utterance vocabulary inputted by the next utterance prediction means and the vocabulary connected from the next utterance vocabulary are sequentially made sensitive (FIG. 11: state 3, “air conditioning”, “air conditioning” to “ON”, “OFF” Connection).
(Method 2) Assigning a class to a vocabulary included in a statistical language dictionary Add part-of-speech information to vocabulary contained in statistical language dictionaries (nouns, verbs, adjectives, particles, etc.)
2. For nouns and verbs that are directly related to operations, class information related to operation tasks is assigned. Number, input / output (ON / OFF)
For the vocabulary connected from the input vocabulary, the operation task is applied based on the class information. As a general rule, no nouns and verbs are added when there is no operation task, that is, when there is no content related to device operation. However, even if the part of speech is not related to the operation, for example, in the case of a vocabulary that is added before the name of the operation target device, such as “a little”, “radio”, or “a little”, the recognition device is “a little radio. "" As a single vocabulary, and irrelevant vocabulary may be extracted. To avoid this, vocabulary such as "a little" or "um" may be added.
(Method 3) Prompt the user for input When adding a vocabulary, a dialog is displayed to prompt the user to input operation information.

以上、本発明の概要と実施例を簡単に説明してきた。上記に述べた例はあくまで発明内容の理解を容易に行なうためであり、発明の範囲を限定するものではない。また、該実施例を単独、あるいは複数例、組合わせることも容易に実現できる。 The outline and embodiments of the present invention have been briefly described above. The above-described examples are only for easy understanding of the contents of the invention, and do not limit the scope of the invention. Further, it is possible to easily realize a single example or a combination of a plurality of examples.

第１の実施の形態における処理の系統図。The systematic diagram of the process in 1st Embodiment. 第１の実施の形態における装置構成のブロック図。The block diagram of the apparatus structure in 1st Embodiment. 第１の実施の形態におけるネットワーク言語辞書に対する追加処理のフロー図。The flowchart of the addition process with respect to the network language dictionary in 1st Embodiment. 第２の実施の形態における処理の系統図。The systematic diagram of the process in 2nd Embodiment. 第２の実施の形態におけるネットワーク言語辞書に対する追加処理のフロー図。The flowchart of the addition process with respect to the network language dictionary in 2nd Embodiment. 第３の実施の形態における処理の系統図。The systematic diagram of the process in 3rd Embodiment. 第３の実施の形態におけるネットワーク言語辞書に対する追加処理のフロー図。The flowchart of the addition process with respect to the network language dictionary in 3rd Embodiment. 第４の実施の形態における具体的語彙を用いた処理のフロー図。The flowchart of the process using the specific vocabulary in 4th Embodiment. 第５の実施の形態における具体的語彙を用いた処理のフロー図。The flowchart of the process using the specific vocabulary in 5th Embodiment. 第６の実施の形態における具体的語彙を用いた処理のフロー図。The flowchart of the process using the specific vocabulary in 6th Embodiment. 第７の実施の形態における具体的語彙を用いた処理のフロー図。The flowchart of the process using the specific vocabulary in 7th Embodiment.

Explanation of symbols

１１０：音声入力手段１２０：音声認識手段
１３０：記憶手段１３１：ネットワーク言語辞書
１３２：統計的言語辞書１４０：追加手段
１５０：抽出手段１６０：履歴手段
１７０：次発話予測手段２１０：時刻t-1の言語辞書
２１１：時刻tの言語辞書２２０：時刻t-1の認識語彙
２３０：統計的言語辞書３１０：記憶装置
３２０：演算装置３３０：ＡＤ変換装置
３３５：マイク３４０：入力装置
４１０：時刻t-1の言語辞書４１１：時刻tの言語辞書
４２０：時刻t-1の認識語彙４３０：統計的言語辞書
５１０：時刻t-1の言語辞書５１１：時刻tの言語辞書
５２０：時刻t-1の次発話予測語彙
６１０：状態１のネットワーク言語辞書
６１１：状態２のネットワーク言語辞書
６１２：状態３のネットワーク言語辞書
６２０：時刻t-1の次発話予測語彙６３０：統計的言語辞書 110: voice input means 120: voice recognition means 130: storage means 131: network language dictionary 132: statistical language dictionary 140: addition means 150: extraction means 160: history means 170: next utterance prediction means 210: time t-1 Language dictionary 211: Language dictionary at time t 220: Recognition vocabulary at time t-1 230: Statistical language dictionary 310: Storage device 320: Arithmetic device 330: AD converter 335: Microphone 340: Input device 410: Time t-1 411: Language dictionary at time t 420: Recognition vocabulary at time t-1 430: Statistical language dictionary 510: Language dictionary at time t-1 511: Language dictionary at time t 520: Next utterance at time t-1 Predictive vocabulary 610: Network language dictionary in state 1 611: Network language dictionary in state 2 612: Network language dictionary in state 3 620: Next time t-1 Talk prediction vocabulary 630: Statistical language dictionary

Claims

In a speech recognition device used in a navigation system,
Voice input means for inputting voice;
Speech recognition means for recognizing input speech and converting it into vocabulary,
Storage means 1 for storing a network language dictionary;
Storage means 2 for storing a statistical language dictionary;
When an arbitrary vocabulary A included in the recognized vocabulary exists in the statistical language dictionary, one or more vocabulary B having a lot of frequency information regarding connection to the vocabulary A in the statistical language dictionary Extracting means for extracting
Add to the network language dictionary frequency information related to the connection between the vocabulary B and the vocabulary B to either the vocabulary A or another vocabulary C included in the network language dictionary, and frequency information related to the appearance of the vocabulary B And a voice recognition device.

In a speech recognition device used in a navigation system,
Voice input means for inputting voice;
Speech recognition means for recognizing input speech and converting it into vocabulary,
Storage means 1 for storing a network language dictionary;
Storage means 2 for storing a statistical language dictionary;
When an arbitrary vocabulary A included in the recognized vocabulary is present in the statistical language dictionary, the vocabulary A in the statistical language dictionary has a lot of frequency information regarding connection and is related to appearance. Extraction means for extracting one or more vocabulary B having a lot of frequency information;
Add to the network language dictionary frequency information related to the connection between the vocabulary B and the vocabulary B to either the vocabulary A or another vocabulary C included in the network language dictionary, and frequency information related to the appearance of the vocabulary B And a voice recognition device.

In a speech recognition device used in a navigation system,
Voice input means for inputting voice;
Speech recognition means for recognizing input speech and converting it into vocabulary,
Storage means 1 for storing a network language dictionary;
Storage means 2 for storing a statistical language dictionary;
A history is stored for each type of word for the recognized vocabulary,
A history in which the newly recognized vocabulary of the vocabulary of the vocabulary is newly registered in the history, and if it is the same as the already registered vocabulary, the history is added to the number of already entered vocabularies Means,
When an arbitrary vocabulary A included in the recognized vocabulary exists in the statistical language dictionary and the input amount exceeds a preset threshold value, a large amount of connection frequency information is given to the vocabulary A. Extraction means for extracting one or more vocabulary B having;
Additional means for adding to the network language dictionary the vocabulary B, frequency information relating to the connection of the vocabulary A to the vocabulary B or another vocabulary C included in the network language dictionary, and frequency information relating to the appearance of the vocabulary B A speech recognition apparatus characterized by comprising:

In a speech recognition method used in a navigation system,
Input the voice into the voice input means,
The input speech is recognized by speech recognition means and converted into vocabulary,
Furthermore, a network language dictionary and a statistical language dictionary are set on the storage means,
A history is stored for each type of vocabulary or vocabulary group by the history means for the recognized vocabulary, and the frequency of input of the vocabulary is counted,
When the arbitrary vocabulary A included in the recognized vocabulary exists in the statistical language dictionary and its input frequency exceeds a preset threshold, it has a lot of frequency information on connection, and , One or more vocabulary B having a lot of frequency information about appearance is extracted by the extraction means,
The vocabulary B, frequency information related to the connection of the vocabulary A to the vocabulary B or another vocabulary C included in the network language dictionary, and frequency information related to the appearance of the vocabulary B are added to the network language dictionary by an adding unit. A speech recognition method characterized by:

In a speech recognition device used in a navigation system,
Voice input means for inputting voice;
Speech recognition means for recognizing input speech and converting it into vocabulary,
Storage means 1 for storing a network language dictionary;
Storage means 2 for storing a statistical language dictionary;
A next utterance prediction means for predicting a user's next utterance;
When one or more vocabulary A predicted as the next utterance of the recognized vocabulary exists in the statistical language dictionary, one or more vocabulary B having a lot of connection frequency information is extracted from the vocabulary A Extraction means to
Additional means for adding to the network language dictionary the vocabulary B, frequency information relating to the connection of the vocabulary A to the vocabulary B or another vocabulary C included in the network language dictionary, and frequency information relating to the appearance of the vocabulary B A speech recognition apparatus characterized by comprising:

In a speech recognition method used in a navigation system,
Input voice by voice input means,
The input speech is recognized by speech recognition means and converted into vocabulary,
Furthermore, the network language dictionary is stored in the storage means 1,
Storing the statistical language dictionary in the storage means 2;
Predict the next utterance of the user with the next utterance prediction means,
When one or more vocabulary A predicted as the next utterance of the recognized vocabulary exists in the statistical language dictionary and the input frequency exceeds a preset threshold, One or more vocabulary B having a large amount and having a lot of frequency information regarding appearance is extracted by the extraction means,
The vocabulary B, frequency information related to connection of the vocabulary A to the vocabulary B or another vocabulary C included in the network language dictionary, and frequency information related to the appearance of the vocabulary B are added to the network language dictionary by an adding unit. A speech recognition method comprising:

In a speech recognition device used in a vehicle,
Voice input means for inputting voice;
Speech recognition means for recognizing input speech and converting it into vocabulary,
Storage means 1 for storing a network language dictionary;
Storage means 2 for storing a statistical language dictionary;
When an arbitrary vocabulary A included in the recognized vocabulary exists in the statistical language dictionary, one or more vocabularies B having a lot of frequency information related to connection to the vocabulary A in the statistical language dictionary. Extracting means for extracting;
Additional means for adding to the network language dictionary the vocabulary B, frequency information relating to the connection of the vocabulary A to the vocabulary B or another vocabulary C included in the network language dictionary, and frequency information relating to the appearance of the vocabulary B A speech recognition apparatus characterized by comprising:

In a speech recognition device used in a vehicle,
Voice input means for inputting voice;
Speech recognition means for recognizing input speech and converting it into vocabulary,
Storage means 1 for storing a network language dictionary;
Storage means 2 for storing a statistical language dictionary;
When an arbitrary vocabulary A included in the recognized vocabulary is present in the statistical language dictionary, the vocabulary A in the statistical language dictionary has a lot of frequency information regarding connection and is related to appearance. Extraction means for extracting one or more vocabulary B having a lot of frequency information;
Additional means for adding to the network language dictionary the vocabulary B, frequency information relating to the connection of the vocabulary A to the vocabulary B or another vocabulary C included in the network language dictionary, and frequency information relating to the appearance of the vocabulary B A speech recognition apparatus characterized by comprising:

In a speech recognition device used for a vehicle,
Voice input means for inputting voice;
Speech recognition means for recognizing input speech and converting it into vocabulary,
Storage means 1 for storing a network language dictionary;
Storage means 2 for storing a statistical language dictionary;
A history is stored for each type of word for the recognized vocabulary,
A history in which the newly recognized vocabulary of the vocabulary of the vocabulary is newly registered in the history, and if it is the same as the already registered vocabulary, the history is added to the number of already entered vocabularies Means,
When an arbitrary vocabulary A included in the recognized vocabulary exists in the statistical language dictionary and the input amount exceeds a preset threshold value, a large amount of connection frequency information is given to the vocabulary A. Extraction means for extracting one or more vocabulary B having;
Additional means for adding to the network language dictionary the vocabulary B, frequency information relating to the connection of the vocabulary A to the vocabulary B or another vocabulary C included in the network language dictionary, and frequency information relating to the appearance of the vocabulary B A speech recognition apparatus characterized by comprising:

In a method for creating a dictionary for a speech recognition device used in a vehicle,
Input the voice into the voice input means,
The input speech is recognized by speech recognition means and converted into vocabulary,
Furthermore, a network language dictionary and a statistical language dictionary are set on the storage means,
A history is stored for each type of vocabulary or vocabulary group by the history means for the recognized vocabulary, and the frequency of input of the vocabulary is counted,
When the arbitrary vocabulary A included in the recognized vocabulary exists in the statistical language dictionary and its input frequency exceeds a preset threshold, it has a lot of frequency information on connection, and , One or more vocabulary B having a lot of frequency information about appearance is extracted by the extraction means,
The vocabulary B, frequency information related to the connection of the vocabulary A to the vocabulary B or another vocabulary C included in the network language dictionary, and frequency information related to the appearance of the vocabulary B are added to the network language dictionary by an adding unit. A speech recognition method characterized by:

In a speech recognition device used in a vehicle,
Voice input means for inputting voice;
Speech recognition means for recognizing input speech and converting it into vocabulary,
Storage means 1 for storing a network language dictionary;
Storage means 2 for storing a statistical language dictionary;
A next utterance prediction means for predicting a user's next utterance;
When one or more vocabulary A predicted as the next utterance of the recognized vocabulary exists in the statistical language dictionary, one or more vocabulary B having a lot of connection frequency information is extracted from the vocabulary A Extraction means to
Additional means for adding to the network language dictionary the vocabulary B, frequency information relating to the connection of the vocabulary A to the vocabulary B or another vocabulary C included in the network language dictionary, and frequency information relating to the appearance of the vocabulary B A speech recognition apparatus characterized by comprising:

In speech recognition methods used in vehicles,
Input voice by voice input means,
The input speech is recognized by speech recognition means and converted into vocabulary,
Furthermore, the network language dictionary is stored in the storage means 1,
Storing the statistical language dictionary in the storage means 2;
Predict the next utterance of the user with the next utterance prediction means,
When one or more vocabulary A predicted as the next utterance of the recognized vocabulary exists in the statistical language dictionary and the input frequency exceeds a preset threshold, One or more vocabulary B having a large amount and having a lot of frequency information regarding appearance is extracted by the extraction means,
The vocabulary B, frequency information related to connection of the vocabulary A to the vocabulary B or another vocabulary C included in the network language dictionary, and frequency information related to the appearance of the vocabulary B are added to the network language dictionary by an adding unit. A speech recognition method comprising:

In the speech recognition device according to any one of claims 1 to 3, claim 5, claim 7 to claim 9 or claim 11,
Extraction means for extracting frequency information relating to the connection between the vocabulary A and the vocabulary B from the statistical language dictionary;
The balance ratio between the frequency information related to the connection between the vocabulary A and the vocabulary B and the frequency information related to the connection between two vocabulary C, the vocabulary A, and the vocabulary B included in the network language dictionary is adjusted. A speech recognition apparatus comprising: the vocabulary B; frequency information relating to the connection between the vocabulary A and the vocabulary B; and the adding means for adding frequency information relating to the appearance of the vocabulary B.

In the speech recognition device according to any one of claims 1 to 3, claim 5, claim 7 to claim 9 or claim 11,
Extraction means for extracting frequency information about the occurrence of the vocabulary A and the vocabulary B from the statistical language dictionary;
The balance ratio between the frequency information related to the connection between the vocabulary A and the vocabulary B and the frequency information related to the connection between two vocabulary C, the vocabulary A, and the vocabulary B included in the network language dictionary is adjusted. A speech recognition apparatus comprising: the vocabulary B; frequency information relating to the connection between the vocabulary A and the vocabulary B; and the adding means for adding frequency information relating to the appearance of the vocabulary B.

The additional means according to any one of claims 1 to 3, 5, 7, 7 to 9, 11 or 14 is used by a new vocabulary added to the network language dictionary. A speech recognition apparatus, which is an additional means having a function of enabling device operation by linking to the input vocabulary when input by a person.

The additional means in the speech recognition apparatus according to any one of claims 1 to 3, 5, 7, 7 to 9, 11, 11, 14 or 15 is provided in the network language dictionary. A speech recognition apparatus characterized by being an adding means having a function of not executing processing for adding a candidate vocabulary to the network language dictionary when there is no content related to device operation as a candidate vocabulary to be added .

The extraction means in the speech recognition apparatus according to any one of claims 1 to 3, claim 5, claim 7, to claim 9, claim 11, claim 14 or claim 15, the network language dictionary. A speech recognition apparatus, comprising: extraction means having a function of deleting a candidate vocabulary from the network language dictionary when there is no content related to device operation as a candidate vocabulary to be added.