JP2003271180A

JP2003271180A - Voice processor, voice processing method, program and recording medium

Info

Publication number: JP2003271180A
Application number: JP2002072718A
Authority: JP
Inventors: Atsuo Hiroe; 厚夫廣江
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 2002-03-15
Filing date: 2002-03-15
Publication date: 2003-09-25
Anticipated expiration: 2022-03-15
Also published as: JP4048473B2

Abstract

<P>PROBLEM TO BE SOLVED: To enable a user to register an unknown word without caring a registration mode during continuous normal conversation in an interactive system. <P>SOLUTION: A feature value extracting part 43 extracts the feature parameter of a voice input fetched from a microphone 41 through an AD converting part 42. A phoneme typewriter 45 outputs a phoneme sequence based on the feature parameter. A matching part 44 refers to a dictionary database 52 to generate a word string based on the feature parameter and estimates an unknown word. A conversation control part 3 registers a combination of the unknown word included in the word string and a category estimated from the word string in an association storing part 2 when the word string corresponds to a prescribed template. This present invention can be applied to an interactive system that performs conversation based on voice recognition processing. <P>COPYRIGHT: (C)2003,JPO

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、音声処理装置およ
び音声処理方法、並びにプログラムおよび記録媒体に関
し、特に、連続して入力される音声信号を音声認識して
いる最中に、その入力音声信号に含まれる未知語を抽出
し、簡単に登録することができるようにした音声処理装
置および音声処理方法、並びにプログラムおよび記録媒
体に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice processing device, a voice processing method, a program, and a recording medium, and more particularly, to an input voice signal during voice recognition of continuously input voice signals. The present invention relates to a voice processing device and a voice processing method, a program, and a recording medium that can extract an unknown word included in and easily register it.

【０００２】[0002]

【従来の技術】対話システムにおいて、何かの名前を音
声で登録するという場面は、多く発生する。例えば、ユ
ーザが自分の名前を登録したり、対話システムに名前を
つけたり、地名や店名を入力したりするという場面であ
る。2. Description of the Related Art In interactive systems, there are many situations in which a name is registered by voice. For example, a user may register his / her name, name the dialog system, or enter a place name or a shop name.

【０００３】従来、このような音声登録を簡単に実現す
る方法としては、何かのコマンドによって登録モードに
移行して、登録が終了したら通常の対話モードに戻ると
いうものがある。この場合、例えば、「ユーザ名登録」
という音声コマンドによって登録モードに移行して、そ
の後でユーザが名前を発生したらそれが登録され、その
後、通常モードに戻る処理が行われる。Conventionally, as a method of easily realizing such voice registration, there is a method of shifting to the registration mode by some command and returning to the normal interactive mode when the registration is completed. In this case, for example, "User name registration"
When the user generates a name, the voice command is entered to register the mode, and thereafter, the name is registered, and then the process returns to the normal mode.

【０００４】[0004]

【発明が解決しようとする課題】しかしながら、このよ
うな音声登録の方法では、コマンドによるモード切換え
をしなければならず、対話としては不自然であり、ユー
ザにとっては煩わしいという課題がある。また、名付け
る対象が複数存在する場合、コマンドの数が増えるた
め、いっそう煩わしくなる。However, in such a voice registration method, there is a problem in that the mode must be switched by a command, which is unnatural as a dialogue and annoying to the user. Further, when there are a plurality of objects to be named, the number of commands increases, which makes it even more troublesome.

【０００５】さらに、登録モード中に、ユーザが名前以
外の単語（例えば、「こんにちは」）を話してしまった
場合、それも名前として登録されてしまう。また、例え
ば、「太郎」という名前だけではなく、「私の名前は太
郎です。」といったように、ユーザが名前以外の言葉を
付加して話した場合、全体（「私の名前は太郎で
す。」）が名前として登録されてしまう。[0005] Further, in the registration mode, if the user is you've talked a word other than the name (for example, "Hello"), it will also be registered as a name. Also, for example, when the user speaks with a word other than the name, such as "My name is Taro." Instead of just the name "Taro," the whole ("My name is Taro. )) Is registered as a name.

【０００６】本発明はこのような状況に鑑みてなされた
ものであり、通常の対話の中で、ユーザに登録モードを
意識させることなく、単語を登録できるようにすること
を目的とする。The present invention has been made in view of such a situation, and an object thereof is to enable a user to register a word in a normal dialogue without making the user aware of the registration mode.

【０００７】[0007]

【課題を解決するための手段】本発明の音声処理装置
は、連続する入力音声を認識する認識手段と、認識手段
により認識された認識結果に、未知語が含まれているか
否かを判定する未知語判定手段と、未知語判定手段によ
り、未知語が含まれていると判定された場合、未知語に
対応する単語を獲得する獲得手段と、獲得手段により獲
得された単語を他の情報に関連付けて登録する登録手段
とを備えることを特徴とする。A speech processing apparatus of the present invention determines a recognition means for recognizing continuous input speech and whether or not an unknown word is included in the recognition result recognized by the recognition means. When the unknown word determination means and the unknown word determination means determine that the unknown word is included, the acquisition means for acquiring the word corresponding to the unknown word, and the word acquired by the acquisition means as other information. It is characterized by comprising a registration means for registering in association with each other.

【０００８】認識結果が特定のパターンにマッチするか
否かを判定するパターン判定手段をさらに備え、登録手
段は、パターン判定手段により、認識結果が特定のパタ
ーンにマッチしていると判定された場合、単語を登録す
ることができる。When the recognition result is judged to match the specific pattern by the pattern judging means, the registration means further comprises pattern judging means for judging whether or not the recognition result matches the specific pattern. , You can register words.

【０００９】未知語判定手段により、未知語が含まれて
いないと判定された場合、または、パターン判定手段に
より、認識結果が特定のパターンにマッチしていないと
判定された場合、入力音声に対応する応答を生成する応
答生成手段をさらに備えることができる。If the unknown word determination means determines that the unknown word is not included, or if the pattern determination means determines that the recognition result does not match a specific pattern, it corresponds to the input voice. It may further comprise a response generating means for generating a response to

【００１０】登録手段は、他の情報として、単語のカテ
ゴリを登録することができる。The registration means can register a word category as other information.

【００１１】登録手段は、パターン判定手段によりマッ
チすると判定されたパターンに基づいて、他の情報を登
録することができる。The registration means can register other information based on the pattern judged to match by the pattern judgment means.

【００１２】獲得手段は、未知語をクラスタリングする
ことで単語を獲得することができる。The acquisition means can acquire words by clustering unknown words.

【００１３】入力音声の所定の区間について、既知語で
マッチングさせた場合と音韻タイプライタで認識させた
場合の音響スコアを比較する比較手段をさらに備え、比
較手段は、音韻タイプライタで認識させた音響スコアの
方が優れている場合、その区間を未知語であると推定す
ることができる。[0013] For a predetermined section of the input voice, there is further provided comparison means for comparing acoustic scores in the case of matching with a known word and in the case of being recognized by the phoneme typewriter, and the comparing means is made to recognize by the phoneme typewriter. If the acoustic score is better, the section can be estimated to be an unknown word.

【００１４】比較手段は、既知語でマッチングさせた場
合の音響スコアに対して、音韻タイプライタで認識させ
た場合の音響スコアに補正をかけた上で比較を行うこと
ができる。The comparison means can make a comparison after correcting the acoustic score when the phonological typewriter recognizes the acoustic score when the known word is matched.

【００１５】本発明の音声処理方法は、連続する入力音
声を認識する認識ステップと、認識ステップの処理によ
り認識された認識結果に、未知語が含まれているか否か
を判定する判定ステップと、判定ステップの処理によ
り、未知語が含まれていると判定された場合、未知語に
対応する単語を獲得する獲得ステップと、獲得ステップ
の処理により獲得された単語を他の情報に関連付けて登
録する登録ステップとを含むことを特徴とする。The speech processing method of the present invention comprises a recognition step of recognizing continuous input speech, a determination step of determining whether or not an unknown word is included in the recognition result recognized by the processing of the recognition step, When it is determined by the processing of the determination step that the unknown word is included, the acquisition step of acquiring the word corresponding to the unknown word and the word acquired by the processing of the acquisition step are registered in association with other information. And a registration step.

【００１６】本発明の記録媒体のプログラムは、連続す
る入力音声を認識する認識ステップと、認識ステップの
処理により認識された認識結果に、未知語が含まれてい
るか否かを判定する判定ステップと、判定ステップの処
理により、未知語が含まれていると判定された場合、未
知語に対応する単語を獲得する獲得ステップと、獲得ス
テップの処理により獲得された単語を他の情報に関連付
けて登録する登録ステップとを含むことを特徴とする。The program of the recording medium of the present invention comprises a recognition step of recognizing continuous input speech, and a judgment step of judging whether or not the recognition result recognized by the processing of the recognition step includes an unknown word. If the process of the determination step determines that the unknown word is included, the acquisition step of acquiring the word corresponding to the unknown word and the word obtained by the process of the acquisition step are registered in association with other information. And a registration step to perform.

【００１７】本発明のプログラムは、連続する入力音声
を認識する認識ステップと、認識ステップの処理により
認識された認識結果に、未知語が含まれているか否かを
判定する判定ステップと、判定ステップの処理により、
未知語が含まれていると判定された場合、未知語に対応
する単語を獲得する獲得ステップと、獲得ステップの処
理により獲得された単語を他の情報に関連付けて登録す
る登録ステップとを実行させることを特徴とする。The program of the present invention comprises a recognition step of recognizing continuous input speech, a judgment step of judging whether or not an unknown word is included in the recognition result recognized by the processing of the recognition step, and a judgment step. By the processing of
When it is determined that the unknown word is included, the acquisition step of acquiring the word corresponding to the unknown word and the registration step of registering the word acquired by the processing of the acquisition step in association with other information are executed. It is characterized by

【００１８】本発明においては、連続する入力音声が認
識されて、認識結果に未知語が含まれている場合、未知
語に対応する単語が獲得され、その単語が他の情報に関
連付けて登録される。In the present invention, when continuous input speech is recognized and the recognition result includes an unknown word, a word corresponding to the unknown word is acquired and the word is registered in association with other information. It

【００１９】[0019]

【発明の実施の形態】以下、本発明の実施の形態につい
て、図面を参照して説明する。図１は、本発明を適用し
た対話システムの一実施形態の構成例を示している。BEST MODE FOR CARRYING OUT THE INVENTION Embodiments of the present invention will be described below with reference to the drawings. FIG. 1 shows a configuration example of an embodiment of a dialogue system to which the present invention is applied.

【００２０】この対話システムは、ユーザ（人間）と音
声により対話を行うシステムであり、例えば、音声が入
力されると、その音声から名前が取り出され、登録され
るようになっている。This dialogue system is a system for dialogue with a user (human) by voice. For example, when voice is input, a name is extracted from the voice and registered.

【００２１】即ち、音声認識部１には、ユーザからの発
話に基づく音声信号が入力されるようになっており、音
声認識部１は、入力された音声信号を認識し、その音声
認識の結果としてのテキスト、その他付随する情報を、
対話制御部３と単語獲得部４に必要に応じて出力する。That is, the voice recognition unit 1 is adapted to receive a voice signal based on the utterance from the user, and the voice recognition unit 1 recognizes the inputted voice signal and outputs the result of the voice recognition. Text as, and other accompanying information,
Output to the dialogue control unit 3 and the word acquisition unit 4 as necessary.

【００２２】単語獲得部４は、音声認識部１が有する認
識用辞書に登録されていない単語について、音響的特徴
を自動的に記憶し、それ以降、その単語の音声を認識で
きるようにする。The word acquisition unit 4 automatically stores the acoustic characteristics of a word that is not registered in the recognition dictionary of the voice recognition unit 1 so that the voice of the word can be recognized thereafter.

【００２３】即ち、単語獲得部４は、入力音声に対応す
る発音を音韻タイプライタによって求め、それをいくつ
かのクラスタに分類する。各クラスタはＩＤと代表音韻
系列を持ち、ＩＤで管理される。このときのクラスタの
状態を、図２を参照して説明する。That is, the word acquisition unit 4 obtains the pronunciation corresponding to the input voice by the phoneme typewriter and classifies it into several clusters. Each cluster has an ID and a representative phoneme sequence and is managed by the ID. The state of the cluster at this time will be described with reference to FIG.

【００２４】例えば、「あか」、「あお」、「みどり」
という３回の入力音声があったとする。この場合、単語
獲得部４は、３回の音声を、それぞれに対応した「あ
か」クラスタ２１、「あお」クラスタ２２、「みどり」
クラスタ２３の、３つのクラスタに分類し、各クラスタ
には、代表となる音韻系列（図２の例の場合、"a/k/a,
“a/o", “m/i/d/o/r/I"）とＩＤ（図２の例の場合、
「１」、「２」、「３」）を付加する。For example, "red", "blue", "green"
It is assumed that there are three input voices. In this case, the word acquisition unit 4 outputs three times of voices to the “red” cluster 21, the “blue” cluster 22, and the “green” corresponding to each of the three sounds.
The cluster 23 is classified into three clusters, and each cluster has a representative phoneme sequence (in the example of FIG. 2, "a / k / a,
“A / o”, “m / i / d / o / r / I”) and ID (in the case of FIG. 2,
"1", "2", "3") are added.

【００２５】ここで再び、「あか」という音声が入力さ
れると、対応するクラスタが既に存在するので、単語獲
得部４は、入力音声を「あか」クラスタ２１に分類し、
新しいクラスタは生成しない。これに対して、「くろ」
という音声が入力された場合、対応するクラスタが存在
しないので、単語獲得部４は、「くろ」に対応したクラ
スタ２４を新たに生成し、そのクラスタには、代表的な
音韻系列（図２の例の場合、"k/u/r/o"）とＩＤ（図２
の例の場合、「４」）を付加する。When the voice "red" is input again, since the corresponding cluster already exists, the word acquisition unit 4 classifies the input voice into the "red" cluster 21,
Does not create a new cluster. On the other hand, "black"
When the voice is input, since the corresponding cluster does not exist, the word acquisition unit 4 newly generates a cluster 24 corresponding to “black”, and a representative phoneme sequence (in FIG. 2) is generated in the cluster. In the case of the example, "k / u / r / o") and ID (Fig. 2)
In the case of the above example, “4”) is added.

【００２６】したがって、入力音声が未獲得の語である
か否かは、新たなクラスタが生成されたかどうかによっ
て判定できる。なお、このような単語獲得処理の詳細
は、本出願人が先に提案した特願２００１−９７８４３
号に開示されている。Therefore, whether or not the input voice is an unacquired word can be determined by whether or not a new cluster is generated. The details of such word acquisition processing are described in Japanese Patent Application No. 2001-97843 previously proposed by the present applicant.
No.

【００２７】連想記憶部２は、登録した名前（未知語）
がユーザ名であるか、キャラクタ名であるかといったカ
テゴリ等の情報を記憶する。例えば、図３の例では、ク
ラスタＩＤとカテゴリ名とが対応して記憶されている。
図３の例の場合、例えば、クラスタＩＤ「１」、
「３」、「４」は「ユーザ名」のカテゴリに対応され、
クラスタＩＤ「２」は、「キャラクタ名」のカテゴリに
対応されている。The associative storage unit 2 stores the registered name (unknown word).
It stores information such as a category such as whether it is a user name or a character name. For example, in the example of FIG. 3, the cluster ID and the category name are stored in association with each other.
In the case of the example in FIG. 3, for example, the cluster ID “1”,
"3" and "4" correspond to the "user name" category,
The cluster ID “2” corresponds to the “character name” category.

【００２８】対話制御部３は、音声認識部１の出力から
ユーザの発話の内容を理解し、その理解の結果に基づい
て、名前（未知語）の登録を制御する。また、対話制御
部３は、連想記憶部２に記憶されている登録済みの名前
の情報に基づいて、登録済みの名前を認識できるよう
に、それ以降の対話を制御する。The dialogue control unit 3 understands the content of the user's utterance from the output of the voice recognition unit 1 and controls the registration of the name (unknown word) based on the result of the understanding. Further, the dialogue control unit 3 controls the subsequent dialogue so that the registered name can be recognized based on the registered name information stored in the associative storage unit 2.

【００２９】図４は、音声認識部１の構成例を示してい
る。FIG. 4 shows a configuration example of the voice recognition unit 1.

【００３０】ユーザの発話は、マイクロホン４１に入力
され、マイクロホン４１では、その発話が、電気信号と
しての音声信号に変換される。この音声信号は、ＡＤ
（Analog Digital）変換部４２に供給される。ＡＤ変換
部４２は、マイクロホン４１からのアナログ信号である
音声信号をサンプリングし、量子化し、ディジタル信号
である音声データに変換する。この音声データは、特徴
量抽出部４３に供給される。The utterance of the user is input to the microphone 41, and the microphone 41 converts the utterance into a voice signal as an electric signal. This audio signal is AD
It is supplied to the (Analog Digital) converter 42. The AD converter 42 samples an audio signal which is an analog signal from the microphone 41, quantizes it, and converts it into audio data which is a digital signal. This voice data is supplied to the feature amount extraction unit 43.

【００３１】特徴量抽出部４３は、ＡＤ変換部４２から
の音声データについて、適当なフレームごとに、例え
ば、スペクトル、パワー線形予測係数、ケプストラム係
数、線スペクトル対等の特徴パラメータを抽出し、マッ
チング部４４および音韻タイプライタ部４５に供給す
る。The feature amount extraction unit 43 extracts feature parameters such as a spectrum, a power linear prediction coefficient, a cepstrum coefficient, and a line spectrum pair for each appropriate frame from the voice data from the AD conversion unit 42, and a matching unit. 44 and the phoneme typewriter unit 45.

【００３２】マッチング部４４は、特徴量抽出部４３か
らの特徴パラメータに基づき、音響モデルデータベース
５１、辞書データベース５２、および言語モデルデータ
ベース５３を必要に応じて参照しながら、マイクロホン
４１に入力された音声（入力音声）に最も近い単語列を
求める。The matching unit 44 refers to the acoustic model database 51, the dictionary database 52, and the language model database 53 as needed based on the characteristic parameters from the characteristic amount extraction unit 43, and inputs the voice input to the microphone 41. Find the word string closest to (input voice).

【００３３】音響モデルデータベース５１は、音声認識
する音声の言語における個々の音韻や音節などの音響的
な特徴を表す音響モデルを記憶している。音響モデルと
しては、例えば、HMM（Hidden Markov Model）などを用
いることができる。辞書データベース５２は、認識対象
の各単語（語句）について、その発音に関する情報が記
述された単語辞書や、音韻や音節の連鎖関係を記述した
モデルを記憶している。The acoustic model database 51 stores acoustic models representing acoustic features such as individual phonemes and syllables in the language of speech to be recognized. As the acoustic model, for example, an HMM (Hidden Markov Model) or the like can be used. The dictionary database 52 stores, for each word (phrase) to be recognized, a word dictionary in which information regarding pronunciation is described, and a model in which a phoneme or a syllable chain relation is described.

【００３４】なお、ここにおける単語とは、認識処理に
おいて１つのまとまりとして扱ったほうが都合の良い単
位のことであり、言語学的な単語とは必ずしも一致しな
い。例えば、「タロウ君」は、それ全体を１単語として
扱ってもよいし、「タロウ」、「君」という２単語とし
て扱ってもよい。さらに、もっと大きな単位である「こ
んにちはタロウ君」等を１単語として扱ってもよい。The word here is a unit that is more convenient to handle as a unit in the recognition process, and does not necessarily match a linguistic word. For example, "Taro-kun" may be treated as one word as a whole, or may be treated as two words "Taro" and "Kimi". In addition, may be dealing with more is a major unit of "Hello Taro" or the like as one word.

【００３５】また、音韻とは、音響的に１つの単位とし
て扱った方が処理上都合のよいもののことであり、音声
学的な音韻や音素とは必ずしも一致しない。例えば、
「東京」の「とう」の部分を"t/o/u"という３個の音韻
記号で表してもよいし、"o"の長音である"o:"という記
号を用いて"t/o:"と表してもよい。または、"t/o/o"と
表すことも可能である。他にも、無音を表す記号を用意
したり、さらにそれを「発話前の無音」、「発話に挟ま
れた短い無音区間」、「発話語の無音」、「「っ」の部
分の無音」のように細かく分類してそれぞれに記号を用
意してもよい。The phoneme means that it is more convenient to process acoustically as one unit in terms of processing, and does not necessarily correspond to a phonetic phoneme or phoneme. For example,
The "to" part of "Tokyo" may be represented by the three phoneme symbols "t / o / u", or "t / o" using the symbol "o:" which is the long consonant of "o". It may be expressed as ":". Alternatively, it can be expressed as "t / o / o". In addition, we also prepared a symbol that represents silence, and also used it as "silence before utterance", "short silence interval between utterances", "silence of utterances", and "silence of" tsu "part". It may be finely categorized as shown in FIG.

【００３６】言語モデルデータベース５３は、辞書デー
タベース５２の単語辞書に登録されている各単語がどの
ように連鎖する（接続する）かに関する情報を記述して
いる。The language model database 53 describes information about how each word registered in the word dictionary of the dictionary database 52 is chained (connected).

【００３７】音韻タイプライタ部４５は、特徴量抽出部
４３から供給された特徴パラメータに基づいて、入力さ
れた音声に対応する音韻系列を取得する。例えば、「私
の名前は太郎です。」という音声から"w/a/t/a/sh/i/n/
o/n/a/m/a/e/w/a/t/a/r/o:/d/e/s/u"という音韻系列を
取得する。この音韻タイプライタには、既存のものを用
いることができる。The phoneme typewriter unit 45 acquires a phoneme sequence corresponding to the inputted voice, based on the characteristic parameters supplied from the characteristic amount extraction unit 43. For example, from the voice saying "My name is Taro.""W / a / t / a / sh / i / n /
o / n / a / m / a / e / w / a / t / a / r / o: / d / e / s / u "is obtained. This phoneme typewriter has the existing ones. Can be used.

【００３８】なお、音韻タイプライタ以外でも、任意の
音声に対して音韻系列を取得できるものであれば代わり
に用いることができる。例えば、日本語の音節（あ・い
・う…か・き…・ん）を単位とする音声認識や、音韻よ
りも大きく、単語よりは小さな単位であるサブワードを
単位とする音声認識等を用いることも可能である。Note that a phoneme typewriter other than the phoneme typewriter can be used instead as long as the phoneme sequence can be obtained for any voice. For example, speech recognition using Japanese syllables (a, i, u ... ka, ki ... n) as a unit, or speech recognition using subwords that are larger than phonemes and smaller than words as units. It is also possible.

【００３９】制御部４６は、ＡＤ変換部４２、特徴量抽
出部４３、マッチング部４４、音韻タイプライタ部４５
の動作を制御する。The control unit 46 includes an AD conversion unit 42, a feature amount extraction unit 43, a matching unit 44, and a phoneme typewriter unit 45.
Control the behavior of.

【００４０】次に、図５のフローチャートを参照して、
本発明の対話システムの処理について説明する。Next, referring to the flow chart of FIG.
The processing of the dialogue system of the present invention will be described.

【００４１】ステップＳ２１において、ユーザがマイク
ロホン４１に音声を入力すると、マイクロホン４１は、
その発話を、電気信号としての音声信号に変換する。そ
して、ステップＳ２２において、音声認識部１は、音声
認識処理を実行する。In step S21, when the user inputs a voice into the microphone 41, the microphone 41
The utterance is converted into a voice signal as an electric signal. Then, in step S22, the voice recognition unit 1 executes a voice recognition process.

【００４２】音声認識処理の詳細について、図６を参照
して説明する。マイクロホン４１で生成された音声信号
は、ステップＳ４１において、ＡＤ変換部４２により、
ディジタル信号である音声データに変換され、特徴量抽
出部４３に供給される。Details of the voice recognition process will be described with reference to FIG. The audio signal generated by the microphone 41 is processed by the AD conversion unit 42 in step S41.
It is converted into voice data which is a digital signal and supplied to the feature amount extraction unit 43.

【００４３】ステップＳ４２において、特徴量抽出部４
３は、ＡＤ変換部４２からの音声データを受信する。そ
して、特徴量抽出部４３は、ステップＳ４３に進み、適
当なフレームごとに、例えば、スペクトル、パワー、そ
れらの時間変化量等の特徴パラメータを抽出し、マッチ
ング部４４に供給する。In step S42, the feature quantity extraction unit 4
3 receives the audio data from the AD converter 42. Then, the feature amount extraction unit 43 proceeds to step S43, and extracts the feature parameters such as the spectrum, the power, and the time change amount thereof for each appropriate frame, and supplies the feature parameters to the matching unit 44.

【００４４】ステップＳ４４において、マッチング部４
４は、辞書データベース５２に格納されている単語モデ
ルのうちのいくつかを連結して、単語列を生成する。な
お、この単語列を構成する単語には、辞書データベース
５２に登録されている既知語だけでなく、登録されてい
ない未知語を表わすシンボルである“<OOV>”も含まれ
ている。この単語列生成処理について、図７を参照して
詳細に説明する。In step S44, the matching unit 4
4 connects some of the word models stored in the dictionary database 52 to generate a word string. In addition, not only known words registered in the dictionary database 52 but also "<OOV>", which is a symbol representing an unregistered unknown word, is included in the words forming the word string. This word string generation process will be described in detail with reference to FIG.

【００４５】ステップＳ６１において、マッチング部４
４は、入力音声の或る区間について、両方の場合の音響
スコアを計算する。即ち、辞書データベース５２に登録
されている既知語とマッチングさせた結果の音響スコア
と、音韻タイプライタ部４５により得られた結果（今の
場合、"w/a/t/a/sh/i/n/o/n/a/m/a/e/w/a/t/a/r/o:/d/e
/s/u"の中の一部区間）の音響スコアが、それぞれ計算
される。音響スコアは、音声認識結果の候補である単語
列と入力音声とが音としてどれだけ近いかを表す。In step S61, the matching unit 4
4 calculates the acoustic score in both cases for a certain section of the input voice. That is, the acoustic score obtained as a result of matching with the known word registered in the dictionary database 52 and the result obtained by the phonological typewriter unit 45 (in this case, "w / a / t / a / sh / i / n / o / n / a / m / a / e / w / a / t / a / r / o: / d / e
The acoustic score of each part of "/ s / u") is calculated. The acoustic score represents how close the word string that is the candidate of the speech recognition result and the input speech are as sounds.

【００４６】次に、入力音声の一部区間と辞書データベ
ース５２に登録されている既知語とをマッチングさせた
結果の音響スコアと、音韻タイプライタ部４５による結
果の音響スコアを比較させるのであるが、既知語とのマ
ッチングは単語単位で行われ、音韻タイプライタ部４５
でのマッチングは音韻単位で行われ、尺度が異なってい
るので、そのままでは比較することが困難である（一般
的には、音韻単位の音響スコアの方が大きな値とな
る）。そこで、尺度を合わせて比較できるようにするた
めに、マッチング部４４は、ステップＳ６２において、
音韻タイプライタ部４５により得られた結果の音響スコ
アに補正をかける。Next, the acoustic score obtained as a result of matching a partial section of the input voice with the known word registered in the dictionary database 52 is compared with the acoustic score obtained as a result by the phonological typewriter unit 45. , Matching with known words is performed on a word-by-word basis, and the phoneme typewriter unit 45
Since matching is performed in phoneme units and the scales are different, it is difficult to compare as it is (generally, the acoustic score in phoneme unit has a larger value). Therefore, in order to make it possible to compare the scales, the matching unit 44, in step S62,
The acoustic score of the result obtained by the phonological typewriter unit 45 is corrected.

【００４７】例えば、音韻タイプライタ部４５からの音
響スコアに係数を掛けたり、一定の値やフレーム長に比
例した値などを減じたりする処理が行われる。勿論、こ
の処理は相対的なものなので、既知語とマッチングさせ
た結果の音響スコアに対して行うこともできる。なお、
この処理の詳細は、例えば、文献「"EUROSPEECH99 Volu
me 1, Page 49-52"」に「OOV-Detection in Large Voca
bulary System UsingAutomatically Defined Word-Frag
ments as Fillers」として開示されている。For example, a process of multiplying the acoustic score from the phoneme typewriter unit 45 by a coefficient or subtracting a constant value or a value proportional to the frame length is performed. Of course, since this process is relative, it can be performed on the acoustic score resulting from matching with a known word. In addition,
For details of this process, refer to the document "" EUROSPEECH99 Volu
"Me 1, Page 49-52" to "OOV-Detection in Large Voca
bulary System Using Automatically Defined Word-Frag
ments as Fillers ”.

【００４８】マッチング部４４は、ステップＳ６３にお
いて、この２つの音響スコアを比較する（音韻タイプラ
イタ部４５で認識させた結果の音響スコアの方が高い
（優れている）か否かを判定する）。音韻タイプライタ
部４５で認識させた結果の音響スコアの方が高い場合、
ステップＳ６４に進み、マッチング部４４は、その区間
を<OOV>（Out Of Vocabulary）（未知語）であると推定
する。In step S63, the matching section 44 compares the two acoustic scores (determines whether the acoustic score obtained by the phonological typewriter section 45 is higher (excellent)). . When the acoustic score of the result recognized by the phonological typewriter unit 45 is higher,
In step S64, the matching unit 44 estimates that the section is <OOV> (Out Of Vocabulary) (unknown word).

【００４９】ステップＳ６３において、既知語とマッチ
ングさせた結果の音響スコアに対して、音韻タイプライ
タ部４５で認識された結果の音響スコアの方が低いと判
定された場合、ステップＳ６６に進み、マッチング部４
４は、その区間を既知語であると推定する。If it is determined in step S63 that the acoustic score of the result recognized by the phonological typewriter unit 45 is lower than the acoustic score of the result of matching the known word, the process proceeds to step S66. Part 4
4 estimates that the section is a known word.

【００５０】即ち、例えば、「たろう」に相当する区間
について、音韻タイプライタ部４５の出力した"t/a/r/
o:"の音響スコアと、既知語でマッチングさせた場合の
音響スコアを比較して、"t/a/r/o："の音響スコアの方
が高い場合は、その音声区間に相当する単語として「<O
OV>（t/a/r/o:）」が出力され、既知語の音響スコアの
方が高い場合は、その既知語が音声区間に相当する単語
として出力される。That is, for example, for the section corresponding to "Taro", "t / a / r /" output from the phoneme typewriter unit 45.
Compare the acoustic score of "o:" with the acoustic score when matching with a known word, and if the acoustic score of "t / a / r / o:" is higher, the word corresponding to that voice section As "<O
"OV> (t / a / r / o :)" is output, and when the acoustic score of the known word is higher, the known word is output as a word corresponding to the voice section.

【００５１】ステップＳ６５において、音響スコアが高
くなると推測される単語列（いくつかの単語モデルを連
結したもの）を優先的にｎ個を生成する。In step S65, n word strings (concatenation of several word models) which are estimated to have a high acoustic score are preferentially generated.

【００５２】図６に戻って、ステップＳ４５において、
音韻タイプライタ部４５はステップＳ４４の処理とは独
立して、ステップＳ４３の処理で抽出された特徴パラメ
ータに対して音韻を単位とする認識を行ない、音韻系列
を出力する。例えば、「私の名前は太郎（未知語）で
す。」という音声が入力されると、音韻タイプライタ部
４５は、"w/a/t/a/sh/i/n/o/n/a/m/a/e/w/a/t/a/r/o:/d
/e/s/u"という音韻系列を出力する。Returning to FIG. 6, in step S45,
The phoneme typewriter unit 45 performs the recognition in units of phonemes on the feature parameters extracted in the process of step S43 independently of the process of step S44, and outputs a phoneme sequence. For example, when the voice "My name is Taro (unknown word)" is input, the phonological typewriter unit 45 displays "w / a / t / a / sh / i / n / o / n / a". / m / a / e / w / a / t / a / r / o: / d
The phoneme sequence "/ e / s / u" is output.

【００５３】ステップＳ４６において、マッチング部４
４は、ステップＳ４４で生成された単語列ごとに音響ス
コアを計算する。<OOV>（未知語）を含まない単語列に
対しては既存の方法、すなわち各単語列（単語モデルを
連結したもの）に対して音声の特徴パラメータを入力す
ることで尤度を計算するという方法を用いる。一方、<O
OV>を含む単語列については、既存の方法では<OOV>に相
当する音声区間の音響スコアを求めることができない
（<OOV>に対応する単語モデルは事前には存在しないた
め）。そこで、その音声区間については、音韻タイプラ
イタの認識結果の中から同区間の音響スコアを取り出
し、その値に補正をかけたものを<OOV>の音響スコアと
して採用する。さらに、他の既知語部分の音響スコアと
統合し、それをその単語列の音響スコアとする。In step S46, the matching unit 4
4 calculates the acoustic score for each word string generated in step S44. For word strings that do not contain <OOV> (unknown words), the existing method is used, that is, the likelihood is calculated by inputting the speech feature parameters for each word string (concatenation of word models). Use the method. On the other hand, <O
For word strings containing OV>, it is not possible to obtain the acoustic score of the voice section corresponding to <OOV> by the existing method (since there is no word model corresponding to <OOV> in advance). Therefore, for that voice section, the acoustic score of the same section is extracted from the recognition result of the phonological typewriter, and the corrected value is adopted as the <OOV> acoustic score. Further, it is integrated with the acoustic scores of other known word parts and used as the acoustic score of the word string.

【００５４】ステップＳ４７において、マッチング部４
４は、音響スコアの高い単語列を上位ｍ個（ｍ≦ｎ）残
し、候補単語列とする。ステップＳ４８において、マッ
チング部４４は、言語モデルデータベース５３を参照し
て、候補単語列毎に、言語スコアを計算する。言語スコ
アは、認識結果の候補である単語列が言葉としてどれだ
けふさわしいかを表す。ここで、この言語スコアを計算
する方法を詳細に説明する。In step S47, the matching unit 4
In No. 4, the upper m (m ≦ n) word strings having a high acoustic score are left as candidate word strings. In step S48, the matching unit 44 refers to the language model database 53 and calculates a language score for each candidate word string. The language score represents how appropriate a word string that is a candidate of the recognition result is as a word. Here, the method of calculating the language score will be described in detail.

【００５５】本発明の音声認識部１は未知語も認識する
ため、言語モデルは未知語に対応している必要がある。
例として、未知語に対応した文法または有限状態オート
マトン（FSA:Finite State Automaton）を用いた場合
と、同じく未知語に対応したtri-gram（統計言語モデル
の1つである）を用いた場合とについて説明する。Since the speech recognition unit 1 of the present invention recognizes unknown words as well, the language model needs to correspond to the unknown words.
For example, using a grammar or Finite State Automaton (FSA) corresponding to an unknown word and using a tri-gram (one of statistical language models) corresponding to an unknown word. Will be described.

【００５６】文法の例を図９を参照して説明する。この
文法６１はBNF(Backus Naur Form)で記述されている。
図９において、"$Ａ"は「変数」を表し、"Ａ｜Ｂ"は
「ＡまたはＢ」という意味を表す。また、"［Ａ］"は
「Ａは省略可能」という意味を表し、｛Ａ｝は「Ａを０
回以上繰り返す」という意味を表す。An example of the grammar will be described with reference to FIG. This grammar 61 is described in BNF (Backus Naur Form).
In FIG. 9, "$ A" represents "variable" and "A | B" represents "A or B". Also, "[A]" means "A can be omitted", and {A} is "A is 0".
Means "to repeat more than once".

【００５７】<OOV>は未知語を表すシンボルであり、文
法中に<OOV>を記述しておくことで、未知語を含む単語
列に対しても対処することができる。"$ACTION"は図９
では定義されていないが、実際には、例えば、「起
立」、「着席」、「お辞儀」、「挨拶」等の動作の名前
が定義されている。<OOV> is a symbol representing an unknown word. By describing <OOV> in the grammar, it is possible to deal with a word string including an unknown word. "$ ACTION" is Figure 9
However, in reality, names of actions such as “standing”, “seating”, “bowing”, and “greeting” are defined.

【００５８】この文法６１では、「＜先頭＞/こんにち
は/＜終端＞」（“/”は単語間の区切り）、「＜先頭＞
/さようなら/＜終端＞」、「＜先頭＞/私/の/名前/は/<
OOV>/です/＜終端＞」のように、データベースに記憶さ
れている文法に当てはまる単語列は受理される（この文
法で解析される）が、「＜先頭＞/君/の/<OOV>/名前/＜
終端＞」といった、データベースに記憶されている文法
に当てはまらない単語列は受理されない（この文法で解
析されない）。なお、「＜先頭＞」と「＜終端＞」はそ
れぞれ発話前と後の無音を表す特殊なシンボルである。[0058] In the grammar 61, "<start> / Hello / <end>" ( "/" is a separator between words), "<head>
/ Goodbye / <End> ”,“ <Top> / I / Name / wa / <
Word strings that match the grammar stored in the database such as "OOV> / is / <end>" are accepted (parsed by this grammar), but "<head> / Kimi / no / <OOV>" / Name / <
Term strings that do not match the grammar stored in the database, such as ">", are not accepted (not parsed by this grammar). It should be noted that "<start>" and "<end>" are special symbols representing silence before and after utterance, respectively.

【００５９】この文法を用いて言語スコアを計算するた
めに、パーザ（解析機）が用いられる。パーザは、単語
列を、文法を受理できる単語列と、受理できない単語列
に分ける。即ち、例えば、受理できる単語列には言語ス
コア１が与えられて、受理できない単語列には言語スコ
ア０が与えられる。A parser (analyzer) is used to calculate a language score using this grammar. The parser divides the word string into word strings that can accept the grammar and word strings that cannot accept the grammar. That is, for example, a linguistic score of 1 is given to an acceptable word string, and a linguistic score of 0 is given to an unacceptable word string.

【００６０】したがって、例えば、「＜先頭＞/私/の/
名前/は/<OOV>（t/a/r/o：）/です/＜終端＞」と、「＜
先頭＞/私/の/名前/は/<OOV>（j/i/r/o：）/です/＜終
端＞」という２つの単語列があった場合、いずれも「＜
先頭＞/私/の/名前/は/<OOV>/です/＜終端＞」に置き換
えられた上で言語スコアが計算されて、ともに言語スコ
ア１（受理）が出力される。Therefore, for example, "<start> / me / no /
Name / is / <OOV> (t / a / r / o:) / is / <Termination> ”and“ <
If there are two word strings "start> / me / of / name / is / <OOV> (j / i / r / o:) // <end>", both are "<
The top> / me / of / name / is / <OOV> // <end> ”and the language score is calculated, and the language score 1 (accepted) is output together.

【００６１】また、単語列の文法が受理できるか否かの
判定は、事前に文法を等価（近似でも良い）な有限状態
オートマトン（以下、FSAと称する）に変換しておき、
各単語列がそのFSAで受理できるか否かを判定すること
によっても実現できる。To determine whether or not the grammar of the word string can be accepted, the grammar is converted in advance to an equivalent (or approximate) finite state automaton (hereinafter referred to as FSA),
It can also be realized by determining whether each word string can be accepted by the FSA.

【００６２】図９の文法を等価なFSAに変換した例が、
図１０に示されている。FSAは状態（ノード）とパス
（アーク）とからなる有向グラフである。図１０に示さ
れるように、Ｓ１は開始状態、Ｓ１６は終了状態であ
る。また、"$ACTION"には、図９と同様に、実際には動
作の名前が登録されている。An example of converting the grammar of FIG. 9 into an equivalent FSA is
It is shown in FIG. FSA is a directed graph consisting of states (nodes) and paths (arcs). As shown in FIG. 10, S1 is the start state and S16 is the end state. Further, the name of the action is actually registered in "$ ACTION", as in FIG.

【００６３】パスには単語が付与されていて、所定の状
態から次の状態に遷移する場合、パスはこの単語を消費
する。ただし、"ε"が付与されているパスは、単語を消
費しない特別な遷移（以下、ε遷移と称する）である。
即ち、例えば、「＜先頭＞/私/は/<OOV>/です/＜終端
＞」においては、初期状態Ｓ１から状態Ｓ２に遷移し
て、＜先頭＞が消費され、状態Ｓ２から状態Ｓ３へ遷移
して、「私」が消費されるが、状態Ｓ３から状態Ｓ５へ
の遷移は、ε遷移なので、単語は消費されない。即ち、
状態Ｓ３から状態Ｓ５へスキップして、次の状態Ｓ６へ
遷移することができる。A word is attached to the path, and the path consumes this word when the state transits from a predetermined state to the next state. However, a path to which "ε" is added is a special transition that does not consume words (hereinafter referred to as "ε transition").
That is, for example, in “<head> / I / was / <OOV> / is / <end>”, the initial state S1 transits to state S2, the <head> is consumed, and state S2 changes to state S3. A transition occurs and “I” is consumed, but since the transition from the state S3 to the state S5 is an ε transition, no word is consumed. That is,
It is possible to skip from the state S3 to the state S5 and transit to the next state S6.

【００６４】所定の単語列がこのFSAで受理できるか否
かは、初期状態Ｓ１から出発して、終了状態Ｓ１６まで
到達できるか否かで判定される。Whether or not a predetermined word string can be accepted by this FSA is determined by starting from the initial state S1 and reaching the end state S16.

【００６５】即ち、例えば、「＜先頭＞/私/の/名前/は
/<OOV>/です/＜終端＞」においては、初期状態Ｓ１から
状態Ｓ２へ遷移して、単語「＜先頭＞」が消費される。
次に、状態Ｓ２から状態Ｓ３へ遷移して、単語「私」が
消費される。以下、同様に、状態Ｓ３から状態Ｓ４へ、
状態Ｓ４から状態Ｓ５へ、状態Ｓ５から状態Ｓ６へ、状
態Ｓ６から状態Ｓ７へ順次遷移して、「の」、「名
前」、「は」、「<00V>」、が次々に消費される。さら
に、状態Ｓ７から状態Ｓ１５へ遷移して、「です」が消
費され、状態Ｓ１５から状態Ｓ１６に遷移して、「<終
端>」が消費され、結局、終了状態Ｓ１６へ到達する。
したがって、「＜先頭＞/私/の/名前/は/<OOV>/です/＜
終端＞」はFSAで受理される。That is, for example, "<start> / I / no / name / is
In "/ <OOV> / is / <end>", the word "<start>" is consumed by making a transition from the initial state S1 to the state S2.
Next, the state "S2" transits to the state S3, and the word "I" is consumed. Similarly, from state S3 to state S4,
The state S4 transits to the state S5, the state S5 transits to the state S6, and the state S6 transits to the state S7, and "no", "name", "ha", and "<00V>" are successively consumed. Further, the state S7 transits to the state S15, "is" is consumed, the state S15 transits to the state S16, "<end>" is consumed, and the final state S16 is eventually reached.
Therefore, "<start> / me / of / name / is / <OOV> / is / <
Termination> ”is accepted by the FSA.

【００６６】しかしながら、「＜先頭＞/君/の/<OOV>/
名前/＜終端＞」は、状態Ｓ１から状態Ｓ２へ、状態Ｓ
２から状態Ｓ８へ、状態Ｓ８から状態Ｓ９までは遷移し
て、「＜先頭＞」、「君」、「の」までは消費される
が、その先には遷移できないので、終了状態Ｓ１６へ到
達することはできない。したがって、「＜先頭＞/君/の
/<OOV>/名前/＜終端＞」は、FSAで受理されない（不受
理）。However, "<top> / Kimi / no / <OOV> /
Name / <end> ”is changed from state S1 to state S2, state S
From 2 to state S8, and from state S8 to state S9, "<head>", "kun", and "no" are consumed, but transition to the end state S16 is possible because the transition cannot be made beyond that. You cannot do it. Therefore, "<top> / Kimi / of
/ <OOV> / Name / <Termination> ”is not accepted by the FSA (not accepted).

【００６７】さらに、言語モデルとして、統計言語モデ
ルの１つであるtri-gramを用いた場合の言語スコアを計
算する例を、図１１を参照して説明する。統計言語モデ
ルとは、その単語列の生成確率を求めて、それを言語ス
コアとする言語モデルである。即ち、例えば、図１１の
言語モデル７１の「＜先頭＞/私/の/名前/は/<OOV>/で
す/＜終端＞」の言語スコアは、第２行に示されるよう
に、その単語列の生成確率で表される。これはさらに、
第３行乃至第６行で示されるように、条件付き確率の積
として表される。なお、例えば、「Ｐ（の｜＜先頭＞
私）」は、「の」の直前の単語が「私」で、「私」の直
前の単語が「＜先頭＞」であるという条件の下で、
「の」が出現する確率を表す。Further, an example of calculating a language score when a tri-gram which is one of statistical language models is used as a language model will be described with reference to FIG. The statistical language model is a language model in which the generation probability of the word string is obtained and used as the language score. That is, for example, the language score of “<first> / me / of / name / is / <OOV> // <end>” of the language model 71 of FIG. 11 is the word as shown in the second line. It is represented by the probability of column generation. This is
It is expressed as the product of conditional probabilities, as shown in lines 3-6. Note that, for example, “P (| <start>”
"I)" means that the word immediately before "no" is "I" and the word immediately before "I" is "<start>",
It represents the probability that "no" will appear.

【００６８】さらに、tri-gramでは、図１１の第３行乃
至第６行で示される式を、第７行乃至第９行で示される
ように、連続する３単語の条件付き確率で近似させる。
これらの確率値は、図１２に示されるようなtri-gramデ
ータベース８１を参照して求められる。このtri-gramデ
ータベース８１は、予め大量のテキストを分析して求め
られたものである。Further, in the tri-gram, the equations shown in the third to sixth lines of FIG. 11 are approximated by the conditional probability of three consecutive words, as shown in the seventh to ninth lines. .
These probability values are obtained by referring to the tri-gram database 81 as shown in FIG. The tri-gram database 81 is obtained by analyzing a large amount of texts in advance.

【００６９】図１２の例では、３つの連続する単語ｗ
１，ｗ２，ｗ３の確率Ｐ（ｗ３｜ｗ１ｗ２）が表されて
いる。例えば、３つの単語ｗ１，ｗ２，ｗ３が、それぞ
れ、「＜先頭＞」、「私」、「の」である場合、確率値
は０．１２とされ、「私」、「の」、「名前」である場
合、確率値は０．０１とされ、「<OOV>」、「です」、
「＜終端＞」である場合、確率値は、０．８７とされて
いる。In the example of FIG. 12, three consecutive words w
The probability P (w3 | w1w2) of 1, w2, w3 is represented. For example, if the three words w1, w2, and w3 are "<start>", "I", and "no", the probability value is 0.12, and "me", "no", and "name" are set. , The probability value is 0.01, “<OOV>”, “is”,
In the case of “<end>”, the probability value is set to 0.87.

【００７０】勿論、「Ｐ（Ｗ）」及び「Ｐ（ｗ２｜ｗ
１）」についても、同様に、予め求めておく。Of course, "P (W)" and "P (w2 | w
Similarly, “1)” is also obtained in advance.

【００７１】このようにして、言語モデル中に<OOV>に
ついて、エントリ処理をしておくことで、<OOV>を含む
単語列に対して、言語スコアを計算することができる。
したがって、認識結果に<OOV>というシンボルを出力す
ることができる。In this way, by performing entry processing for <OOV> in the language model, the language score can be calculated for the word string including <OOV>.
Therefore, the symbol <OOV> can be output as the recognition result.

【００７２】また、他の種類の言語モデルを用いる場合
も、<OOV>についてのエントリ処理をすることによっ
て、同様に<OOV>を含む単語列に対して、言語スコアを
計算することができる。Also, in the case of using another type of language model, the language score can be calculated for the word string including <OOV> similarly by performing the entry process for <OOV>.

【００７３】さらに、<OOV>のエントリが存在しない言
語モデルを用いた場合でも、<OOV>を言語モデル中の適
切な単語にマッピングする機構を用いることで、言語ス
コアの計算ができる。例えば、「Ｐ（<OOV>｜私は）」
が存在しないtri-gramデータベースを用いた場合でも、
「Ｐ（太郎｜私は）」でデータベースをアクセスし
て、そこに記述されている確率を「Ｐ（<OOV>｜私
は）」の値とみなすことで、言語スコアの計算ができ
る。Furthermore, even when a language model in which an <OOV> entry does not exist is used, the language score can be calculated by using a mechanism that maps <OOV> to an appropriate word in the language model. For example, "P (<OOV> | I am)"
Even if you use a tri-gram database that does not exist,
Access the database with "P (Taro ｜ I am)" and set the probability described in it to "P (<OOV> ｜ I
The language score can be calculated by regarding the value as "ha)".

【００７４】図６に戻って、マッチング部４４は、ステ
ップＳ４９において、音響スコアと言語スコアを統合す
る。ステップＳ５０において、マッチング部４４は、ス
テップＳ４９において求められた音響スコアと言語スコ
アの両スコアを統合したスコアに基づいて、最もよいス
コアをもつ候補単語列を選択して、認識結果として出力
する。Returning to FIG. 6, the matching unit 44 integrates the acoustic score and the language score in step S49. In step S50, the matching unit 44 selects a candidate word string having the best score based on the score obtained by integrating both the acoustic score and the language score obtained in step S49, and outputs the candidate word string as a recognition result.

【００７５】なお、言語モデルとして、有限状態オート
マトンを使用している場合は、ステップＳ４９の統合処
理を、言語スコアが０の場合は単語列を消去し、言語ス
コアが０以外の場合はそのまま残すという処理にしても
よい。When the finite state automaton is used as the language model, the integration process of step S49 is executed. When the language score is 0, the word string is deleted, and when the language score is other than 0, it is left as it is. May be processed.

【００７６】図５に戻って、以上のようにしてステップ
Ｓ２２で音声認識処理が実行された後、ステップＳ２３
において、音声認識部１の制御部４６は認識された単語
列に未知語が含まれているか否かを判定する。未知語が
含まれていると判定された場合、制御部４６は、単語獲
得部４を制御し、ステップＳ２４において、単語獲得処
理を実行させ、その未知語を獲得させる。Returning to FIG. 5, after the voice recognition processing is executed in step S22 as described above, step S23 is performed.
At, the control unit 46 of the voice recognition unit 1 determines whether the recognized word string includes an unknown word. When it is determined that the unknown word is included, the control unit 46 controls the word acquisition unit 4 to execute the word acquisition process in step S24 to acquire the unknown word.

【００７７】単語獲得処理の詳細について、図１３を参
照して説明する。ステップＳ９１において、単語獲得部
４は、音声認識部１から未知語（<OOV>）の特徴パラメ
ータを抽出する。ステップＳ９２において、単語獲得部
４は、未知語が既獲得のクラスタに属するか否かを判定
する。既獲得のクラスタに属さないと判定された場合、
単語獲得部４は、ステップＳ９３において、その未知語
に対応する、新しいクラスタを生成する。そして、ステ
ップＳ９４において、単語獲得部４は、未知語の属する
クラスタのＩＤを音声認識部１のマッチング部４４に出
力する。Details of the word acquisition process will be described with reference to FIG. In step S91, the word acquisition unit 4 extracts the characteristic parameter of the unknown word (<OOV>) from the voice recognition unit 1. In step S92, the word acquisition unit 4 determines whether the unknown word belongs to the already acquired cluster. If it is judged that it does not belong to the already acquired cluster,
The word acquisition part 4 produces | generates the new cluster corresponding to the unknown word in step S93. Then, in step S94, the word acquisition unit 4 outputs the ID of the cluster to which the unknown word belongs to the matching unit 44 of the voice recognition unit 1.

【００７８】ステップＳ９２において、未知語が既獲得
のクラスタに属すると判定された場合、新しいクラスタ
を生成する必要がないので、単語獲得部４はステップＳ
９３の処理をスキップして、ステップＳ９４に進み、未
知語の属する既獲得のクラスタのＩＤをマッチング部４
４に出力する。If it is determined in step S92 that the unknown word belongs to the already-acquired cluster, it is not necessary to generate a new cluster.
The process of step 93 is skipped, and the process proceeds to step S94, where the matching unit 4 determines the ID of the acquired cluster to which the unknown word belongs.
Output to 4.

【００７９】なお、図１３の処理は各未知語語毎に行わ
れる。The process of FIG. 13 is performed for each unknown word.

【００８０】図５に戻って、ステップＳ２４の単語獲得
処理終了後、ステップＳ２５において、対話制御部３
は、ステップＳ２４の処理で獲得された単語列が、テン
プレートにマッチしているかどうかを判定する。即ち、
認識結果の単語列が何かの名前の登録を意味するものか
どうかの判定がここで行われる。そして、ステップＳ２
５において、認識結果の単語列がテンプレートにマッチ
していると判定された場合、ステップＳ２６において、
対話制御部３は、連想記憶部２に、名前のクラスタＩＤ
とカテゴリを対応させて記憶させる。Returning to FIG. 5, after the word acquisition processing in step S24 is completed, in step S25, the dialogue control unit 3
Determines whether the word string acquired in the process of step S24 matches the template. That is,
Here, it is determined whether or not the word string of the recognition result means registration of some name. And step S2
When it is determined in step 5 that the word string of the recognition result matches the template, in step S26,
The dialogue control unit 3 stores the cluster ID of the name in the associative storage unit 2.
And the categories are stored in association with each other.

【００８１】対話制御部３がマッチングさせるテンプレ
ートの例を図１４を参照して説明する。なお、図１４に
おいて、"/Ａ/"は「文字列Ａが含まれていたら」という
意味を表し、"Ａ｜Ｂ"は「ＡまたはＢ」という意味を表
す。また、"."は「任意の文字」を表し、"Ａ＋"は「Ａ
の１回以上の繰り返し」という意味を表し、"(.)＋"は
「任意の文字列」を表す。An example of a template matched by the dialogue control unit 3 will be described with reference to FIG. In FIG. 14, “/ A /” means “if the character string A is included”, and “A | B” means “A or B”. Also, "." Represents "arbitrary character" and "A +" is "A
Is repeated one or more times, and "(.) +" Represents "arbitrary character string".

【００８２】このテンプレート９１は、認識結果の単語
列が図の左側の正規表現にマッチした場合、図の右側の
動作を実行させることを表している。例えば、認識結果
が「＜先頭＞/私/の/名前/は/<OOV>（t/a/r/o:）/です/
＜終端＞」という単語列である場合、この認識結果から
生成された文字列「私の名前は<OOV>です」は、図１４
の第２番目の正規表現にマッチする。したがって、対応
する動作である「＜OOV>に対応するクラスタＩＤをユー
ザ名として登録する」処理が実行される。即ち、「<OOV
>(t/a/r/o：)」のクラスタＩＤが「１」である場合、図
３に示されるように、クラスタＩＤ「１」のカテゴリ名
が「ユーザ名」として登録される。This template 91 represents that the operation on the right side of the figure is executed when the word string of the recognition result matches the regular expression on the left side of the figure. For example, the recognition result is "<first> / me / of / name / is / <OOV> (t / a / r / o:) /" /
In the case of the word string “<end>”, the character string “My name is <OOV>” generated from this recognition result is shown in FIG.
Matches the second regular expression of. Therefore, the corresponding operation of “registering the cluster ID corresponding to <OOV> as a user name” is executed. That is, "<OOV
When the cluster ID of "> (t / a / r / o :)" is "1", the category name of the cluster ID "1" is registered as the "user name" as shown in FIG.

【００８３】また、例えば、認識結果が、「＜先頭＞/
君/の/名前/は/<OOV>（a/i/b/o）/だよ/＜終端＞」であ
る場合、そこから生成される文字列「君の名前は<OOV>
だよ」は図１４の第１番目の正規表現にマッチするの
で、「<OOV>(a/i/b/o)」がクラスタＩＤ「２」であれ
ば、クラスタＩＤ「２」のカテゴリは、「キャラクタ
名」として登録される。Further, for example, the recognition result is "<head> /
If you / 's / name / is / <OOV> (a / i / b / o) /' / <end>', the string "Your name is <OOV>"
14 matches the first regular expression in FIG. 14, so if “<OOV> (a / i / b / o)” is the cluster ID “2”, the category of the cluster ID “2” is , "Character name" is registered.

【００８４】なお、対話システムによっては、登録する
単語が1種類しかない（例えば、「ユーザ名」のみ）場
合もあり、その場合は、テンプレート９１と連想記憶部
２は簡略化することができる。例えば、テンプレート９
１の内容を「認識結果に<OOV>が含まれていたら、その
ＩＤを記憶する」として、連想記憶部２にそのクラスタ
ＩＤのみを記憶させることができる。Depending on the interactive system, there may be only one type of word to be registered (for example, only “user name”). In that case, the template 91 and the associative storage unit 2 can be simplified. For example, template 9
It is possible to store only the cluster ID in the associative storage unit 2 by setting the content of 1 to “store the ID if <OOV> is included in the recognition result”.

【００８５】対話制御部３は、このようにして連想記憶
部２に登録された情報を、以後の対話の判断処理に反映
させる。例えば、対話システムの側で、「ユーザの発話
の中に、対話キャラクタの名前が含まれているかどうか
を判定する。含まれている場合は『呼びかけられた』と
判断して、それに応じた返事をする」という処理や、
「対話キャラクタがユーザの名前をしゃべる」という処
理が必要になった場合に、対話制御部３は連想記憶部２
の情報を参照することで、対話キャラクタに相当する単
語（カテゴリ名が「キャラクタ名」であるエントリ）や
ユーザ名に相当する単語（カテゴリ名が「ユーザ名」で
あるエントリ）を得ることができる。The dialogue control unit 3 reflects the information thus registered in the associative storage unit 2 in the subsequent dialogue judgment processing. For example, on the side of the dialogue system, "determine whether or not the name of the dialogue character is included in the user's utterance. Process "
When the process of “the dialogue character speaks the user's name” becomes necessary, the dialogue control unit 3 causes the associative storage unit 2
By referring to the information of (1), it is possible to obtain the word corresponding to the dialogue character (entry whose category name is "character name") or the word corresponding to the user name (entry whose category name is "user name"). .

【００８６】一方、ステップＳ２３において、認識結果
に未知語が含まれていないと判定された場合、またはス
テップＳ２５において、認識結果がテンプレートにマッ
チしていないと判定された場合、ステップＳ２７におい
て、対話制御部３は、入力音声に対応する応答を生成す
る。すなわち、この場合には、名前（未知語）の登録処
理は行われず、ユーザからの入力音声に対応する所定の
処理が実行される。On the other hand, if it is determined in step S23 that the recognition result does not include an unknown word, or if it is determined in step S25 that the recognition result does not match the template, the dialogue is executed in step S27. The control unit 3 generates a response corresponding to the input voice. That is, in this case, the name (unknown word) registration process is not performed, and the predetermined process corresponding to the input voice from the user is performed.

【００８７】ところで、言語モデルとして文法を用いる
場合、文法の中に音韻タイプライタ相当の記述も組み込
むことができる。この場合の文法の例が図１５に示され
ている。この文法１０１において、第１行目の変数"$PH
ONEME"は、全ての音韻が「または」を意味する"|"で繋
がれているので、音韻記号の内のどれか１つを意味す
る。変数"OOV"は"$PHONEME"を０回以上繰り返すことを
表している。即ち、「任意の音韻記号を０回以上接続し
たもの」を意味し、音韻タイプライタに相当する。した
がって、第３行目の「は」と「です」の間の"$OOV"は、
任意の発音を受け付けることができる。By the way, when a grammar is used as a language model, a description corresponding to a phoneme typewriter can be incorporated in the grammar. An example of the grammar in this case is shown in FIG. In this grammar 101, the variable "$ PH" on the first line
"ONEME" means any one of the phonological symbols because all phonemes are connected by "|" which means "or". The variable "OOV" indicates that "$ PHONEME" is repeated 0 times or more. That is, it means "a phoneme symbol connected zero or more times" and corresponds to a phoneme typewriter. Therefore, "$ OOV" between "ha" and "is" in the 3rd line is
Any pronunciation can be accepted.

【００８８】この文法１０１を用いた場合の認識結果で
は、"$OOV"に相当する部分が複数のシンボルで出力され
る。例えば、「私の名前は太郎です」の認識結果が「＜
先頭＞/私/の/名前/は/t/a/r/o:/です/＜終端＞」とな
る。この結果を「＜先頭＞/私/の/名前/は/<OOV>（t/a/
r/o:）/です」に変換すると、図５のステップＳ２３以
降の処理は、音韻タイプライタを用いた場合と同様に実
行することができる。In the recognition result when this grammar 101 is used, the portion corresponding to "$ OOV" is output as a plurality of symbols. For example, the recognition result of "My name is Taro" is "<
Start> / Me // Name / is / t / a / r / o: // <End> ”. The result is "<top> / I / name / was / <OOV> (t / a /
r / o:) / is ”, the processes after step S23 in FIG. 5 can be executed in the same manner as in the case of using the phoneme typewriter.

【００８９】以上においては、未知語に関連する情報と
して、カテゴリを登録するようにしたが、その他の情報
を登録するようにしてもよい。In the above description, the category is registered as the information related to the unknown word, but other information may be registered.

【００９０】図１６は、上述の処理を実行するパーソナ
ルコンピュータ１１０の構成例を示している。このパー
ソナルコンピュータ１１０は、CPU（Central Processin
g Unit）１１１を内蔵している。CPU１１１にはバス１
１４を介して、入出力インタフェース１１５が接続され
ている。バス１１４には、ROM(Read Only Memory)１１
２およびRAM(Random Access Memory)１１３が接続され
ている。FIG. 16 shows an example of the configuration of the personal computer 110 that executes the above processing. This personal computer 110 has a CPU (Central Processin)
g Unit) 111 is built in. Bus 1 for CPU111
An input / output interface 115 is connected via 14. A ROM (Read Only Memory) 11 is provided on the bus 114.
2 and a RAM (Random Access Memory) 113 are connected.

【００９１】入出力インターフェース１１５には、ユー
ザが操作するマウス、キーボード、マイクロホン、ＡＤ
変換器等の入力デバイスで構成される入力部１１７、お
よびディスプレイ、スピーカ、ＤＡ変換器等の出力デバ
イスで構成される出力部１１６が接続されている。さら
に、入出力インターフェース１１５には、プログラムや
各種データを格納するハードディスクドライブなどより
なる記憶部１１８、並びにインタネットに代表されるネ
ットワークを介してデータを通信する通信部１１９が接
続されている。The input / output interface 115 includes a mouse operated by the user, a keyboard, a microphone, and an AD.
An input unit 117 including an input device such as a converter and an output unit 116 including an output device such as a display, a speaker, and a DA converter are connected. Further, the input / output interface 115 is connected to a storage unit 118 including a hard disk drive for storing programs and various data, and a communication unit 119 for communicating data via a network typified by the Internet.

【００９２】入出力インターフェース１１５には、磁気
ディスク１３１、光ディスク１３２、光磁気ディスク１
３３、半導体メモリ１３４などの記録媒体に対してデー
タを読み書きするドライブ１２０が必要に応じて接続さ
れる。The input / output interface 115 includes a magnetic disk 131, an optical disk 132, and a magneto-optical disk 1.
A drive 120 for reading / writing data from / to a recording medium such as 33 and a semiconductor memory 134 is connected as necessary.

【００９３】このパーソナルコンピュータ１１０に本発
明を適用した音声処理装置としての動作を実行させる音
声処理プログラムは、磁気ディスク１３１（フロッピデ
ィスクを含む）、光ディスク１３２(CD-ROM(Compact Di
sc-Read Only Memory)、DVD(Digital Versatile Disc)
を含む)、光磁気ディスク１３３（MD(Mini Disc)を含
む）、もしくは半導体メモリ１３４に格納された状態で
パーソナルコンピュータ１１０に供給され、ドライブ１
２０によって読み出されて、記憶部１１８に内蔵される
ハードディスクドライブにインストールされる。記憶部
１１８にインストールされた音声処理プログラムは、入
力部１１７に入力されるユーザからのコマンドに対応す
るCPU１１１の指令によって、記憶部１１８からRAM１１
３にロードされて実行される。An audio processing program for causing the personal computer 110 to execute the operation as an audio processing apparatus to which the present invention is applied is a magnetic disk 131 (including a floppy disk), an optical disk 132 (CD-ROM (Compact Disk
sc-Read Only Memory), DVD (Digital Versatile Disc)
Drive), the magneto-optical disc 133 (including MD (Mini Disc)), or the semiconductor memory 134, and is supplied to the personal computer 110 to drive the drive 1
It is read by 20 and installed in the hard disk drive built in the storage unit 118. The voice processing program installed in the storage unit 118 is transferred from the storage unit 118 to the RAM 11 according to a command from the CPU 111 corresponding to a command input by the user to the input unit 117.
3 is loaded and executed.

【００９４】上述した一連の処理は、ハードウエアによ
り実行させることもできるが、ソフトウエアにより実行
させることもできる。一連の処理をソフトウエアにより
実行させる場合には、そのソフトウエアを構成するプロ
グラムが、専用のハードウエアに組み込まれているコン
ピュータ、または、各種のプログラムをインストールす
ることで、各種の機能を実行することが可能な、例えば
汎用のパーソナルコンピュータなどに、ネットワークや
記録媒体からインストールされる。The series of processes described above can be executed by hardware, but can also be executed by software. When a series of processes is executed by software, a program that constitutes the software executes a variety of functions by installing a computer in which dedicated hardware is installed or various programs. It is installed from a network or a recording medium into a general-purpose personal computer or the like capable of performing the above.

【００９５】この記録媒体は、図１６に示されるよう
に、装置本体とは別に、ユーザにプログラムを提供する
ために配布される、プログラムが記録されている磁気デ
ィス１３１、光ディスク１３２、光磁気ディスク１３
３、もしくは半導体メモリ１３４などよりなるパッケー
ジメディアにより構成されるだけでなく、装置本体に予
め組み込まれた状態でユーザに提供される、プログラム
が記録されているROM１１２や、記憶部１１８に含まれ
るハードディスクなどで構成される。As shown in FIG. 16, this recording medium is distributed in order to provide the program to the user separately from the apparatus main body. The program is recorded on the magnetic disk 131, the optical disk 132, and the magneto-optical disk. Thirteen
3 or a package medium including a semiconductor memory 134 and the like, and a ROM 112 in which a program is recorded and a hard disk included in the storage unit 118, which is provided to the user in a state of being pre-installed in the apparatus main body. Etc.

【００９６】なお、本明細書において、記録媒体に記録
されるプログラムを記述するステップは、記載された順
序に沿って時系列的に行われる処理はもちろん、必ずし
も時系列的に処理されなくとも、並列的あるいは個別に
実行される処理をも含むものである。In the present specification, the steps for describing the program recorded on the recording medium are not limited to the processing performed in time series in the order described, but are not necessarily performed in time series. It also includes processing executed in parallel or individually.

【００９７】また、本明細書において、システムとは、
複数の装置が論理的に集合したものをいい、各構成の装
置が同一筐体中にあるか否かは問わない。In this specification, the system means
A plurality of devices are logically aggregated, and it does not matter whether or not the devices of the respective configurations are in the same housing.

【００９８】[0098]

【発明の効果】以上のように、本発明によれば、単語を
音声で登録することができる。またその登録を、ユーザ
に登録モードを意識させることなく実行できる。さら
に、既知語と未知語を含む連続する入力音声の中から未
知語を容易に登録することが可能となる。さらに、登録
した単語を、以降の対話で反映させることが可能とな
る。As described above, according to the present invention, words can be registered by voice. Further, the registration can be executed without making the user aware of the registration mode. Further, it becomes possible to easily register the unknown word from the continuous input speech including the known word and the unknown word. Furthermore, it becomes possible to reflect the registered word in subsequent dialogues.

[Brief description of drawings]

【図１】本発明を適用した対話システムの一実施の形態
の構成例を示すブロック図である。FIG. 1 is a block diagram showing a configuration example of an embodiment of a dialogue system to which the present invention has been applied.

【図２】クラスタの状態の例を示す図である。FIG. 2 is a diagram showing an example of a state of a cluster.

【図３】単語の登録を示す図である。FIG. 3 is a diagram showing word registration.

【図４】図１の音声認識部の構成例を示すブロック図で
ある。FIG. 4 is a block diagram showing a configuration example of a voice recognition unit in FIG.

【図５】図１の対話システムの動作を説明するためのフ
ローチャートである。5 is a flowchart for explaining the operation of the dialogue system of FIG.

【図６】図５のステップＳ２２の音声認識処理の詳細を
説明するためのフローチャートである。6 is a flowchart for explaining details of the voice recognition process in step S22 of FIG.

【図７】図６のステップＳ４４の単語列を生成する動作
の詳細を説明するためのフローチャートである。7 is a flowchart for explaining details of an operation of generating a word string in step S44 of FIG.

【図８】言語モデルデータベースで用いられる文法の例
を示す図である。FIG. 8 is a diagram showing an example of a grammar used in a language model database.

【図９】有限状態オートマトンによる言語モデルの例を
示す図である。FIG. 9 is a diagram showing an example of a language model based on a finite state automaton.

【図１０】tri-gramを用いた言語スコアの計算の例を示
す図である。FIG. 10 is a diagram illustrating an example of calculation of a language score using a tri-gram.

【図１１】tri-gramデータベースの例を示す図である。FIG. 11 is a diagram showing an example of a tri-gram database.

【図１２】図５のステップＳ２４の単語獲得処理の詳細
を説明するためのフローチャートである。12 is a flowchart for explaining details of the word acquisition process of step S24 of FIG.

【図１３】テンプレートの例を示す図である。FIG. 13 is a diagram showing an example of a template.

【図１４】音韻タイプライタを組み込んだ文法の例を示
す図である。FIG. 14 is a diagram showing an example of a grammar incorporating a phoneme typewriter.

【図１５】本発明を適用したコンピュータの一実施の形
態の構成例を示すブロック図である。FIG. 15 is a block diagram showing a configuration example of an embodiment of a computer to which the present invention has been applied.

[Explanation of symbols]

１音声認識部，２連想記憶部，３対話制御
部，４単語獲得部，４１マイクロホン，４２
ＡＤ変換部，４３特徴量抽出部，４４マッチン
グ部，４５音韻タイプライタ部，４６制御部，
５１音響モデルデータベース，５２辞書データ
ベース，５３言語モデルデータベース1 voice recognition unit, 2 associative storage unit, 3 dialogue control unit, 4 word acquisition unit, 41 microphone, 42
AD conversion unit, 43 feature amount extraction unit, 44 matching unit, 45 phoneme typewriter unit, 46 control unit,
51 acoustic model database, 52 dictionary database, 53 language model database

Claims

[Claims]

1. A voice processing device for processing an input voice and registering a word included in the input voice based on the processing result, the recognition means recognizing continuous input voice, and the recognition means. In the recognition result recognized by the unknown word determination means for determining whether or not an unknown word is included, by the unknown word determination means, if it is determined that the unknown word is included, the unknown word A speech processing apparatus, comprising: an acquisition unit that acquires a word corresponding to the above item; and a registration unit that registers the word acquired by the acquisition unit in association with other information.

2. The pattern determination means for determining whether or not the recognition result matches a specific pattern, wherein the registration means causes the pattern determination means to match the recognition result with a specific pattern. The voice processing device according to claim 1, wherein the word is registered when it is determined that the word is present.

3. The unknown word determination means determines that the unknown word is not included, or the pattern determination means determines that the recognition result does not match a specific pattern. In this case, the voice processing device according to claim 2, further comprising a response generation unit that generates a response corresponding to the input voice.

4. The registration means, as the other information,
The voice processing device according to claim 2, wherein the category of the word is registered.

5. The registration means registers the other information based on the pattern determined to match by the pattern determination means.
The voice processing device according to.

6. The voice processing apparatus according to claim 1, wherein the acquisition unit acquires the word by clustering the unknown word.

7. A comparison unit is further provided which compares acoustic scores of a predetermined section of the input speech matched with a known word and recognized with a phonological typewriter, wherein the comparison unit includes the phonological type. The speech processing apparatus according to claim 1, wherein when the acoustic score recognized by the writer is superior, the section is estimated as an unknown word.

8. The comparing means compares the acoustic score when the known word is matched with the acoustic score when the phonological typewriter is recognized, and then compares the acoustic score. The audio processing device according to claim 7.

9. A voice processing method of a voice processing device, which processes an input voice and registers a word included in the input voice based on a processing result, a recognition step of recognizing continuous input voices, In the recognition result recognized by the processing of the recognition step,
A determination step of determining whether or not an unknown word is included, and, by the processing of the determination step, if it is determined that the unknown word is included, an acquisition step of acquiring a word corresponding to the unknown word; And a registration step of registering the word acquired by the processing of the acquisition step in association with other information.

10. A program of a voice processing device for processing an input voice and registering a word included in the input voice based on a processing result thereof, the recognition step of recognizing continuous input voices, In the recognition result recognized by the processing of the recognition step,
A determination step of determining whether or not an unknown word is included, and, by the processing of the determination step, if it is determined that the unknown word is included, an acquisition step of acquiring a word corresponding to the unknown word; And a registration step of registering the word acquired by the processing of the acquisition step in association with other information, the recording medium having a computer-readable program recorded thereon.

11. A recognition step of recognizing a continuous input voice to a computer for controlling a voice processing device which processes an input voice and registers a word included in the input voice based on the processing result. In the recognition result recognized by the processing of the recognition step,
A determination step of determining whether or not an unknown word is included, and, by the processing of the determination step, if it is determined that the unknown word is included, an acquisition step of acquiring a word corresponding to the unknown word; And a registration step of registering the word acquired by the processing of the acquisition step in association with other information.