JP2003131693A

JP2003131693A - Device and method for voice recognition

Info

Publication number: JP2003131693A
Application number: JP2001328083A
Authority: JP
Inventors: Satoko Tanaka; 聡子田中
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 2001-10-25
Filing date: 2001-10-25
Publication date: 2003-05-09

Abstract

PROBLEM TO BE SOLVED: To recognize a word which is not shown in dictionary data. SOLUTION: Provided are a 1st decoder 7 and a 2nd decoder 8. The 1st decoder 7 decodes a feature quantity supplied from a voice analysis part 2 into a phoneme sequence by using a sound model and a 1st language model, compares the phoneme sequence with words and documents shown in the dictionary data, and outputs a word and/or document having a similar phoneme sequence as a recognition result of an inputted voice signal. The 2nd decoder 8, on the other hand, decodes the feature quantity supplied from the voice analysis part 2 into a phoneme sequence by using the sound model and a 2nd language model. Then a KANA(Japanese syllabary) conversion part 9 converts the phoneme sequence into KANA and outputs the conversion result as a recognition result of the inputted voice signal.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、例えば音声を入力
してデータベースから所望のデータを得る際などに適用
する音声認識装置及び音声認識方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice recognition device and a voice recognition method applied when, for example, inputting voice and obtaining desired data from a database.

【０００２】[0002]

【従来の技術】近年、情報を入力し、入力した情報に基
づいてデータベースを検索して、所望の情報を得るシス
テムが提案されている。例としては、アーティスト名や
曲名などの情報を入力し、入力した情報に基づいてデー
タベースを検索して、所望の曲を得るシステムなどが挙
げられる。2. Description of the Related Art In recent years, a system has been proposed in which information is input and a database is searched based on the input information to obtain desired information. As an example, there is a system in which information such as artist name and song name is input, and a database is searched based on the input information to obtain a desired song.

【０００３】当該システムにおいてアーティスト名や曲
名などの情報を入力するときには、手動操作入力以外に
音声入力が行われている。今日では、音声入力をより簡
便に行うべく、音声認識装置の開発が進められている。When inputting information such as artist name and song name in the system, voice input is performed in addition to manual operation input. Today, a voice recognition device is being developed in order to perform voice input more easily.

【０００４】図８に示すように、音声認識装置１００
は、音声分析部１０１、音響モデル記憶部１０２と、言
語モデル記憶部１０３と、辞書１０４と、デコーダ１０
５とを備える。As shown in FIG. 8, a voice recognition device 100.
Is a voice analysis unit 101, an acoustic model storage unit 102, a language model storage unit 103, a dictionary 104, and a decoder 10.
5 and 5.

【０００５】音声分析部１０１は、入力された音声信号
に対して、例えば認識に必要な特徴量の抽出を所定のサ
ンプリング間隔で行うことなどにより、周波数分析を行
う。The voice analysis unit 101 performs frequency analysis on the input voice signal, for example, by extracting feature quantities necessary for recognition at predetermined sampling intervals.

【０００６】音響モデル記憶部１０２は、音響モデルが
記憶されている。音響モデルは、音声信号の特徴を、例
えば音素毎にＨＭＭ（Hidden Markov Model）などによ
ってモデル化したものである。The acoustic model storage unit 102 stores acoustic models. The acoustic model is obtained by modeling the characteristics of a speech signal by, for example, HMM (Hidden Markov Model) for each phoneme.

【０００７】言語モデル記憶部１０３は、言語モデルが
記憶されている。言語モデルは、認識対象となる単語や
文章などの音素列を示す情報である。The language model storage unit 103 stores a language model. The language model is information indicating a phoneme string such as a word or a sentence to be recognized.

【０００８】辞書１０４は、言語モデル１０３の一部を
なし、認識対象の音素列とテキストデータとの対応を示
す辞書データを記憶している。この辞書から、音響モデ
ルを接続する情報をモデル化している。The dictionary 104 is a part of the language model 103 and stores dictionary data showing the correspondence between the phoneme sequence to be recognized and the text data. Information that connects acoustic models is modeled from this dictionary.

【０００９】デコーダ１０５は、音声分析部１０１から
供給された特徴量に対して、音響モデルと言語モデルと
を用いて単語や文章などの音素列を生成し、生成した音
素列を辞書データに示されている単語や文章の音素列と
比較し、類似した音素列を有する単語や文章が辞書デー
タに示されていたときに、当該単語や文章を、入力され
た音声信号の認識結果として出力する。The decoder 105 uses the acoustic model and the language model to generate phoneme strings of words, sentences, etc. for the feature quantities supplied from the speech analysis unit 101, and displays the generated phoneme strings in dictionary data. When a word or sentence having a similar phoneme sequence is shown in the dictionary data, the word or sentence is output as a recognition result of the input voice signal. .

【００１０】以上説明した音声認識装置１００では、先
ず、音声分析部１０１が、入力された音声信号の波形の
周波数分析を行い、特徴量を抽出する。抽出した特徴量
は、デコーダ１０５に供給される。In the voice recognition apparatus 100 described above, first, the voice analysis unit 101 performs frequency analysis of the waveform of the input voice signal and extracts the feature amount. The extracted feature amount is supplied to the decoder 105.

【００１１】次に、デコーダ１０５が、音響モデルを用
いて、音声分析部１０１から供給された特徴量を、音素
毎にモデル化する。Next, the decoder 105 models the feature amount supplied from the speech analysis unit 101 for each phoneme using the acoustic model.

【００１２】次に、デコーダ１０５は、モデル化された
音素を、言語モデルを用いて接続することで音素列を生
成し、生成した音素列を辞書データに示されている単語
や文章の音素列と比較し、類似した音素列を有する単語
や文章が辞書データに示されているときには、当該単語
や文章を、入力された音声信号の認識結果として出力す
る。Next, the decoder 105 generates a phoneme string by connecting the modeled phonemes using a language model, and the generated phoneme string is a phoneme string of words or sentences indicated in the dictionary data. When a word or sentence having a similar phoneme sequence is shown in the dictionary data, the word or sentence is output as a recognition result of the input voice signal.

【００１３】[0013]

【発明が解決しようとする課題】ところで、音声認識装
置１００では、入力された音声信号を変換して生成した
音素列と辞書データに示されている単語や文章の音素列
とを比較し、類似した音素列を有する単語や文章が辞書
データに示されているときに、当該単語や文章を認識結
果としている。By the way, in the speech recognition apparatus 100, a phoneme sequence generated by converting an input voice signal is compared with a phoneme sequence of a word or a sentence shown in dictionary data, and the result is similar. When a word or a sentence having the selected phoneme sequence is shown in the dictionary data, the word or the sentence is used as the recognition result.

【００１４】したがって、音声認識装置１００は、辞書
データに示されていない単語や文章を認識結果とするこ
とが困難となる。また、ユーザが、辞書データに示され
ている単語や文章を知らずに音声認識装置１００を使用
したときには、認識結果とされない単語を入力してしま
い、音声認識装置１００を充分に使いこなすことが困難
となる。Therefore, it becomes difficult for the voice recognition apparatus 100 to recognize a word or a sentence not shown in the dictionary data as a recognition result. Further, when the user uses the voice recognition device 100 without knowing the words or sentences shown in the dictionary data, a word that is not a recognition result is input, and it is difficult to fully use the voice recognition device 100. Become.

【００１５】また、音声認識装置１００では、音声信号
を入力したときに、発音と共に長さが類似している単語
や文章を認識結果とすることとなる。したがって、入力
された音声信号と部分的に一致しており且つ長さが異な
る単語や文章が辞書データに示されているときにも、部
分的に一致している単語や文章を認識結果とせずに、長
さが近い単語や文章を認識結果としてしまうことが多く
なる。すなわち、ユーザが入力する単語や文章を部分的
に記憶しているときなどに、音声認識装置１００は、辞
書１０４に正解の単語や文章が記憶されているにも拘わ
らず、正解から外れた単語や文章を認識結果とすること
が多くなる。例えば、辞書データに「富士山」と「宇
治」が示されており「富士」が示されていない場合に、
ユーザが「富士山」を意図して「富士」を入力すると、
「宇治」が認識結果とされてしまう。Further, in the voice recognition apparatus 100, when a voice signal is input, a word or a sentence whose length is similar to the pronunciation is used as the recognition result. Therefore, even when words or sentences that partially match the input voice signal and have different lengths are shown in the dictionary data, the partially matching words or sentences are not regarded as the recognition result. In addition, a word or a sentence having a similar length often becomes a recognition result. That is, when a word or a sentence input by the user is partially stored, the speech recognition apparatus 100 causes the dictionary 104 to store the word or the sentence of the correct answer, but deviates from the correct answer. And sentences are often used as recognition results. For example, if the dictionary data shows "Mt. Fuji" and "Uji" but not "Fuji",
When the user inputs "Fuji" with the intention of "Mount Fuji",
"Uji" is recognized as the recognition result.

【００１６】したがって、音声認識装置１００は、精度
が不十分なものとなる。音声認識装置１００の精度が不
十分であると、例えば入力した情報に基づいてデータベ
ースを検索したときに、所望の情報を得ることが困難と
なる。Therefore, the voice recognition apparatus 100 becomes insufficient in accuracy. If the accuracy of the voice recognition device 100 is insufficient, it becomes difficult to obtain desired information when searching the database based on the input information, for example.

【００１７】本発明は以上のような従来の実情を鑑みて
提案されたものであり、辞書データに示されていない単
語や文章、及び辞書データに示されている単語や文章と
部分的に一致している単語や文章を入力したときに、認
識することが可能である音声認識装置及び音声認識方法
を提供することを目的とする。The present invention has been proposed in view of the conventional circumstances as described above, and partially matches the words and sentences not shown in the dictionary data and the words and sentences shown in the dictionary data. An object of the present invention is to provide a voice recognition device and a voice recognition method capable of recognizing an input word or sentence.

【００１８】[0018]

【課題を解決するための手段】本発明に係る音声認識装
置は、入力された音声信号の特徴量を抽出する音声分析
手段と、音声信号の特徴量をにモデル化した音響モデル
を記憶する音響モデル記憶手段と、音素が出現する確率
を示す情報からなる言語モデルが記憶されている言語モ
デル記憶手段と、複数の単語及び／又は文章の音素列を
示す辞書データが記憶されている辞書データ記憶手段
と、上記音声分析手段から供給された特徴量に対して上
記言語モデル及び上記音響モデルを適用して音素列を生
成し、生成した音素列を上記辞書データに示されている
単語及び／又は文章の音素列と比較し、類似した音素列
を有する単語及び／又は文章が上記辞書データに示され
ているときに、当該単語及び／又は文章を、入力された
音声信号の認識結果として出力する第１の復号手段と、
上記音声分析手段から供給された特徴量に対して上記言
語モデル及び上記音響モデルを適用して音素列を生成す
る第２の復号手段と、上記第２の復号手段が生成した音
素列をかなに変換し、かなに変換した結果を、入力され
た音声信号の認識結果として出力するかな変換手段とを
備えることを特徴とする。A speech recognition apparatus according to the present invention comprises a speech analysis means for extracting a feature amount of an input voice signal and an acoustic storing a sound model in which the feature amount of the voice signal is modeled. Model storage means, language model storage means for storing a language model consisting of information indicating the probability of phoneme appearance, and dictionary data storage for storing dictionary data indicating phoneme strings of a plurality of words and / or sentences Means, and the language model and the acoustic model are applied to the feature amount supplied from the speech analysis means to generate a phoneme string, and the generated phoneme string is a word and / or a word indicated in the dictionary data. When a word and / or a sentence having a similar phoneme sequence is shown in the dictionary data by comparing with a phoneme sequence of a sentence, the word and / or the sentence is recognized as a recognition result of an input voice signal. A first decoding means for and outputs,
Second decoding means for applying the language model and the acoustic model to the feature quantity supplied from the speech analysis means to generate a phoneme string, and a phoneme string generated by the second decoding means. And a kana conversion means for outputting the result of conversion to kana as a recognition result of the input voice signal.

【００１９】また、本発明に係る音声認識装置は、上記
辞書データに示されている単語及び／又は文章と、上記
かな変換手段から出力される単語及び／又は文章との部
分的な一致を検出し、部分的に一致した単語及び／又は
文章が上記辞書データに示されているときには、当該単
語及び／又は文章を、入力された音声信号の認識結果と
して出力する部分一致検出手段を備えることを特徴とす
る。Further, the voice recognition apparatus according to the present invention detects a partial match between the word and / or sentence shown in the dictionary data and the word and / or sentence output from the kana conversion means. However, when a partially matched word and / or sentence is shown in the dictionary data, a partial match detection means for outputting the word and / or sentence as a recognition result of the input voice signal is provided. Characterize.

【００２０】また、本発明に係る音声認識装置は、上記
第１の復号手段によって認識される単語及び／又は文章
の信頼性を評価し、上記信頼性に応じて、当該単語及び
／又は文章を、入力された音声信号の認識結果として出
力する信頼性検出手段を備えることを特徴とする。Further, the voice recognition device according to the present invention evaluates the reliability of the word and / or the sentence recognized by the first decoding means, and according to the reliability, recognizes the word and / or the sentence. , And a reliability detecting means for outputting as a recognition result of the input voice signal.

【００２１】また、本発明に係る音声認識方法は、音声
信号の特徴量をモデル化した音響モデルと、音素が出現
する確率を示す情報からなる言語モデルと、複数の単語
及び／又は文章の音素列を示す辞書データとを記憶手段
に記憶し、入力された音声信号の特徴量を抽出し、上記
特徴量に対して上記言語モデル及び上記音響モデルを適
用して音素列を生成し、生成した音素列を上記辞書デー
タに示されている単語及び／又は文章の音素列と比較
し、類似した音素列を有する単語及び／又は文章が上記
辞書データに示されているときに、当該単語及び／又は
文章を認識結果とする第１の認識処理を行うとともに、
上記特徴量に対して上記言語モデル及び上記音響モデル
を適用して音素列を生成した後に、上記音素列をかなに
変換し、かなに変換した単語及び／又は文章を、入力さ
れた音声信号の認識結果とする第２の認識処理を行うこ
とを特徴とする。Further, the speech recognition method according to the present invention is such that an acoustic model modeling a feature amount of a speech signal, a language model consisting of information indicating the probability that a phoneme appears, and a phoneme of a plurality of words and / or sentences. The dictionary data indicating the sequence is stored in the storage unit, the feature amount of the input speech signal is extracted, the phoneme sequence is generated by applying the language model and the acoustic model to the feature amount, and is generated. The phoneme string is compared with the phoneme strings of the words and / or sentences shown in the dictionary data, and when words and / or sentences having similar phoneme strings are shown in the dictionary data, the word and / or sentence Or, while performing the first recognition process with the sentence as the recognition result,
After the phoneme sequence is generated by applying the language model and the acoustic model to the feature quantity, the phoneme sequence is converted into kana, and the kana-converted word and / or sentence is converted into the input speech signal. It is characterized in that a second recognition process is performed as a recognition result.

【００２２】また、本発明に係る音声認識方法は、上記
辞書データに示されている単語及び／又は文章と、上記
第２の認識処理で得られる単語及び／又は文章との部分
的な一致を検出し、部分的に一致した単語及び／又は文
章が上記辞書データに示されているときには、当該単語
及び／又は文章を、入力された音声信号の認識結果とす
ることを特徴とする。Further, the voice recognition method according to the present invention makes it possible to partially match the words and / or sentences shown in the dictionary data with the words and / or sentences obtained by the second recognition processing. When a word and / or a sentence that is detected and partially matched is shown in the dictionary data, the word and / or the sentence is used as a recognition result of the input voice signal.

【００２３】また、本発明に係る音声認識方法は、上記
第１の認識処理によって生成される単語及び／又は文章
の信頼性を評価し、上記信頼性に応じて、当該単語及び
／又は文章を、入力された音声信号の認識結果とするこ
とを特徴とする。Further, the speech recognition method according to the present invention evaluates the reliability of the word and / or the sentence generated by the first recognition processing, and determines the word and / or the sentence according to the reliability. , The recognition result of the input voice signal is used.

【００２４】[0024]

【発明の実施の形態】以下、本発明の実施の形態につい
て、図面を参照しながら詳細に説明する。BEST MODE FOR CARRYING OUT THE INVENTION Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

【００２５】第１の実施の形態以下では、本発明の第１の実施の形態について、図面を
参照しながら詳細に説明する。 First Embodiment Hereinafter, a first embodiment of the present invention will be described in detail with reference to the drawings.

【００２６】本発明は、例えば図１に示す構成の音声認
識装置１に適用することができる。The present invention can be applied to, for example, the voice recognition device 1 having the configuration shown in FIG.

【００２７】音声認識装置１は、音声分析部２と、第１
の言語モデル記憶部３と、音響モデル記憶部４と、第２
の言語モデル記憶部５と、第１のデコーダ７と、第２の
デコーダ８と、かな変換部９とを備える。The voice recognition device 1 includes a voice analysis unit 2 and a first
Language model storage unit 3, acoustic model storage unit 4, and second
Language model storage unit 5, a first decoder 7, a second decoder 8, and a Kana conversion unit 9.

【００２８】音声分析部２は、入力された音声信号に対
して、例えば認識に必要な特徴量の抽出を所定のサンプ
リング間隔で行うことなどにより、周波数分析を行う。
具体的には、音声分析部２は、信号のエネルギー、零公
差数、周波数特性及びこれらの変化量などを抽出する。
周波数分析としては、線形予測分析（ＬＰＣ）、高速フ
ーリエ変換（ＦＦＴ）、バンドパスフィルタ（ＢＰＦ）
などが用いられる。音声分析部２は、これらの特徴量を
ベクトルとして抽出したり、量子化を行ってスカラーと
して抽出する。The voice analysis unit 2 performs frequency analysis on the input voice signal, for example, by extracting feature quantities necessary for recognition at predetermined sampling intervals.
Specifically, the voice analysis unit 2 extracts the energy of the signal, the number of zero tolerances, the frequency characteristic, and the amount of change thereof.
As frequency analysis, linear prediction analysis (LPC), fast Fourier transform (FFT), band pass filter (BPF)
Are used. The voice analysis unit 2 extracts these feature quantities as a vector or quantizes and extracts them as a scalar.

【００２９】第１の言語モデル記憶部３は、第１の言語
モデルと辞書データとを記憶している。辞書データは、
音声認識装置１において認識対象となる音素列とテキス
トデータとの対応を示すデータである。また、第１の言
語モデルは、辞書データに基づいて音素が出現する確率
をモデル化している。The first language model storage unit 3 stores the first language model and dictionary data. Dictionary data is
It is data indicating a correspondence between a phoneme string to be recognized by the voice recognition device 1 and text data. Further, the first language model models the probability that a phoneme appears based on dictionary data.

【００３０】音響モデル記憶部４は、音響モデルが記憶
されている。音響モデルは、入力された音声信号の特徴
量を、例えば音素毎にＨＭＭ（Hidden Markov Model）
などによってモデル化している。The acoustic model storage unit 4 stores an acoustic model. The acoustic model is a HMM (Hidden Markov Model) for each phoneme of the feature amount of the input speech signal.
It is modeled by.

【００３１】なお、本実施の形態においては、音声をモ
デル化する単位として、ＰＬＵ（疑似音素単位）を採用
している。ＰＬＵは、表１に示すように、日本語を２７
の音素単位で扱うものとし、各音素には状態数が割り付
けられている。状態数とは、サブワード単位の持続する
最も短いフレーム数のことであり、例えば音素“ａ”の
状態数は“３”であるので、音素“ａ”は少なくとも３
フレーム続くことを意味する。３状態は、発音の立ち上
がり、定常状態、リリース状態を擬似的に表したもので
ある。なお、音素“ｂ”，“ｇ”などの破裂音は、本来
持つ音韻が短いので２状態に設定されている。また、息
継ぎ（ASPIRATION）も２状態に設定されている。さら
に、無音（SILENCE）は、時間的変動がないので、１状
態に設定されている。In this embodiment, PLU (pseudo phoneme unit) is adopted as a unit for modeling a voice. As shown in Table 1, PLU has 27 Japanese
The number of states is assigned to each phoneme. The number of states refers to the shortest number of continuous frames in subword units. For example, since the number of states of the phoneme "a" is "3", the phoneme "a" is at least three.
Means that the frame continues. The three states are pseudo representations of the rise of sound, the steady state, and the release state. It should be noted that plosive sounds such as phonemes "b" and "g" are set to two states because their original phonemes are short. Also, breathing (ASPIRATION) is set to 2 states. Furthermore, silence (SILENCE) is set to one state because there is no temporal change.

【００３２】[0032]

【表１】 [Table 1]

【００３３】第２の言語モデル記憶部５は、第２の言語
モデルが記憶されている。第２の言語モデルは、各音素
が均等若しくは不均等に出現するモデルである。なお、
第２の言語モデル記憶部５には、第２の言語モデル以外
の情報が記憶されていても良い。例えば、入力された音
声信号が日本語であるか否かを判断するための情報が記
憶されていても良い。The second language model storage unit 5 stores the second language model. The second language model is a model in which each phoneme appears evenly or unevenly. In addition,
Information other than the second language model may be stored in the second language model storage unit 5. For example, information for determining whether or not the input voice signal is in Japanese may be stored.

【００３４】第１のデコーダ７は、音声分析部２から供
給された特徴量に対して、音響モデルと第１の言語モデ
ルとを用いて単語や文章の音素列を生成し、生成した音
素列を辞書データに示されている単語や文章の音素列と
比較し、類似した音素列を有する単語や文章が辞書デー
タに示されているときに、当該単語や文章を、入力され
た音声信号の認識結果として出力する。The first decoder 7 uses the acoustic model and the first language model for the feature quantity supplied from the speech analysis unit 2 to generate a phoneme sequence of words and sentences, and the generated phoneme sequence. Is compared with the phoneme sequence of the word or sentence shown in the dictionary data, and when the word or sentence having a similar phoneme sequence is shown in the dictionary data, the word or sentence is Output as a recognition result.

【００３５】なお、第１のデコーダ７は、類似した音素
列を有する単語や文章が辞書データに複数示されている
ときには、それぞれの単語や文章について信頼性を評価
してスコア付けを行い、スコアが最も高い単語や文章、
或いはスコアが所定の値以上である複数の単語や文章
を、入力された音声信号の認識結果として出力する構成
としても良い。When a plurality of words or sentences having similar phoneme strings are shown in the dictionary data, the first decoder 7 evaluates the reliability of each word or sentence and scores the score. The highest word or sentence,
Alternatively, a plurality of words or sentences whose scores are equal to or higher than a predetermined value may be output as the recognition result of the input voice signal.

【００３６】単語や文章についての信頼性の評価方法と
しては、例えば、入力された音声を構成する各音素の長
さが、所定の長さと比較してどの程度差があるかを検出
することによって評価する方法がある。具体的に説明す
ると、「貝（ｋａｉ）」と入力したときには、ｋ，ａ，
ｉの各音素について所定の長さがあり、各音素の長さが
所定の長さに近いほど、信頼性が高くなる。As a method of evaluating the reliability of a word or a sentence, for example, it is possible to detect how much the length of each phoneme forming the input speech differs from a predetermined length. There is a way to evaluate. Specifically, when "kai" is entered, k, a,
There is a predetermined length for each phoneme of i, and the closer the length of each phoneme is to the predetermined length, the higher the reliability.

【００３７】第２のデコーダ８は、音声分析部２から供
給された特徴量に対して、音響モデルと第２の言語モデ
ルとを用いて単語や文章などの音素列を生成する。ま
た、第２のデコーダ８は、生成した音素列をかな変換部
９へ供給する。The second decoder 8 uses the acoustic model and the second language model for the feature amount supplied from the speech analysis unit 2 to generate a phoneme string such as a word or a sentence. Further, the second decoder 8 supplies the generated phoneme string to the kana conversion unit 9.

【００３８】かな変換部９は、第２のデコーダ８から供
給された音素列をかなに変換する。また、かな変換部９
は、かなに変換した結果を、入力された音声信号の認識
結果として出力する。The kana conversion section 9 converts the phoneme string supplied from the second decoder 8 into kana. Also, the kana conversion unit 9
Outputs the result of kana conversion as the recognition result of the input voice signal.

【００３９】以上説明した音声認識装置１の信号の流れ
は、以下に説明する通りとなる。The signal flow of the speech recognition apparatus 1 described above is as described below.

【００４０】先ず、音声認識装置１に音声が入力され
る。入力された音声は、音声分析部２へ供給される。First, a voice is input to the voice recognition device 1. The input voice is supplied to the voice analysis unit 2.

【００４１】次に、音声分析部２が、入力された音声信
号に対して、例えば認識に必要な特徴量の抽出を所定の
サンプリング間隔で行うことなどにより周波数分析を行
い、抽出した特徴量を第１のデコーダ７と第２のデコー
ダ８とに供給する。Next, the voice analysis unit 2 performs frequency analysis on the input voice signal, for example, by extracting a feature amount necessary for recognition at a predetermined sampling interval, and then extracts the extracted feature amount. It is supplied to the first decoder 7 and the second decoder 8.

【００４２】第１のデコーダ７は、音響モデルを用い
て、音声分析部２から供給された特徴量を音素毎にモデ
ル化する。The first decoder 7 models the feature amount supplied from the speech analysis unit 2 for each phoneme using the acoustic model.

【００４３】次に、第１のデコーダ７は、モデル化され
た音素に対して、第１の言語モデルを適用することで音
素列を生成し、生成した音素列を辞書データに示されて
いる単語や文章の音素列と比較し、類似した音素列を有
する単語や文章が辞書データに示されているときに、当
該単語や文章を、入力された音声信号の認識結果として
出力する。Next, the first decoder 7 generates a phoneme string by applying the first language model to the modeled phoneme, and the generated phoneme string is shown in the dictionary data. When a word or a sentence having a similar phoneme sequence is shown in the dictionary data by comparison with the phoneme sequence of the word or the sentence, the word or the sentence is output as the recognition result of the input voice signal.

【００４４】一方、第２のデコーダ８は、音響モデルを
用いて、音声分析部２から供給された特徴量を音素毎に
モデル化する。On the other hand, the second decoder 8 models the feature amount supplied from the speech analysis unit 2 for each phoneme using the acoustic model.

【００４５】次に、第２のデコーダ８は、モデル化され
た音素に対して、第２の言語モデルを適用することで音
素列を生成する。第２のデコーダ８は、生成した音素列
をかな変換部９へ供給する。Next, the second decoder 8 applies a second language model to the modeled phoneme to generate a phoneme string. The second decoder 8 supplies the generated phoneme string to the kana conversion unit 9.

【００４６】次に、かな変換部９は、第２のデコーダ８
から供給された音素列をかなに変換する。かな変換部９
は、かなに変換した結果を、入力された音声信号の認識
結果として出力する。Next, the kana conversion section 9 includes the second decoder 8
The phoneme string supplied from is converted into kana. Kana converter 9
Outputs the result of kana conversion as the recognition result of the input voice signal.

【００４７】なお、音声認識装置１のハードウェア構成
は、図２に示す通りとなり、マイクロフォン１０と、ア
ナログデジタル変換部１１と、中央演算装置（Central
Processing Unit；以下、ＣＰＵと称する。）１２と、
ＤＲＡＭ（Dinamic Random Access Memory）１３と、フ
ラッシュＲＯＭ（Read Only Memory）１４と、マスクＲ
ＯＭ１５とからなる。The hardware configuration of the voice recognition device 1 is as shown in FIG. 2, and the microphone 10, the analog-digital conversion unit 11, the central processing unit (Central Processing Unit).
Processing Unit; hereinafter referred to as CPU. ) 12, and
DRAM (Dinamic Random Access Memory) 13, flash ROM (Read Only Memory) 14, mask R
It consists of OM15.

【００４８】マイクロフォン１０は、入力された音声信
号を電気信号に変換し、アナログデジタル変換部１１へ
供給する。アナログデジタル変換部１１は、マイクロフ
ォン１０から供給された電気信号を、アナログ信号から
デジタル信号へ変換する。ＣＰＵ１２は、音声分析部
２、第１のデコーダ７、第２のデコーダ８、及びかな変
換部９における処理を行う。The microphone 10 converts the input audio signal into an electric signal and supplies the electric signal to the analog-digital conversion unit 11. The analog-digital conversion unit 11 converts the electric signal supplied from the microphone 10 from an analog signal into a digital signal. The CPU 12 performs processing in the voice analysis unit 2, the first decoder 7, the second decoder 8, and the kana conversion unit 9.

【００４９】ＤＲＡＭ１３は、ＣＰＵ１２で演算された
データを一時的に記憶する。フラッシュＲＯＭ１４は、
デバック用に備えられている。マスクＲＯＭ１５は、音
響モデルと、第１の言語モデルと、第２の言語モデルと
を記憶している。The DRAM 13 temporarily stores the data calculated by the CPU 12. The flash ROM 14 is
Prepared for debugging. The mask ROM 15 stores an acoustic model, a first language model, and a second language model.

【００５０】以上説明した音声認識装置１は、第１のデ
コーダ７と、第２のデコーダ８と、かな変換部９とを備
えている。音声認識装置１では、第１のデコーダ７が、
音声分析部２から供給された特徴量を、音響モデルを用
いて音素毎にモデル化した後に第１の言語モデルを用い
て接続することで音素列を生成する。そして、第１のデ
コーダ７が、生成した音素列を辞書データに示されてい
る単語や文章の音素列と比較し、類似した音素列を有す
る単語や文章が辞書データに示されているときに、当該
単語や文章を、入力された音声信号の認識結果として出
力する。また、第２のデコーダ８が、音声分析部２から
供給された特徴量を、音響モデルを用いて音素毎にモデ
ル化した後に第２の言語モデルを用いて接続することで
音素列を生成する。そして、かな変換部９が、生成した
音素列をかなに変換し、かなに変換した結果を、入力さ
れた音声信号の認識結果として出力する。The voice recognition device 1 described above includes the first decoder 7, the second decoder 8 and the kana conversion section 9. In the voice recognition device 1, the first decoder 7 is
The feature quantity supplied from the speech analysis unit 2 is modeled for each phoneme using an acoustic model, and then connected using the first language model to generate a phoneme string. Then, the first decoder 7 compares the generated phoneme string with the phoneme strings of words or sentences shown in the dictionary data, and when words or sentences having similar phoneme strings are shown in the dictionary data. , The word or sentence is output as a recognition result of the input voice signal. Also, the second decoder 8 generates a phoneme string by modeling the feature amount supplied from the speech analysis unit 2 for each phoneme using the acoustic model and then connecting the feature quantity using the second language model. . Then, the kana conversion unit 9 converts the generated phoneme sequence into kana and outputs the result of conversion into kana as the recognition result of the input voice signal.

【００５１】したがって、音声認識装置１は、辞書デー
タに示されていない単語や文章が入力されたときにも、
入力された音声信号の認識結果を出力することが可能と
なる。例えば、ユーザが辞書データに示されている単語
や文章を知らないときにも、音声認識装置１を充分に使
いこなすことが可能となる。また、音声認識装置１は、
高精度なものとなり、例えば入力した情報に基づいてデ
ータベースを検索し、所望の情報を得ることが容易とな
る。Therefore, the voice recognition device 1 can be used even when a word or a sentence not shown in the dictionary data is input.
It is possible to output the recognition result of the input voice signal. For example, even if the user does not know a word or a sentence shown in the dictionary data, the voice recognition device 1 can be fully used. Further, the voice recognition device 1 is
The accuracy is high, and it is easy to obtain desired information by searching the database based on the input information, for example.

【００５２】なお、本発明を適用した音声認識装置１
は、図３に示すように、信頼性検出部２０を備えても良
い。The voice recognition device 1 to which the present invention is applied.
May include a reliability detector 20, as shown in FIG.

【００５３】信頼性検出部２０は、第１のデコーダ７が
評価した認識結果の信頼性を検出し、スコアが高い単語
や文章を、スコアが低い単語や文章よりも優先的に認識
結果として出力したり、所定のスコア以下の単語や文章
よりも、かな変換部９から出力された認識結果を優先的
に認識結果として出力したりする。The reliability detecting section 20 detects the reliability of the recognition result evaluated by the first decoder 7 and outputs the words or sentences having a high score as the recognition results in preference to the words or sentences having a low score. Alternatively, the recognition result output from the kana conversion unit 9 is preferentially output as the recognition result rather than the words and sentences having a predetermined score or less.

【００５４】例えば、第１のデコーダ７及びかな変換部
９から出力されたデータが以下の表２に示す通りであ
り、且つ、信頼性検出部２０が、スコアが６０未満の単
語や文章よりも、かな変換部９から出力された単語や文
章を優先して出力する設定とされているとき、音声認識
装置１の動作は以下に説明する通りとなる。For example, the data output from the first decoder 7 and the kana conversion unit 9 are as shown in Table 2 below, and the reliability detection unit 20 outputs the data more than words or sentences whose score is less than 60. When the word or sentence output from the kana conversion unit 9 is set to be preferentially output, the operation of the voice recognition device 1 is as described below.

【００５５】[0055]

【表２】 [Table 2]

【００５６】図４に示すように、先ず、ステップＳ１に
おいて、信頼性検出部２０は「ウジ（ｕｊｉ）」を出力
する。次にステップＳ２に進み、ユーザは、例えば図示
しない表示部上で、入力した音声信号が「ウジ（ｕｊ
ｉ）」であるか否かを確認する。入力した音声信号が
「ウジ（ｕｊｉ）」であるときには、音声信号の認識が
終了となる。As shown in FIG. 4, first, in step S1, the reliability detecting section 20 outputs "uji". Next, the process proceeds to step S2, and the user inputs a voice signal “uj (uj
i) ”. When the input voice signal is "uji", the recognition of the voice signal ends.

【００５７】入力した音声信号が「ウジ（ｕｊｉ）」で
はないときにはステップＳ３に進み、信頼性検出部２０
は「フイ（ｆｕｉ）」を出力する。次にステップＳ４に
進み、ユーザは、入力した音声信号が「フイ（ｆｕ
ｉ）」であるか否かを確認する。入力した音声信号が
「フイ（ｆｕｉ）」であるときには、音声信号の認識が
終了となる。When the input voice signal is not "uji", the process proceeds to step S3, and the reliability detecting unit 20
Outputs "fui". Next, in step S4, the user determines that the input voice signal is “fu (fu).
i) ”. When the input voice signal is "fui", the recognition of the voice signal ends.

【００５８】入力した音声信号が「フイ（ｆｕｉ）」で
はないときにはステップＳ５に進み、信頼性検出部２０
は「フジ（ｆｕｊｉ）」を出力する。次にステップＳ６
に進み、ユーザは、入力した音声信号が「フジ（ｆｕｊ
ｉ）」であるか否かを確認する。入力した音声信号が
「フジ（ｆｕｊｉ）」であるときには、音声信号の認識
が終了となる。When the input voice signal is not "fui", the process proceeds to step S5, and the reliability detecting section 20
Outputs "Fuji". Then step S6
The user can confirm that the input voice signal is “Fuji (fuj).
i) ”. When the input voice signal is "Fuji", the recognition of the voice signal ends.

【００５９】入力した音声信号が「フジ（ｆｕｊｉ）」
ではないときにはステップＳ７に進み、信頼性検出部２
０は「クジ（ｋｕｊｉ）」を出力する。次にステップＳ
８に進み、ユーザは、入力した音声信号が「クジ（ｋｕ
ｊｉ）」であるか否かを確認する。入力した音声信号が
「クジ（ｋｕｊｉ）」であるときには、音声信号の認識
が終了となる。また、「クジ（ｋｕｊｉ）」ではないと
きにはステップＳ９に進み、認識結果が全て間違いであ
ったと判断する。そして、音声信号の認識が終了とな
る。The input voice signal is "Fuji".
If not, the process proceeds to step S7, and the reliability detecting unit 2
0 outputs "kuji". Then step S
8, the user confirms that the input voice signal is “kuji (ku).
ji) ”. When the input voice signal is "kuji", the recognition of the voice signal ends. If it is not "kuji", the process proceeds to step S9, and it is determined that the recognition results are all incorrect. Then, the recognition of the voice signal ends.

【００６０】以上説明したように、信頼性検出部２０を
備えることによって、音声認識装置１は、入力された音
声信号の認識を、効率良く行うことが可能となる。As described above, by providing the reliability detecting section 20, the voice recognition device 1 can efficiently recognize the input voice signal.

【００６１】第２の実施の形態つぎに、本発明の第２の実施の形態について、図面を参
照しながら詳細に説明する。 Second Embodiment Next, a second embodiment of the present invention will be described in detail with reference to the drawings.

【００６２】本発明は、例えば図５に示す構成の音声認
識装置３０にも適用することができる。なお、音声認識
装置３０では、音声認識装置１との同一要素について
は、同一の符号を付し、詳細な説明を省略する。The present invention can also be applied to the voice recognition device 30 having the structure shown in FIG. 5, for example. In the voice recognition device 30, the same elements as those of the voice recognition device 1 are designated by the same reference numerals, and detailed description thereof will be omitted.

【００６３】音声認識装置３０は、音声分析部２と、第
１の言語モデル記憶部３と、音響モデル記憶部４と、第
２の言語モデル記憶部５と、第１のデコーダ７と、第２
のデコーダ８と、かな変換部９と、部分一致検出部３１
とを備える。The speech recognition device 30 includes a speech analysis unit 2, a first language model storage unit 3, an acoustic model storage unit 4, a second language model storage unit 5, a first decoder 7, and a first decoder 7. Two
Decoder 8, Kana conversion unit 9, and partial match detection unit 31
With.

【００６４】部分一致検出部３１は、かな変換部９によ
って変換された単語や文章と、辞書データに示されてい
る単語や文章とが部分的に一致しているか否かを検出す
る。そして、かな変換部９によって変換された単語や文
章と部分的に一致している単語や文章が辞書データに示
されているときには、部分一致検出部３１は、当該単語
や文章を、認識結果として出力する。The partial match detection unit 31 detects whether or not the word or sentence converted by the kana conversion unit 9 partially matches the word or sentence shown in the dictionary data. Then, when the dictionary data shows a word or a sentence that partially matches the word or the sentence converted by the kana conversion unit 9, the partial match detection unit 31 uses the word or the sentence as the recognition result. Output.

【００６５】以上説明した音声認識装置３０の信号の流
れは、以下に説明する通りとなる。The signal flow of the voice recognition device 30 described above is as described below.

【００６６】先ず、音声認識装置３０に音声が入力され
る。入力された音声は、音声分析部２へ供給される。First, a voice is input to the voice recognition device 30. The input voice is supplied to the voice analysis unit 2.

【００６７】次に音声分析部２が、入力された音声信号
に対して、例えば認識に必要な特徴量の抽出を所定のサ
ンプリング間隔で行うことなどにより周波数分析を行
い、抽出した特徴量を第１のデコーダ７と第２のデコー
ダ８とに供給する。Next, the voice analysis unit 2 performs frequency analysis on the input voice signal, for example, by extracting a feature amount necessary for recognition at a predetermined sampling interval, and then extracts the extracted feature amount as a first value. It is supplied to the first decoder 7 and the second decoder 8.

【００６８】第１のデコーダ７は、音響モデルを用い
て、音声分析部２から供給された特徴量を音素毎にモデ
ル化する。The first decoder 7 models the feature amount supplied from the speech analysis unit 2 for each phoneme using the acoustic model.

【００６９】次に、第１のデコーダ７は、モデル化され
た音素に対して、第１の言語モデルを適用することで音
素列を生成し、生成した音素列を辞書データに示されて
いる単語や文章の音素列と比較し、類似した音素列を有
する単語や文章が辞書データに示されているときに、当
該単語や文章を、入力された音声信号の認識結果として
出力する。Next, the first decoder 7 generates a phoneme string by applying the first language model to the modeled phoneme, and the generated phoneme string is shown in the dictionary data. When a word or a sentence having a similar phoneme sequence is shown in the dictionary data by comparison with the phoneme sequence of the word or the sentence, the word or the sentence is output as the recognition result of the input voice signal.

【００７０】一方、第２のデコーダ８は、音響モデルを
用いて、音声分析部２から供給された特徴量を音素毎に
モデル化する。On the other hand, the second decoder 8 models the feature amount supplied from the speech analysis unit 2 for each phoneme using the acoustic model.

【００７１】次に、第２のデコーダ８は、モデル化され
た音素に対して、第２の言語モデルを適用することで音
素列を生成する。第２のデコーダ８は、生成した音素列
をかな変換部９へ供給する。Next, the second decoder 8 applies a second language model to the modeled phoneme to generate a phoneme string. The second decoder 8 supplies the generated phoneme string to the kana conversion unit 9.

【００７２】次に、かな変換部９は、第２のデコーダ８
から供給された音素列をかなに変換する。かな変換部９
は、かなに変換した結果を、入力された音声信号の認識
結果として出力するとともに、部分一致検出部３１へ供
給する。Next, the kana conversion section 9 uses the second decoder 8
The phoneme string supplied from is converted into kana. Kana converter 9
Outputs the result of kana conversion as a recognition result of the input voice signal and supplies the result to the partial match detection unit 31.

【００７３】次に、部分一致検出部３１は、かな変換部
９によって変換された単語や文章と、辞書データに示さ
れている単語や文章とが部分的に一致しているか否かを
検出し、部分的に一致している単語や文章が辞書データ
に示されているときには、当該単語や文章を、入力され
た音声信号の認識結果として出力する。Next, the partial match detection section 31 detects whether or not the word or sentence converted by the kana conversion section 9 partially matches the word or sentence shown in the dictionary data. When a partially matching word or sentence is shown in the dictionary data, the word or sentence is output as the recognition result of the input voice signal.

【００７４】以上説明した音声認識装置３０は、部分一
致検出部３１を備えており、入力された音声信号と部分
的に一致している単語や文章が辞書データに示されてい
るときには、当該単語や文章を認識結果とすることが可
能となる。The speech recognition device 30 described above is provided with the partial match detection section 31, and when a word or a sentence partially matching the input voice signal is shown in the dictionary data, the word is detected. Or a sentence can be used as the recognition result.

【００７５】また、音声認識装置３０は、入力された音
声信号と比較して、長さが異なり部分的に一致している
単語や文章が辞書データに示されているときにも、当該
単語や文章を認識結果とすることが可能となる。Further, the voice recognition device 30 compares the input voice signal with a word or a sentence having different lengths and partially matching, even when the dictionary data shows the word or the sentence. A sentence can be used as the recognition result.

【００７６】したがって、例えばユーザが部分的に記憶
している単語や文章を音声認識装置３０に入力したとき
にも、入力した単語や文章と部分的に一致している単語
や文章が辞書データに示されていれば、当該単語や文章
を認識結果して出力することができる。Therefore, for example, even when the user inputs a partially stored word or sentence to the voice recognition device 30, the word or sentence partially matching the input word or sentence becomes the dictionary data. If so, the word or sentence can be output as a recognition result.

【００７７】なお、本発明を適用した音声認識装置１
は、図６に示すように、信頼性検出部３２を備えても良
い。The voice recognition device 1 to which the present invention is applied.
May include a reliability detector 32, as shown in FIG.

【００７８】信頼性検出部３２は、第１のデコーダ７が
評価した認識結果の信頼性を検出し、スコアが高い単語
や文章を、スコアが低い単語や文章よりも優先的に認識
結果として出力したり、所定のスコア以下の単語や文章
よりも、かな変換部９から出力された認識結果や部分一
致検出部３１から出力された認識結果を優先的に出力し
たりする。The reliability detecting section 32 detects the reliability of the recognition result evaluated by the first decoder 7, and outputs the words or sentences with a high score as the recognition result with priority over the words or sentences with a low score. Alternatively, the recognition result output from the kana conversion unit 9 or the recognition result output from the partial match detection unit 31 is preferentially output over a word or a sentence having a predetermined score or less.

【００７９】例えば、第１のデコーダ７、かな変換部
９、及び部分一致検出部３１から出力されたデータが以
下の表３に示す通りであり、且つ、信頼性認識部３２
が、スコアが７０未満の単語や文章よりも、部分一致検
出部３１から出力された単語や文章を優先して出力する
とともに、スコアが６０未満の単語や文章よりも、かな
変換部９から出力された単語や文章を優先して出力する
設定とされているとき、音声認識装置３０の動作は、以
下に説明する通りとなる。For example, the data output from the first decoder 7, the kana conversion unit 9, and the partial match detection unit 31 are as shown in Table 3 below, and the reliability recognizing unit 32 is used.
However, the words and sentences output from the partial match detection unit 31 are output with priority over the words and sentences with a score of less than 70, and the words and sentences with a score of less than 60 are output from the kana conversion unit 9. The operation of the voice recognition device 30 is as described below when the selected word or sentence is set to be preferentially output.

【００８０】[0080]

【表３】 [Table 3]

【００８１】図７に示すように、先ず、ステップＳ２１
において、信頼性認識部３２は「ウジ（ｕｊｉ）」を出
力する。次にステップＳ２２に進み、ユーザは、例えば
図示しない表示部上で、入力した音声信号が「ウジ（ｕ
ｊｉ）」であるか否かの確認を行う。入力した音声信号
が「ウジ（ｕｊｉ）」であるときには、音声信号の認識
が終了となる。As shown in FIG. 7, first, step S21.
In, the reliability recognizing unit 32 outputs "uji". Next, the process proceeds to step S22, in which the user inputs the voice signal "U
ji) ”is confirmed. When the input voice signal is "uji", the recognition of the voice signal ends.

【００８２】入力した音声が「ウジ（ｕｊｉ）」ではな
いときにはステップＳ２３に進み、信頼性認識部３２は
「フジサン（ｆｕｊｉｓａｎ）」を出力する。次にステ
ップＳ２４に進み、ユーザは、入力した音声が「フジサ
ン（ｆｕｊｉｓａｎ）」であるか否かを確認する。入力
した音声が「フジサン（ｆｕｊｉｓａｎ）」であるとき
には、音声信号の認識が終了となる。When the input voice is not "uji", the process proceeds to step S23, and the reliability recognizing section 32 outputs "fujisan". Next, proceeding to step S24, the user confirms whether or not the inputted voice is "Fujisan". When the input voice is "Fujisan", the recognition of the voice signal ends.

【００８３】入力した音声が「フジサン（ｆｕｊｉｓａ
ｎ）」ではないときにはステップＳ２５に進み、信頼性
認識部３２は「フイ（ｆｕｉ）」を出力する。次にステ
ップＳ２６に進み、ユーザは、入力した音声が「フイ
（ｆｕｉ）」であるか否かを確認する。入力した音声が
「フイ（ｆｕｉ）」であるときには、音声信号の認識が
終了となる。The input voice is "Fujisan".
n) ”, the reliability recognizing unit 32 outputs“ fui ”(S25). Next, proceeding to step S26, the user confirms whether or not the input voice is "fui". When the input voice is "fui", the recognition of the voice signal ends.

【００８４】入力した音声が「フイ（ｆｕｉ）」ではな
いときにはステップＳ２７に進み、信頼性認識部３２は
「フジ（ｆｕｊｉ）」を出力する。次にステップＳ２８
に進み、ユーザは、入力した音声が「フジ（ｆｕｊ
ｉ）」であるか否かを確認する。入力した音声が「フジ
（ｆｕｊｉ）」であるときには、音声信号の認識が終了
となる。When the input voice is not "fui", the process proceeds to step S27, and the reliability recognizing section 32 outputs "fuji". Next in step S28
Then, the user inputs the inputted voice to "Fuji (fuj
i) ”. When the input voice is "Fuji", the recognition of the voice signal ends.

【００８５】入力した音声が「フジ（ｆｕｊｉ）」では
ないときにはステップＳ２９に進み、信頼性認識部３２
は「クジ（ｋｕｊｉ）」を出力する。次にステップＳ３
０に進み、ユーザは、入力した音声が「クジ（ｋｕｊ
ｉ）」であるか否かを確認する。入力した音声が「クジ
（ｋｕｊｉ）」であるときには、音声信号の認識を終了
する。また、「クジ（ｋｕｊｉ）」ではないときにはス
テップＳ３１に進み、認識結果が全て間違いであったと
判断する。そして、音声信号の認識を終了する。When the input voice is not "Fuji", the process proceeds to step S29, and the reliability recognizing unit 32 is used.
Outputs "kuji". Then step S3
0, the user inputs the input voice as “kuj (kuj
i) ”. When the input voice is "kuji", the recognition of the voice signal is ended. If it is not "kuji", the process proceeds to step S31, and it is determined that the recognition results are all incorrect. Then, the recognition of the voice signal is ended.

【００８６】以上説明したように、信頼性検出部３２を
備えることによって、音声認識装置３０は、入力された
音声信号の認識を、効率良く行うことが可能となる。As described above, by providing the reliability detecting section 32, the voice recognition device 30 can efficiently recognize the input voice signal.

【００８７】[0087]

【発明の効果】本発明に係る音声認識装置及び音声認識
方法は、入力された音声信号の特徴量を抽出し、当該特
徴量に対して言語モデル及び音響モデルを適用して音素
列を生成し、生成した音素列を辞書データに示されてい
る単語や文章の音素列と比較し、類似した音素列を有す
る単語や文章が辞書データに示されているときに、当該
単語や文章を、入力された音声信号の認識結果とする第
１の認識処理を行う。また、本発明に係る音声認識装置
及び音声認識方法は、当該特徴量に対して言語モデル、
音響モデルを適用することで入力された音声信号を音素
列に変換した後、当該音素列をかなに変換し、入力され
た音声信号の認識結果とする第２の認識処理を行う。The speech recognition apparatus and the speech recognition method according to the present invention extract a feature amount of an input voice signal and apply a language model and an acoustic model to the feature amount to generate a phoneme sequence. , Compare the generated phoneme string with the phoneme string of the word or sentence shown in the dictionary data, and enter the word or sentence when the word or sentence having a similar phoneme string is shown in the dictionary data. A first recognition process is performed to obtain a recognition result of the generated voice signal. Further, a voice recognition device and a voice recognition method according to the present invention, a language model for the feature amount,
After the input voice signal is converted into a phoneme sequence by applying the acoustic model, the phoneme sequence is converted into kana and the second recognition processing is performed to obtain the recognition result of the input voice signal.

【００８８】すなわち、本発明に係る音声認識装置及び
音声認識方法は、辞書データに示されていない単語や文
章が入力されたときにも、入力された音声信号を認識結
果を出力することが可能となる。したがって、本発明に
係る音声認識装置及び音声認識方法は、入力された音声
信号を精度良く認識することが可能となる。That is, the voice recognition apparatus and the voice recognition method according to the present invention can output the recognition result of the input voice signal even when a word or a sentence not shown in the dictionary data is input. Becomes Therefore, the voice recognition device and the voice recognition method according to the present invention can accurately recognize the input voice signal.

【００８９】また、本発明に係る音声認識装置及び音声
認識方法は、第２の認識処理によって得られる単語や文
章と辞書データに示されている単語や文章とを比較し、
部分的に一致している単語や文章が辞書データに示され
ているときには、当該単語や文章を、入力された音声信
号の認識結果とする。Further, the voice recognition device and the voice recognition method according to the present invention compare the word or sentence obtained by the second recognition processing with the word or sentence shown in the dictionary data,
When partially matching words or sentences are shown in the dictionary data, the words or sentences are regarded as the recognition result of the input voice signal.

【００９０】したがって、本発明に係る音声認識装置及
び音声認識方法は、入力された音声信号と部分的に一致
している単語や文章が辞書データに示されているときに
も、認識結果を得ることが可能となる。また、本発明に
係る音声認識装置及び音声認識方法は、入力された音声
信号と比較して、長さが異なり部分的に一致している単
語や文章が辞書データに示されているときにも、当該単
語や文章を認識結果とすることが可能となる。Therefore, the voice recognition apparatus and the voice recognition method according to the present invention obtain the recognition result even when the dictionary data shows a word or a sentence partially matching the input voice signal. It becomes possible. Further, the voice recognition device and the voice recognition method according to the present invention, even when a word or a sentence which is different in length and partially coincides with each other is shown in the dictionary data as compared with the input voice signal. It is possible to use the word or sentence as the recognition result.

【００９１】また、本発明に係る音声認識装置及び音声
認識方法は、第１の認識処理によって生成される単語や
文章の信頼性を評価し、信頼性に応じて入力された音声
信号の認識結果とする。Further, the voice recognition device and the voice recognition method according to the present invention evaluate the reliability of the word or sentence generated by the first recognition processing, and recognize the recognition result of the voice signal input according to the reliability. And

【００９２】したがって、本発明に係る音声認識装置及
び音声認識方法によれば、入力された音声信号を効率良
く認識することが可能となる。Therefore, according to the voice recognition device and the voice recognition method of the present invention, the input voice signal can be efficiently recognized.

[Brief description of drawings]

【図１】本発明を適用した音声認識装置を示すブロック
図である。FIG. 1 is a block diagram showing a voice recognition device to which the present invention is applied.

【図２】同音声認識装置のハードウェア構成を示す図で
ある。FIG. 2 is a diagram showing a hardware configuration of the voice recognition device.

【図３】本発明を適用した他の音声認識装置を示すブロ
ック図である。FIG. 3 is a block diagram showing another voice recognition device to which the present invention is applied.

【図４】同音声認識装置における信頼性検出部の動作を
説明するためのフローチャートである。FIG. 4 is a flowchart for explaining an operation of a reliability detection unit in the voice recognition device.

【図５】本発明を適用したさらに他の音声認識装置を示
すブロック図である。FIG. 5 is a block diagram showing still another voice recognition device to which the present invention is applied.

【図６】本発明を適用したさらに他の音声認識装置を示
すブロック図である。FIG. 6 is a block diagram showing still another voice recognition device to which the present invention is applied.

【図７】同音声認識装置における信頼性検出部の動作を
説明するためのフローチャートである。FIG. 7 is a flowchart for explaining the operation of the reliability detecting unit in the voice recognition device.

【図８】従来の音声認識装置を示すブロック図である。FIG. 8 is a block diagram showing a conventional voice recognition device.

[Explanation of symbols]

１音声認識装置、２音声分析部、３第１の言語モ
デル記憶部、４音響モデル記憶部、５第２の言語モ
デル記憶部、７第１のデコーダ、８第２のデコー
ダ、９かな変換部1 speech recognition device, 2 speech analysis unit, 3 first language model storage unit, 4 acoustic model storage unit, 5 second language model storage unit, 7 first decoder, 8 second decoder, 9 Kana conversion unit

Claims

[Claims]

1. A voice analysis unit for extracting a feature amount of an input voice signal, an acoustic model storage unit for storing an acoustic model modeled on the feature amount of a voice signal, and information indicating a probability that a phoneme appears. A language model storage means for storing a language model consisting of, a dictionary data storage means for storing dictionary data indicating a phoneme sequence of a plurality of words and / or sentences, and a feature amount supplied from the speech analysis means. For the above, the language model and the acoustic model are applied to generate a phoneme string, the generated phoneme string is compared with the phoneme strings of the words and / or sentences shown in the dictionary data, and a similar phoneme string is obtained. A first decoding means for outputting the word and / or the sentence as a recognition result of the input voice signal when the word and / or the sentence included therein is shown in the dictionary data; Second decoding means for generating a phoneme string by applying the language model and the acoustic model to the feature amount supplied from the analyzing means, and converting the phoneme string generated by the second decoding means into kana. ,
A voice recognition device, comprising: a kana conversion unit that outputs the result of Kana conversion as a recognition result of an input voice signal.

2. A word and / or a sentence output from the first decoding means and a word and / or a sentence output from the kana conversion means are output as a recognition result of an input voice signal. The voice recognition device according to claim 1, wherein

3. A partial match between a word and / or a sentence shown in the dictionary data and a word and / or a sentence output from the kana conversion means is detected, and a partially matched word and a sentence are detected. 2. A voice according to claim 1, further comprising: partial match detection means for outputting the word and / or the sentence as a recognition result of the input voice signal when the sentence is indicated in the dictionary data. Recognition device.

4. The reliability of a word and / or a sentence recognized by the first decoding means is evaluated, and the word and / or the sentence is recognized according to the reliability, as a result of recognition of an input voice signal. The voice recognition device according to claim 1, further comprising a reliability detection unit that outputs the above.

5. A storage unit that stores an acoustic model that models a feature amount of a speech signal, a language model that includes information that indicates the probability that a phoneme appears, and dictionary data that indicates a phoneme string of a plurality of words and / or sentences. The phoneme string is generated by applying the language model and the acoustic model to the feature quantity, and the generated phoneme sequence is indicated in the dictionary data. A word and / or sentence having a similar phoneme sequence, and when a word and / or a sentence having a similar phoneme sequence is shown in the dictionary data, the word and / or the sentence is used as a recognition result. While performing the recognition process of the above, and after generating the phoneme sequence by applying the language model and the acoustic model to the feature amount, the phoneme sequence is converted into kana, and the word and / or the sentence converted into kana is converted into kana. Entered A speech recognition method, characterized in that a second recognition processing is performed to obtain a recognition result of the generated speech signal.

6. A speech input with both a word and / or a sentence obtained by performing the first recognition processing and a word and / or a sentence obtained by performing the second recognition processing. 6. The signal recognition result is used as a recognition result.
The voice recognition method described.

7. A partial match between a word and / or a sentence shown in the dictionary data and a word and / or a sentence obtained by the second recognition processing is detected, and the partially matched word is detected. 6. The voice recognition method according to claim 5, wherein when the dictionary data indicates a word and / or a sentence, the word and / or the sentence is used as a recognition result of the input voice signal.

8. The reliability of a word and / or a sentence generated by the first recognition processing is evaluated, and the word and / or the sentence is recognized as a result of recognition of an input voice signal according to the reliability. The voice recognition method according to claim 5, wherein: