JP2013250379A

JP2013250379A - Voice recognition device, voice recognition method and program

Info

Publication number: JP2013250379A
Application number: JP2012124247A
Authority: JP
Inventors: Shuichi Kawaguchi; 修市川口
Original assignee: Alpine Electronics Inc
Current assignee: Alpine Electronics Inc
Priority date: 2012-05-31
Filing date: 2012-05-31
Publication date: 2013-12-12

Abstract

PROBLEM TO BE SOLVED: To provide a voice recognition device, a voice recognition method and a program for improving recognition accuracy of a proper noun included in voice by correcting an error in extraction of acoustic characteristics of the proper noun to contrast with a user dictionary to the proper noun.SOLUTION: A voice recognition device 100 in an information terminal collects a proper name on the basis of a use situation of a user to the terminal to create a user dictionary. Input voice is made into a text by voice recognition on the basis of its voice waveform, and a text part concerning a proper noun is extracted from the voice. A specific character at the extracted text part is replaced, and one or more words are generated. The words are retrieved by the user dictionary, and when the words are included in the user dictionary, the text part concerning the proper noun is replaced with the words.

Description

本発明は、入力音声に対する認識精度を改善させる音声認識装置、音声認識方法およびプログラムに関する。 The present invention relates to a voice recognition device, a voice recognition method, and a program for improving recognition accuracy for input voice.

キーボードなどによる手入力が困難な環境におけるコンピュータの利用状況、例えば、走行中の車両におけるナビゲーション装置の操作制御、携帯型情報端末での文字入力などにおいて、音声認識による情報入力の活用が増えてきている。音声認識技術では、入力された音声を、その音響的な特徴に対する、音響モデルと呼ばれる音声コーパスを利用したデータとの比較による解析と、その言語的な特徴に対する、言語モデルと呼ばれる音素の並びに対する言語的制約に基づく解析とを通して、テキストデータに変換する、ということが基礎技術として確立されている。 Utilization of information input by voice recognition is increasing in the use situation of computers in environments where manual input using a keyboard or the like is difficult, for example, operation control of a navigation device in a running vehicle, character input on a portable information terminal, etc. Yes. In speech recognition technology, the input speech is analyzed for its acoustic features by comparison with data using a speech corpus called an acoustic model, and the phoneme sequence called a language model for its linguistic features. Conversion to text data through analysis based on linguistic constraints has been established as a basic technology.

情報機器に入力される音声は、その発話者の声質、入力機器の性能、周囲環境などによって影響を受けるので、音声認識技術における最大の関心はその認識精度を如何に向上させるかといったことにある。一般的に、音響モデルおよび言語モデルにおけるサンプルデータ、すなわち音声コーパスや登録単語を増加させることによって、その認識精度を高めることができるが、一方でデータの増加に伴う処理速度の低下の問題が懸念される。 Since the voice input to the information equipment is affected by the voice quality of the speaker, the performance of the input equipment, the surrounding environment, etc., the greatest interest in voice recognition technology is how to improve the recognition accuracy. . In general, by increasing the sample data in the acoustic model and language model, that is, the speech corpus and registered words, the recognition accuracy can be improved, but there is a concern that the processing speed decreases due to the increase in data. Is done.

このような問題を解決しうる技術として特許文献１が存在する。特許文献１に開示の技術は、第１音声認識部で認識された音声に対し、更にその特定の区間を抽出して、その区間に対してより制約的な言語モデルに基づく解析を行なうことによって、認識精度を向上させるといったものである。そして、この特定の区間を抽出する方法として、周知の「固有名詞抽出技術」を用いることにより、固有名詞をその特定の区間として抽出し、これを専用の固有名詞辞書と対比することで精度向上を図ることが開示されている。 Patent Document 1 exists as a technique that can solve such a problem. The technique disclosed in Patent Document 1 extracts a specific section from the speech recognized by the first speech recognition unit, and performs analysis based on a more restrictive language model for the section. And improving the recognition accuracy. And, as a method of extracting this specific section, by using the well-known “proprietary noun extraction technique”, the proper noun is extracted as the specific section, and this is compared with a dedicated proper noun dictionary to improve accuracy. Is disclosed.

また、車載ナビゲーション用の音声認識装置における認識精度を向上させる技術として、特許文献２が存在する。特許文献２に開示の技術は、音声辞書を地域毎にグループ分けし、車両の現在位置に基づいて音声認識の際に利用する辞書を使い分けることで、その認識精度を向上させるといったものである。 Further, Patent Literature 2 exists as a technique for improving recognition accuracy in a voice recognition device for in-vehicle navigation. The technique disclosed in Patent Document 2 is to improve the recognition accuracy by grouping voice dictionaries for each region and using different dictionaries to be used for voice recognition based on the current position of the vehicle.

特開２０１１‐２４２６１３号公報JP 2011-242613 A 特開平７‐６４４８０号公報Japanese Patent Laid-Open No. 7-64480

特許文献１に開示の技術によって、音声中に含まれる特定表現に対する認識精度を一定程度改善することが期待されるものの、次のような理由によりその程度は限定的になるものと考えられる。すなわち、特許文献１の技術は、抽出される特定区間の表現に対して言語モデルを変えて再認識を実施するものであり、従って、もともと音響モデルにおける認識に誤りが含まれていた場合には、たとえその区間に対して再認識を行なったとしても、その認識精度を向上させることは期待できない。 Although the technique disclosed in Patent Document 1 is expected to improve the recognition accuracy for the specific expression included in the speech to a certain degree, it is considered that the degree is limited for the following reason. That is, the technique of Patent Document 1 performs re-recognition by changing the language model with respect to the expression of the extracted specific section. Therefore, when the recognition in the acoustic model originally includes an error. Even if the section is re-recognized, it cannot be expected to improve the recognition accuracy.

この問題は特許文献２に開示の技術においても同様であり、音響モデルにおける認識に誤りが含まれていた場合には、精度向上は期待できない。 This problem also applies to the technique disclosed in Patent Document 2. If an error is included in the recognition in the acoustic model, improvement in accuracy cannot be expected.

本発明は、これらの問題を解決するためになされたものであり、音声中に含まれる固有名詞に対して、その音響的特徴の抽出での誤りを補正してユーザ辞書との対比を行うことによって、その認識精度を改善させることができるものである。 The present invention has been made to solve these problems, and corrects errors in the extraction of acoustic features of proper nouns contained in speech and compares them with a user dictionary. The recognition accuracy can be improved.

本発明は、情報端末における音声認識装置であって、固有名称をその端末に対するユーザの利用状況に基づいて収集して、ユーザ辞書を作成する手段と、入力された音声を、その音声波形に基づいて音声認識しテキスト化する手段と、前記テキスト化された音声から固有名詞に係るテキスト部位を抽出する手段と、前記抽出されたテキスト部位における特定の文字を置き換えることにより、１または複数の単語を生成する手段と、前記ユーザ辞書に前記１または複数の単語が含まれている場合に、前記固有名詞に係るテキスト部位を当該単語で置き換える手段と、を有する。 The present invention is a speech recognition apparatus in an information terminal, which collects a unique name based on a user's usage status for the terminal, creates a user dictionary, and inputs speech based on the speech waveform. A means for recognizing and converting to text, a means for extracting a text part related to a proper noun from the textized voice, and replacing one or more words by replacing specific characters in the extracted text part. Means for generating, and when the user dictionary includes the one or more words, means for replacing a text part related to the proper noun with the word.

好ましくは、前記音声をテキスト化する手段が、音響モデルに基づく確率により音声波形に含まれる各音素を決定する手段を含み、前記抽出されたテキスト部位から１または複数の単語を生成する手段が、その音声波形に対する前記音響モデルに基づく確率データを参照して、前記抽出されたテキスト部位における特定の文字を置き換える。 Preferably, the means for converting the speech into text includes means for determining each phoneme included in the speech waveform based on a probability based on an acoustic model, and the means for generating one or more words from the extracted text portion, A specific character in the extracted text portion is replaced with reference to probability data based on the acoustic model for the speech waveform.

好ましくは、前記抽出されたテキスト部位から１または複数の単語を生成する手段が、前記音声をテキスト化するときに、各音素の次候補とされた音素を、前記抽出されたテキスト部位における対応音素と置き換える。 Preferably, when the means for generating one or a plurality of words from the extracted text part converts the speech into a text, the phoneme that is the next candidate for each phoneme is used as the corresponding phoneme in the extracted text part. Replace with

好ましくは、前記音声認識装置が、移動体に対するナビゲーション機能を備えた情報端末における音声認識装置であり、前記ユーザ辞書を作成する手段が、前記ナビゲーションに係る固有名称をそのユーザの移動履歴に基づいて収集して、ユーザ辞書を作成するものである。 Preferably, the voice recognition device is a voice recognition device in an information terminal having a navigation function for a moving body, and the means for creating the user dictionary uses a unique name related to the navigation based on the movement history of the user. Collect and create a user dictionary.

好ましくは、前記ユーザ辞書を作成する手段が、移動体の現在位置、目的地または現在地から目的値までの経路に基づいて、地図データから取得される地名、施設名を含む固有名称を収集して構成されるものである。 Preferably, the means for creating the user dictionary collects a unique name including a place name and a facility name acquired from the map data based on a current position of the mobile object, a destination or a route from the current place to the destination value. It is composed.

好ましくは、前記ユーザ辞書が、各固有名称に対して、その読み、位置座標、登録日時の各情報を備えるとともに、それらの情報に基づいてその優先順位が与えられたものであり、前記固有名詞に係るテキスト部位を置き換える手段は、その置き換えに係る単語が複数ある場合に、前記ユーザ辞書における優先順位に従って、置き換えに係る単語を決定する。 Preferably, the user dictionary includes, for each unique name, information on its reading, position coordinates, and registration date and time, and a priority is given based on the information, and the proper noun When there are a plurality of words related to the replacement, the means for replacing the text portion according to determines a word related to the replacement according to the priority order in the user dictionary.

本発明は、情報端末における音声認識方法であって、固有名称をその端末に対するユーザの利用履歴に基づいて収集して、ユーザ辞書を作成するステップと、入力された音声を、その音声波形に基づいて音声認識しテキスト化するステップと、前記テキスト化された音声から固有名詞に係るテキスト部位を抽出するステップと、前記抽出されたテキスト部位における特定の文字を置き換えることにより、１または複数の単語を生成するステップと、前記ユーザ辞書に前記１または複数の単語が含まれている場合に、前記固有名詞に係るテキスト部位を当該単語で置き換えるステップと、を有する。 The present invention relates to a speech recognition method in an information terminal, wherein a unique name is collected based on a user's usage history for the terminal, a user dictionary is created, and an input speech is based on the speech waveform. Recognizing and converting to text, extracting a text part related to a proper noun from the textized voice, and replacing one or more words by replacing specific characters in the extracted text part And a step of replacing the text portion related to the proper noun with the word when the user dictionary includes the one or more words.

好ましくは、前記音声をテキスト化するステップが、音響モデルに基づく確率により音声波形に含まれる各音素を決定するステップを含み、前記抽出されたテキスト部位から１または複数の単語を生成するステップが、その音声波形に対する前記音響モデルに基づく確率データを参照して、前記抽出されたテキスト部位における特定の文字を置き換える。 Preferably, the step of converting the speech into text includes determining each phoneme included in the speech waveform according to a probability based on an acoustic model, and generating one or more words from the extracted text portion, A specific character in the extracted text portion is replaced with reference to probability data based on the acoustic model for the speech waveform.

本発明は、情報端末における音声認識プログラムであって、固有名称をその端末に対するユーザの利用履歴に基づいて収集して、ユーザ辞書を作成するステップと、入力された音声を、その音声波形に基づいて音声認識しテキスト化するステップと、前記テキスト化された音声から固有名詞に係るテキスト部位を抽出するステップと、前記抽出されたテキスト部位における特定の文字を置き換えることにより、１または複数の単語を生成するステップと、前記ユーザ辞書に前記１または複数の単語が含まれている場合に、前記固有名詞に係るテキスト部位を当該単語で置き換えるステップと、を有する。 The present invention is a speech recognition program in an information terminal, which collects unique names based on a user's usage history for the terminal, creates a user dictionary, and inputs speech based on the speech waveform. Recognizing and converting to text, extracting a text part related to a proper noun from the textized voice, and replacing one or more words by replacing specific characters in the extracted text part And a step of replacing the text portion related to the proper noun with the word when the user dictionary includes the one or more words.

本発明によれば、抽出された固有名詞に係るテキスト部位に対し、その文字の組み換えを行った上でユーザ辞書との対比がなされる。これによって当該テキスト部位に対する、音響モデルによる認識での誤りがあった場合でも、その補正がなされる可能性が高まり、結果として認識精度が向上することが期待できるものである。特に、音声をテキスト化するときに、各音素の次候補とされた音素を置き換えの対象とすることで、その置換回数を最小に抑えることができ、認識速度上の影響を最小にできる。 According to the present invention, the text portion related to the extracted proper noun is compared with the user dictionary after the characters are recombined. As a result, even if there is an error in recognition by the acoustic model for the text part, the possibility of correction is increased, and as a result, it can be expected that the recognition accuracy is improved. In particular, when converting speech into text, by making the phoneme that is the next candidate for each phoneme the target of replacement, the number of replacements can be minimized, and the influence on the recognition speed can be minimized.

本発明の実施例に係る情報端末における、音声認識装置の構成例を示すブロック図である。It is a block diagram which shows the structural example of the speech recognition apparatus in the information terminal which concerns on the Example of this invention. 音響解析により得られる音素ごとの確からしさを表した図である。It is a figure showing the probability for every phoneme obtained by an acoustic analysis. 置き換え単語生成部における機能を説明するための図である。It is a figure for demonstrating the function in a replacement word production | generation part. ユーザ辞書に登録される固有名称のデータ構造の一例である。It is an example of the data structure of the specific name registered into a user dictionary. 音声認識装置における音声認識処理のフローチャートである。It is a flowchart of the speech recognition process in a speech recognition apparatus. 音声認識装置における各解析における出力テキストの変化を模式的に表した図である。It is the figure which represented typically the change of the output text in each analysis in a speech recognition apparatus.

次に、本発明の実施の形態について図面を参照して詳細に説明する。以下では、移動体に対するナビゲーション機能を備えた情報端末における音声認識装置を例に取って、本発明の詳細を説明する。実施例に係る音声認識装置においては、移動体の現在位置や走行履歴の情報を活用して、音声認識の精度を向上させるものである。この種の情報端末の形態としては、車両のコンソールに設置された車載ナビゲーション装置、または車載ナビゲーション装置に接続してそこから移動体の情報を取得可能な携帯型情報端末、または自身がナビゲーション機能を備えた携帯型情報端末などが考えられる。 Next, embodiments of the present invention will be described in detail with reference to the drawings. Hereinafter, the details of the present invention will be described by taking a voice recognition device in an information terminal having a navigation function for a moving body as an example. In the speech recognition apparatus according to the embodiment, the accuracy of speech recognition is improved by utilizing information on the current position of the moving body and the travel history. As a form of this type of information terminal, an in-vehicle navigation device installed in a console of a vehicle, a portable information terminal that can be connected to an in-vehicle navigation device and can acquire information on a moving body, or has a navigation function by itself. A portable information terminal provided may be considered.

図１は、本発明の実施例に係る情報端末における、音声認識装置の構成例を示すブロック図である。同図に示すように、音声認識装置１００は、音声入力部１０２、音声認識部１０４、認識テキスト補正部１０６、生成テキスト出力部１０８、記憶部１１０および制御部１１２を有する。これらの各機能は、ＣＰＵ、メモリ、通信機能などを備えた汎用コンピュータ上で、本音声認識に係るプログラムを実行することによって実現することができる。 FIG. 1 is a block diagram illustrating a configuration example of a voice recognition device in an information terminal according to an embodiment of the present invention. As shown in the figure, the speech recognition apparatus 100 includes a speech input unit 102, a speech recognition unit 104, a recognized text correction unit 106, a generated text output unit 108, a storage unit 110, and a control unit 112. Each of these functions can be realized by executing a program related to the voice recognition on a general-purpose computer having a CPU, a memory, a communication function, and the like.

音声入力部１０２は、利用者の音声をマイクなどから入力して音声波形に変換する。音声認識部１０４は、入力音声波形を解析してその音声に対応するテキストを出力するもので、音響モデル１１４に基づく解析を実施する音響解析部１１６と、言語モデル１１８に基づく解析を実施する言語解析部１２０との二段階に渡る解析で音声を認識する。音響解析部１１６では、入力音声波形を音素、音節、トライフォン（三つ組音素）などの小単位に分離し、音響モデル１１４に蓄積した大量の音声波形データ（音声コーパス）と対比する。そして、その中から最も近似度の高い音声を認識結果として出力する。例えば、音響解析部１１６では、「小菅（こすげ）」という入力に対して、図２に模式的に示されるような中間出力が得られる。図における縦方向の並びは、各音素に対する対応候補の確からしさの程度を表している。例えば、入力音声の最初の音素に対し、音響モデルとの対比により、「こ」、「ほ」、「か」がその候補として選出され、それらの確からしさの度合いはそれぞれ９０％、８０％、６０％といった具合になる。音響解析部１１６における解析データは記憶部１１０に記録され、認識テキスト補正部１０６においても利用される。 The voice input unit 102 inputs a user's voice from a microphone or the like and converts it into a voice waveform. The speech recognition unit 104 analyzes an input speech waveform and outputs text corresponding to the speech. The speech recognition unit 104 performs an analysis based on the acoustic model 114, and a language that performs an analysis based on the language model 118. Speech is recognized by analysis in two stages with the analysis unit 120. The acoustic analysis unit 116 separates the input speech waveform into small units such as phonemes, syllables, triphones (triple phonemes), and compares them with a large amount of speech waveform data (speech corpus) accumulated in the acoustic model 114. And the voice with the highest degree of approximation is output as the recognition result. For example, the acoustic analysis unit 116 can obtain an intermediate output as schematically shown in FIG. 2 for an input “kosuge”. The vertical arrangement in the figure represents the degree of likelihood of the corresponding candidate for each phoneme. For example, for the first phoneme of the input speech, “ko”, “ho”, “ka” are selected as candidates by comparison with the acoustic model, and the degree of their probability is 90%, 80%, 60% and so on. The analysis data in the acoustic analysis unit 116 is recorded in the storage unit 110 and is also used in the recognized text correction unit 106.

言語解析部１２０では、言語モデル１１８に蓄積した大量の単語データと、単語の並びの制約や品詞を定義した句・文データから、音響解析部１１６で得られたテキストの並びを、言語的に解析する。単語の並びの制約の表現には、Ｎ個の単語の並びにおける出現頻度をテーブル化したｎ.ｇｒａｍ文法などを用いて、並びの確からしさを確率的に表現する。音声認識部１０４では、音響解析部１１６における小単位の音声解析結果に対し、言語解析部１２０における言語的解析によって、入力音声に対する意味的補正がなされ、それが音声認識出力として得られる。 The language analysis unit 120 linguistically arranges the text sequence obtained by the acoustic analysis unit 116 from a large amount of word data stored in the language model 118 and phrase / sentence data defining restrictions on word sequence and part of speech. To analyze. In order to express the restriction on the arrangement of words, the probability of the arrangement is expressed probabilistically by using an n.gram grammar that tabulates the appearance frequency in the arrangement of N words. In the speech recognition unit 104, the speech analysis result of the small unit in the acoustic analysis unit 116 is subjected to semantic correction for the input speech by linguistic analysis in the language analysis unit 120, and obtained as speech recognition output.

認識テキスト補正部１０６は、音声認識部１０４の出力を取得して、更にその補正を行うもので、固有名詞抽出部１２２、置き換え単語生成部１２４、置き換え判定部１２６、ユーザ辞書生成部１２８、ユーザ辞書１３０およびユーザ辞書管理部１３２を備える。固有名詞抽出部１２２は、音声認識部１０４からの音声に係る出力において、その固有名詞に係るテキスト部位を抽出する。固有名詞の抽出においては、前記言語解析部１２０による品詞解析の結果を利用する。置き換え単語生成部１２４は、固有名詞抽出部１２２で抽出された固有名詞に対し、その単語を構成する文字の置き換えを行なって１または複数の単語を生成する。この際、置き換え単語生成部１２４では、前記音響解析部１１６で得られた音素候補の確率値を利用して、その組合せ数を限定する。例えば、対象音素に対する確率値が８０％以上のもの、対象音素の次候補のみ、などの境界値を設定して対象候補を限定する。図２で示した、入力音声「小菅（こすげ）」に対して次候補の音素のみ、すなわち「ほ」「う」「げ」を対象にした場合、図３に示すように、置き換え単語生成部１２４で生成される単語の組み合わせ数は８組となる。 The recognized text correction unit 106 acquires the output of the speech recognition unit 104 and further corrects it. The proper noun extraction unit 122, the replacement word generation unit 124, the replacement determination unit 126, the user dictionary generation unit 128, the user A dictionary 130 and a user dictionary management unit 132 are provided. The proper noun extraction unit 122 extracts a text part related to the proper noun in the output related to the speech from the speech recognition unit 104. In the extraction of proper nouns, the result of part-of-speech analysis by the language analysis unit 120 is used. The replacement word generation unit 124 generates one or a plurality of words by replacing characters constituting the word with respect to the proper noun extracted by the proper noun extraction unit 122. At this time, the replacement word generation unit 124 uses the probability value of the phoneme candidate obtained by the acoustic analysis unit 116 to limit the number of combinations. For example, the target candidates are limited by setting boundary values such as those having a probability value of 80% or more for the target phoneme and only the next candidate for the target phoneme. When only the next candidate phoneme, that is, “ho”, “u”, “ge” is targeted for the input speech “kosuge” shown in FIG. 2, as shown in FIG. The number of word combinations generated by the unit 124 is eight.

置き換え判定部１２６は、置き換え単語生成部１２４で生成された各単語につき、それがユーザ辞書１３０に含まれているか検索し、該当単語が辞書内に存在する場合に、これを置換文字列として決定する。例えば、ユーザ辞書１３０内には、「小菅（こすげ）」の単語が登録されていて、一方で音声認識部１０４で認識された「小杉（こすぎ）」の単語が含まれていない場合においては、「小杉（こすぎ）」に変えて「小菅（こすげ）」を認識文字として採択する。後述するようにユーザ辞書１３０における各単語は、所定基準に基づく優先順位を持っており、組合せ単語のうちで複数の単語が辞書内に発見された場合には、その優先順位に従って置き換え単語を決定する。 The replacement determination unit 126 searches each word generated by the replacement word generation unit 124 to determine whether or not it is included in the user dictionary 130, and when the corresponding word exists in the dictionary, determines this as a replacement character string. To do. For example, in the case where the word “Kosuge” is registered in the user dictionary 130 and the word “Kosugi” recognized by the voice recognition unit 104 is not included. , Instead of “Kosugi”, “Kosuge” is adopted as the recognition character. As will be described later, each word in the user dictionary 130 has a priority based on a predetermined criterion. When a plurality of words are found in the dictionary among the combined words, a replacement word is determined according to the priority. To do.

ユーザ辞書１３０は、ユーザのナビゲーション装置における利用状況に基づいて取得される固有名称を蓄積したデータベースであり、ユーザ辞書生成部１２８は、ナビゲーション装置１３４に接続して、そこからユーザの利用状況を取得し固有名称を抽出する。ここでナビゲーション装置におけるユーザの利用状況を把握するものとして、車両の現在および過去の走行情報が利用される。具体的には、現在位置算出部１３４ａで算出される車両の現在位置の周辺おける地域や施設の情報、誘導経路案内部１３４ｂで構築した誘導経路および目的地の周辺おける地域や施設の情報、並びに施設検索部１３４ｃで検索された施設の情報を、地図データ１３４ｄから取得する。好適な実施例において取得される情報には、その地域または施設の「読み」、「綴り」、「位置座標」、「登録日時」の各情報が含まれる。図４に、ユーザ辞書１３０に登録される固有名称のデータ構造の一例を示した。 The user dictionary 130 is a database in which unique names acquired based on the usage status of the user in the navigation device are accumulated. The user dictionary generation unit 128 is connected to the navigation device 134 and acquires the usage status of the user therefrom. The unique name is extracted. Here, the current and past travel information of the vehicle is used to grasp the usage status of the user in the navigation device. Specifically, information on areas and facilities around the current position of the vehicle calculated by the current position calculation unit 134a, information on areas and facilities around the guidance route and destination constructed by the guidance route guide unit 134b, and Information on the facility searched by the facility search unit 134c is acquired from the map data 134d. In the preferred embodiment, the information acquired includes “reading”, “spelling”, “position coordinates”, and “registration date / time” information of the area or facility. FIG. 4 shows an example of the data structure of unique names registered in the user dictionary 130.

登録される固有名称には、所定基準に基づく優先順位が付けられる。例えば、その元データにおける出現頻度、登録日時、現在の車両位置からの距離の何れかまたはそれらの複合的基準に従って、優先順位を決定し、登録時またはその読み出し時にデータの並び替えを行なう。ユーザ辞書管理部１３２は、このようなデータの並び替えに係るデータ管理を行うと共に、古い情報をユーザ辞書１３０から削除する処理を定期的に実行する。 The registered unique names are given priorities based on predetermined criteria. For example, the priority order is determined according to any of the appearance frequency, registration date and time, the distance from the current vehicle position in the original data, or a composite standard thereof, and the data is rearranged at the time of registration or reading. The user dictionary management unit 132 performs data management related to such data rearrangement and periodically executes a process of deleting old information from the user dictionary 130.

生成テキスト出力部１０８は、音声認識部１０４で認識されたテキストに対し、置き換え判定部１２６で採択された単語の置き換えを行なって、これを音声認識結果として次処理に渡す。例えば、ソーシャルテキスト投稿サービスの利用に際して、本音声認識が利用される場合には、この音声認識結果は、そのようなサービスのアプリケーション・インタフェースプログラムに渡され、情報端末が備える通信機能などを介して投稿可能になる。また、ナビゲーション装置に対する施設検索などの操作に利用される場合には、音声認識結果は、ナビゲーション装置側に入力されその施設案内プログラムの実行を可能にする。 The generated text output unit 108 replaces the word recognized by the replacement determination unit 126 with respect to the text recognized by the speech recognition unit 104, and passes this to the next processing as a speech recognition result. For example, when this speech recognition is used when using a social text posting service, the speech recognition result is passed to the application interface program of such a service, via a communication function provided in the information terminal, etc. Posting becomes possible. Further, when used for an operation such as facility search for the navigation device, the voice recognition result is input to the navigation device side, and the facility guidance program can be executed.

記憶部１１０は、本音声認識装置１００における各処理の段階で生成される一時データを記憶する。そのようなデータには、音声入力部１０２からの音声波形、音響解析部１１６からの抽出音素およびその確率値、言語解析部１２０からの句・文データおよびその品詞などの属性情報、固有名詞抽出部１２２で抽出された固有名詞、置き換え単語生成部１２４で生成された単語群、置き換え判定部１２６で採択された置き換え単語、生成テキスト出力部１０８で出力される生成テキストが含まれる。制御部１１２は、音声認識装置１００の各機能を制御する。 The storage unit 110 stores temporary data generated at each processing stage in the speech recognition apparatus 100. Such data includes a speech waveform from the speech input unit 102, a phoneme extracted from the acoustic analysis unit 116 and its probability value, attribute information such as phrase / sentence data and its part of speech from the language analysis unit 120, and proper noun extraction. The proper noun extracted by the unit 122, the word group generated by the replacement word generation unit 124, the replacement word adopted by the replacement determination unit 126, and the generated text output by the generated text output unit 108 are included. The control unit 112 controls each function of the voice recognition device 100.

次に、図５のフローチャートに従って、本音声認識装置における音声認識処理の過程を説明する。図において本音声認識処理は、利用者がマイクなどの音声入力機器に対して発話することによって開始される（ステップＳ５０２）。利用者のボタン操作などによる明示的な指示、または一定時間の無音を検出することなどにより、音声入力の終了を検出し（ステップＳ５０４）、音声入力部１０２において入力音声を音声波形データに変換する（ステップＳ５０６）。取得された音声波形は、音声認識部１０４へ入力され、最初に音響モデルに基づく音響解析に掛けられる（ステップＳ５０８）。音響解析部１１６では、音声波形は音素などの小単位に分離され、個々が音声コーパスと対比されて、その確率値に応じて解析テキストが決定される。この解析データは後の利用のために記憶部に保存される。次に、この解析テキストは言語モデルに基づく言語解析に掛けられる（ステップＳ５１０）。言語解析部１２０では、言語モデルにおける単語データと句・文データから、解析テキストの並びを言語的に解析し、並びの確からしさを確率値として表現する。そして確率値の高い並びをその音声テキストとして決定する。 Next, the process of speech recognition processing in the speech recognition apparatus will be described with reference to the flowchart of FIG. In the figure, the voice recognition process is started when a user speaks to a voice input device such as a microphone (step S502). The end of voice input is detected by detecting an explicit instruction by a user's button operation or the like, or detecting silence for a certain time (step S504), and the voice input unit 102 converts the input voice into voice waveform data. (Step S506). The acquired speech waveform is input to the speech recognition unit 104, and first subjected to acoustic analysis based on the acoustic model (step S508). In the acoustic analysis unit 116, the speech waveform is separated into small units such as phonemes, each is compared with the speech corpus, and the analysis text is determined according to the probability value. This analysis data is stored in the storage unit for later use. Next, the analysis text is subjected to language analysis based on the language model (step S510). The language analysis unit 120 analyzes the sequence of analysis texts from the word data and the phrase / sentence data in the language model, and expresses the probability of the sequence as a probability value. Then, a sequence with a high probability value is determined as the speech text.

次に、言語解析部１２０からの出力テキストは、固有名詞抽出部１２２に渡され、そのテキストにおける固有名詞に係る単語が抽出される（ステップＳ５１２）。固有名詞の抽出には、言語解析部１２０における品詞解析の結果を利用する。テキスト中に固有名詞が存在しない場合、処理はステップＳ５１４からステップＳ５２４に移り、言語解析部１２０からの出力テキストを最終的な生成テキストとし、次処理に出力する。 Next, the output text from the language analysis unit 120 is passed to the proper noun extraction unit 122, and words related to the proper nouns in the text are extracted (step S512). For the extraction of proper nouns, the result of part-of-speech analysis in the language analysis unit 120 is used. If there is no proper noun in the text, the process proceeds from step S514 to step S524, and the output text from the language analysis unit 120 is used as the final generated text and is output to the next process.

ステップＳ５１４においてテキスト中に固有名詞に係る単語が含まれていると判断される場合は、次に、その単語に対する１または複数の置き換え単語を生成する（ステップＳ５１６）。前述のとおり、単語中の各文字の置き換え基準は、音響解析部１１６で得られた各音素に対する確率値を参照することによる。そして、ここで生成された各単語に対して、ユーザ辞書１３０に対する検索を実施し（ステップＳ５１８）、辞書中に該当単語が存在する場合には、これを元のテキストから抽出した固有名詞と置き換え、認識テキストを完成させ、これを最終的な認識テキストとして次処理に出力する（ステップＳ５２０〜５２４）。生成した複数の単語がユーザ辞書１３０に見つかった場合は、前述したその固有名詞の優先順位基準に従い、優先順位が高い単語を置き換え単語として採択する。一方で、生成した複数の単語の何れもがユーザ辞書１３０に存在しない場合、または置換前の固有名詞だけがユーザ辞書１３０に存在する場合には、ステップＳ５２０からステップＳ５２４に処理を移し、言語解析部１２０からの出力テキストを最終的な認識テキストとして出力する。以上のようにして、利用者からの入力音声は、音響モデルによる解析、言語モデルによる解析、およびユーザ辞書を利用した固有名詞の置き換え、の各処理を経て音声認識されテキスト文字として出力されるのである。 If it is determined in step S514 that the word related to the proper noun is included in the text, next, one or more replacement words for the word are generated (step S516). As described above, the replacement criterion for each character in the word is based on referring to the probability value for each phoneme obtained by the acoustic analysis unit 116. Then, the user dictionary 130 is searched for each word generated here (step S518), and if the corresponding word exists in the dictionary, it is replaced with the proper noun extracted from the original text. The recognition text is completed, and this is output to the next processing as the final recognition text (steps S520 to 524). When a plurality of generated words are found in the user dictionary 130, a word having a higher priority is adopted as a replacement word in accordance with the priority criteria of the proper noun described above. On the other hand, if none of the plurality of generated words exists in the user dictionary 130, or if only the proper noun before replacement exists in the user dictionary 130, the process proceeds from step S520 to step S524, and language analysis is performed. The output text from the unit 120 is output as the final recognized text. As described above, the input speech from the user is recognized and output as text characters through each process of analysis using an acoustic model, analysis using a language model, and replacement of proper nouns using a user dictionary. is there.

図６は、本音声認識装置における各解析における出力テキストの変化を模式的に表している。ここでは、利用者がソーシャルテキスト投稿サービスに対して音声による投稿を行う状況で、「小菅ジャンクション到着」と発話した場合を例とする。この入力に係る音声波形は個々の音素に分離され、音響解析によって各文字単位で音声コーパスとの対比による認識が行われる。この例では、「こすげ」の発話における「げ」の音素がより確率値の高い「ぎ」と認識されたと仮定する。他の文字は発話通りに解析されたものとする。 FIG. 6 schematically shows changes in the output text in each analysis in the speech recognition apparatus. Here, as an example, the user utters “Kobuchi Junction Arrival” in a situation where the user posts by voice to the social text posting service. The speech waveform according to this input is separated into individual phonemes, and recognition is performed by comparing with the speech corpus for each character by acoustic analysis. In this example, it is assumed that the phoneme of “ge” in the utterance of “kosuge” is recognized as “gi” having a higher probability value. Other characters shall be analyzed as uttered.

音響解析により認識された音声波形の各音素「こ」、「す」、「ぎ」、「じゃ」、「ん」...「く」は、言語解析により「小杉（こすぎ）」、「ジャンクション」および「到着（とうちゃく）」と解析されている。このテキストに対して固有名詞である「小杉（こすぎ）」の文字列が抽出され、その文字の組み合わせとして「こすぎ」、「こすげ」、「ほすぎ」...などが生成される。ユーザ辞書にはこれらの候補に対して「小菅（こすげ）」という固有名詞のみがヒットし、置き換え単語として選ばれる。これによって、「小杉（こすぎ）ジャンクション到着」という認識テキストは、「小菅（こすげ）ジャンクション到着」というテキストに変換され、音声認識における最終結果として出力される。 Each phoneme “ko”, “su”, “gi”, “ja”, “n” ... “ku” of the speech waveform recognized by the acoustic analysis is converted to “kosugi”, “junction” by language analysis. And “arrival”. A character string of “kosugi”, which is a proper noun, is extracted from this text, and “kosugi”, “kosuge”, “hoso”, etc. are generated as combinations of the characters. In the user dictionary, only the proper noun “Kosuge” is hit against these candidates and selected as a replacement word. As a result, the recognition text “arrival of Kosugi junction” is converted into the text “arrival of Kosuge junction” and output as the final result in speech recognition.

以上、本発明の好ましい実施の形態について詳述したが、本発明は、特定の実施形態に限定されるものではなく、特許請求の範囲に記載された発明の要旨の範囲において、種々の変形・変更が可能である。前記実施例では、ユーザ辞書で管理する固有名称を、ナビゲーションに係るデータから収集したが、情報端末またはナビゲーション装置において蓄積されている他のデータ、例えば端末に格納した人名、住所などの個人の情報を含むアドレスデータ、視聴のために格納した映像および楽曲の情報その他の、ユーザに係る情報からも収集して良い。 The preferred embodiments of the present invention have been described in detail above. However, the present invention is not limited to the specific embodiments, and various modifications and changes can be made within the scope of the gist of the invention described in the claims. It can be changed. In the above embodiment, the unique names managed by the user dictionary are collected from the data related to navigation, but other information stored in the information terminal or the navigation device, for example, personal information such as a person's name and address stored in the terminal May also be collected from address data including, information stored for viewing and video and music information, and other information relating to the user.

１００：音声認識装置１０２：音声入力部
１０４：音声認識部１０６：認識テキスト補正部
１０８：生成テキスト出力部１１０：記憶部
１１２：制御部１１４：音響モデル
１１６：音響解析部１１８：言語モデル
１２０：言語解析部１２２：固有名詞抽出部
１２４：置き換え単語生成部１２６：置き換え判定部
１２８：ユーザ辞書生成部１３０：ユーザ辞書
１３２：ユーザ辞書管理部 DESCRIPTION OF SYMBOLS 100: Speech recognition apparatus 102: Speech input part 104: Speech recognition part 106: Recognition text correction part 108: Generated text output part 110: Storage part 112: Control part 114: Acoustic model 116: Acoustic analysis part 118: Language model 120: Language analysis unit 122: proper noun extraction unit 124: replacement word generation unit 126: replacement determination unit 128: user dictionary generation unit 130: user dictionary 132: user dictionary management unit

Claims

A speech recognition device in an information terminal,
Means for collecting a unique name based on a user's usage status for the terminal and creating a user dictionary;
Means for recognizing and text-inputting the input speech based on the speech waveform;
Means for extracting a text portion related to a proper noun from the text-formed speech;
Means for generating one or more words by replacing specific characters in the extracted text portion;
Means for replacing the text part of the proper noun with the word when the user dictionary includes the one or more words;
A speech recognition apparatus.

The means for converting the speech into text includes means for determining each phoneme included in the speech waveform with a probability based on an acoustic model;
Means for generating one or more words from the extracted text portion replaces specific characters in the extracted text portion with reference to probability data based on the acoustic model for the speech waveform;
The speech recognition apparatus according to claim 1.

The means for generating one or a plurality of words from the extracted text portion replaces the phoneme that is the next candidate for each phoneme with the corresponding phoneme in the extracted text portion when the speech is converted into text.
The speech recognition apparatus according to claim 2.

The voice recognition device is a voice recognition device in an information terminal having a navigation function for a moving body,
The means for creating the user dictionary collects unique names related to the navigation based on the movement history of the user, and creates a user dictionary.
The speech recognition apparatus according to claim 1.

The means for creating the user dictionary is configured by collecting a unique name including a place name and a facility name acquired from map data based on the current position of the moving body, the destination or the route from the current place to the destination value. Is,
The speech recognition apparatus according to claim 4.

The user dictionary, for each unique name, is provided with each information of its reading, position coordinates, registration date and time, the priority is given based on the information,
The means for replacing the text portion related to the proper noun determines the word related to replacement according to the priority order in the user dictionary when there are a plurality of words related to the replacement.
The speech recognition apparatus according to claim 5.

A speech recognition method in an information terminal,
Collecting a unique name based on a user's usage history for the terminal and creating a user dictionary;
Recognizing the input speech based on the speech waveform and converting it into text;
Extracting a text portion related to the proper noun from the textified speech;
Generating one or more words by replacing particular characters in the extracted text portion;
If the user dictionary contains the one or more words, replacing the text portion of the proper noun with the words;
A speech recognition method comprising:

Converting the speech into text comprises determining each phoneme included in the speech waveform with a probability based on an acoustic model;
Generating one or more words from the extracted text portion refers to probability data based on the acoustic model for the speech waveform and replaces specific characters in the extracted text portion;
The speech recognition method according to claim 7.

A speech recognition program for an information terminal,
Collecting a unique name based on a user's usage history for the terminal and creating a user dictionary;
Recognizing the input speech based on the speech waveform and converting it into text;
Extracting a text portion related to the proper noun from the textified speech;
Generating one or more words by replacing particular characters in the extracted text portion;
If the user dictionary contains the one or more words, replacing the text portion of the proper noun with the words;
A speech recognition program.