JP2005208483A

JP2005208483A - Device and program for speech recognition, and method and device for language model generation

Info

Publication number: JP2005208483A
Application number: JP2004016989A
Authority: JP
Inventors: Kenji Araki; 健治荒木
Original assignee: NEIKUSU KK
Current assignee: NEIKUSU KK
Priority date: 2004-01-26
Filing date: 2004-01-26
Publication date: 2005-08-04

Abstract

<P>PROBLEM TO BE SOLVED: To provide a device and program for speech recognition whose speech recognition rate is improved by optimizing candidates of speech recognition, and a method and device for language model generation used for the speech recognition device. <P>SOLUTION: The speech recognition device 1 is equipped with a language model storage part 21 which stores a language model generated from a fixed-form pattern of utterance, a sound model storage part 22 which stores a sound model including sound characteristics of a speech, and a speech processing part 31 which takes a sound analysis of a speech signal by reference to the language model and sound model and converts it into character information. The language model is generated from the fixed-form pattern of utterance including character information obtained at the site specified by a URL of the organization that the speaker belongs to. Further, the sound model is a model learnt with a telephone speech. Such a language model and a sound model are employed, so a proper candidate can be converted into character information to improve the speech recognition rate. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

本発明は、音声を認識し文字情報に変換する音声認識装置、音声認識プログラム、並びにこの音声認識装置に用いられる言語モデルを生成する言語モデル生成方法、及び言語モデル生成装置に関する。 The present invention relates to a speech recognition device that recognizes speech and converts it into character information, a speech recognition program, a language model generation method that generates a language model used in the speech recognition device, and a language model generation device.

近年、コンピュータを利用して音声を文字に変換する音声認識技術が、カーナビゲーション等、様々なシステムや場面で利用され始めている。音声認識技術を利用した音声認識装置では、一般に、音響モデルと言語モデルとを参照して、大語彙連続音声認識が行われている。音響モデルは、音声認識に用いるための、基本的な音の単位である子音や母音などの情報を含んでいる。また、言語モデルは、コーパスを形態素解析して単語の接続関係、すなわち共起情報が計算されたｎ−ｇｒａｍ言語モデル及び単語辞書を含んでいる（たとえば、非特許文献１、２、３）。 In recent years, a speech recognition technique for converting speech into characters using a computer has begun to be used in various systems and scenes such as car navigation. In a speech recognition apparatus using speech recognition technology, large vocabulary continuous speech recognition is generally performed with reference to an acoustic model and a language model. The acoustic model includes information such as consonants and vowels that are basic sound units for use in speech recognition. In addition, the language model includes an n-gram language model and a word dictionary in which co-occurrence information is calculated by morphological analysis of the corpus (for example, Non-Patent Documents 1, 2, and 3).

このような音声認識技術において、音声認識率を向上させるため、言語モデルの改善からアプローチした研究もなされている。例えば、非特許文献４では、単語クラスタリングを利用した言語モデルの分野適応について提案されており、非特許文献２では、音声対話における発話予測を利用した音声認識が提案されている。非特許文献４に提案された音声認識技術では、対象分野に特化した言語データの収集が困難であるため、対象分野の小規模コーパスに含まれる語彙をクラスとみなし、収集しやすい大規模コーパスから類似した振る舞いをする語を同じクラスとしている。また、非特許文献２に提案された音声認識技術では、状態遷移を用いた発話予測を行っている。
緒方淳、有木康雄、「日本語話し言葉音声認識のための音節に基づく音響モデリング」、電子情報通信学会論文誌、日本、２００３年、Ｄ−ＩＩ、Ｖｏｌ．Ｊ８６−Ｄ−ＩＩ，Ｎｏ．１１，ｐｐ．１５２３−１５３０玉井孝幸、堀内靖雄、市川熹、「音声対話システムにおける発話予測を利用した音声認識」情報処理学会、研究報告，「音声言語情報処理」、日本、２００２年、Ｎｏ．４３−１鹿野清宏、伊藤克亘、河原達也、武田一哉、山本幹雄、「IT Text音声認識システム」オーム社，２００１年森信介、伊藤伸泰、西村雅史、「単語クラスタリングによる確率的言語モデルの分野適用」情報処理学会、研究報告「音声言語情報処理」、日本、２００３年、Nｏ．４５−１５ In such a speech recognition technology, in order to improve the speech recognition rate, research that approaches from the improvement of the language model has also been made. For example, Non-Patent Document 4 proposes field adaptation of a language model using word clustering, and Non-Patent Document 2 proposes speech recognition using utterance prediction in speech dialogue. In the speech recognition technology proposed in Non-Patent Document 4, it is difficult to collect linguistic data specialized for the target field. Therefore, the vocabulary contained in the small corpus of the target field is regarded as a class, and it is easy to collect Words that behave similarly from the same class. In the speech recognition technique proposed in Non-Patent Document 2, utterance prediction using state transition is performed.
Satoshi Ogata, Yasuo Ariki, “Syllable-Based Acoustic Modeling for Japanese Spoken Speech Recognition”, IEICE Transactions, Japan, 2003, D-II, Vol. J86-D-II, no. 11, pp. 1523-1530 Takayuki Tamai, Ikuo Horiuchi, Atsushi Ichikawa, “Speech recognition using utterance prediction in spoken dialogue system” Information Processing Society of Japan, Research Report, “Spoken Language Information Processing”, Japan, 2002, No. 43-1 Kiyohiro Shikano, Katsunobu Ito, Tatsuya Kawahara, Kazuya Takeda, Mikio Yamamoto, “IT Text Speech Recognition System” Ohm, 2001 Shinsuke Mori, Nobuyasu Ito, Masafumi Nishimura, “Application of probabilistic language model by word clustering” Information Processing Society of Japan, Research Report “Spoken Language Information Processing”, Japan, 2003, No. 45-15

しかしながら、上述した音声認識装置では次のような課題がある。すなわち、非特許文献４に提案された音声認識技術では、分野に適用したデータが多い場合であっても、言語モデルを生成したコーパスによっては認識候補が多くなり、音声を誤って変換してしまう確率が高い。また、音声認識を行う発話内容によっては、非特許文献２に提案された状態遷移を用いた発話予測は適しない場合もある。 However, the above-described speech recognition apparatus has the following problems. That is, in the speech recognition technique proposed in Non-Patent Document 4, even if there is a lot of data applied to the field, there are many recognition candidates depending on the corpus that generated the language model, and the speech is erroneously converted. Probability is high. Further, depending on the utterance content for which speech recognition is performed, the utterance prediction using the state transition proposed in Non-Patent Document 2 may not be suitable.

そこで、本発明は、音声認識率を向上させた音声認識装置、音声認識プログラム、並びに音声認識装置に用いる言語モデルの生成方法、及び言語モデル生成装置を提供することを目的とする。 Therefore, an object of the present invention is to provide a speech recognition device, a speech recognition program, a language model generation method used for the speech recognition device, and a language model generation device with an improved speech recognition rate.

上記の課題を解決するための手段を述べるにあたって、まず、複数の言語モデルと複数の音響モデルを組合せ、それぞれの組合せにおける音声認識率を音声認識装置で確認する予備実験について述べる。 In describing the means for solving the above problems, first, a preliminary experiment will be described in which a plurality of language models and a plurality of acoustic models are combined and the speech recognition rate in each combination is confirmed by a speech recognition apparatus.

予備実験に用いる音響モデルとして、Ｊｕｌｉｕｓのコンソーシアム２００１年度版に収録されている次の５種類の音響モデルを用いた。
・標準成人モデル
・高精度成人モデル（６４混合）
・高精度成人モデル（１２８混合）
・高精度成人モデル（２５６混合）
・電話用音響モデル As acoustic models used in the preliminary experiment, the following five types of acoustic models recorded in the 2001 version of the Julius consortium were used.
・ Standard adult model ・ High-precision adult model (64 mixed)
・ High-precision adult model (128 mixed)
・ High-precision adult model (256 mixed)
・ Telephone acoustic model

ここで、標準成人モデルは、「新聞記事読み上げ音声コーパス」により学習したモデルである。高精度成人モデルは、「ＡＴＲ多数話者音声データベース」により学習したモデルであり、３種類の混合分布（６４，１２８，２５６混合）がある。電話用音響モデルは、電話音声で学習したモデルである。 Here, the standard adult model is a model learned by the “newspaper reading speech corpus”. The high-accuracy adult model is a model learned from the “ATR multi-speaker speech database”, and has three types of mixed distributions (64, 128, 256 mixtures). The telephone acoustic model is a model learned by telephone voice.

また、予備実験に用いる言語モデルとして、次の言語モデルを用いた。
・Ｗｅｂ２０ｋ
・ＭＮＫ１００ｋ
・自作言語モデル１
・自作言語モデル２ Moreover, the following language model was used as a language model used for a preliminary experiment.
・ Web20k
・ MNK100k
・ Original language model 1
・ Original language model 2

ここで、Ｗｅｂ２０ｋは、Ｊｕｌｉｕｓのコンソーシアム２００１年度版に収録された言語モデルであり、Ｗｅｂページから収集された２万語彙分のテキストにより学習した言語モデルである。ＭＮＰ１００ｋは、Ｊｕｌｉｕｓのコンソーシアム２００１年度版に収録された言語モデルであり、新聞記事データから収集された１０万語彙から学習した言語モデルである。自作言語モデル１は、コールセンターにおける約２００分間分の書き起こしデータを学習データとした言語モデルである。また、自作言語モデル２は、コールセンターにおける１０分間分の発話のうち発話が重なった部分を除いた１５４発話分の書き起こしデータを学習データとした言語モデルである。なお、自作言語モデル２の学習データには、自作言語モデル１で学習した１５４発話分の書き起こしデータが含まれている。 Here, the Web 20k is a language model recorded in the 2001 version of the Julius consortium, and is a language model learned from 20,000 vocabulary texts collected from Web pages. The MNP 100k is a language model recorded in the 2001 version of the Julius consortium, and is a language model learned from 100,000 vocabularies collected from newspaper article data. The self-made language model 1 is a language model in which transcription data for about 200 minutes at a call center is used as learning data. The self-made language model 2 is a language model using, as learning data, transcription data for 154 utterances excluding a portion where utterances overlap among utterances for 10 minutes in a call center. Note that the learning data of the self-made language model 2 includes transcription data for 154 utterances learned by the self-made language model 1.

上記４つの言語モデルと５つの音響モデルとの２０通りの組合せそれぞれを音声認識装置に適用して、コールセンターにおける発話をこの音声認識装置に音声認識させたところ、図１に示す結果となった。図１には、音響モデルと言語モデルとの組合せ、及びそれぞれの組合せの正単語認識率（音声認識率）を示している。その結果、音声データ１５４発話の書き起こしを学習データとして生成された自作言語モデル２と電話用音響モデルとの組合せが最も高い正単語認識率（音声認識率）を示した。次いで、音声データ１５4発話を含む約２００分間の書き起こしデータを学習データとして生成された自作言語モデル１と電話用音響モデルとの組合せが良い正単語認識率（音声認識率）を示した。このことから、大量の言語モデルを利用するよりも、少量の適切な言語モデルを利用した方が、高い正単語認識率（音声認識率）で音声認識が可能であることが判明した。また、言語モデルに含まれていない未知語が解析対象の音声に存在すると正しく音声認識できないという未知語の問題を解消するために、大規模な言語モデルを利用したとしても、音声認識における候補の選択数が増えてしまい変換ミスが生じやすくなる。このことからもコーパスの文字情報量が少量の適切な言語モデルが音声認識の精度向上に有効であることが分かった。 When each of the 20 combinations of the four language models and the five acoustic models is applied to a speech recognition apparatus, and the speech recognition apparatus recognizes speech in a call center, the result shown in FIG. 1 is obtained. FIG. 1 shows combinations of acoustic models and language models, and correct word recognition rates (voice recognition rates) of the respective combinations. As a result, the correct word recognition rate (speech recognition rate) was the highest in the combination of the self-made language model 2 generated using the transcription of the speech data 154 utterance as the learning data and the telephone acoustic model. Next, a correct word recognition rate (speech recognition rate) is shown in which the combination of the self-produced language model 1 generated using the transcription data of about 200 minutes including the speech data 154 utterance as learning data and the telephone acoustic model is good. From this, it was found that speech recognition can be performed with a high correct word recognition rate (speech recognition rate) by using a small amount of an appropriate language model rather than using a large amount of language models. Even if a large-scale language model is used to solve the problem of unknown words that cannot be recognized correctly when unknown words that are not included in the language model are present in the speech to be analyzed, The number of selections increases and conversion errors are likely to occur. This also shows that an appropriate language model with a small amount of text information in the corpus is effective in improving the accuracy of speech recognition.

以上の結果を踏まえると、上記課題を解決するためには、以下の手段を有するとよい。すなわち、本発明に係る音声認識装置は、発話された音声が変換された音声信号を文字情報に変換する音声認識装置であって、変更可能な部分と定型部分とを有する発話の定型パターンにおける変更可能な部分に話者が発話すると推定される一又は複数の用語をそれぞれ備える一又は複数の文章の単語の接続関係を表す言語モデルを格納する言語モデル格納手段と、音声の音響特性を含む音響モデルを格納する音響モデル格納手段と、言語モデルと音響モデルとを参照して、音声信号を文字情報に変換する音声処理手段とを備えることを特徴とする。 Based on the above results, in order to solve the above problems, the following means may be provided. That is, the speech recognition device according to the present invention is a speech recognition device that converts a speech signal obtained by converting a spoken speech into character information, and changes in a regular pattern of an utterance having a changeable portion and a fixed portion. Language model storage means for storing a language model representing a connection relationship of one or a plurality of sentence words each having one or a plurality of terms estimated to be spoken by a speaker in a possible part, and an acoustic including an acoustic characteristic of speech An acoustic model storage unit that stores a model, and a voice processing unit that converts a speech signal into character information with reference to a language model and an acoustic model are provided.

この音声認識装置によれば、言語モデルは発話の定型パターンから生成されているため、音声処理手段は、このような言語モデルと音響モデルとを参照して、適正な候補から文字情報に変換することができる。また、話者によって発話されると推定される用語を定型パターンの変更可能な部分に当てはめて生成された文章から言語モデルが生成されているため、音声処理手段は、話者から発話された音声を適正な候補から文字情報に変換することができ、音声認識率が向上する。 According to this speech recognition apparatus, since the language model is generated from a fixed pattern of utterances, the speech processing means refers to such a language model and an acoustic model and converts appropriate candidates into character information. be able to. In addition, since the language model is generated from sentences generated by applying terms estimated to be spoken by the speaker to the changeable part of the fixed pattern, the speech processing means Can be converted from character candidates into character information, and the speech recognition rate is improved.

ここで、用語は、話者に関係するＵＲＬによって特定されるサイトから取得されているとよい。話者によって発話されると推定される用語には、その話者の関係する会社、団体、法人等の名称、ビジネス情報、専門用語等が含まれることが多いが、そのような文字情報は話者の関係するＵＲＬによって特定されるサイトに頻出する。そこで、話者の関係するＵＲＬによって特定されるサイトから取得された用語を言語モデルに含ませることにより、音声処理手段は、話者から発話された音声を適正な候補から文字情報に変換することができ、音声認識率が向上する。 Here, the term may be acquired from a site specified by a URL related to the speaker. Terms that are estimated to be spoken by speakers often include names of companies, organizations, corporations, etc. related to the speakers, business information, technical terms, etc. It frequently appears on the site specified by the URL related to the person. Therefore, by including a term acquired from a site specified by a URL related to the speaker in the language model, the speech processing means converts speech uttered by the speaker from appropriate candidates to character information. And the speech recognition rate is improved.

また、音響モデルは、電話音声で学習されたモデルであるとよい。このことにより、音声処理手段は、例えばコールセンターにおける対話など、電話で対話された音声を適切に文字情報に変換することができ、音声認識率が向上する。 The acoustic model may be a model learned by telephone voice. As a result, the voice processing means can appropriately convert voice spoken by telephone, such as conversation at a call center, into character information, and the voice recognition rate is improved.

また、音声認識装置は、発話が書き起こされた基本データを格納する基本データ格納手段と、話者に関係する文字情報を含む第１のコーパスを格納する第１のコーパス格納手段と、第１のコーパスの文字情報量よりも多量の文字情報量を有する第２のコーパスを格納する第２のコーパス格納手段と、言語モデルを生成する言語モデル生成手段とをさらに備え、言語モデル生成手段は、基本データに出現する単語の出現頻度に関連する出現頻度関連値と第２のコーパスに出現するその単語の出現頻度に関連する出現頻度関連値とを比較する第１の比較手段と、第１の比較手段の比較結果に基づき、基本データに所定の出現頻度以上で出現する単語を変数化した定型パターンを基本データから生成する定型パターン生成手段と、第１のコーパスに出現する単語の出現頻度に関連する出現頻度関連値と第２のコーパスに出現するその単語の出現頻度に関連する出現頻度関連値とを比較する第２の比較手段と、第２の比較手段の比較結果に基づき、第１のコーパスに所定の出現頻度以上で出現する用語を抽出する用語抽出手段と、用語抽出手段によって抽出された用語を定型パターンの変数化された部分に当てはめた文章を生成する文章生成手段と、その文章の単語の接続関係を生成する接続関係生成手段とを有すると好適である。 The speech recognition apparatus also includes a basic data storage unit that stores basic data in which an utterance is transcribed, a first corpus storage unit that stores a first corpus including character information related to a speaker, The second corpus storage means for storing a second corpus having a larger amount of character information than the amount of character information of the corpus, and a language model generation means for generating a language model, the language model generation means, First comparison means for comparing an appearance frequency related value related to the appearance frequency of a word appearing in the basic data and an appearance frequency related value related to the appearance frequency of the word appearing in the second corpus; Based on the comparison result of the comparison means, a fixed pattern generation means for generating a fixed pattern from the basic data in which the words appearing in the basic data with a predetermined frequency of occurrence or more are converted from the basic data, and the first corpus A second comparing means for comparing an appearance frequency related value related to the appearance frequency of the present word and an appearance frequency related value related to the appearance frequency of the word appearing in the second corpus; Based on the comparison result, a term extracting means for extracting terms appearing in the first corpus with a predetermined frequency of occurrence and a sentence in which the terms extracted by the term extracting means are applied to the variableized portion of the standard pattern are generated. It is preferable to have a sentence generation means for generating a connection relation and a connection relation generation means for generating a connection relation of words of the sentence.

この構成によれば、第１の比較手段によって、基本データにおける単語の出現頻度関連値と、第２のコーパスにおけるその単語の出現頻度関連値とが比較され、その比較結果を基に、定型パターン生成手段によって、基本データに所定の出現頻度以上で出現する単語が抽出され、定型パターンが生成される。これにより、一般化された定型パターンを得ることができる。また、第２の比較手段によって、第１のコーパスにおける単語の出現頻度関連値と第２のコーパスにおけるその単語の出現頻度関連値とが比較され、その比較結果を基に、用語抽出手段によって、第１のコーパスに所定の出現頻度以上で出現する用語が抽出される。そこで、文章生成手段は、定型パターンの変数化された部分に対し、第１のコーパスに所定の出現頻度以上で現れる話者の関係する用語を当てはめるため、その話者の予想される文章が生成できる。さらに、接続関係生成手段は、その文章を基に、その文章に現れる単語同士の接続関係を生成する。したがって、音声処理手段は、話者から発話された音声を適正な候補から文字情報に変換することができ、音声認識率が向上する。 According to this configuration, the first comparison means compares the appearance frequency related value of the word in the basic data with the appearance frequency related value of the word in the second corpus, and based on the comparison result, the fixed pattern The generating means extracts words appearing in the basic data with a predetermined appearance frequency or more, and generates a fixed pattern. Thereby, a generalized fixed pattern can be obtained. Also, the second comparison means compares the appearance frequency related value of the word in the first corpus with the appearance frequency related value of the word in the second corpus, and based on the comparison result, the term extraction means Terms that appear in the first corpus with a frequency of appearance or higher are extracted. Therefore, the sentence generation means applies a term related to the speaker appearing in the first corpus at a predetermined appearance frequency or more to the variable part of the fixed pattern, so that an expected sentence of the speaker is generated. it can. Further, the connection relation generating means generates a connection relation between words appearing in the sentence based on the sentence. Therefore, the speech processing means can convert speech uttered by the speaker from proper candidates to character information, and the speech recognition rate is improved.

ここで、第１のコーパスは、話者に関係するＵＲＬによって特定されるサイトから取得された文字情報であり、第２のコーパスは、ＷＷＷから収集された文字情報であるとよい。この構成によれば、話者に関係するＵＲＬによって特定されるサイトには、話者によって発話されると推定される用語が頻出しているため、容易にそのような用語を抽出することができる。また、ＷＷＷのサイトには多くの文字情報が掲載されているため、第２のコーパスの言語資源として適している。そのため、音声処理手段の音声認識率が向上する。 Here, the first corpus may be character information acquired from a site specified by a URL related to the speaker, and the second corpus may be character information collected from the WWW. According to this configuration, since a term that is estimated to be spoken by a speaker frequently appears on a site specified by a URL related to the speaker, such a term can be easily extracted. . Also, since a lot of text information is posted on the WWW site, it is suitable as a language resource for the second corpus. Therefore, the voice recognition rate of the voice processing means is improved.

また、言語モデル生成手段は、第２のコーパスの共起情報に基づいて、話者の関係するＵＲＬによって特定されるサイトから取得された用語を定型パターンの変数化された部分に当てはめるとよい。ＷＷＷの言語資源から収集された第２のコーパスの共起情報が利用されるため、話者の関係するＵＲＬによって特定されるサイトから取得された用語を定型パターンの変数化された部分に、簡易かつ効率よく当てはめることができる。 The language model generation means may apply a term acquired from a site specified by the URL related to the speaker to the variableized portion of the fixed pattern based on the co-occurrence information of the second corpus. Since the co-occurrence information of the second corpus collected from the WWW language resources is used, the term obtained from the site specified by the URL related to the speaker can be easily converted into a variable part of the fixed pattern. And can be applied efficiently.

さらに音声認識装置は、話者の電話番号と、話者の関係するＵＲＬとを対応させて格納する電話／ＵＲＬ対応格納手段と、着信電話番号を示す情報を受け、電話／ＵＲＬ対応格納手段を参照して、着信電話番号に対応付けられて格納されたＵＲＬを読み出し、読み出したＵＲＬによって特定されるサイトにアクセスし、そのアクセスしたサイトから文字情報を取得して第１のコーパス格納手段に格納するＷＷＷアクセス手段とをさらに備えるとよい。これにより、着信電話番号によって特定される相手側話者に関係するＵＲＬが電話／ＵＲＬ対応格納手段によって読み出され、そのＵＲＬによって特定されるサイトにＷＷＷアクセス手段によってアクセスでき、そのサイトから第１のコーパスを取得することができる。話者の関係するＵＲＬによって特定されるサイトとしては、例えば、話者の属する会社、団体、法人等のホームページ等が挙げられるが、このようなサイトから例えば、会社等の名称、ビジネス情報、専門用語等の特徴的な単語が第１のコーパスに含まれるようになるため、相手側電話機からの着信電話に応じて、適切な第１のコーパスを生成することができ、音声認識率が向上する。 Further, the voice recognition device stores a telephone / URL correspondence storage means for storing the telephone number of the speaker in association with the URL related to the speaker, and a telephone / URL correspondence storage means for receiving the information indicating the incoming telephone number. The URL stored in association with the incoming telephone number is read, the site specified by the read URL is accessed, character information is acquired from the accessed site, and stored in the first corpus storage means. And a WWW access means. As a result, the URL related to the other party's speaker specified by the incoming telephone number is read out by the telephone / URL correspondence storage means, and the site specified by the URL can be accessed by the WWW access means. The corpus can be obtained. As a site specified by a URL related to a speaker, for example, a website of a company, an organization, a corporation, etc. to which the speaker belongs can be cited. From such a site, for example, the name of the company, business information, professional Since characteristic words such as terms are included in the first corpus, an appropriate first corpus can be generated according to the incoming call from the other party's telephone, and the speech recognition rate is improved. .

また、音声認識装置は、音声処理手段によって変換された文字情報を、入力手段からの文字情報の入力指示又は変更指示に基づいて修正する文字編集手段とをさらに備え、文字編集手段によって修正された文字情報を基本データ格納手段に格納する基本データとすると好適である。この構成により、音声処理手段によって音声を文字情報に変換した後、入力手段を用いて文字編集手段によって校正し、その校正した文字情報を基本データとするので、より適正な言語モデルを言語モデル生成手段によって生成することができる。 The voice recognition device further includes character editing means for correcting the character information converted by the voice processing means based on an input instruction or change instruction for character information from the input means, and the character recognition means has corrected the character information. The character information is preferably basic data stored in the basic data storage means. With this configuration, after the speech is converted into character information by the speech processing means, it is calibrated by the character editing means using the input means, and the calibrated character information is used as basic data, so that a more appropriate language model is generated as a language model. Can be generated by means.

また、本発明に係る音声認識プログラムは、話者が発話する音声を文字情報に変換するための音声処理プログラムであって、コンピュータを、変更可能な部分と定型部分とを有する発話の定型パターンにおける変更可能な部分に話者が発話すると推定される一又は複数の用語をそれぞれ備える一又は複数の文章の単語の接続関係を表す言語モデルを格納する言語モデル格納手段と、音声の音響特性を含む音響モデルを格納する音響モデル格納手段と、言語モデルと音響モデルとを参照して、音声信号を文字情報に変換する音声処理手段として機能させることを特徴とする。 The speech recognition program according to the present invention is a speech processing program for converting speech uttered by a speaker into character information, and the computer uses a utterance fixed pattern having a changeable portion and a fixed portion. A language model storage means for storing a language model representing a connection relation of one or a plurality of sentences each having one or a plurality of terms estimated to be spoken by a speaker in a changeable portion; and an acoustic characteristic of speech An acoustic model storing means for storing an acoustic model, a language model and an acoustic model are referred to function as speech processing means for converting a speech signal into character information.

この音声認識プログラムによれば、言語モデルには発話の定型パターンから生成されているため、適正な候補から文字情報に変換することができ、音声認識率が向上する。また、話者によって発話されると推定される用語を定型パターンの変更可能な部分に当てはめて生成された文章から言語モデルが生成されているため、音声処理手段は、話者から発話された音声を適正な候補から文字情報に変換することができ、音声認識率が向上する。 According to this speech recognition program, since the language model is generated from a fixed pattern of utterances, it is possible to convert appropriate candidates into character information, and the speech recognition rate is improved. In addition, since the language model is generated from sentences generated by applying terms estimated to be spoken by the speaker to the changeable part of the fixed pattern, the speech processing means Can be converted from character candidates into character information, and the speech recognition rate is improved.

また、本発明に係る言語モデル生成方法は、音声認識装置において音声信号を音響分析して文字情報に変換する際に音声の音響特性を含む音響モデルとともに参照される言語モデルを生成する言語モデル生成方法であって、発話された音声データを書き起こした基本データから、変更可能な部分と定型部分とを有する発話の定型パターンを生成する定型パターン生成ステップと、定型パターンの変更可能な部分に話者が発話すると推定される一又は複数の用語を当てはめた一又は複数の文章を生成し、その文章の単語の接続関係を表す言語モデルを生成する言語モデル生成ステップとを含むことを特徴とする。 In addition, the language model generation method according to the present invention generates a language model that generates a language model to be referred to together with an acoustic model including an acoustic characteristic of speech when the speech signal is acoustically analyzed and converted into character information in the speech recognition apparatus. A method of generating a fixed pattern of an utterance having a changeable part and a fixed part from basic data obtained by transcription of spoken speech data, and a spoken to a changeable part of the fixed pattern A language model generating step of generating one or a plurality of sentences fitted with one or a plurality of terms estimated to be uttered by the person and generating a language model representing a connection relation of words of the sentences .

この言語モデル生成方法によれば、言語モデルの生成に用いられる定型パターンは、発話された音声データを書き起こした基本データから抽出されているため、定型パターンに沿った発話内容に適した言語モデルを生成することができる。また、定型パターンは、変更可能な部分と定型部分とを有し、その変更可能な部分に、話者が発話すると推定される用語を当てはめた文章を生成したうえで、その文章の単語の接続関係を表す言語モデルが生成されているため、言語モデルは、話者から発話された音声に合わせて、適正な文字情報に変換するのに適している。 According to this language model generation method, the fixed pattern used for generating the language model is extracted from the basic data that transcribes the spoken speech data, so the language model suitable for the utterance content along the fixed pattern Can be generated. In addition, the fixed pattern has a changeable part and a fixed part, and after generating a sentence in which a term estimated to be spoken by the speaker is generated in the changeable part, the words in the sentence are connected. Since the language model representing the relationship is generated, the language model is suitable for conversion into appropriate character information in accordance with the speech uttered by the speaker.

また、用語は、話者の関係するＵＲＬによって特定されるサイトから取得するとよい。話者に関係するＵＲＬによって特定されるサイトには、話者によって発話されると推定される用語が頻出しているため、容易にそのような用語を抽出することができる。 The term may be acquired from a site specified by a URL related to the speaker. Since terms that are estimated to be uttered by the speaker frequently appear on the site specified by the URL related to the speaker, such a term can be easily extracted.

また、この言語モデル生成方法において、基本データとして、音声認識装置によって変換された文字情報が編集により校正された文字情報を用いるとよい。このように、音声認識装置によって音声が変換された文字情報を校正したうえで基本データとして用いることにより、より適正な言語モデルを生成することができる。 In this language model generation method, it is preferable to use, as basic data, character information in which character information converted by the speech recognition device is calibrated by editing. In this way, a more appropriate language model can be generated by calibrating the character information whose speech has been converted by the speech recognition apparatus and using it as basic data.

また、本発明に係る言語モデル生成装置は、音声認識装置において音声信号を音響分析して文字情報に変換する際に音声の音響特性を含む音響モデルとともに参照される言語モデルを生成する言語モデル生成装置であって、発話が書き起こされた基本データを格納する基本データ格納手段と、話者に関係する文字情報を含む第１のコーパスを格納する第１のコーパス格納手段と、第１のコーパスの文字情報量よりも多量の文字情報量を有する第２のコーパスを格納する第２のコーパス格納手段と、言語モデルを生成する言語モデル生成手段とをさらに備え、言語モデル生成手段は、基本データに出現する単語の出現頻度に関連する出現頻度関連値と第２のコーパスに出現するその単語の出現頻度に関連する出現頻度関連値とを比較する第１の比較手段と、第１の比較手段の比較結果に基づき、基本データに所定の出現頻度以上で出現する単語を変数化した定型パターンを基本データから生成する定型パターン生成手段と、第１のコーパスに出現する単語の出現頻度に関連する出現頻度関連値と第２のコーパスに出現するその単語の出現頻度に関連する出現頻度関連値とを比較する第２の比較手段と、第２の比較手段の比較結果に基づき、第１のコーパスに所定の出現頻度以上で出現する用語を抽出する用語抽出手段と、用語抽出手段によって抽出された用語を定型パターンの変数化された部分に当てはめた文章を生成する文章生成手段と、その文章の単語の接続関係を生成する接続関係生成手段とを有することを特徴とする。 The language model generation apparatus according to the present invention generates a language model that generates a language model to be referred to together with an acoustic model including an acoustic characteristic of speech when the speech signal is acoustically analyzed and converted into character information in the speech recognition apparatus. A basic data storage means for storing basic data in which an utterance is transcribed, a first corpus storage means for storing a first corpus including character information related to the speaker, and a first corpus A second corpus storage means for storing a second corpus having a character information amount larger than the character information amount of the character information, and a language model generation means for generating a language model, wherein the language model generation means includes basic data Comparing the appearance frequency related value related to the appearance frequency of the word appearing in the word with the appearance frequency related value related to the appearance frequency of the word appearing in the second corpus Based on the comparison result of the comparison means and the first comparison means, a fixed pattern generation means for generating a fixed pattern from the basic data, in which words appearing in the basic data with a frequency of occurrence of a predetermined frequency or more are generated from the basic data, and the first corpus A second comparing means for comparing an appearance frequency related value related to the appearance frequency of the appearing word and an appearance frequency related value related to the appearance frequency of the word appearing in the second corpus; Based on the comparison result, a term extracting means for extracting terms appearing in the first corpus with a predetermined frequency of occurrence and a sentence in which the terms extracted by the term extracting means are applied to the variableized portion of the standard pattern are generated. And a connection relation generating means for generating a connection relation of words of the sentence.

この言語モデル生成装置によれば、第１の比較手段によって、基本データにおける単語の出現頻度関連値と、第２のコーパスにおけるその単語の出現頻度関連値とが比較され、その比較結果を基に、定型パターン生成手段によって、基本データに所定の出現頻度以上で出現する単語が抽出され、定型パターンが生成される。これにより、一般化された定型パターンを得ることができる。また、第２の比較手段によって、第１のコーパスにおける単語の出現頻度関連値と第２のコーパスにおけるその単語の出現頻度関連値とが比較され、その比較結果を基に、用語抽出手段によって、第１のコーパスに所定の出現頻度以上で出現する用語が抽出される。そこで、文章生成手段は、定型パターンの変数化された部分に対し、第１のコーパスに所定の出現頻度以上で現れる話者の関係する用語を当てはめるため、その話者の予想される文章が生成できる。さらに、接続関係生成手段は、その文章を基に、その文章に現れる単語同士の接続関係を生成する。したがって、本言語モデル生成装置は、話者から発話された音声を適正な候補から文字情報に変換することができ、音声認識率が向上する言語モデルを生成することができる。 According to this language model generation device, the first comparison unit compares the appearance frequency related value of the word in the basic data with the appearance frequency related value of the word in the second corpus, and based on the comparison result The standard pattern generation means extracts words that appear in the basic data at a predetermined appearance frequency or higher, and generates a standard pattern. Thereby, a generalized fixed pattern can be obtained. Also, the second comparison means compares the appearance frequency related value of the word in the first corpus with the appearance frequency related value of the word in the second corpus, and based on the comparison result, the term extraction means Terms that appear in the first corpus with a frequency of appearance or higher are extracted. Therefore, the sentence generation means applies a term related to the speaker appearing in the first corpus at a predetermined appearance frequency or more to the variable part of the fixed pattern, so that an expected sentence of the speaker is generated. it can. Further, the connection relation generating means generates a connection relation between words appearing in the sentence based on the sentence. Therefore, the language model generation apparatus can convert speech uttered by a speaker from appropriate candidates to character information, and can generate a language model that improves the speech recognition rate.

本発明によれば、音声認識における候補を適正化し音声認識率を向上させることができる。 ADVANTAGE OF THE INVENTION According to this invention, the candidate in speech recognition can be optimized and a speech recognition rate can be improved.

以下、添付図面を参照しながら、本発明の好適な実施形態について詳細に説明する。なお、図面の説明において、同一または相当要素には同一の符号を付し、重複する説明は省略する。 Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. In the description of the drawings, the same or corresponding elements are denoted by the same reference numerals, and redundant description is omitted.

＜通信回線システムの構成＞
図２は本実施形態の音声認識装置１を利用した通信回線システム２を示す概略図である。コールセンター３には、音声認識装置１と電話機４とが設置されており、電話機４は、通信回線網６を介して、相手側の電話機７と通話できるようになっている。電話機４と音声認識装置１とは、通信回線８に開設されたアダプタ９によって接続されており、電話機４及び電話機７は、話者によって対話された音声の信号を生成し、音声認識装置１に送信する。また、電話機７を利用して電話をかけてくる相手は、サーバー１１が設置されており、アダプタ１０を介して通信回線網６に接続されている。音声認識装置１は、この話者の対話を文字情報に変換する装置であるとともに、通信回線網６を介してサーバー１１が開設しているサイトにアクセスでき、そのサイトを参照することができるようになっている。 <Configuration of communication line system>
FIG. 2 is a schematic diagram showing a communication line system 2 using the speech recognition apparatus 1 of the present embodiment. The call center 3 is provided with a voice recognition device 1 and a telephone 4, and the telephone 4 can communicate with a telephone 7 on the other side via a communication network 6. The telephone 4 and the voice recognition apparatus 1 are connected by an adapter 9 established on the communication line 8, and the telephone 4 and the telephone 7 generate a voice signal that is spoken by a speaker, and the voice recognition apparatus 1 Send. In addition, a server 11 is installed as a partner who makes a call using the telephone 7, and is connected to the communication line network 6 through the adapter 10. The voice recognition device 1 is a device that converts the conversation of the speaker into character information, and can access a site established by the server 11 via the communication network 6 so that the site can be referred to. It has become.

なお、図１では、コールセンター３内の一台の電話機４と、外部の一台の電話機７を示しているが、実際のコールセンター３には、複数のオペレータによって対応される複数の電話機４が設置されており、外部からの複数の電話機７からかかってくる相手側の電話に対応している。また、音声認識装置１は、通信回線網６を介して多数のサーバーが開設している各サイトにアクセスできる。各サイトは、各会社、団体、各種法人等ごとに開設されており、そのサイトには、その会社等の名称、ビジネス情報、専門用語等が頻出して掲載されている。 Although FIG. 1 shows one telephone 4 in the call center 3 and one external telephone 7, a plurality of telephones 4 corresponding to a plurality of operators are installed in the actual call center 3. It corresponds to the other party's telephones coming from a plurality of telephones 7 from the outside. In addition, the voice recognition device 1 can access each site established by a large number of servers via the communication line network 6. Each site is established for each company, organization, various corporation, etc., and the name of the company, business information, technical terms, etc. are frequently displayed on the site.

＜音声認識装置の構成＞
音声認識装置１は、物理的には、ＣＰＵ、ＲＡＭ等のメモリ、ディスプレイ等の表示装置、インターフェイス、キーボードやマウスの入力手段等をハードウェアとして備えて構成されている。図３は、この音声認識装置１を機能ブロックごとに示した構成図である。音声認識装置１は、格納部として、言語モデル格納部２１、音響モデル格納部２２、基本データ格納部２３、大規模コーパス格納部２４、小規模コーパス格納部２５、及び電話／ＵＲＬ対応格納部２７を備える。また、音声認識装置１は、インターフェイスＩ／Ｆ２９を備える。このインターフェイスＩ／Ｆ２９は、電話機４と電話機７とを介して話者の間で対話された音声の音声信号を受信して音声処理部３１に出力するとともに、ＵＲＬによって特定される外部のサイトにアクセスするためのインターフェイスでもある。また、音声認識装置１は、処理部として、音声処理部３１、文字編集部３２、言語モデル生成部３３、ＷＷＷアクセス部３４を備え、さらに表示部３７を備える。 <Configuration of voice recognition device>
The voice recognition device 1 is physically configured to include a CPU, a memory such as a RAM, a display device such as a display, an interface, a keyboard and a mouse input unit, and the like as hardware. FIG. 3 is a configuration diagram showing the speech recognition apparatus 1 for each functional block. The speech recognition apparatus 1 includes a language model storage unit 21, an acoustic model storage unit 22, a basic data storage unit 23, a large-scale corpus storage unit 24, a small-scale corpus storage unit 25, and a telephone / URL correspondence storage unit 27 as storage units. Is provided. Further, the voice recognition device 1 includes an interface I / F 29. The interface I / F 29 receives a voice signal of a voice conversation between the speakers via the telephone 4 and the telephone 7, and outputs the voice signal to the voice processing unit 31, and also to an external site specified by the URL. It is also an interface for access. Further, the speech recognition apparatus 1 includes a speech processing unit 31, a character editing unit 32, a language model generation unit 33, a WWW access unit 34, and a display unit 37 as processing units.

言語モデル格納部２１には、言語モデルが格納されている。この言語モデルは、発話の定型パターンと、予測又は予定された話者が属する組織（会社、団体等）のＵＲＬによって特定されるサイトから取得したキーワード、例えば、組織名や取扱商品、ビジネス情報、事業用専門用語等とから予測発話表現を生成して、その発話表現から生成されたものであり、予測発話表現の形態素解析によって、文頭によく現れる単語や、ある単語の次に現れる単語の接続関係を表している。音響モデル格納部２２には、母音や子音などの発音記号ごとの音響情報である音響モデルが格納されており、この音響モデルは、電話音声で学習したものである。 The language model storage unit 21 stores language models. This language model is a keyword obtained from a site specified by a URL pattern of an organization (company, organization, etc.) to which a predicted or planned speaker belongs, for example, an organization name, a handling product, business information, A predicted utterance expression is generated from business terminology, etc., and is generated from the utterance expression. By morphological analysis of the predicted utterance expression, a word that appears frequently at the beginning of a sentence or a word that appears next to a certain word is connected. Represents a relationship. The acoustic model storage unit 22 stores an acoustic model that is acoustic information for each phonetic symbol such as a vowel or a consonant, and this acoustic model is learned by telephone speech.

基本データ格納部２３は、コールセンターの音声データを書き起こしたテキストデータである基本データが格納される。大規模コーパス格納部２４には、インターネットＷＷＷ４０の言語資源から、ＷＷＷ上のリンクを再帰的に辿って無作為に収集される文字情報を含む１２ＧＢの大規模なコーパス（第２のコーパス）を格納する。この大規模コーパスからは、不要な記号や文字化けした文字等は削除しておく。小規模コーパス２５は、話者が属する組織のＵＲＬによって特定されるサイトから取得された文字情報を小規模コーパス（第１のコーパス）として格納する。 The basic data storage unit 23 stores basic data that is text data obtained by transcription of call center voice data. The large-scale corpus storage unit 24 stores a 12 GB large-scale corpus (second corpus) including character information randomly collected from the language resources of the Internet WWW 40 by recursively following links on the WWW. To do. Unnecessary symbols and garbled characters are deleted from the large corpus. The small corpus 25 stores character information acquired from a site specified by the URL of the organization to which the speaker belongs as a small corpus (first corpus).

音声データ格納部２６は、インターフェイスＩ／Ｆ２９を介して入力された音声データを格納する。電話／ＵＲＬ対応格納部２７は、電話番号と、その電話番号でかけてくる話者の属する団体のＵＲＬとを対応させて格納する。 The audio data storage unit 26 stores audio data input via the interface I / F 29. The telephone / URL correspondence storage unit 27 stores the telephone number and the URL of the organization to which the speaker calling by the telephone number is associated with each other.

また、音声処理部３１は、インターフェイスＩ／Ｆ２９から出力された音声信号を受け、その音声信号の周波数分析等の音響分析を行う。さらに、音声処理部３１は、言語モデル格納部２１に格納された言語モデルを参照して文頭に現れやすい単語の候補をリストアップし、入力された音声信号の音響分析の結果と音響モデル格納部２２に格納された音響モデルとを照合する。次に、文頭の単語に接続しうる単語の候補を参照している言語モデルからリストアップし、音声信号の音響分析の結果と音響モデルとを照合する。このようにして、入力された音声信号を順次音声認識して行き、文字情報に変換していく。そして、変換された文字情報を表示部３７に出力する。 The voice processing unit 31 receives the voice signal output from the interface I / F 29 and performs acoustic analysis such as frequency analysis of the voice signal. Furthermore, the speech processing unit 31 refers to the language model stored in the language model storage unit 21 to list words that are likely to appear at the beginning of the sentence, and the result of acoustic analysis of the input speech signal and the acoustic model storage unit The acoustic model stored in 22 is collated. Next, a list of language models referring to word candidates that can be connected to the word at the beginning of the sentence is listed, and the result of acoustic analysis of the speech signal is compared with the acoustic model. In this way, the input voice signal is sequentially recognized and converted into character information. Then, the converted character information is output to the display unit 37.

文字編集部３２は、音声処理部３１によって出力された文字情報の中に、誤って変換された文字があった場合に、キーボードやマウスから構成される入力部３６からの入力指示、変更指示を受け、文字情報を校正、編集する。文字編集部３２は、一般の文章作成アプリケーションであってもよい。また、文字編集部３２は、編集された文字情報を基本データ格納部２３に格納する。 The character editing unit 32 gives an input instruction or change instruction from the input unit 36 composed of a keyboard and a mouse when there is a character converted in error in the character information output by the voice processing unit 31. Receive and proofread and edit text information. The character editing unit 32 may be a general text creation application. The character editing unit 32 stores the edited character information in the basic data storage unit 23.

言語モデル生成部３３は、基本データ格納部２３に格納された基本データ、大規模コーパス格納部２４に格納された大規模コーパス、小規模コーパス格納部２５に格納された小規模コーパスを用いて、言語モデル格納部２１に格納する言語モデルを生成する。すなわち、言語モデル生成部３３は、基本データに出現する単語の情報量と、大規模コーパスに出現する単語の情報量との差が所定以上の単語を変数化して定型パターンを生成する。また、言語モデル生成部３３は、小規模コーパスに出現する単語の情報量と、大規模コーパスに出現する単語の情報量との差が所定以上の単語を、話者の関係するＵＲＬによって特定されるサイトから取得された文字情報を頻出するキーワードとして抽出する。さらに、言語モデル生成部３３は、その文字情報を定型パターンの変数化された部分に当てはめた文章を基に、言語モデルを生成し、言語モデル格納部２１に出力する。ここで、言語モデル生成部３３が、ＵＲＬによって特定されるサイトから取得した文字情報を定型パターンの変数化された部分に当てはめるときは、大規模コーパスの共起情報に基づいて当てはめる。 The language model generation unit 33 uses the basic data stored in the basic data storage unit 23, the large-scale corpus stored in the large-scale corpus storage unit 24, and the small-scale corpus stored in the small-scale corpus storage unit 25. A language model to be stored in the language model storage unit 21 is generated. That is, the language model generation unit 33 generates a fixed pattern by variableizing words whose difference between the information amount of words appearing in the basic data and the information amount of words appearing in the large-scale corpus is greater than or equal to a predetermined value. In addition, the language model generation unit 33 identifies a word whose difference between the information amount of the word appearing in the small corpus and the information amount of the word appearing in the large corpus is a predetermined value or more by the URL related to the speaker. Character information acquired from a site is extracted as a frequent keyword. Further, the language model generation unit 33 generates a language model based on a sentence in which the character information is applied to a variable part of the standard pattern, and outputs the language model to the language model storage unit 21. Here, when the language model generation unit 33 applies the character information acquired from the site specified by the URL to the variableized portion of the fixed pattern, it applies it based on the co-occurrence information of the large-scale corpus.

上記の情報量Ｉ（ｘ）は、単語の出現頻度に関連する値であり、次のように統計的に算出する。すなわち、
Ｉ（ｘ）＝−ｌｏｇ_２Ｐ（ｘ）（式１）
ただし、
Ｐ（ｘ）＝Ｆ（ｘ）／Ｎ（式２）
ここで、Ｎは総単語数、Ｆ（ｘ）は単語ｘの出現数である。出現数Ｆ（ｘ）が小さいほど、情報量Ｉ（ｘ）は大きい値となる。なお、対数演算をするため、Ｐ（ｘ）が零とならないように、大規模コーパスには、基本データを含ませておくとよい。 The information amount I (x) is a value related to the appearance frequency of words, and is calculated statistically as follows. That is,
I (x) = − log ₂ P (x) (Formula 1)
However,
P (x) = F (x) / N (Formula 2)
Here, N is the total number of words, and F (x) is the number of occurrences of the word x. The smaller the occurrence number F (x), the larger the information amount I (x). In order to perform a logarithmic operation, basic data may be included in a large-scale corpus so that P (x) does not become zero.

ＷＷＷアクセス部３４は、インターフェイスＩ／Ｆ２９を介して、着信した電話番号を受けたときに、電話／ＵＲＬ対応格納部２７を参照して、着信した電話番号に対応付けられて格納されたＵＲＬを読み出し、その読み出したＵＲＬによって特定されるサイトにアクセスする。そして、ＷＷＷアクセス部３４は、そのアクセスしたサイトから文字情報を取得して小規模コーパス格納部２５に格納する。 When the WWW access unit 34 receives the incoming telephone number via the interface I / F 29, the WWW access unit 34 refers to the telephone / URL correspondence storage unit 27 and stores the URL stored in association with the incoming telephone number. Read and access the site specified by the read URL. The WWW access unit 34 acquires character information from the accessed site and stores it in the small-scale corpus storage unit 25.

表示部３７は、音声処理部３１によって出力された文字情報を受け、その文字情報を画面に表示する他、文字編集部３２により編集を行っている文字情報を表示し、また、ＷＷＷアクセス部３４によりアクセスされたサイトを表示する。 The display unit 37 receives the character information output from the voice processing unit 31, displays the character information on the screen, displays the character information edited by the character editing unit 32, and also displays the character information edited by the character editing unit 32. Display sites accessed by.

＜言語モデル生成方法＞
ここで、言語モデル格納部２１に格納する言語モデルの生成方法について図４及び図５を参照して説明する。なお、音声認識装置１では、言語モデルは、言語モデル生成部３３によって行われる。この言語モデル生成部３３には、以下の説明のごとく機能する第１の比較部、定型パターン生成部、第２の比較部、用語抽出部、文章生成部、及び接続関係生成部とを有する。 <Language model generation method>
Here, a method of generating a language model stored in the language model storage unit 21 will be described with reference to FIGS. 4 and 5. In the speech recognition device 1, the language model is performed by the language model generation unit 33. The language model generation unit 33 includes a first comparison unit, a fixed pattern generation unit, a second comparison unit, a term extraction unit, a sentence generation unit, and a connection relationship generation unit that function as described below.

予めコールセンターの音声データを書き起こしたテキストデータである基本データを作成し、基本データ格納部２３に格納する（Ｓ１０１）。この基本データからコールセンターにおける定型パターンを抽出する（Ｓ１０２）。この定型パターンの抽出には、大規模コーパス格納部２４に格納された大規模コーパスを用いる。大規模コーパスは、上述のようにインターネットのＷＷＷ４０の言語資源から収集された１２ＧＢのコーパスである。また、定型パターンの抽出は、以下の通りである。まず、第１の比較部は、基本データに出現する各単語の情報量と大規模コーパスにおける対応する各単語の情報量との差を求め、その情報量の差の大きさに基づいて降順にソートする。そして、定型パターン生成部は、まず情報量の差が所定値以上（例えば、８．０以上）の単語を選択する。図６は、情報量の差が８．０以上の単語の例を示している。さらに、定型パターン生成部は、その選択された単語を変数化する（「＠ＶＡＲ」とする）ことで、図７に示すような定型パターンが抽出される。このように定型パターンは、変更可能な部分である変数化された部分と定型部分を有する。 Basic data, which is text data obtained by writing call center voice data in advance, is created and stored in the basic data storage unit 23 (S101). A fixed pattern at the call center is extracted from the basic data (S102). For extraction of the fixed pattern, a large-scale corpus stored in the large-scale corpus storage unit 24 is used. The large-scale corpus is a 12 GB corpus collected from the WWW 40 language resources of the Internet as described above. The extraction of the fixed pattern is as follows. First, the first comparison unit obtains a difference between the information amount of each word appearing in the basic data and the information amount of each corresponding word in the large-scale corpus, and descends in descending order based on the magnitude of the difference in the information amount. Sort. Then, the fixed pattern generation unit first selects a word whose information amount difference is a predetermined value or more (for example, 8.0 or more). FIG. 6 shows an example of a word having a difference in information amount of 8.0 or more. Furthermore, the fixed pattern generation unit converts the selected word into a variable (“@VAR”), thereby extracting a fixed pattern as shown in FIG. As described above, the fixed pattern has a variable part and a fixed part which can be changed.

次に、第２の比較部は、比較判定のため、小規模コーパスに出現する各単語の情報量と大規模コーパスにおける対応する各単語の情報量との差を求める。そして、用語抽出部は、情報量の差が所定値以上の単語をキーワードとして抽出する（S１０３）。なお、キーワードの抽出においては、定型パターンの変数化された部分へ入力できる品詞を制限すると、より適正なキーワードが抽出できる。また、シソーラスを利用するなどして、変数化した部分の単語と同様の特徴を有する単語を選択することで、適正なキーワードが抽出できる。このキーワードは、予測される又は予定される話者が属する組織のＵＲＬによって特定されるサイトから取得された文字情報から抽出されることになる。キーワードとしては、話者が属する組織（例えば会社、団体等）の名称、話者の氏名等、話者の属性データや、その組織が取り扱う商品、事業に関する用語、ビジネス情報、専門用語等、発話が予想される単語が挙げられる。このキーワードは、話者によって変化すると想定される単語である。なお、小規模コーパス格納部２５に格納される小規模コーパスは、ＷＷＷアクセス部３４が、話者に関連するＵＲＬを開設しているサーバー１１（図２参照）にアクセスして、そのアクセスしたサイトから取得した文字情報である。 Next, the second comparison unit obtains a difference between the information amount of each word appearing in the small corpus and the information amount of each corresponding word in the large corpus for comparison determination. Then, the term extraction unit extracts words whose information amount difference is equal to or greater than a predetermined value as keywords (S103). In keyword extraction, more appropriate keywords can be extracted by limiting the part of speech that can be input to the variableized portion of the fixed pattern. In addition, an appropriate keyword can be extracted by selecting a word having the same characteristics as the variable part of the word using a thesaurus. This keyword is extracted from the character information acquired from the site specified by the URL of the organization to which the predicted or planned speaker belongs. Keywords include the name of the organization (eg, company, organization) to which the speaker belongs, the speaker's name, speaker attribute data, products handled by the organization, business terms, business information, technical terms, etc. Words that can be expected. This keyword is a word that is assumed to change depending on the speaker. The small-scale corpus stored in the small-scale corpus storage unit 25 is the site accessed by the WWW access unit 34 accessing the server 11 (see FIG. 2) that establishes a URL related to the speaker. Character information obtained from

さらに、文章生成部は、S１０３で選択されたキーワードを、大規模コーパスの共起情報を用いて、Ｓ１０２で抽出した定型パターンに代入可能か判定して、定型パターンに適切なキーワードを代入し、予測文を生成する（Ｓ１０４）。そのうえで、接続関係生成部は、生成した予測文を形態素（単語、品詞）解析し、発話の文頭に出現する単語の確率、各単語の次に現れる単語の接続確率等を求めて、その接続関係を生成し、言語モデルを生成する。このようにして生成された言語モデルを言語モデル格納部２１に格納して、音声認識装置１を構成するとよい。 Further, the sentence generation unit determines whether the keyword selected in S103 can be substituted into the fixed pattern extracted in S102 using the co-occurrence information of the large-scale corpus, and assigns an appropriate keyword to the fixed pattern, A predicted sentence is generated (S104). In addition, the connection relation generation unit analyzes the generated predicted sentence by morpheme (word, part of speech), finds the probability of the word appearing at the beginning of the utterance, the connection probability of the word appearing next to each word, and the connection relation Generate a language model. The speech recognition apparatus 1 may be configured by storing the language model generated in this way in the language model storage unit 21.

また、さらに音声認識装置１によって音声認識させ（Ｓ１０６）、変換された文字情報を校正処理し（Ｓ１０７）、その校正した文字情報をＳ１０１における基本データとして利用し、あらためてＳ１０２〜Ｓ１０５の工程により言語モデルを生成し、言語モデル格納部２１に格納してもよい。ここでの校正処理は、音声データ格納部２６に格納した音声データを操作者が聞き直し、文字編集部３２及び入力部３６を用いて、音声が文字情報に変換された文章を校正する。この場合、対話が校正された文字情報を基本データとするので、より適正な言語モデルを言語モデル生成部３３により生成させることができる。 Further, the speech recognition apparatus 1 performs speech recognition (S106), the converted character information is proofread (S107), the proofread character information is used as the basic data in S101, and the language is renewed by the steps S102 to S105. A model may be generated and stored in the language model storage unit 21. In this proofreading process, the operator re-listens the voice data stored in the voice data storage unit 26 and uses the character editing unit 32 and the input unit 36 to proofread the sentence in which the voice is converted into character information. In this case, since the character information whose dialogue has been proofread is used as the basic data, a more appropriate language model can be generated by the language model generation unit 33.

また、Ｓ１０３におけるキーワードの抽出において利用する小規模コーパスの取得は、次のようにすることもできる。すなわち、相手側の電話機７から電話を受ける際、電話機７からの着信電話番号をインターフェイスＩ／Ｆ２９を介して、ＷＷＷアクセス部３４に受信させる。すると、ＷＷＷアクセス部３４は、電話／ＵＲＬ対応格納部２７を参照し、着信した電話番号に対応付けられて格納されたＵＲＬ情報を読み出す。さらに、ＷＷＷアクセス部３４は、読み出したＵＲＬによって特定されるサイトにアクセスし、そのアクセスしたサイトから文字情報を取得して小規模コーパス格納手段に格納する。このようにして、小規模コーパスを取得してもよい。 The acquisition of a small-scale corpus used in keyword extraction in S103 can also be performed as follows. In other words, when receiving a call from the telephone 7 on the other side, the WWW access unit 34 receives the incoming telephone number from the telephone 7 via the interface I / F 29. Then, the WWW access unit 34 refers to the telephone / URL correspondence storage unit 27 and reads out URL information stored in association with the incoming telephone number. Further, the WWW access unit 34 accesses a site specified by the read URL, acquires character information from the accessed site, and stores it in the small-scale corpus storage means. In this way, a small corpus may be acquired.

また、言語モデルは、最新の所定分量の書き起こしデータから生成したものとし、所定分量以降の過去のデータを用いないようするとよい。あるいは、言語モデルは、書き起こしデータの中で、発話頻度の高い内容を用いて生成するようにしてもよい。このように、言語モデルを生成するために利用する基本データを、常時最近のデータ、あるいは使用頻度の高いデータにすることで、音声認識における候補の適正化を図ることもでき、音声認識率は向上する。 The language model is generated from the latest predetermined amount of transcription data, and it is preferable not to use past data after the predetermined amount. Or you may make it produce | generate a language model using the content with high utterance frequency in transcription data. In this way, the basic data used to generate the language model is always recent data or frequently used data, so that candidates for speech recognition can be optimized, and the speech recognition rate is improves.

本発明の実施形態に係る音声認識プログラムについて説明する。図８は、音声認識プログラムが記録された記録媒体の構成図である。記録媒体２００の記憶領域２１０には、実施形態に係る音声認識プログラム（以下、「本プログラム」という。）が記録されており、その構成要素として、音声処理モジュール２１１、文字編集モジュール２１２、言語モデル生成モジュール２１３、ＷＷＷアクセスモジュール２１４を備える。これらの各モジュールを動作させることによって実現する機能は、上述の音声認識装置１における音声処理部３１、文字編集部３２、言語モデル生成部３３、ＷＷＷアクセス部３４におけるそれそれの機能と同様である。また、記憶領域２１０に記憶された言語モデルデータ２１５、音響モデルデータ２１６、基本データ２１７、大規模コーパスデータ２１８、小規模コーパスデータ２１９、電話／ＵＲＬ対応データ２２０それぞれは、音声認識装置１における言語モデル格納部２１に格納された言語モデル、音響モデル格納部２２に格納された音響モデル、基本データ格納部２３に格納された基本データ、大規模コーパス格納部２４に格納された大規模コーパス、小規模コーパス格納部２５に格納された小規模コーパス、電話／ＵＲＬ対応格納部２７に格納されたデータと同じである。 A speech recognition program according to an embodiment of the present invention will be described. FIG. 8 is a configuration diagram of a recording medium on which a voice recognition program is recorded. The storage area 210 of the recording medium 200 records the speech recognition program according to the embodiment (hereinafter referred to as “this program”), and includes the speech processing module 211, the character editing module 212, the language model as its constituent elements. A generation module 213 and a WWW access module 214 are provided. The functions realized by operating these modules are the same as the functions of the speech processing unit 31, the character editing unit 32, the language model generation unit 33, and the WWW access unit 34 in the speech recognition apparatus 1 described above. . The language model data 215, acoustic model data 216, basic data 217, large-scale corpus data 218, small-scale corpus data 219, and telephone / URL correspondence data 220 stored in the storage area 210 are the languages in the speech recognition device 1. The language model stored in the model storage unit 21, the acoustic model stored in the acoustic model storage unit 22, the basic data stored in the basic data storage unit 23, the large-scale corpus stored in the large-scale corpus storage unit 24, the small This is the same as the data stored in the small corpus stored in the large corpus storage 25 and the telephone / URL correspondence storage 27.

以上、本発明の好適な実施形態について詳細に説明したが、本発明は上記実施形態に限定されないことはいうまでもない。例えば、上記実施形態では、音声データを書き起こした基本データと大規模コーパスとから定型パターンを抽出したが、これに限らず、コールセンターにおいてオペレータから発話される定型パターンが予め決められている場合は、その定型パターンをテキストデータに書き起こしてもよい。また、上記実施形態においては、小規模コーパスと大規模コーパスとから、話者に関するキーワードを抽出したが、これに限らず、話者が属する組織のＵＲＬによって特定されるサイトを参照して、直接必要なキーワードを選択してもよい。また、上記実施形態における音声認識装置１は、日本語に限らず、英語等他の言語に適用することもできる。出現頻度関連値の演算方法も式１、２に限定されず、他の方法であってもよい。 As mentioned above, although preferred embodiment of this invention was described in detail, it cannot be overemphasized that this invention is not limited to the said embodiment. For example, in the above embodiment, the fixed pattern is extracted from the basic data and the large corpus from which the voice data is transcribed. However, the present invention is not limited to this, and a fixed pattern uttered by the operator at the call center is determined in advance. The fixed pattern may be written into text data. In the above embodiment, keywords related to the speaker are extracted from the small corpus and the large corpus. However, the present invention is not limited to this, and the site is specified directly by referring to the site specified by the URL of the organization to which the speaker belongs. Necessary keywords may be selected. Moreover, the speech recognition apparatus 1 in the above embodiment can be applied not only to Japanese but also to other languages such as English. The calculation method of the appearance frequency related value is not limited to Formulas 1 and 2, and other methods may be used.

なお、このような音声認識装置１は、会社の商品受注を受付けるコールセンター、証券会社におけるコールセンター、消費者相談センター等に適用できる他、警察や消防署の通報センター、防衛関連施設間の音声通信、航空業における管制官とパイロットとの音声通信、ボイスレコーダの音声解析等にも応用することができる。 Such a voice recognition device 1 can be applied to a call center that accepts company product orders, a call center in a securities company, a consumer consultation center, etc., as well as voice communication between police and fire department reporting centers, defense-related facilities, aviation It can also be applied to voice communication between controllers and pilots, voice analysis of voice recorders, etc.

予備実験における音響モデルと言語モデルの組合せ、及びその正単語認識率を示す図である。It is a figure which shows the combination of the acoustic model in a preliminary experiment, and a language model, and its positive word recognition rate. 実施形態に係る通信回線システムの概略図である。1 is a schematic diagram of a communication line system according to an embodiment. 図２に示す音声認識装置の機能ブロック図である。It is a functional block diagram of the speech recognition apparatus shown in FIG. 言語モデルを生成するフローチャートである。It is a flowchart which produces | generates a language model. 言語モデルの生成方法を説明するための概念図である。It is a conceptual diagram for demonstrating the production | generation method of a language model. コーパスにおける情報量の差を説明するための図である。It is a figure for demonstrating the difference of the information content in a corpus. 変数化された基本データの例を示す図である。It is a figure which shows the example of the variable-ized basic data. 音声認識プログラムが記録された記録媒体の構成図である。It is a block diagram of the recording medium with which the speech recognition program was recorded.

Explanation of symbols

１…音声認識装置、２１…言語モデル格納部、２２…音響モデル格納部、２３…基本データ格納部、２４…大規模コーパス格納部、２５…小規模コーパス格納部、２７…電話／ＵＲＬ対応格納部、３１…音声処理部、３２…文字変種部、３３…言語モデル生成部、３４…ＷＷＷアクセス部。 DESCRIPTION OF SYMBOLS 1 ... Voice recognition apparatus, 21 ... Language model storage part, 22 ... Acoustic model storage part, 23 ... Basic data storage part, 24 ... Large scale corpus storage part, 25 ... Small corpus storage part, 27 ... Telephone / URL correspondence storage Part 31 sound processing part 32 character variation part 33 language model generation part 34 WWW access part

Claims

In a speech recognition device that converts speech signals obtained by converting spoken speech into character information,
A language that represents a connection relationship of one or a plurality of sentence words each having one or a plurality of terms estimated to be spoken by a speaker in the changeable part in a typical pattern of utterances having a changeable part and a fixed part Language model storage means for storing the model;
Acoustic model storage means for storing an acoustic model including acoustic characteristics of speech;
A speech recognition device comprising speech processing means for converting speech signals into character information with reference to the language model and the acoustic model.

The speech recognition apparatus according to claim 1, wherein the term is acquired from a site specified by a URL related to a speaker.

The speech recognition apparatus according to claim 1, wherein the acoustic model is a model learned by telephone speech.

Basic data storage means for storing basic data in which utterances are transcribed;
First corpus storage means for storing a first corpus including character information relating to a speaker;
Second corpus storage means for storing a second corpus having a character information amount larger than that of the first corpus;
Language model generation means for generating the language model,
The language model generation means includes
First comparison means for comparing an appearance frequency related value related to an appearance frequency of a word appearing in the basic data and an appearance frequency related value related to the appearance frequency of the word appearing in the second corpus;
Based on the comparison result of the first comparison unit, a fixed pattern generation unit that generates the fixed pattern from the basic data, which is a variable of words that appear in the basic data at a predetermined frequency of appearance,
Second comparing means for comparing an appearance frequency related value related to an appearance frequency of a word appearing in the first corpus and an appearance frequency related value related to the appearance frequency of the word appearing in the second corpus; ,
Term extracting means for extracting the term that appears in the first corpus at a predetermined frequency of appearance based on the comparison result of the second comparing means;
Sentence generating means for generating a sentence in which the term extracted by the term extracting means is applied to the variable part of the fixed pattern;
The speech recognition apparatus according to claim 1, further comprising a connection relationship generation unit configured to generate a connection relationship between words of the sentence.

The first corpus is character information acquired from a site specified by the URL related to a speaker;
The speech recognition apparatus according to claim 4, wherein the second corpus is character information collected from the WWW.

The speech recognition apparatus according to claim 4, wherein the language model generation unit applies the term to a variable portion of the fixed pattern based on the co-occurrence information of the second corpus.

A telephone / URL correspondence storing means for storing the telephone number of the speaker in association with the URL related to the speaker;
Receiving information indicating the incoming telephone number, referring to the telephone / URL correspondence storage means, reading the URL stored in association with the incoming telephone number, and accessing the site specified by the read URL The speech recognition apparatus according to claim 5, further comprising: a WWW access unit that acquires character information from the accessed site and stores the character information in the first corpus storage unit.

Further comprising character editing means for correcting the character information converted by the voice processing means based on an input instruction or change instruction of the character information from the input means,
8. The speech recognition apparatus according to claim 5, wherein the character data modified by the character editing unit is used as the basic data stored in the basic data storage unit.

In a speech recognition program for converting a speech signal obtained by converting spoken speech into character information,
Computer
A language that represents a connection relationship of one or a plurality of sentence words each having one or a plurality of terms estimated to be spoken by a speaker in the changeable part in a typical pattern of utterances having a changeable part and a fixed part Language model storage means for storing the model;
Acoustic model storage means for storing an acoustic model including acoustic characteristics of speech;
A speech recognition program that functions as speech processing means for converting speech signals into character information with reference to the language model and the acoustic model.

A language model generation method for generating a language model to be referred to together with an acoustic model including an acoustic characteristic of speech when an acoustic signal is acoustically analyzed and converted into character information in a speech recognition device,
A fixed pattern generation step for generating a fixed pattern of an utterance having a changeable part and a fixed part from basic data obtained by transcribing spoken voice data;
A language that generates one or a plurality of sentences in which one or a plurality of terms estimated to be spoken by a speaker in the changeable part of the fixed pattern is generated, and generates a language model that represents a connection relation of words in the sentences A language model generation method comprising: a model generation step.

The language model generation method according to claim 10, wherein the term is acquired from a site specified by a URL related to a speaker.

The language model generation method according to claim 10 or 11, wherein character information obtained by calibrating character information converted by the voice recognition device is used as the basic data.

A language model generation device that generates a language model that is referred to together with an acoustic model including an acoustic characteristic of speech when an acoustic signal is acoustically analyzed and converted into character information in a speech recognition device,
Basic data storage means for storing basic data in which utterances are transcribed;
First corpus storage means for storing a first corpus including character information related to a speaker;
Second corpus storage means for storing a second corpus having a character information amount larger than the character information amount of the first corpus;
Language model generation means for generating the language model,
The language model generation means includes
First comparison means for comparing an appearance frequency related value related to an appearance frequency of a word appearing in the basic data and an appearance frequency related value related to the appearance frequency of the word appearing in the second corpus;
Based on the comparison result of the first comparison unit, a fixed pattern generation unit that generates the fixed pattern from the basic data by converting words that appear in the basic data with a frequency of appearance higher than a predetermined frequency,
Second comparing means for comparing an appearance frequency related value related to an appearance frequency of a word appearing in the first corpus and an appearance frequency related value related to the appearance frequency of the word appearing in the second corpus; ,
A term extracting means for extracting the term that appears in the first corpus at a predetermined appearance frequency or more based on a comparison result of the second comparing means;
Sentence generating means for generating a sentence in which the term extracted by the term extracting means is applied to the variable part of the fixed pattern;
A language model generation apparatus comprising: a connection relationship generation unit configured to generate a connection relationship between words of the sentence.