JP4177195B2

JP4177195B2 - Recognition grammar creation system

Info

Publication number: JP4177195B2
Application number: JP2003277050A
Authority: JP
Inventors: 鏡子奥山
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2003-07-18
Filing date: 2003-07-18
Publication date: 2008-11-05
Anticipated expiration: 2023-07-18
Also published as: JP2005037838A

Description

本発明は、音声認識機能を利用した、ボイスポータルなどの電話応答サービスや音声認識機能付きカーナビのような自動応答サービスに関し、特に、このようなサービスで利用される認識文法を作成するための認識文法作成システムに関する。 The present invention relates to a telephone response service such as a voice portal using a voice recognition function and an automatic response service such as a car navigation system with a voice recognition function, and in particular, recognition for creating a recognition grammar used in such a service. It relates to a grammar creation system.

昨今の代表的な音声認識方式には、連続単語認識、ワードスポッティングなどがあるが、いずれの方式も、認識語彙を別途記述する必要があり、認識語彙を記述したものは「認識文法」と呼ばれている。認識文法は、ユーザが発声する語彙を網羅する必要がある。なぜならば、認識文法に記述されている語彙だけしか認識できないため、認識文法の精度（ユーザ発声を網羅しているかどうか）が、直接的に認識率に繋がってしまうからである。そこで、精度の良い認識文法が求められる。ところが、認識文法作成の難しさは、人間の発声内容の多様性にある。言葉は時代によって変化していき、例えば昨今では省略した言い回しがはやっているように、はやり廃りがある。一つの概念を表現するのに、省略も含めて、いくつもの言い回しがあり、認識文法はこれを網羅しなければならない。 Recent typical speech recognition methods include continuous word recognition and word spotting, but each method requires the recognition vocabulary to be described separately, and the one that describes the recognition vocabulary is called "recognition grammar". It is. The recognition grammar needs to cover the vocabulary spoken by the user. This is because only the vocabulary described in the recognition grammar can be recognized, and the accuracy of the recognition grammar (whether it covers the user utterance) directly leads to the recognition rate. Therefore, an accurate recognition grammar is required. However, the difficulty in creating a recognition grammar lies in the diversity of human utterances. Languages change with the times, for example, they are obsolete just like the abbreviations we have omitted in recent years. There are a number of phrases, including omissions, to express a concept, and the recognition grammar must cover this.

例えば、日本全国の会社名の認識文法を作成する場合を想定する。正式語彙（正式名称）は、例えば会社名の正式名称を集めたデータベースがあるので、それをもとに認識文法を作成することが出来る。だが、省略形も含めて、会話文で良く使われる言い回しを格納したデータベースは存在しないので、そのような言い回しを収集する必要がある。正式語彙と、会話文で良く使われる言い回しとの組み合わせの例としては、「株式会社西日本旅客鉄道」という正式語彙に対して「ＪＲ西日本」という言い回し、「全日空」という正式語彙に対して「ＡＮＡ（アナ）」という言い回しなどが挙げられる。 For example, assume that a recognition grammar for company names in Japan is created. The official vocabulary (official name) has, for example, a database that collects the official names of company names, and based on this database, a recognition grammar can be created. However, there is no database that stores phrases often used in conversations, including abbreviations, so it is necessary to collect such phrases. Examples of combinations of formal vocabulary and phrases often used in conversational sentences include the phrase “JR West” for the official vocabulary “West Japan Railway Company” and “ANA” for the official vocabulary “All Nippon Airways”. (Ana) "and so on.

このような有名な会社である場合は、言い回しを思いつくことが可能であるが、そうでない場合、手近な手段としては、例えば、ＷＷＷの検索エンジンを利用し、言い回しを収集していた。また、言い回しには、経験則による一定のルールがあり、例えば、省略する、頭文字をとる、などの方法で作成されることが多い。 In the case of such a well-known company, it is possible to come up with phrases, but in other cases, as a handy means, for example, a WWW search engine is used to collect phrases. In addition, there are certain rules based on empirical rules, and for example, they are often created by a method such as omitting or taking an initial letter.

ここで、認識文法を作成するための従来の手順を図１４に示す。すなわち、従来は、まず、元になる語彙から省略名称などの言い回しの候補を作成し（ステップＳ９１）、作成した言い回し候補が実際に存在するかを確かめるために大規模テキストコーパスを検索していた（ステップＳ９２）。このような大規模テキストコーパスとしては、例えばＷＷＷ（ＷｏｒｌｄＷｉｄｅＷｅｂ）などが用いられている。そして、例えば、大規模テキストコーパスの検索で存在が確認された言い回し候補のみを語彙データベースに登録するなど、検索結果を語彙データベースに反映させ（ステップＳ９３）、その後、語彙データベースから認識文法を作成していた（ステップＳ９４）。 Here, FIG. 14 shows a conventional procedure for creating a recognition grammar. That is, conventionally, first, a phrase candidate such as an abbreviated name is created from the original vocabulary (step S91), and a large-scale text corpus is searched to check whether the created phrase candidate actually exists. (Step S92). As such a large-scale text corpus, for example, WWW (World Wide Web) is used. Then, for example, the search results are reflected in the vocabulary database, such as registering only the wording candidates that have been confirmed by the search of the large text corpus in the vocabulary database (step S93), and then the recognition grammar is created from the vocabulary database. (Step S94).

なお、従来、ユーザ固有の語彙や表現をユーザ言語モデルに追加することにより、ユーザ固有の発話バリエーションに対応して認識性能を高めようとする音声認識文章入力装置が、特許文献１にも開示されている。
特開２００２−２２９５８５号公報 Conventionally, a speech recognition text input device that attempts to improve recognition performance in response to user-specific utterance variations by adding user-specific vocabulary and expressions to the user language model is also disclosed in Patent Document 1. ing.
JP 2002-229585 A

ところが、上述の従来の手順によれば、言い回し候補が存在するか否かを確認するために、ＷＷＷなどの大規模テキストコーパスを闇雲に検索していたため、言い回し候補とテキストとしては同じであるが異なる意味を持つ単語が検索ヒットした場合であっても、その言い回し候補が存在すると誤って判断されることが多かった。そして、このような誤判定に基づいて語彙データベースに登録された語彙からそのまま認識文法を作成すると、精度の低い文法しか得られないという問題があった。 However, according to the above-described conventional procedure, a large text corpus such as the WWW is searched in the dark cloud in order to check whether or not there is a wording candidate, so the wording candidate and the text are the same. Even when a word having a different meaning is a search hit, it is often erroneously determined that the wording candidate exists. Then, if the recognition grammar is generated as it is from the vocabulary registered in the vocabulary database based on such a misjudgment, there is a problem that only a grammar with low accuracy can be obtained.

本発明は、上述の問題を解決するために、検索対象のテキストコーパスを選別することにより、言い回し候補の適合性の判断精度を向上させることができ、精度の高い認識文法を作成することが可能な認識文法作成システムを提供することを目的とする。 In order to solve the above-mentioned problem, the present invention can improve the accuracy of determining the suitability of wording candidates by selecting a text corpus to be searched, and can create a recognition grammar with high accuracy. It aims at providing a simple recognition grammar creation system.

上記の目的を達成するために、本発明にかかる認識文法作成システムは、正式語彙とその言い回しを記憶する語彙データベースと、言い回し候補の適合性を調べるための検索対象とするコーパスを選択し、選択したコーパスへ当該言い回し候補の検索を依頼する検索コーパス選定部と、前記検索コーパス選定部によって選択されたコーパスにおける検索結果を解析する検索結果解析部と、前記検索結果解析部による解析結果と所定の基準に基づいて言い回し候補の適合性を判断し、言い回し候補のうち適合性を有する言い回しを前記語彙データベースへ格納するデータベース更新部と、前記語彙データベースから語彙を読み出して文法仕様に従った当該語彙の認識文法を作成する文法作成部とを備えた構成である。 In order to achieve the above object, the recognition grammar creation system according to the present invention selects and selects a vocabulary database for storing formal vocabularies and their wordings, and a corpus to be searched for checking suitability of wording candidates. A search corpus selection unit that requests the corpus to search for the wording candidate, a search result analysis unit that analyzes a search result in the corpus selected by the search corpus selection unit, an analysis result by the search result analysis unit, and a predetermined result A database update unit that determines the suitability of the wording candidates based on the criteria, and stores the wording having relevance among the wording candidates in the vocabulary database, and reads out the vocabulary from the vocabulary database and reads the vocabulary according to the grammar specification The configuration includes a grammar creation unit for creating a recognition grammar.

この構成によれば、検索コーパス選定部において、言い回し候補の適合性を調べるための検索対象とするコーパスを適切に選択することにより、実際には使用されない言い回しが語彙データベースに登録されることを防止できる。このように実際に使用される言い回しを収録した語彙データベースを用いることにより、精度の高い認識文法を作成することが可能となる。 According to this configuration, the search corpus selection unit appropriately selects a corpus to be searched for checking the suitability of the phrase candidates, thereby preventing a phrase that is not actually used from being registered in the vocabulary database. it can. In this way, it is possible to create a recognition grammar with high accuracy by using a vocabulary database that contains phrases that are actually used.

上記構成にかかる認識文法作成システムにおいて、前記検索コーパス選定部は、検索対象の候補とするコーパスである候補コーパスに関する情報を記憶する候補コーパスデータベースと、候補コーパスの検索キーワードとして、前記正式語彙に関連する少なくとも一つのキーワードを入力し、当該検索キーワードに従ってインターネット上で候補コーパスに関する情報を検索する候補コーパス検索部と、前記候補コーパス検索部で得られた情報を前記候補コーパスデータベースに登録する候補コーパス登録部とを備え、前記検索コーパス選定部が、前記候補コーパスデータベースに登録された情報を参照し、検索対象とするコーパスを選定する態様とすることが好ましい。 In the recognition grammar creation system according to the above configuration, the search corpus selection unit includes a candidate corpus database that stores information on a candidate corpus that is a candidate corpus to be searched, and a search keyword for the candidate corpus that is related to the formal vocabulary. A candidate corpus search unit that inputs at least one keyword to search for information on candidate corpora on the Internet according to the search keyword, and candidate corpus registration that registers information obtained by the candidate corpus search unit in the candidate corpus database It is preferable that the search corpus selection unit refers to information registered in the candidate corpus database to select a corpus to be searched.

上記の態様において、「候補コーパスに関する情報」とは、候補コーパスへのアクセス情報などを含む。上記の態様によれば、検索コーパス選定部の選択候補とされる候補コーパスに関する情報を候補コーパスデータベースに登録しておき、検索コーパス選定部がこれを参照することにより、言い回し候補の適合性を調べるための検索対象とするコーパスを適切に選択することができる。 In the above aspect, “information on candidate corpus” includes access information to the candidate corpus. According to the above aspect, information on candidate corpora that are candidates for selection of the search corpus selection unit is registered in the candidate corpus database, and the search corpus selection unit refers to this to check the suitability of the wording candidate. Therefore, it is possible to appropriately select a corpus to be searched for.

上記構成にかかる認識文法作成システムにおいて、前記検索結果解析部が、前記コーパスにおける前記言い回し候補の出現情報を生成する出現情報生成部を備え、当該出現情報生成部による出現情報の生成結果を、前記解析結果として前記データベース更新部へ渡す態様とすることも好ましい。コーパスにおける語彙の出現情報は、その語彙が使われる度合いであると見ることができるので、使われる度合いに応じて各言い回し候補を語彙データベースに格納するかどうかを決定することにより、精度の高い語彙データベースを作成でき、ひいては精度の高い認識文法を作成できるからである。 In the recognition grammar creation system according to the above configuration, the search result analysis unit includes an appearance information generation unit that generates the appearance information of the wording candidate in the corpus, and the generation result of the appearance information by the appearance information generation unit is It is also preferable that the analysis result is passed to the database update unit. The appearance information of the vocabulary in the corpus can be regarded as the degree to which the vocabulary is used. Therefore, by determining whether each wording candidate is stored in the vocabulary database according to the degree to which the vocabulary is used, a highly accurate vocabulary is obtained. This is because a database can be created, and thus a highly accurate recognition grammar can be created.

上記の構成にかかる認識文法作成システムにおいて、前記データベース更新部が、言い回しを前記語彙データベースへ格納する際に、当該言い回しについて前記出現情報生成部により生成された出現情報を、当該言い回しに関連づけて前記語彙データベースに格納し、前記認識文法作成システムは、さらに、前記語彙データベースから、前記出現情報に基づいて認識文法を作成すべき語彙を選択して前記文法作成部へ渡す文法語彙判別部を備えた態様とすることも好ましい。コーパスにおける語彙の出現情報は、その語彙が使われる度合いであると見ることができるので、文法を作成する際に、その語彙を文法に追加するかどうかの判断基準としても役に立つからである。 In the recognition grammar creation system according to the above configuration, when the database update unit stores the wording in the vocabulary database, the appearance information generated by the appearance information generation unit for the wording is associated with the wording and the wording is generated. The recognition grammar creation system further includes a grammar vocabulary determination unit that selects a vocabulary for which a recognition grammar should be created from the vocabulary database and passes the grammar creation unit to the grammar creation unit. It is also preferable to adopt an embodiment. This is because the appearance information of the vocabulary in the corpus can be regarded as the degree to which the vocabulary is used, so that it is useful as a criterion for determining whether to add the vocabulary to the grammar when creating the grammar.

上記の構成にかかる認識文法作成システムにおいて、前記文法作成部が、前記出現情報に基づいて、単語の優先度を示す情報を持つ認識文法を作成する態様としても良い。その語彙が使われる度合いを表す情報を認識文法に持たせることにより、この認識文法を用いた音声認識の精度を向上させることができるからである。 In the recognition grammar creation system according to the above configuration, the grammar creation unit may create a recognition grammar having information indicating a word priority based on the appearance information. This is because the accuracy of speech recognition using the recognition grammar can be improved by providing the recognition grammar with information indicating the degree to which the vocabulary is used.

本発明によれば、検索対象のテキストコーパスを選定することにより、言い回し候補の適合性の判断精度を向上させることができ、精度の高い認識文法を作成することが可能な認識文法作成システムを作成できる。 According to the present invention, by selecting a text corpus to be searched, a recognition grammar creation system capable of improving the accuracy of determining the suitability of wording candidates and creating a highly accurate recognition grammar is created. it can.

図１は、本発明の一実施形態にかかる認識文法作成システムの概略構成を示すブロック図である。図１に示すように、本実施形態にかかる認識文法作成システムは、語彙データベース１、語彙候補作成部２、文法作成部３、データベース編集部４、類義語データベース５、データベース更新部６、検索コーパス選定部７、検索結果解析部９、および、文法記憶部１０を備えている。 FIG. 1 is a block diagram showing a schematic configuration of a recognition grammar creation system according to an embodiment of the present invention. As shown in FIG. 1, the recognition grammar creation system according to the present embodiment includes a vocabulary database 1, a vocabulary candidate creation unit 2, a grammar creation unit 3, a database editing unit 4, a synonym database 5, a database update unit 6, and a search corpus selection. Unit 7, search result analysis unit 9, and grammar storage unit 10.

この認識文法作成システムは、語彙候補作成部２によって正式語彙から言い回し候補を作成し、システム外の大規模コーパス８に当該言い回し候補の検索を依頼する。大規模コーパス８としては、認識文法作成システムにローカル接続されるＷＷＷや、インターネットなどの外部ネットワークを介して接続されるＷＷＷなどを用いることができる。なお、ＷＷＷ以外のコーパスを用いることも、もちろん可能である。 In this recognition grammar creation system, the vocabulary candidate creation unit 2 creates a wording candidate from the formal vocabulary, and requests the large-scale corpus 8 outside the system to search for the wording candidate. As the large-scale corpus 8, WWW connected locally to the recognition grammar creation system, WWW connected via an external network such as the Internet, or the like can be used. Of course, it is also possible to use a corpus other than WWW.

大規模コーパス８は、依頼された検索を実行し、検索結果を検索結果解析部９へ送る。検索結果解析部９は、大規模コーパス８から送られた検索結果に基づいて、当該言い回し候補の適合性を判断し、判断結果をデータベース更新部６へ渡す。そして、データベース更新部６が、検索結果解析部９により適合すると判断された言い回し候補のみを、語彙データベース１へ登録する。 The large-scale corpus 8 executes the requested search and sends the search result to the search result analysis unit 9. The search result analysis unit 9 determines the suitability of the wording candidate based on the search result sent from the large-scale corpus 8 and passes the determination result to the database update unit 6. Then, the database update unit 6 registers only the wording candidates determined to be suitable by the search result analysis unit 9 in the vocabulary database 1.

以下、各部の構成および動作について、より詳しく説明する。 Hereinafter, the configuration and operation of each unit will be described in more detail.

語彙データベース１は、正式語彙と、正式語彙の変形として会話文で用いられるいわゆる「言い回し」とを対応付けて記憶している。例えば、図２に、語彙データベース１に登録されるデータの例を示す。この例では、語彙データベース１は、１つのエントリが、ＩＤ（識別番号）と、そのエントリに含まれる語彙の個数を表す語彙個数と、語彙個数分の語彙（第１語彙、第２語彙、…）とによって構成されている。第１語彙には、元になる語彙（正式語彙）が登録され、第２語彙以降には第１語彙の言い回しが登録される。例えば、図２の例では、「総務部」という第１語彙（正式語彙）の言い回しとして、その省略形である「総務」という語彙が登録されている。なお、この例では、各エントリがＩＤおよび語彙個数を有するものとしたが、これらは必須ではない。 The vocabulary database 1 stores a formal vocabulary and a so-called “phrase” used in a conversation sentence as a modification of the formal vocabulary in association with each other. For example, FIG. 2 shows an example of data registered in the vocabulary database 1. In this example, in the vocabulary database 1, one entry is an ID (identification number), a vocabulary number indicating the number of vocabularies included in the entry, and vocabularies corresponding to the number of vocabularies (first vocabulary, second vocabulary,... ) And. The original vocabulary (formal vocabulary) is registered in the first vocabulary, and the wording of the first vocabulary is registered after the second vocabulary. For example, in the example of FIG. 2, the vocabulary “general affairs”, which is an abbreviation thereof, is registered as the phrase of the first vocabulary (official vocabulary) “general affairs department”. In this example, each entry has an ID and a vocabulary number, but these are not essential.

データベース編集部４は、キーボードやマウス等の入力デバイスあるいはディスプレイ等の出力デバイス（いずれも図示せず）を含み、語彙データベース１のエントリに対してユーザが編集作業を行ったり、語彙データベース１の内容をユーザが確認したりすることが可能である。例えば、語彙データベース１に新たな語彙エントリを作成する場合は、ユーザは、データベース編集部４を用いて正式語彙（第１語彙）を入力する。また、思いつく言い回しがあれば、同様に、データベース編集部４を用いて、第２語彙以降に追加することができる。 The database editing unit 4 includes an input device such as a keyboard and a mouse or an output device such as a display (both not shown), and the user performs editing operations on the entries of the vocabulary database 1 and the contents of the vocabulary database 1 Can be confirmed by the user. For example, when creating a new vocabulary entry in the vocabulary database 1, the user inputs a formal vocabulary (first vocabulary) using the database editing unit 4. In addition, if there is a word that comes to mind, it can be added after the second vocabulary using the database editing unit 4 in the same manner.

語彙候補作成部２は、エントリの第１語彙に登録された正式語彙から、所定のルールに従って、その正式語彙の言い回しとして語彙データベース１へ登録するための候補（言い回し候補）を生成する。本実施形態では、語彙候補作成部２は、生成した言い回し候補を、語彙データベース１の第２語彙以降に登録する。なお、語彙候補作成部２は、言い回し候補を作成する際に、類義語データベース５を参照しても良い。 The vocabulary candidate creation unit 2 generates a candidate (phrase candidate) to be registered in the vocabulary database 1 as a wording of the formal vocabulary from the formal vocabulary registered in the first vocabulary of the entry according to a predetermined rule. In this embodiment, the vocabulary candidate creation unit 2 registers the generated wording candidates in the vocabulary database 1 after the second vocabulary. The vocabulary candidate creation unit 2 may refer to the synonym database 5 when creating the wording candidate.

ここで、語彙候補作成部２による言い回し候補の生成手順について、具体例を挙げて説明する。 Here, a procedure for generating wording candidates by the vocabulary candidate creating unit 2 will be described with a specific example.

語彙候補作成部２は、前述したように、所定のルールに従って言い回し候補を生成するが、この所定のルールとしては、例えば以下のような経験則（ヒューリスティックルール）が考えられる。なお、以下の４つはあくまでも例示であって、適切な言い回しを生成できることを条件として、これ以外の任意のルールを適用することが可能である。 As described above, the vocabulary candidate creation unit 2 generates wording candidates according to a predetermined rule. As the predetermined rule, for example, the following empirical rule (heuristic rule) can be considered. Note that the following four are merely examples, and any other rule can be applied on condition that an appropriate wording can be generated.

（１）「○○部」という名称は、「部」を省略し、「○○」と呼ばれる。 (1) The name “XXX” is called “XXX” by omitting “part”.

（２）「○○研究部」は、「研」の後ろを省略して「○○研」と呼ばれる。 (2) “XX Research Department” is called “XX Lab” by omitting the back of “ken”.

（３）複数単語で構成される場合、単語の頭文字をとって呼ばれる。 (3) When composed of a plurality of words, it is called by taking the first letter of the word.

（４）複数単語で構成される場合、単語の頭文字のアルファベットをとって呼ばれる。 (4) When composed of a plurality of words, it is called by taking the first letter of the word.

例えば、ユーザが、データベース編集部４を用いて、図２に示すようなエントリを語彙データベース１に作成したとすると、語彙候補作成部２は、上述の経験則を用いて言い回し候補を生成し、各エントリの第２語彙以降に登録する。例えば、その結果は、図３に示すようになる。図３に示す例では、「ソフトウェア事業本部（そふとうぇあじぎょうほんぶ）」という正式語彙に対して、語彙候補作成部２が、「ソ事本（そじほん）」および「ＳＪＨ（えすじぇいえいち）」という言い回し候補を生成し、エントリの第２語彙および第３語彙に登録している。また、「メディアソリューション研究部（めでぃあそりゅーしょんけんきゅうぶ）」という正式語彙に対して、語彙候補作成部２が、「メディアソリューション研（めでぃあそりゅーしょんけん）」、「メソ研（めそけん）」、「ＭＳ研究部（えむえすけんきゅうぶ）」、「ＭＳ研（えむえすけん）」という４種類の言い回し候補を生成し、エントリの第２〜第５語彙に登録している。 For example, if the user uses the database editing unit 4 to create an entry as shown in FIG. 2 in the vocabulary database 1, the vocabulary candidate creation unit 2 generates a wording candidate using the above empirical rules, Register after the second vocabulary of each entry. For example, the result is as shown in FIG. In the example shown in FIG. 3, the vocabulary candidate creating unit 2 performs “Sojihon” and “SJH” for the formal vocabulary “Software Business Headquarters”. The wording candidate “Essieieichi)” is generated and registered in the second vocabulary and the third vocabulary of the entry. In addition, for the official vocabulary “Media Solutions Research Department”, the Vocabulary Candidate Creation Department 2 has created “Media Solutions Laboratory”. ”,“ Mesoken ”,“ MS Research Department ”, and“ MS Lab ”are generated, and the second to fifth entries are created. Registered in the vocabulary.

ここで、省略によって言い回し候補を作成するための語彙候補作成部２の構成例のいくつかについて、図４〜図７を用いて具体的に説明する。 Here, some examples of the configuration of the vocabulary candidate creation unit 2 for creating wording candidates by omission will be specifically described with reference to FIGS.

図４に示す構成例は、例えば「ソフトウェア事業本部」から「ソ事本」のように複数単語の頭文字をとった言い回し候補を作成するための構成である。この場合、語彙候補作成部２には、形態素解析部２１、頭文字取得部２２、頭文字合成部２３を設ければよい。形態素解析部２１は、入力された正式語彙の形態素解析を行い、品詞毎に分割する。例えば、「ソフトウェア事業本部」という正式語彙は、「ソフトウェア」、「事業」「本部」という３つの名詞に分割される。頭文字取得部２２は、分割された品詞の頭の１文字をとる。すなわち、上記の場合は、「ソ」、「事」、「本」が取得される。次に、頭文字合成部２３が、頭文字取得部２２により取得された文字を、語順に従って結合することにより、言い回し候補として「ソ事本」が得られる。 The configuration example shown in FIG. 4 is a configuration for creating a wording candidate using a plurality of initial words such as “Software Business Headquarters” to “Soviet Books”. In this case, the vocabulary candidate creation unit 2 may be provided with a morphological analysis unit 21, an initial acquisition unit 22, and an initial synthesis unit 23. The morpheme analysis unit 21 performs morpheme analysis on the input formal vocabulary and divides it into parts of speech. For example, the formal vocabulary “software business headquarters” is divided into three nouns “software”, “business”, and “headquarters”. The initial letter acquisition unit 22 takes the first letter of the divided part of speech. That is, in the above case, “So”, “Thing”, and “Book” are acquired. Next, the initial character synthesizing unit 23 combines the characters acquired by the initial character acquiring unit 22 according to the word order, so that “Sojihon” is obtained as a wording candidate.

また、図５に示す構成例は、例えば「メディアソリューション研究部」から「ＭＳ研究部」という言い回し候補を作成するための構成である。この場合、語彙候補作成部２には、上述の形態素解析部２１、頭文字取得部２２、頭文字合成部２３の他に、省略規則適用部２４、省略作成規則記憶部２５、アルファベット変換部２６を設ければよい。 Further, the configuration example shown in FIG. 5 is a configuration for creating a wording candidate “MS Research Department” from “Media Solution Research Department”, for example. In this case, the vocabulary candidate creation unit 2 includes the abbreviation rule application unit 24, the abbreviated creation rule storage unit 25, and the alphabet conversion unit 26 in addition to the morpheme analysis unit 21, the initial acquisition unit 22, and the initial synthesis unit 23 described above. May be provided.

図５に示す構成例において、例えば上述の「メディアソリューション研究部」が正式語彙として入力されると、形態素解析部２１は、これを、「メディア」、「ソリューション」、「研究」、「部」という４つの名詞に分割し、省略規則適用部２４へ渡す。省略規則適用部２４は、省略作成規則記憶部２５を参照し、省略作成規則を適用することによって省略形を作成する。省略作成規則記憶部２５には、経験則などから得られる省略規則があらかじめ記憶されている。ここに記憶されている省略規則とは、例えば、以下のようなものである。 In the configuration example shown in FIG. 5, for example, when the above-mentioned “media solution research department” is input as a formal vocabulary, the morpheme analysis section 21 converts this into “media”, “solution”, “research”, “part”. And is passed to the omission rule application unit 24. The omission rule application unit 24 refers to the omission creation rule storage unit 25 and creates an abbreviation by applying the omission creation rule. The omission creation rule storage unit 25 stores in advance omission rules obtained from empirical rules and the like. The omission rules stored here are, for example, as follows.

（１）「研究所」は「研」になる。 (1) “Research Institute” becomes “Research”.

（２）「研究部」は「研」になる。 (2) “Research Department” becomes “Lab”.

（３）「研究センター」は「研」になる。 (3) “Research Center” becomes “Lab”.

（４）「○○部」は「○○」になる。 (4) “XXX part” becomes “XX”.

このような省略規則を適用することにより、省略規則適用部２４は、上記の４つの名詞のうち、「研究」および「部」という２つの名詞から、「研」という省略形を作成し、省略しなかった「メディア」、「ソリューション」と共に、アルファベット変換部２６へ渡す。アルファベット変換部２６は、カタカナ文字をアルファベットに変換する。次に、頭文字取得部２２が、単語の頭文字をとる。これにより、「Ｍ」、「Ｓ」、「研」が取得される。最後に、頭文字合成部２３が、頭文字を語順に従って結合することにより、「ＭＳ研」という言い回し候補が得られる。 By applying such abbreviated rules, the abbreviated rule applying unit 24 creates an abbreviation “ken” from the two nouns “research” and “part” among the above four nouns and omits them. Along with “Media” and “Solution” that have not been performed, the data is transferred to the alphabet conversion unit 26. The alphabet conversion unit 26 converts katakana characters into alphabets. Next, the initial acquisition unit 22 takes the initial of the word. As a result, “M”, “S”, and “ken” are acquired. Finally, the acronym synthesizer 23 combines the acronyms according to the word order to obtain a wording candidate “MS Lab”.

なお、図４または図５に示す構成例において、頭文字取得部２２は、頭の１文字だけを取得しているが、例えば、「パーソナルコンピュータ」が「パソコン」となるように、頭の２文字を取得する場合もある。 In the configuration example shown in FIG. 4 or 5, the initial acquisition unit 22 acquires only one initial character. For example, the initial 2 is set so that “personal computer” becomes “personal computer”. Sometimes get characters.

図６に示す構成例は、例えば「メディアソリューション研究部」から「メディアソリューション研」という言い回し候補を作成するための構成である。この場合、語彙候補作成部２には、上述の形態素解析部２１および省略規則適用部２４の他に、単語合成部２７を設ければよい。 The configuration example illustrated in FIG. 6 is a configuration for creating a wording candidate “Media Solution Lab” from “Media Solution Research Department”, for example. In this case, the vocabulary candidate creation unit 2 may be provided with a word synthesis unit 27 in addition to the morphological analysis unit 21 and the omission rule application unit 24 described above.

この構成例において、例えば上述の「メディアソリューション研究部」が入力されると、形態素解析部２１は、これを、「メディア」、「ソリューション」、「研究」、「部」という４つの名詞に分割し、省略規則適用部２４へ渡す。省略規則適用部２４は、省略作成規則記憶部２５を参照することにより、「メディア」、「ソリューション」、「研」という単語を生成する。単語合成部２７が、これらの単語を語順に結合することにより、「メディアソリューション研」という言い回し候補が得られる。 In this configuration example, for example, when the above-mentioned “media solution research department” is input, the morphological analysis section 21 divides this into four nouns “media”, “solution”, “research”, and “part”. Then, it is passed to the omission rule application unit 24. The omission rule application unit 24 refers to the omission creation rule storage unit 25 to generate the words “media”, “solution”, and “ken”. The word synthesizing unit 27 combines these words in the order of words to obtain a wording candidate “Media Solution Lab”.

あるいは、図７に示す構成例は、語彙候補作成部２が、形態素解析も行わず、省略作成規則記憶部２５にあらかじめ記憶されている省略規則のみに従って省略形を作成し、それを言い回し候補とする構成である。この場合、例えば、「『遺伝因子』は『遺伝子』になる」という省略規則が省略作成規則記憶部２５にあらかじめ記憶されていれば、「遺伝因子」という正式語彙が入力されると、「遺伝子」という言い回し候補が作成される。 Alternatively, in the configuration example illustrated in FIG. 7, the vocabulary candidate creation unit 2 does not perform morphological analysis, creates an abbreviation according to only the abbreviation rules stored in the abbreviation creation rule storage unit 25 in advance, It is the structure to do. In this case, for example, if the omission rule “genetic factor” becomes “gene” is stored in advance in the omission creation rule storage unit 25, when the formal vocabulary “genetic factor” is input, Is created.

以上のように、語彙候補作成部２が正式語彙から言い回し候補を作成して語彙データベース１へ登録すると、本実施形態の認識文法作成システムでは、語彙候補作成部２が生成して登録した言い回し候補が実際に存在するかどうかを確認するために、大規模コーパス８に検索を依頼する。大規模コーパス８は、例えば図１に示すように、多数のテキストコーパス８ｂ₁、８ｂ₂、…と、各コーパスに対して検索を実行するための検索部８ａ₁、８ａ₂、…とを有する。テキストコーパス８ｂとしては、例えば、ＷＷＷのサイトを用いることができ、検索部８ａとしてはＷＷＷ用の検索エンジンを用いることができる。なお、図１では、検索部８ａとテキストコーパス８ｂとの組み合わせを二組だけ図示したが、その数は任意である。 As described above, when the vocabulary candidate creation unit 2 creates a wording candidate from the official vocabulary and registers it in the vocabulary database 1, the wording candidate generated and registered by the vocabulary candidate creation unit 2 in the recognition grammar creation system of this embodiment. Is requested to search the large corpus 8. As shown in FIG. 1, for example, the large-scale corpus 8 includes a large number of text corpuses 8b ₁ , 8b ₂ ,... And search units 8a ₁ , 8a ₂ ,. . For example, a WWW site can be used as the text corpus 8b, and a search engine for WWW can be used as the search unit 8a. In FIG. 1, only two combinations of the search unit 8a and the text corpus 8b are illustrated, but the number is arbitrary.

ところで、従来のように、大規模コーパス８を闇雲に検索すると、言い回し候補と同じテキストであるが異なる意味を持つ単語が検索にヒットする可能性がある。例えば、「ＳＪＨ」は「スウェーデンジュニアハイスクール」の略語として、「ＭＳ研」は「マルチスタイル研究開発センター」の略語として用いられている例があった場合、このような検索結果に従うと、言い回し候補の適合性を誤って判断してしまうおそれがある。 By the way, if the large-scale corpus 8 is searched in the dark clouds as in the conventional case, there is a possibility that words having the same text as the wording candidate but having a different meaning may hit the search. For example, if “SJH” is used as an abbreviation for “Swedish Junior High School” and “MS Lab” is used as an abbreviation for “Multi-Style Research and Development Center”, There is a risk of misjudging the suitability of candidates.

そこで、本実施形態の認識文法作成システムでは、検索コーパス選定部７が、大規模コーパス８内の多数のテキストコーパス８ｂから、検索対象とするテキストコーパスを選別するようになっている。例えば、ＷＷＷ上の特定のサイトを、検索対象として選別することが考えられる。 Therefore, in the recognition grammar creation system of this embodiment, the search corpus selection unit 7 selects a text corpus to be searched from a large number of text corpuses 8b in the large-scale corpus 8. For example, a specific site on the WWW can be selected as a search target.

このため、検索コーパス選定部７は、例えば、図８に示すように、候補コーパス登録部７２および候補コーパスデータベース７３を備えている。候補コーパス登録部７２は、入出力デバイス（図示せず）を備え、ユーザに候補コーパスを入力させる。例えば、ユーザが、前記入出力デバイスを用いて、検索対象として適切と考えるＷＷＷサイトのＵＲＬ等を入力すると、候補コーパス登録部７２は、そのＷＷＷサイトを候補コーパスデータベース７３へ登録する。 Therefore, the search corpus selection unit 7 includes a candidate corpus registration unit 72 and a candidate corpus database 73 as shown in FIG. The candidate corpus registration unit 72 includes an input / output device (not shown), and allows the user to input a candidate corpus. For example, when the user inputs a URL or the like of a WWW site deemed appropriate as a search target using the input / output device, the candidate corpus registration unit 72 registers the WWW site in the candidate corpus database 73.

このように、検索コーパス選定部７の候補コーパスデータベース７３に検索対象とするテキストコーパスが登録されると、それ以降、検索コーパス選定部７は、候補コーパスデータベース７３に登録されているテキストコーパスに対して、言い回し候補の適合性を調べるための検索を依頼する。すなわち、例えば、候補コーパスデータベース７３に、図２に示すテキストコーパス８ｂ₁が登録されたとすると、検索コーパス選定部７は、このテキストコーパス８ｂ₁に対応する検索エンジン８ａ₁へ、言い回し候補の適合性を調べるための検索を依頼する。 As described above, when a text corpus to be searched is registered in the candidate corpus database 73 of the search corpus selection unit 7, thereafter, the search corpus selection unit 7 applies to the text corpus registered in the candidate corpus database 73. And request a search to check the suitability of the wording candidates. That is, for example, if the text corpus 8b ₁ shown in FIG. 2 is registered in the candidate corpus database 73, the search corpus selection unit 7 matches the wording candidates to the search engine 8a ₁ corresponding to the text corpus 8b ₁ . Request a search to find out.

また、上記の例では、どのテキストコーパスを検索対象とするかをユーザが決定するものとしたが、検索対象とするテキストコーパスの選択も検索コーパス選定部７が行うようにすることも可能である。この場合、図９に示すように、検索コーパス選定部７に、上述の候補コーパス登録部７２および候補コーパスデータベース７３に加えて、候補コーパス検索部７１を設ける。候補コーパス検索部７１は、正式語彙（第１語彙）とその正式語彙に関連するキーワードとしてユーザが与えるキーワードとの少なくとも一方を含むＷＷＷサイトを、大規模コーパス８から検索する。そして、この検索の結果として見つかったＷＷＷサイトを、候補コーパスデータベース７３に登録する。 In the above example, the user determines which text corpus is the search target. However, the search corpus selection unit 7 can also select the text corpus to be searched. . In this case, as shown in FIG. 9, the search corpus selection unit 7 is provided with a candidate corpus search unit 71 in addition to the candidate corpus registration unit 72 and the candidate corpus database 73 described above. The candidate corpus search unit 71 searches the large corpus 8 for a WWW site including at least one of a formal vocabulary (first vocabulary) and a keyword given by the user as a keyword related to the formal vocabulary. Then, the WWW site found as a result of this search is registered in the candidate corpus database 73.

なお、上述のようにユーザがキーワードを与えるのではなく、検索コーパス選定部７が、概念階層データベース（図示せず）を利用して、検索対象コーパスを選別するためのキーワードを自動抽出するようにしても良い。概念階層データベースを用いれば、例えば、正式語彙が「パーソナルコンピュータ」、「メインフレーム」、「ワークステーション」などである場合、その上位概念である「コンピュータ」を抽出できる。そして、これをキーワードとして用いて、大規模コーパス８から検索対象コーパスを選別すれば良い。 Note that the search corpus selection unit 7 automatically extracts keywords for selecting a search corpus using a concept hierarchy database (not shown), instead of giving keywords by the user as described above. May be. By using the concept hierarchy database, for example, when the formal vocabulary is “personal computer”, “mainframe”, “workstation”, etc., it is possible to extract “computer” which is a superordinate concept thereof. Then, using this as a keyword, a search target corpus may be selected from the large-scale corpus 8.

あるいは、複数の言い回し候補がある場合は、一つの言い回し候補の適合性を調べるために検索対象とするテキストコーパスを、正式語彙（第１語彙）と他の言い回し候補との組み合わせをキーワードとした検索によって選定することも効果的である。例えば、図３に示すＩＤ２のエントリを例にとれば、第２語彙として登録されている「メディアソリューション研」の適合性を調べるためのテキストコーパスを、
（１）「メディアソリューション研究部」
（２）「メディアソリューション研究部」と「メソ研」とのＡＮＤ検索
（３）「メディアソリューション研究部」と「ＭＳ研究部」とのＡＮＤ検索
（４）「メディアソリューション研究部」と「ＭＳ研」とのＡＮＤ検索
によって、大規模コーパス８から選定する。 Alternatively, when there are a plurality of wording candidates, a text corpus to be searched for examining the suitability of one wording candidate is searched using a combination of the formal vocabulary (first vocabulary) and other wording candidates as a keyword. It is also effective to select by. For example, taking the ID2 entry shown in FIG. 3 as an example, a text corpus for examining the suitability of “Media Solutions Lab” registered as the second vocabulary,
(1) "Media Solution Research Department"
(2) AND search between “Media Solution Research Department” and “Meso Lab” (3) AND search between “Media Solution Research Department” and “MS Research Department” (4) “Media Solution Research Department” and “MS Lab” Is selected from the large-scale corpus 8 by an AND search with “”.

以上のように、検索コーパス選定部７は、言い回し候補のそれぞれについて、その言い回し候補が実際に用いられるものであるかを判定するための検索対象テキストコーパスを、大規模コーパス８内の多数のテキストコーパス８ｂの中から１つまたは複数選定し、選定したテキストコーパス８ｂに対応する検索エンジン８ａに対して、その言い回し候補の検索を依頼する。その検索結果は、検索結果解析部９へ送られる。 As described above, the search corpus selection unit 7 selects, for each of the wording candidates, the search target text corpus for determining whether the wording candidate is actually used. One or more are selected from the corpus 8b, and the search engine 8a corresponding to the selected text corpus 8b is requested to search for the wording candidate. The search result is sent to the search result analysis unit 9.

検索結果解析部９では、言い回し候補が検索対象テキストコーパスに存在した場合（検索ヒットした場合）は、その言い回し候補が適合性を有する（すなわちその言い回し候補が実際に用いられるものである）と判断し、逆に、言い回し候補が検索対象テキストコーパスに存在しなかった場合は、その言い回し候補には適合性がない（実際には用いられない）と判断する。判断結果はデータベース更新部６へ送られ、適合性がないと判断された言い回し候補は、データ更新部６によって、語彙データベース１のエントリから削除される。 When the wording candidate exists in the search target text corpus (when a search hit occurs), the search result analysis unit 9 determines that the wording candidate has relevance (that is, the wording candidate is actually used). On the other hand, if the wording candidate does not exist in the search target text corpus, it is determined that the wording candidate is not compatible (not actually used). The determination result is sent to the database update unit 6, and the wording candidate determined to be incompatible is deleted from the entry of the vocabulary database 1 by the data update unit 6.

例えば、図３に示したような言い回し候補について、検索コーパス選定部７によって選定されたテキストコーパス８ｂを用いて検索を行った結果、第２エントリの第３語彙である「ＳＪＨ」と第３エントリの第３語彙である「メソ研」が検索ヒットしなかったとする。この場合、データ更新部６がこれらの言い回し候補をエントリから削除することにより、語彙データベース１の記憶内容は、図１０に示すようになる。 For example, as a result of searching for the wording candidate as shown in FIG. 3 using the text corpus 8b selected by the search corpus selection unit 7, the third vocabulary “SJH” of the second entry and the third entry Suppose that the third vocabulary of “Meso Ken” did not make a search hit. In this case, when the data updating unit 6 deletes these wording candidates from the entry, the stored contents of the vocabulary database 1 are as shown in FIG.

なお、上記では、検索結果解析部９が、検索ヒットの有無に応じて言い回し候補の適合性を判断し、適合しないと判断された言い回し候補をデータベース更新部６がエントリから削除する例を述べたが、本発明における言い回し候補の適合性の判断手法および判断結果の語彙データベースへの反映方法は、これに限定されるわけではない。 In the above description, the search result analysis unit 9 determines the suitability of the wording candidates according to the presence or absence of the search hit, and the database update unit 6 deletes the wording candidates that are determined not to match from the entry. However, the method for determining the suitability of wording candidates and the method of reflecting the determination result in the vocabulary database in the present invention are not limited to this.

例えば、検索結果解析部９が、言い回し候補の出現情報に応じてその言い回し候補の適合性を判断するようにすることも有効である。テキストコーパス内における語彙の出現情報は、その語彙が使われる度合いであると見ることができる。このような意味においても、出現情報は、文法を作成する際に、その語彙を文法に追加するかどうかの判断基準としても役に立つ。一般的に、認識対象の語彙数が多いほど認識率は下がる傾向にあるので、文法の語彙を制限する場合もあり、そのような場合に出現情報を利用することもできる。 For example, it is also effective for the search result analysis unit 9 to determine the suitability of the wording candidate according to the appearance information of the wording candidate. Vocabulary appearance information in a text corpus can be viewed as the degree to which the vocabulary is used. Even in this sense, the appearance information is useful as a criterion for determining whether or not to add the vocabulary to the grammar when creating the grammar. Generally, since the recognition rate tends to decrease as the number of vocabulary to be recognized increases, the grammatical vocabulary may be limited, and in such a case, appearance information can be used.

また、適合しない（あるいは適合度が低い）と判断された言い回し候補を語彙データベース１から削除せずに残しておき、文法作成部３が、文法を作成する際に、出現情報に応じて、文法に登録する言い回し候補を決定するようにしても良い。 Further, the phrasing candidate determined to be incompatible (or low in fitness) is left without being deleted from the vocabulary database 1, and the grammar creation unit 3 creates a grammar according to the appearance information when creating the grammar. You may make it determine the wording candidate registered into.

また、出現情報をそのまま文法に記述することも可能である。単語に出現情報（一般的に確率と呼ばれる場合が多い）を付けて文法に記載すると、認識エンジンで、認識単語の照合時に利用される。このような出現情報は、語彙と共に語彙データベース１に登録することが好ましい。また、各言い回し候補の出現情報の表し方としては、少なくとも以下の３種類が考えられるが、これらに限定されるものではない。 It is also possible to describe the appearance information as it is in the grammar. When appearance information (generally called a probability in many cases) is added to a word and described in the grammar, it is used by the recognition engine at the time of matching the recognized word. Such appearance information is preferably registered in the vocabulary database 1 together with the vocabulary. In addition, at least the following three types of ways of expressing the appearance information of each wording candidate can be considered, but are not limited to these.

（１）検索ヒットした語彙の総数に対する各言い回し候補の出現回数の割合
（２）各言い回し候補の出現回数が所定の閾値を超えたか否か（２値表現）
（３）各言い回し候補の出現回数そのもの
例えば、「メディアソリューション研究部」とその言い回し候補を、検索コーパス選定部７が選定したテキストコーパス８ｂで検索した場合のヒット件数が、図１１（ａ）に示すとおりであるものとする。 (1) Ratio of the number of appearances of each wording candidate with respect to the total number of search hit words (2) Whether or not the number of appearances of each wording candidate has exceeded a predetermined threshold (binary expression)
(3) The number of occurrences of each wording candidate itself For example, the number of hits when searching the “media solution research department” and the wording candidate using the text corpus 8b selected by the search corpus selecting part 7 is shown in FIG. It shall be as shown.

検索結果解析部９は、図１１（ａ）に示したヒット件数から、語彙のヒット件数の合計（１５３件）に対する各語彙の出現情報を、図１１（ｂ）に示すように、相対値として算出する。このように相対値として算出することにより、語彙によってヒット件数に大きなばらつきがある場合でも、計算値の取り扱いが容易となる。検索結果解析部９は、例えば、このように算出した出現情報が所定の閾値（例えば図１１（ｂ）の場合であれば１０％以上）を超える言い回し候補について、適合性があると判断する。 As shown in FIG. 11 (b), the search result analyzing unit 9 uses the hit counts shown in FIG. 11 (a) as relative values for the appearance information of each vocabulary with respect to the total number of vocabulary hits (153). calculate. By calculating as a relative value in this way, even when there is a large variation in the number of hits depending on the vocabulary, handling of the calculated value becomes easy. For example, the search result analysis unit 9 determines that the phrase candidates whose appearance information calculated in this way exceeds a predetermined threshold (for example, 10% or more in the case of FIG. 11B) are compatible.

次に、以上のように作成された語彙データベース１から文法作成部３が認識文法を作成する手順について説明する。文法作成部３は、語彙データベース１から語彙データを読み出し、与えられた文法仕様に従って文法を作成する。例えば、ＳＲＧＳ仕様では、
＜rule id="部署名"＞
＜one-of＞
＜item＞そうむぶ＜/item＞
＜item＞そうむ＜/item＞
＜/one-of＞
＜one-of＞
＜item＞そふとうぇあじぎょうほんぶ＜/item＞
＜item＞そじほん＜/item＞
＜/one-of＞
＜one-of＞
＜item＞めでぃあそりゅーしょんけんきゅうぶ＜/item＞
＜item＞めでぃあそりゅーしょんけん＜/item＞
＜item＞えむえすけんきゅうぶ＜/item＞
＜item＞えむえすけん＜/item＞
＜/one-of＞
＜/rule＞
＜/grammar＞
のような認識文法を作成する。 Next, a procedure in which the grammar creation unit 3 creates a recognition grammar from the vocabulary database 1 created as described above will be described. The grammar creation unit 3 reads vocabulary data from the vocabulary database 1 and creates a grammar according to a given grammar specification. For example, in the SRGS specification,
<Rule id = "group name">
<One-of>
<Item> Somubu </ item>
<Item> Some </ item>
</ One-of>
<One-of>
<Item> Soft Ajigyo Honbu </ item>
<Item> Sojihon </ item>
</ One-of>
<One-of>
<Item> Measy Assuryukenkyubu </ item>
<Item> Measy Assuryuken </ item>
<Item> Emues Kenkyubu / item
<Item> Emusuke </ item>
</ One-of>
</ Rule>
</ Grammar>
Create a recognition grammar like

このように作成された認識文法は、文法作成部３から文法記憶部１０へ送られ、記憶される。文法記憶部１０は、任意の記憶装置または記憶媒体によって実現される。文法記憶部１０に記憶された認識文法は、例えばボイスポータルなどの電話応答サービスシステムや、音声認識機能付きカーナビのような自動応答サービスシステム内の記憶装置または記憶媒体にコピーされて、これらのシステムにおける音声認識に用いられる。あるいは、文法記憶部１０そのものを本実施形態の音声認識文法作成システムから取り外し、これらのシステムに組み込むことも可能である。 The recognition grammar created in this way is sent from the grammar creation unit 3 to the grammar storage unit 10 and stored therein. The grammar storage unit 10 is realized by an arbitrary storage device or storage medium. The recognition grammar stored in the grammar storage unit 10 is copied to a storage device or storage medium in a telephone response service system such as a voice portal or an automatic response service system such as a car navigation system with a voice recognition function. Used for voice recognition. Alternatively, the grammar storage unit 10 itself can be removed from the speech recognition grammar creation system of this embodiment and incorporated in these systems.

なお、認識文法を作成する際に、出現情報を利用しても良い。この場合は、例えば各語彙の出現頻度が図１２に示すとおりであったとすると、第１エントリについては、「そうむ」の出現頻度４０に対する「そうむぶ」の出現頻度６０の比は、１：１．５であるので、
＜item weight="1.5"＞そうむぶ＜/item＞
＜item weight="1"＞そうむ＜/item＞
のようになる。すなわち、この認識文法には、出現情報から得られた、各語彙の優先度を示す情報が含まれている。 Note that appearance information may be used when creating a recognition grammar. In this case, for example, if the appearance frequency of each vocabulary is as shown in FIG. 12, the ratio of the appearance frequency 60 of “SOMUBU” to the appearance frequency 40 of “SOMU” is 1 for the first entry. : 1.5, so
<Item weight = "1.5"> Somu </ </ item>
<Item weight = "1"> Som </ item>
become that way. That is, the recognition grammar includes information indicating the priority of each vocabulary obtained from the appearance information.

なお、語彙データベース１に出現情報も登録されている場合、認識文法の精度をさらに向上させるために、閾値を設け、閾値以上の語彙から認識文法を作成するようにすることも好ましい。この場合、図１３に示すように、語彙データベース１と文法作成部３との間に、語彙データベース１の出現情報と閾値とを比較することにより文法作成部３へ渡すべき語彙を選択する文法語彙判別部１１を設ければ良い。 When appearance information is also registered in the vocabulary database 1, in order to further improve the accuracy of the recognition grammar, it is preferable to provide a threshold value and create the recognition grammar from words having a threshold value or higher. In this case, as shown in FIG. 13, the grammar vocabulary for selecting the vocabulary to be passed to the grammar creation unit 3 by comparing the appearance information of the vocabulary database 1 and the threshold between the vocabulary database 1 and the grammar creation unit 3. A determination unit 11 may be provided.

以上のように、本実施形態によれば、検索対象のテキストコーパスを選別することにより、言い回し候補の適合性の判断精度を向上させることができ、精度の高い認識文法を作成することが可能となる。さらに、検索結果から算出される各言い回し候補の出現情報を利用することにより、さらに精度の高い認識文法を作成できる。 As described above, according to the present embodiment, by selecting the text corpus to be searched, it is possible to improve the accuracy of determining the suitability of wording candidates and to create a highly accurate recognition grammar. Become. Furthermore, by using the appearance information of each wording candidate calculated from the search result, a recognition grammar with higher accuracy can be created.

なお、上記の実施形態では、語彙候補作成部２が生成した言い回し候補を語彙データベース１のエントリへ一旦登録し、大規模コーパス８での検索結果に応じて、検索結果解析部９により適合性がないと判断された言い回し候補をデータベース更新部６が語彙データベース１のエントリから削除するものとした。しかし、本発明はこれに限定されず、語彙候補作成部２が生成した言い回し候補を、システム内のメモリに一時的に記憶しておき、大規模コーパス８での検索結果に応じて、検索結果解析部９により適合性があると判断された言い回し候補のみを、データベース更新部６が前記メモリから取り出し、語彙データベース１のエントリへ登録するものとしても良い。 In the embodiment described above, the wording candidate generated by the vocabulary candidate creation unit 2 is temporarily registered in the entry of the vocabulary database 1, and the search result analysis unit 9 determines the suitability according to the search result in the large-scale corpus 8. The database update unit 6 deletes the wording candidate determined to be absent from the entry in the vocabulary database 1. However, the present invention is not limited to this, and the wording candidate generated by the vocabulary candidate creation unit 2 is temporarily stored in a memory in the system, and the search result is determined according to the search result in the large-scale corpus 8. Only the wording candidates that are determined to be compatible by the analysis unit 9 may be extracted from the memory by the database update unit 6 and registered in the entry of the vocabulary database 1.

また、本発明にかかる認識文法作成システムは、単一のハードウェアとして構成することも可能であるし、複数のハードウェアの組み合わせとして構成することも可能である。また、後者の場合は、各構成要素間を接続するための通信回線は有線または無線のいずれでも良いし、各構成要素が遠隔地に配置されていることもあり得る。 In addition, the recognition grammar creation system according to the present invention can be configured as a single piece of hardware or a combination of a plurality of pieces of hardware. In the latter case, the communication line for connecting each component may be either wired or wireless, and each component may be located at a remote location.

なお、以下の付記１〜６のそれぞれに記載の認識文法作成システムの他に、付記７および８にそれぞれ記載の認識文法作成方法および認識文法作成プログラムも、本発明の一形態である。 In addition to the recognition grammar creation system described in each of Supplementary Notes 1 to 6 below, the recognition grammar creation method and the recognition grammar creation program described in Supplementary Notes 7 and 8, respectively, are also an embodiment of the present invention.

（付記１）正式語彙とその言い回しを記憶する語彙データベースと、
言い回し候補の適合性を調べるための検索対象とするコーパスを選択し、選択したコーパスへ当該言い回し候補の検索を依頼する検索コーパス選定部と、
前記検索コーパス選定部によって選択されたコーパスにおける検索結果を解析する検索結果解析部と、
前記検索結果解析部による解析結果と所定の基準に基づいて言い回し候補の適合性を判断し、言い回し候補のうち適合性を有する言い回しを前記語彙データベースへ格納するデータベース更新部と、
前記語彙データベースから語彙を読み出して文法仕様に従った当該語彙の認識文法を作成する文法作成部とを備えた認識文法作成システム。 (Supplementary note 1) A vocabulary database that memorizes formal vocabularies and their phrases,
A search corpus selection unit that selects a corpus to be searched for checking suitability of the wording candidate and requests the selected corpus to search for the wording candidate;
A search result analysis unit for analyzing a search result in the corpus selected by the search corpus selection unit;
A database update unit that determines the suitability of the wording candidate based on the analysis result by the search result analysis unit and a predetermined criterion, and stores the wording having relevance among the wording candidates in the vocabulary database;
A recognition grammar creation system comprising: a grammar creation unit that reads a vocabulary from the vocabulary database and creates a recognition grammar of the vocabulary according to grammar specifications.

（付記２）前記検索コーパス選定部は、
検索対象の候補とするコーパスである候補コーパスに関する情報を記憶する候補コーパスデータベースと、
候補コーパスの検索キーワードとして、前記正式語彙に関連する少なくとも一つのキーワードを入力し、当該検索キーワードに従ってインターネット上で候補コーパスに関する情報を検索する候補コーパス検索部と、
前記候補コーパス検索部で得られた情報を前記候補コーパスデータベースに登録する候補コーパス登録部とを備え、
前記検索コーパス選定部が、前記候補コーパスデータベースに登録された情報を参照し、検索対象とするコーパスを選定する、付記１に記載の認識文法作成システム。 (Supplementary Note 2) The search corpus selection unit
A candidate corpus database that stores information about candidate corpora that are corpora to be searched;
A candidate corpus search unit that inputs at least one keyword related to the formal vocabulary as a search keyword for the candidate corpus and searches for information on the candidate corpus on the Internet according to the search keyword;
A candidate corpus registration unit that registers information obtained by the candidate corpus search unit in the candidate corpus database;
The recognition grammar creation system according to appendix 1, wherein the search corpus selection unit refers to information registered in the candidate corpus database and selects a corpus to be searched.

（付記３）前記検索結果解析部が、前記コーパスにおける前記言い回し候補の出現情報を生成する出現情報生成部を備え、当該出現情報生成部による出現情報の生成結果を、前記解析結果として前記データベース更新部へ渡す、付記１または２に記載の認識文法作成システム。 (Additional remark 3) The said search result analysis part is provided with the appearance information generation part which produces | generates the appearance information of the said wording candidate in the said corpus, and the said database update is made into the said analysis result by using the generation result of the appearance information by the said appearance information generation part The recognition grammar creation system according to appendix 1 or 2, which is passed to the department.

（付記４）前記データベース更新部が、言い回しを前記語彙データベースへ格納する際に、当該言い回しについて前記出現情報生成部により生成された出現情報を、当該言い回しに関連づけて前記語彙データベースに格納し、
前記認識文法作成システムは、さらに、
前記語彙データベースから、前記出現情報に基づいて認識文法を作成すべき語彙を選択して前記文法作成部へ渡す文法語彙判別部を備えた、付記３に記載の認識文法作成システム。 (Supplementary Note 4) When the database update unit stores the wording in the vocabulary database, the appearance information generated by the appearance information generation unit for the wording is stored in the vocabulary database in association with the wording,
The recognition grammar creation system further includes:
The recognition grammar creation system according to appendix 3, further comprising a grammar vocabulary determination unit that selects a vocabulary for which a recognition grammar is to be created based on the appearance information from the vocabulary database and passes the selected vocabulary to the grammar creation unit.

（付記５）前記文法作成部が、前記出現情報に基づいて、単語の優先度を示す情報を持つ認識文法を作成する、付記３または４に記載の認識文法作成システム。 (Additional remark 5) The recognition grammar preparation system of Additional remark 3 or 4 with which the said grammar preparation part produces the recognition grammar which has the information which shows the priority of a word based on the said appearance information.

（付記６）正式語彙とその言い回しを記憶する語彙データベースと、
正式語彙から言い回し候補を作成する語彙候補作成部と、
言い回し候補の適合性を調べるための検索対象とするコーパスを選択し、選択したコーパスへ当該言い回し候補の検索を依頼する検索コーパス選定部と、
前記検索コーパス選定部によって選択されたコーパスにおける検索結果を解析する検索結果解析部と、
前記検索結果解析部による解析結果に基づいて言い回し候補の適合性を判断し、言い回し候補のうち適合性を有する言い回しを前記語彙データベースへ格納するデータベース更新部と、
前記語彙データベースから語彙を読み出して文法仕様に従った認識文法を作成する文法作成部とを備えた認識文法作成システム。 (Supplementary Note 6) A vocabulary database for storing formal vocabulary words and phrases,
A vocabulary candidate creation unit that creates wording candidates from formal vocabulary;
A search corpus selection unit that selects a corpus to be searched for checking suitability of the wording candidate and requests the selected corpus to search for the wording candidate;
A search result analysis unit for analyzing a search result in the corpus selected by the search corpus selection unit;
A database update unit that determines the suitability of the wording candidate based on the analysis result by the search result analysis unit, and stores the wording having relevance among the wording candidates in the vocabulary database;
A recognition grammar creation system comprising: a grammar creation unit that reads a vocabulary from the vocabulary database and creates a recognition grammar according to grammar specifications.

（付記７）正式語彙から言い回し候補を作成し、
言い回し候補の適合性を調べるための検索対象とするコーパスを選択し、選択したコーパスへ当該言い回し候補の検索を依頼し、
検索対象として選択されたコーパスにおける検索結果を受け取って解析し、
前記解析の結果に基づいて言い回し候補の適合性を判断し、言い回し候補のうち適合性を有する言い回しを語彙データベースへ格納し、
前記語彙データベースから語彙を読み出して文法仕様に従った認識文法を作成する認識文法作成方法。 (Appendix 7) Create wording candidates from formal vocabulary,
Select a corpus to be searched for checking the suitability of the wording candidate, request the selected corpus to search for the wording candidate,
Receive and analyze search results in the corpus selected for search,
Determining the suitability of the wording candidates based on the result of the analysis, and storing the wording having relevance among the wording candidates in the vocabulary database;
A recognition grammar creation method for reading a vocabulary from the vocabulary database and creating a recognition grammar according to a grammar specification.

（付記８）正式語彙から言い回し候補を作成し、
言い回し候補の適合性を調べるための検索対象とするコーパスを選択し、選択したコーパスへ当該言い回し候補の検索を依頼し、
検索対象として選択されたコーパスにおける検索結果を受け取って解析し、
前記解析の結果に基づいて言い回し候補の適合性を判断し、言い回し候補のうち適合性を有する言い回しを語彙データベースへ格納し、
前記語彙データベースから語彙を読み出して文法仕様に従った認識文法を作成する処理をコンピュータに実行させる命令を含むコンピュータプログラム。 (Appendix 8) Create wording candidates from formal vocabulary,
Select a corpus to be searched for checking the suitability of the wording candidate, request the selected corpus to search for the wording candidate,
Receive and analyze search results in the corpus selected for search,
Determining the suitability of the wording candidates based on the result of the analysis, and storing the wording having relevance among the wording candidates in the vocabulary database;
A computer program comprising instructions for causing a computer to execute processing for reading a vocabulary from the vocabulary database and creating a recognition grammar according to a grammar specification.

以上のように本発明によれば、検索対象のテキストコーパスを選別することにより、言い回し候補の適合性の判断精度を向上させることができ、精度の高い認識文法を作成することが可能な認識文法作成システムを提供できる。 As described above, according to the present invention, it is possible to improve the accuracy of determining the suitability of wording candidates by selecting a text corpus to be searched, and to create a highly accurate recognition grammar. A creation system can be provided.

本発明の一実施形態にかかる認識文法作成システムの概略構成を示すブロック図The block diagram which shows schematic structure of the recognition grammar preparation system concerning one Embodiment of this invention. 語彙データベースに格納されているデータの例を示す説明図Explanatory diagram showing an example of data stored in the vocabulary database 語彙データベースに格納されているデータの例を示す説明図Explanatory diagram showing an example of data stored in the vocabulary database 語彙候補作成部の構成例を示すブロック図Block diagram showing a configuration example of the vocabulary candidate creation unit 語彙候補作成部の構成例を示すブロック図Block diagram showing a configuration example of the vocabulary candidate creation unit 語彙候補作成部の構成例を示すブロック図Block diagram showing a configuration example of the vocabulary candidate creation unit 語彙候補作成部の構成例を示すブロック図Block diagram showing a configuration example of the vocabulary candidate creation unit 検索コーパス選定部の構成例を示すブロック図Block diagram showing a configuration example of the search corpus selection unit 検索コーパス選定部の構成例を示すブロック図Block diagram showing a configuration example of the search corpus selection unit 語彙データベースに格納されているデータの例を示す説明図Explanatory diagram showing an example of data stored in the vocabulary database 各語彙のヒット件数と出現情報との関係を示す説明図Explanatory diagram showing the relationship between the number of hits and appearance information for each vocabulary 語彙データベースに格納されているデータの例を示す説明図Explanatory diagram showing an example of data stored in the vocabulary database 各語彙の出現情報を考慮して文法を作成するための構成例を示すブロック図A block diagram showing a configuration example for creating a grammar in consideration of appearance information of each vocabulary 従来の認識文法作成手順を示すフローチャートFlow chart showing conventional recognition grammar creation procedure

Explanation of symbols

１語彙データベース
２語彙候補作成部
３文法作成部
４データベース編集部
５類義語データベース
６データベース更新部
７検索コーパス選定部
８大規模コーパス
８ａ検索部
８ｂテキストコーパス
９検索結果解析部
１０文法記憶部
１１文法語彙判別部
２１形態素解析部
２２頭文字取得部
２３頭文字合成部
２４省略規則適用部
２５省略作成規則記憶部
２６アルファベット変換部 DESCRIPTION OF SYMBOLS 1 Vocabulary database 2 Vocabulary candidate creation part 3 Grammar creation part 4 Database editing part 5 Synonym database 6 Database update part 7 Search corpus selection part 8 Large-scale corpus 8a Search part 8b Text corpus 9 Search result analysis part 10 Grammar storage part 11 Grammar vocabulary Discriminator 21 Morphological analyzer 22 Initial acquisition unit 23 Initial synthesis unit 24 Omission rule application unit 25 Omission creation rule storage unit 26 Alphabet conversion unit

Claims

A vocabulary candidate creation unit that creates a wording candidate from a formal vocabulary based on a predetermined rule;
A vocabulary database that memorizes the official vocabulary and its phrases,
A search corpus selection unit that selects a corpus to be searched for checking suitability of the wording candidate and requests the selected corpus to search for the wording candidate;
A search result analysis unit for analyzing a search result in the corpus selected by the search corpus selection unit;
A database update unit that determines the suitability of the wording candidate based on the analysis result by the search result analysis unit and a predetermined criterion, and stores the wording having relevance among the wording candidates in the vocabulary database;
A recognition grammar creation system comprising: a grammar creation unit that reads a vocabulary from the vocabulary database and creates a recognition grammar of the vocabulary according to grammar specifications.

The search corpus selection unit
A candidate corpus database that stores information about candidate corpora that are corpora to be searched;
A candidate corpus search unit that inputs at least one keyword related to the formal vocabulary as a search keyword for the candidate corpus and searches for information on the candidate corpus on the Internet according to the search keyword;
A candidate corpus registration unit that registers information obtained by the candidate corpus search unit in the candidate corpus database;
The recognition grammar creation system according to claim 1, wherein the search corpus selection unit refers to information registered in the candidate corpus database and selects a corpus to be searched.

The search result analysis unit includes an appearance information generation unit that generates the appearance information of the wording candidate in the corpus, and passes the generation result of the appearance information by the appearance information generation unit to the database update unit as the analysis result. The recognition grammar creation system according to claim 1 or 2.

The recognition grammar creation system according to any one of claims 1 to 3, wherein the predetermined rule includes at least one of an empirical rule and an omission rule.

When the database update unit stores the wording in the vocabulary database, the appearance information generated by the appearance information generation unit for the wording is stored in the vocabulary database in association with the wording,
The recognition grammar creation system further includes:
Wherein the lexical database, with a grammar vocabulary determination unit selects the vocabulary to create a recognition grammar based on the occurrence information passed to the grammar creation unit, recognition grammar creation system according to claim 3 or 4.

The recognition grammar creation system according to any one of claims 3 to 5, wherein the grammar creation unit creates a recognition grammar having information indicating a word priority based on the appearance information.

A recognition grammar creation method that creates a recognition grammar using a computer that can access a vocabulary database that stores formal vocabulary words and phrases,
  The vocabulary candidate creation unit of the computer creates a wording candidate from a formal vocabulary based on a predetermined rule;
  A search corpus selection unit of the computer selects a corpus to be searched for checking suitability of the wording candidates, and requests the selected corpus to search for the wording candidates;
  The search result analysis unit of the computer receives and analyzes the search result in the corpus selected as the search target;
  The database update unit of the computer determines the suitability of the wording candidates based on the result of the analysis, and stores the wording having relevance among the wording candidates in the vocabulary database;
  A recognition grammar creation method comprising: a grammar creation unit of the computer reading a vocabulary from the vocabulary database and creating a recognition grammar according to a grammar specification.

A computer program for causing a computer capable of accessing a vocabulary database for storing formal vocabulary and phrases to execute a recognition grammar creation process,
  A procedure for creating wording candidates from a formal vocabulary based on predetermined rules,
  A procedure for selecting a corpus to be searched for checking the suitability of a wording candidate and requesting the selected corpus to search for the wording candidate;
  A procedure for receiving and analyzing search results in a corpus selected for search;
  Determining the suitability of the wording candidate based on the result of the analysis, and storing the wording having relevance among the wording candidates in the vocabulary database;
  A computer program for causing a computer to execute a procedure for reading a vocabulary from the vocabulary database and creating a recognition grammar according to a grammar specification.