JP2016164724A

JP2016164724A - Vocabulary knowledge acquisition device, vocabulary knowledge acquisition method, and vocabulary knowledge acquisition program

Info

Publication number: JP2016164724A
Application number: JP2015044661A
Authority: JP
Inventors: 恭子牧野; Kyoko Makino; 後藤　和之; Kazuyuki Goto; 和之後藤; 彰夫古畑; Akio Furuhata; 篤弘吉田; Atsuhiro Yoshida; 泰成宮部; Yasunari Miyabe
Original assignee: Toshiba Corp; Toshiba Solutions Corp
Current assignee: Toshiba Corp; Toshiba Digital Solutions Corp
Priority date: 2015-03-06
Filing date: 2015-03-06
Publication date: 2016-09-08
Anticipated expiration: 2035-03-06
Also published as: JP6584795B2

Abstract

PROBLEM TO BE SOLVED: To present read candidates of a notation without being restricted within preset information as knowledge about notations to be added to a dictionary.SOLUTION: According to an embodiment, a vocabulary knowledge acquisition device has morphological analysis means, compound word extraction means, unknown word extraction means, unknown word related information imparting means, abbreviation estimation means, formal notation candidate imparting means, and result output means. The unknown word extraction means extracts unknown words unregistered with an established dictionary. The unknown word related information imparting means extracts candidates of read of unknown words from data acquired from outside and imparts them to the unknown words as unknown word related information. The formal notation candidate imparting means imparts compound words generating abbreviations to the unknown words as formal notation candidates when the abbreviations generated from the compound words match the unknown words. The result output means outputs the unknown word, the unknown word related information, and the formal notation candidates together in order of dictionary additional registration effect.SELECTED DRAWING: Figure 2

Description

本発明は、音声認識などの用途に使用される辞書に新しい語彙を追加する際のユーザ作業を支援する語彙知識獲得装置、語彙知識獲得方法、及び語彙知識獲得プログラムに関する。 The present invention relates to a vocabulary knowledge acquisition device, a vocabulary knowledge acquisition method, and a vocabulary knowledge acquisition program that support user work when a new vocabulary is added to a dictionary used for applications such as speech recognition.

近年、人間が発した音を認識して文書に記録したり、画面に表示したりする音声認識システムが使用されている。音声認識システムで認識精度を向上させるためには、正しく認識できない語彙の読み、表記、品詞などを辞書に追加することが有効である。ここで、正しく音声認識をするとは、入力された音声信号に対して正しい読みと表記を割り付けることである。 In recent years, a speech recognition system that recognizes a sound generated by a human and records it in a document or displays it on a screen has been used. In order to improve recognition accuracy in a speech recognition system, it is effective to add vocabulary readings, notations, parts of speech, etc. that cannot be recognized correctly to the dictionary. Here, correct speech recognition means assigning correct reading and notation to the input speech signal.

正しく認識できない語彙は、音声認識をした結果を見て判断するばかりでなく、新しく音声認識システムを作成する時に、音声認識システムを使う場面に合わせて、音声認識用の構築済み辞書にない語彙でよく使うものを予想して予め音声認識辞書に追加する場合がある。 Vocabulary that cannot be recognized correctly is not only determined by looking at the results of speech recognition, but also when creating a new speech recognition system, it is a vocabulary that is not in the built-in dictionary for speech recognition. In some cases, frequently used items are predicted and added to the speech recognition dictionary in advance.

この場合、音声認識システムは、音声認識を使う場面に関係する書類や用語リストから、構築済み辞書にはない語彙、すなわち未知語の表記を選んで、品詞や読みの情報を付与して辞書に登録する。 In this case, the speech recognition system selects a vocabulary that is not in the built dictionary, that is, a notation of an unknown word, from a document or term list related to the scene where speech recognition is used, and gives part-of-speech and reading information to the dictionary. sign up.

従来では、未知語の表記に正しい読みの情報を付与するために、次のような未知語に対する読みを抽出する方法がある。例えば、事前に用意された読み判断用辞書やルールを参照し、読みを推定する技術がある。例えば、未知語に対して、未知語読み付け辞書を参照し、表記「ＡＢＣ」には登録されている表記「Ａ」「Ｂ」「Ｃ」に対応する読み「えー」「びー」「しー」を連結したものを読みとして付与する。また、読み確率記憶部に記録された二つのテーブルを参照して条件付き確率最大となる読み候補を生成し、表記と読みのセットでＷｅｂを検索した結果得られたヒット件数から読み候補の優劣を判断し、優先度の高い読み候補を選択する。 Conventionally, in order to give correct reading information to an unknown word notation, there is a method of extracting readings for unknown words as follows. For example, there is a technique for estimating a reading by referring to a dictionary and rules for reading determination prepared in advance. For example, for unknown words, the unknown word reading dictionary is referred to, and the readings “E”, “B”, “S” corresponding to the registered “A”, “B”, “C” are registered in the notation “ABC”. "Is added as a reading. In addition, with reference to the two tables recorded in the reading probability storage unit, a reading candidate having the maximum conditional probability is generated, and the reading candidate is superior or inferior from the number of hits obtained as a result of searching the Web with a set of notation and reading. And select a reading candidate with a high priority.

さらに、同義語の一種である略語と原語（正式名称）の組み合わせをＷｅｂ検索結果から探索する技術がある。この技術では、略語が入力された場合に、略語の部分のみが異なる構文を抽出し、略語と同じ位置で使われている表現を原語（正式名称）と推定する。 Furthermore, there is a technique for searching a combination of an abbreviation, which is a kind of synonym, and an original word (official name) from a Web search result. In this technique, when an abbreviation is input, a syntax that differs only in the abbreviation portion is extracted, and an expression used at the same position as the abbreviation is estimated as an original word (official name).

特許第４９４１４９５号公報Japanese Patent No. 4941495 特開２００９−２０４７３２号公報JP 2009-204732 A 特許第５３５５５３７号公報Japanese Patent No. 5355537

このように、従来の技術では、システムに設けられた情報（予め辞書等に登録された情報）に基づいて読みを推定しているため、システムが推定できない特殊な読みや、システムに設定されていない表記に対する読みを追加することができない。 As described above, in the conventional technique, reading is estimated based on information provided in the system (information registered in a dictionary or the like in advance), and therefore, special reading that cannot be estimated by the system or set in the system. Cannot add readings for no notation.

また、略語と原語（正式名称）の組み合わせをＷｅｂ検索結果から探索する技術は、複数の原語候補が抽出された場合のみ、原語候補から略語を推定して絞り込みを行っている。従って、同じ構文の表現が少ない場合は、略語に対して、同義ではない類義語を唯一の原語候補として抽出するおそれがある。 Further, the technique for searching for combinations of abbreviations and original words (official names) from Web search results estimates and narrows down abbreviations from original word candidates only when a plurality of original word candidates are extracted. Therefore, when there are few expressions of the same syntax, there is a possibility that synonyms that are not synonymous with abbreviations are extracted as the only original word candidates.

本発明が解決しようとする課題は、辞書へ追加する表記に関する知識として、表記に対する読みの候補を予め設定された情報内に制限されることなく提示することが可能な語彙知識獲得装置、語彙知識獲得方法、及び語彙知識獲得プログラムを提供することである。 The problem to be solved by the present invention is to provide a vocabulary knowledge acquisition device and vocabulary knowledge that can present reading candidates for a notation within a preset information as knowledge about the notation to be added to a dictionary. An acquisition method and a vocabulary knowledge acquisition program are provided.

実施形態によれば、語彙知識獲得装置は、形態素解析手段、複合語抽出手段、未知語抽出手段、未知語関連情報付与手段、略称推定手段、正式表記候補付与手段、及び結果出力手段とを有する。形態素解析手段は、平文コーパスに含まれるテキストを単語に分割して、各単語に品詞を付与する形態素解析をする。複合語抽出手段は、前記形態素解析の結果をもとに複合語を抽出する。未知語抽出手段は、前記形態素解析で得られた単語、及び複合語抽出で得られた複合語を構築済み辞書の登録語と比較し、前記構築済み辞書に登録されていない未知語を抽出する。未知語関連情報付与手段は、前記未知語に対する読みの候補を外部から取得されるデータから抽出して、未知語関連情報として前記未知語に付与する。略称推定手段は、複合語から略称を生成する。正式表記候補付与手段は、前記略称推定手段により生成された略称と前記未知語とが一致する場合に、前記略称の生成元とする複合語を正式表記候補として前記未知語に付与する。結果出力手段は、前記未知語と前記未知語関連情報と前記正式表記候補を合わせて、辞書追加登録効果の高い順に並べて語彙リストとして出力する。 According to the embodiment, the vocabulary knowledge acquisition device includes a morphological analysis unit, a compound word extraction unit, an unknown word extraction unit, an unknown word related information addition unit, an abbreviation estimation unit, a formal notation candidate addition unit, and a result output unit. . The morpheme analysis means divides the text included in the plaintext corpus into words and performs morpheme analysis that gives parts of speech to each word. The compound word extracting unit extracts a compound word based on the result of the morphological analysis. The unknown word extraction means compares the word obtained by the morphological analysis and the compound word obtained by the compound word extraction with a registered word of the constructed dictionary, and extracts an unknown word that is not registered in the constructed dictionary. . The unknown word related information adding means extracts a reading candidate for the unknown word from data acquired from the outside, and adds it to the unknown word as unknown word related information. The abbreviation estimation means generates an abbreviation from the compound word. The formal notation candidate assigning means assigns a compound word that is the generation source of the abbreviation to the unknown word as a formal notation candidate when the abbreviation generated by the abbreviation estimation means matches the unknown word. The result output means puts together the unknown word, the unknown word related information, and the formal notation candidates, and outputs them as a vocabulary list arranged in descending order of the dictionary additional registration effect.

本実施形態における語彙知識獲得装置を用いるシステムの構成を示すブロック図。The block diagram which shows the structure of the system using the vocabulary knowledge acquisition apparatus in this embodiment. 本実施形態における語彙知識獲得プログラムに基づいて実現される機能を示すブロック図。The block diagram which shows the function implement | achieved based on the vocabulary knowledge acquisition program in this embodiment. 本実施形態における語彙知識獲得装置の語彙知識獲得処理の動作について示すフローチャート。The flowchart which shows about operation | movement of the vocabulary knowledge acquisition process of the vocabulary knowledge acquisition apparatus in this embodiment. 本実施形態における構築済み辞書に登録されるデータの一例を示す図。The figure which shows an example of the data registered into the constructed dictionary in this embodiment. 本実施形態における形態素解析結果の一例を示す図。The figure which shows an example of the morphological analysis result in this embodiment. 本実施形態における未知語関連情報付与機能４４により出力される未知語関連情報の一例を示す図。The figure which shows an example of the unknown word related information output by the unknown word related information provision function 44 in this embodiment. 本実施形態における結果出力機能から出力される語彙リストの一例を示す図。The figure which shows an example of the vocabulary list output from the result output function in this embodiment. 本実施形態における未知語関連情報付与処理を示すフローチャート。The flowchart which shows the unknown word related information provision process in this embodiment. 本実施形態における信頼性評価リストの一例を示す図。The figure which shows an example of the reliability evaluation list | wrist in this embodiment. 本実施形態における結果出力処理を示すフローチャート。The flowchart which shows the result output process in this embodiment. 本実施形態における複合語抽出処理を示すフローチャート。The flowchart which shows the compound word extraction process in this embodiment. 本実施形態における複合語抽出処理に用いられるリストの一例を示す図。The figure which shows an example of the list | wrist used for the compound word extraction process in this embodiment.

以下、実施形態について図面を参照して説明する。 Hereinafter, embodiments will be described with reference to the drawings.

図１は、本実施形態における語彙知識獲得装置１０を用いるシステムの構成を示すブロック図である。図１に示すシステムにおいて、語彙知識獲得装置１０は、インターネット等のネットワーク１２を通じて、Ｗｅｂサーバ１４−１，１４−２，…，１４−ｎや各種の電子機器と通信して、各種データを送受信することができる。 FIG. 1 is a block diagram showing the configuration of a system that uses a vocabulary knowledge acquisition apparatus 10 according to this embodiment. In the system shown in FIG. 1, the vocabulary knowledge acquisition device 10 communicates with Web servers 14-1, 14-2,..., 14 -n and various electronic devices through a network 12 such as the Internet, and transmits and receives various data. can do.

本実施形態における語彙知識獲得装置１０は、例えばパーソナルコンピュータ等のコンピュータによって実現される。図１に示すように、語彙知識獲得装置１０は、プロセッサ２０、メモリ２１、記憶装置２４、入力ユニット２５、表示ユニット２６、音声入力ユニット２７、音声出力ユニット２８、及び通信ユニット２９を有する。 The vocabulary knowledge acquisition apparatus 10 in the present embodiment is realized by a computer such as a personal computer. As shown in FIG. 1, the vocabulary knowledge acquisition device 10 includes a processor 20, a memory 21, a storage device 24, an input unit 25, a display unit 26, a voice input unit 27, a voice output unit 28, and a communication unit 29.

プロセッサ２０は、記憶装置２４からメモリ２１に読み出された各種プログラム（ソフトウェア）を実行することにより各種の機能を実現する。例えば、プロセッサ２０は、メモリ２１に記憶されたＯＳ（Operating System）やアプリケーションプログラムなどの各種プログラム（ソフトウェア）を実行して、各種機能を実現する。例えば、プロセッサ２０は、語彙知識獲得プログラム２１ａを実行して、音声認識システムで使用される音声認識辞書（構築済み辞書２４ｅ）に新しい語彙を追加する際のユーザ作業を支援するための機能を実現する。語彙知識獲得プログラム２１ａに基づいて実現される機能については図２に示す。また、プロセッサ２０は、音声認識プログラム２１ｂを実行することにより音声認識システムを実現する。 The processor 20 implements various functions by executing various programs (software) read from the storage device 24 to the memory 21. For example, the processor 20 executes various programs (software) such as an OS (Operating System) and application programs stored in the memory 21 to realize various functions. For example, the processor 20 executes a vocabulary knowledge acquisition program 21a and realizes a function for supporting user work when adding a new vocabulary to a speech recognition dictionary (built dictionary 24e) used in the speech recognition system. To do. The functions realized based on the vocabulary knowledge acquisition program 21a are shown in FIG. Further, the processor 20 implements a voice recognition system by executing the voice recognition program 21b.

メモリ２１は、プロセッサ２０により実行されるプログラムやデータを記憶する。 The memory 21 stores programs and data executed by the processor 20.

記憶装置２４は、ＯＳ（Operating System）やアプリケーションプログラムなどの各種プログラム（ソフトウェア）やプログラムの実行に必要なデータなどを、不揮発性の記憶媒体において記憶する。記憶装置２４に記憶されるデータは、例えば平文コーパス２４ａ、正式名称リスト２４ｂ、日英機械翻訳辞書２４ｃ、Ｗｅｂクローリングデータ２４ｄ、構築済み辞書２４ｅ、仮構築辞書２４ｆ、語彙リスト２４ｇ、及び音声ファイナル２４ｈを含む。各データの詳細については後述する。 The storage device 24 stores various programs (software) such as an OS (Operating System) and application programs, data necessary for executing the programs, and the like in a nonvolatile storage medium. The data stored in the storage device 24 includes, for example, a plain text corpus 24a, a formal name list 24b, a Japanese-English machine translation dictionary 24c, a Web crawling data 24d, a built dictionary 24e, a temporary construction dictionary 24f, a vocabulary list 24g, and an audio final 24h. including. Details of each data will be described later.

入力ユニット２５は、プロセッサ２０の制御のもとで、ユーザにより操作される入力デバイス（例えば、キーボード、マウス、タブレット等）からの入力を制御する。 The input unit 25 controls input from an input device (for example, a keyboard, a mouse, a tablet, etc.) operated by the user under the control of the processor 20.

表示ユニット２６は、プロセッサ２０の制御のもとで、ＬＣＤ（Liquid Crystal Display）等のディスプレイにおける表示を制御する。 The display unit 26 controls display on a display such as an LCD (Liquid Crystal Display) under the control of the processor 20.

音声入力ユニット２７は、プロセッサ２０の制御のもとで、マイクからの音声入力を制御する。 The voice input unit 27 controls voice input from the microphone under the control of the processor 20.

音声出力ユニット２８は、プロセッサ２０の制御のもとで、スピーカやヘッドホン等からの音声出力を制御する。 The sound output unit 28 controls sound output from a speaker, headphones, or the like under the control of the processor 20.

通信ユニット２９は、ネットワーク１２を通じて、Ｗｅｂサーバ１４や電子機器との通信を制御する。 The communication unit 29 controls communication with the Web server 14 and the electronic device via the network 12.

なお、語彙知識獲得装置１０は、ハードウェア構成、又はハードウェア資源とソフトウェア（プログラム）との組合せ構成のいずれでも実施可能である。ソフトウェアは、予めネットワーク１２又は非一時的なコンピュータ読み取り可能な記憶媒体からコンピュータにインストールされ、当該コンピュータのプロセッサ２０に実行されることにより、各装置の機能を当該コンピュータに実行させる。 Note that the vocabulary knowledge acquisition device 10 can be implemented with either a hardware configuration or a combined configuration of hardware resources and software (programs). The software is installed in the computer from the network 12 or a non-transitory computer-readable storage medium in advance, and is executed by the processor 20 of the computer, thereby causing the computer to execute the function of each device.

図２は、本実施形態における語彙知識獲得装置１０の機能構成を示すブロック図である。プロセッサ２０は、語彙知識獲得プログラム２１ａを実行することにより、機能部３０に含まれる各機能を実現する。機能部３０に含まれる各機能は、記憶部３２に含まれる各データに対する処理を実行する。 FIG. 2 is a block diagram showing a functional configuration of the vocabulary knowledge acquisition apparatus 10 in the present embodiment. The processor 20 implements each function included in the functional unit 30 by executing the vocabulary knowledge acquisition program 21a. Each function included in the function unit 30 executes processing for each data included in the storage unit 32.

語彙知識獲得装置１０は、語彙知識獲得プログラム２１ａに基づいて、形態素解析機能４１、複合語抽出機能４２、未知語抽出機能４３、未知語関連情報付与機能４４、略称推定機能４５、正式表記候補付与機能４６、結果出力機能４７、及び辞書編集機能４８による処理を実行する。 Based on the vocabulary knowledge acquisition program 21a, the vocabulary knowledge acquisition device 10 has a morphological analysis function 41, a compound word extraction function 42, an unknown word extraction function 43, an unknown word related information addition function 44, an abbreviation estimation function 45, and a formal notation candidate assignment. Processing by the function 46, the result output function 47, and the dictionary editing function 48 is executed.

なお、音声認識システム４９は、プロセッサ２０が音声認識プログラム２１ｂを実行することにより実現される機能である。音声認識システム４９は、語彙知識獲得装置１０の機能とは独立したシステムであり、辞書編集機能４８による処理において利用される。ただし、音声認識システム４９は、語彙知識獲得プログラム２１ａにより実現される機能の一部としても良い。 The voice recognition system 49 is a function realized by the processor 20 executing the voice recognition program 21b. The speech recognition system 49 is a system independent of the function of the vocabulary knowledge acquisition device 10 and is used in the processing by the dictionary editing function 48. However, the speech recognition system 49 may be a part of the function realized by the vocabulary knowledge acquisition program 21a.

記憶部３２（記憶装置２４）には、機能部３０の各機能の処理に必要な資源である、平文コーパス２４ａ、正式名称リスト２４ｂ、日英機械翻訳辞書２４ｃ、Ｗｅｂクローリングデータ２４ｄ、構築済み辞書２４ｅ、仮構築辞書２４ｆ、音声ファイル２４ｈとが含まれる。また、記憶部３２には、各機能の処理結果とする語彙リスト２４ｇが記憶される。 The storage unit 32 (storage device 24) includes a plaintext corpus 24a, a formal name list 24b, a Japanese-English machine translation dictionary 24c, Web crawling data 24d, and a built-up dictionary, which are resources necessary for processing of each function of the function unit 30. 24e, temporary construction dictionary 24f, and audio file 24h. In addition, the storage unit 32 stores a vocabulary list 24g as a processing result of each function.

構築済み辞書２４ｅは、例えば音声認識システム４９による音声認識処理に利用される辞書である。構築済み辞書２４ｅには、例えば図４に示すように、表記（見出し語）、品詞、読みを示すデータの組が、複数の見出し語毎に登録されている。構築済み辞書２４ｅには、語彙知識獲得装置１０による処理結果を利用して、ユーザ操作によって新たな語彙（品詞、表記、読み）を追加することができる。 The built dictionary 24e is a dictionary used for voice recognition processing by the voice recognition system 49, for example. In the built dictionary 24e, for example, as shown in FIG. 4, a set of data indicating notation (headword), part of speech, and reading is registered for each of a plurality of headwords. A new vocabulary (part of speech, notation, reading) can be added to the constructed dictionary 24e by a user operation using the processing result of the vocabulary knowledge acquisition device 10.

平文コーパス２４ａは、構築済み辞書２４ｅに新しい語彙を追加するために使用される書類（例えば、テキストデータ）の集合である。例えば、平文コーパス２４ａから構築済み辞書２４ｅに登録されていない未知語が抽出され、この未知語が構築済み辞書２４ｅへ追加する語彙の候補となる。平文コーパス２４ａは、音声認識システム４９を使う分野についての音声認識の品質を向上するため、該当する分野に関係する書類が用いられる。例えば、医療・薬学分野であれば、薬剤の添付文書などが該当する。 The plaintext corpus 24a is a set of documents (for example, text data) used to add a new vocabulary to the built dictionary 24e. For example, an unknown word that is not registered in the built dictionary 24e is extracted from the plain text corpus 24a, and this unknown word becomes a vocabulary candidate to be added to the built dictionary 24e. The plaintext corpus 24a uses documents related to the corresponding field in order to improve the quality of the speech recognition in the field where the speech recognition system 49 is used. For example, in the medical / pharmaceutical field, a package insert of a drug is applicable.

正式名称リスト２４ｂは、音声認識システム４９を使う場面に関連する表記（用語等）が登録された用語リストである。例えば、医療・薬学分野であれば、病名などの正式名称リスト（医学用語辞書）、薬剤リストなどが該当する。なお、人名について音声認識処理をする場合には、人名リスト（一般的な人名だけでなく、芸名などを含んでも良い）が用いられる。同様にして、地名については地名リスト、商品名については商標リストを用いるなど、音声認識処理の対象とする分野に応じたリストが用いられる。 The formal name list 24b is a term list in which notations (terms and the like) related to the scene where the speech recognition system 49 is used are registered. For example, in the medical / pharmaceutical field, a formal name list (medical term dictionary) such as a disease name, a drug list, and the like are applicable. When performing speech recognition processing for a person name, a person name list (not only general person names but also stage names etc. may be used) is used. Similarly, a place name list is used for the place name, and a trademark list is used for the product name.

日英機械翻訳辞書２４ｃは、日本語の表記と、その表記に対する英語の対訳が登録されたリストである。例えば、日本語の表記「リンパ節」（読み：りんぱせつ、品詞：名詞−一般）に対して、英語の対訳である「ｌｙｍｐｈｎｏｄｅ」が登録されている。 The Japanese-English machine translation dictionary 24c is a list in which Japanese notations and English translations for the notations are registered. For example, “lymph node”, which is a parallel translation of English, is registered for the Japanese expression “lymph node” (reading: Rinpetsu, part of speech: noun-general).

Ｗｅｂクローリングデータ２４ｄは、Ｗｅｂクローリングによって、ネットワーク１２（インターネット）を通じて外部から取得されるデータである。Ｗｅｂクローリングデータ２４ｄは、Ｗｅｂサイト（Ｗｅｂサーバ１４）において公開されているＷｅｂページを静的なファイルとして保存したものである。Ｗｅｂクローリングデータ２４ｄは、平文コーパス２４ａから抽出された未知語（表記）に対する読みの情報を獲得するために利用される。Ｗｅｂクローリングデータ２４ｄのファイルの形式は、インターネット公開ページのソースであるＨＴＭＬ（Hyper Text Markup Language）形式であっても、ＨＴＭＬ形式を公開ページと同じ体裁の一般文書形式に変換したものであってもよい。Ｗｅｂクローリングデータ２４ｄは、語彙知識獲得装置１０の語彙知識獲得プログラム２１ａによる機能によって、ネットワーク１２を通じてＷｅｂサーバ１４から収集しても良いし、語彙知識獲得装置１０とは別の電子機器において作成したものを入力しても良い。Ｗｅｂクローリングデータ２４ｄは、語彙知識獲得装置１０に固定的に記録されたデータではなく、継続的に更新されるデータである。従って、インターネットを通じて公開されているＷｅｂページが更新されることで、Ｗｅｂクローリングデータ２４ｄから表記に対する新たな読みの情報を獲得することができる。 The web crawling data 24d is data acquired from the outside through the network 12 (Internet) by web crawling. The web crawling data 24d is obtained by saving a web page published on a website (web server 14) as a static file. The web crawling data 24d is used to acquire reading information for an unknown word (notation) extracted from the plaintext corpus 24a. The file format of the web crawling data 24d may be the HTML (Hyper Text Markup Language) format that is the source of the Internet public page, or the HTML format converted to the general document format that is the same as the public page. Good. The web crawling data 24d may be collected from the web server 14 through the network 12 by the function of the vocabulary knowledge acquisition program 21a of the vocabulary knowledge acquisition device 10, or created by an electronic device different from the vocabulary knowledge acquisition device 10. May be entered. The web crawling data 24d is not data that is fixedly recorded in the vocabulary knowledge acquisition device 10, but data that is continuously updated. Therefore, by updating the Web page that is published through the Internet, it is possible to acquire new reading information for the notation from the Web crawling data 24d.

仮構築辞書２４ｆは、構築済み辞書２４ｅがコピーされた音声認識システム４９による音声認識処理に利用される辞書である。仮構築辞書２４ｆは、構築済み辞書２４ｅに追加する表記の候補を追加して、音声認識システム４９による音声認識処理を実行するために利用される。語彙知識獲得装置１０は、構築済み辞書２４ｅを用いた音声認識処理の結果と、仮構築辞書２４ｆを用いた音声認識処理結果（解析結果）との差分を抽出して、構築済み辞書２４ｅへ追加する表記に関する知識として抽出する。 The temporary construction dictionary 24f is a dictionary used for voice recognition processing by the voice recognition system 49 to which the built dictionary 24e is copied. The temporary construction dictionary 24f is used to add a notation candidate to be added to the constructed dictionary 24e and execute the speech recognition processing by the speech recognition system 49. The vocabulary knowledge acquisition device 10 extracts the difference between the result of the speech recognition process using the constructed dictionary 24e and the result of the speech recognition process (analysis result) using the temporary construction dictionary 24f, and adds it to the constructed dictionary 24e. This is extracted as knowledge about notation.

語彙リスト２４ｇは、構築済み辞書２４ｅに新しい表記を追加する際のユーザ作業を支援するために、ユーザに提示されるデータである。語彙リスト２４ｇは、構築済み辞書２４ｅに追加する表記（未知語）の候補について、ユーザが構築済み辞書２４ｅに表記を追加するか否かを判断する際に参考となるデータ（知識）を提示する。詳細については後述する（図７参照）。 The vocabulary list 24g is data presented to the user in order to support the user work when adding a new notation to the built dictionary 24e. The vocabulary list 24g presents data (knowledge) that serves as a reference when the user determines whether or not to add a notation to the built dictionary 24e for a notation (unknown word) candidate to be added to the built dictionary 24e. . Details will be described later (see FIG. 7).

音声ファイル２４ｈは、音声認識システム４９により構築済み辞書２４ｅ及び仮構築辞書２４ｆを用いた音声認識処理を実行させるための、音声認識システム４９に対する入力音声とする音声データである。音声ファイル２４ｈは、例えば平文コーパス２４ａのテキストデータと１対１で対応づけられた音声データ、すなわち平文コーパス２４ａのテキストを読み上げた音声の音声データである。なお、音声ファイル２４ｈは、ユーザによりテスト用として用意された、平文コーパス２４ａのテキストとは別の内容の音声データのファイルとしても良い。 The voice file 24h is voice data as input voice to the voice recognition system 49 for causing the voice recognition system 49 to execute voice recognition processing using the built dictionary 24e and the temporary construction dictionary 24f. The voice file 24h is, for example, voice data associated with the text data of the plaintext corpus 24a on a one-to-one basis, that is, voice data of voice obtained by reading out the text of the plaintext corpus 24a. The voice file 24h may be a voice data file having a different content from the text of the plaintext corpus 24a prepared for the test by the user.

次に、本実施形態における語彙知識獲得装置１０の語彙知識獲得処理の動作について、図３に示すフローチャートを参照しながら説明する。
まず、形態素解析機能４１は、平文コーパス２４ａについて、形態素解析処理を実行する（ステップＡ１）。形態素解析機能４１は、形態解析処理によって、平文コーパス２４ａに含まれる日本語のテキストデータを単語に分割し、各単語について品詞を付与する。 Next, the operation of the vocabulary knowledge acquisition process of the vocabulary knowledge acquisition apparatus 10 in this embodiment will be described with reference to the flowchart shown in FIG.
First, the morpheme analysis function 41 executes a morpheme analysis process for the plaintext corpus 24a (step A1). The morphological analysis function 41 divides the Japanese text data included in the plaintext corpus 24a into words by morphological analysis processing, and gives parts of speech for each word.

例えば、形態素解析機能４１は、平文コーパス２４ａ中の日本語テキスト「風邪の初期症状の訴えがあったため、葛根湯を処方しました。ＬＮの腫れはありません。」のテキストデータについて形態素解析処理を実行した結果、図５に示すような形態素解析結果が得られる。 For example, the morphological analysis function 41 executes the morphological analysis processing on the text data of the Japanese text in the plaintext corpus 24a “Prescription Kakkonto was prescribed because there was a complaint of the initial symptoms of a cold. No swelling of LN”. As a result, a morphological analysis result as shown in FIG. 5 is obtained.

次に、複合語抽出機能４２は、形態素解析機能４１の出力（形態素解析結果）を入力し、形態素解析結果に基づいて複合語を抽出するための複合語抽出処理を実行する（ステップＡ２）。 Next, the compound word extraction function 42 inputs the output of the morpheme analysis function 41 (morpheme analysis result), and executes a compound word extraction process for extracting a compound word based on the morpheme analysis result (step A2).

複合語抽出機能４２は、隣接する形態素を連結して複合語を構成すると推定できる文字列を抽出して、複合語として出力する。複合語を構成する文字列の判断として、例えば、「『名詞−一般』の連続部分は複合語（複合名詞）と推測する」などのルールを用いる。 The compound word extraction function 42 extracts a character string that can be estimated to constitute a compound word by connecting adjacent morphemes, and outputs it as a compound word. As the determination of the character string constituting the compound word, for example, a rule such as “guesses a continuous part of“ noun-general ”as a compound word (compound noun)” is used.

図５に示す形態素解析結果では、「初期」と「症状」がともに品詞「名詞−一般」であり連続して現れるため、「初期症状」を複合語（複合名詞）と推測できる。また、一つの形態素解析結果だけではなく、大量の形態素解析結果を元にして、隣接して現れる頻度の高い形態素のつながりを複合語と推測する技術を利用することもできる。ここで、「名詞−一般」に限定せず「名詞」の連続部分もしくはアルファベットの連続部分を複合語（複合名詞）と推測すると、図５に示す形態素解析結果からは「初期症状」と「葛根湯」と「ＬＮ」が、複合語（複合名詞）として抽出される。 In the morphological analysis result shown in FIG. 5, since both “initial” and “symptom” are part-of-speech “noun-general” and appear continuously, it is possible to infer “initial symptom” as a compound word (compound noun). Further, it is also possible to use a technique for inferring not only a single morpheme analysis result but also a high-frequency morpheme connection that appears adjacently as a compound word based on a large amount of morpheme analysis results. Here, without limiting to “noun-general”, if a continuous part of “noun” or a continuous part of the alphabet is assumed to be a compound word (compound noun), the result of morphological analysis shown in FIG. "Yu" and "LN" are extracted as compound words (compound nouns).

次に、未知語抽出機能４３は、形態素解析機能４１の形態素解析結果、及び複合語抽出機能４２によれ抽出された複合語から、構築済み辞書２４ｅに登録されていない未知語（語彙）を抽出する未知語抽出処理を実行する（ステップＡ３）。 Next, the unknown word extraction function 43 extracts an unknown word (vocabulary) that is not registered in the constructed dictionary 24e from the morphological analysis result of the morphological analysis function 41 and the compound word extracted by the compound word extraction function 42. The unknown word extraction process is executed (step A3).

未知語抽出機能４３は、形態素解析機能４１から出力される形態素解析結果をもとに、自立語に相当する品詞が付与された基本形を抽出する。自立語とは、単独でも文節を構成することのできる単語を示す。自立語に相当する品詞は、名詞・代名詞・動詞・形容詞・形容動詞・副詞・連体詞・接続詞・感動詞が該当する。 The unknown word extraction function 43 extracts a basic form to which a part of speech corresponding to an independent word is assigned based on the morphological analysis result output from the morphological analysis function 41. An independent word refers to a word that can constitute a phrase alone. The part of speech corresponding to an independent word is a noun, pronoun, verb, adjective, adjective verb, adverb, conjunction, conjunction, or impression verb.

図５に示す形態素解析結果から抽出される基本形（表記）は、「風邪（名詞−一般）」「初期（名詞−一般）」「症状（名詞−一般）」「訴え（名詞−一般）」「ある（動詞−自立）」「ため（名詞−非自立−副詞可能）」「葛根（名詞−固有名詞−地域−一般）」「湯（名詞−一般）」「処方（名詞−サ変接続）」「する（動詞−自立）」「腫れ（名詞−一般）」の１１語となる。 The basic forms (notation) extracted from the morphological analysis results shown in FIG. 5 are “cold (noun-general)”, “initial (noun-general)”, “symptom (noun-general)”, “appeal (noun-general)”, “ "There is (verb-independence)" "For (noun-non-independence-adverb possible)" "Kakone (noun-proper noun-region-general)" "Yu (noun-general)" "Prescription (noun-sa connection)" 11 (verb-independent) "swelling (noun-general)".

さらに、未知語抽出機能４３は、複合語抽出機能４２の出力（複合語）を、形態素解析機能４１の形態素解析結果から抽出した１１語に加える。ここで、加える表記（複合語）は、「初期症状（名詞）」「葛根湯（名詞）」「ＬＮ（名詞）」の３表記であり、抽出された表記は１４語（１４表記）となる。ここで、抽出された１４語の表記は、平文コーパス２４ａから抽出された未知語の候補となる。 Further, the unknown word extraction function 43 adds the output (composite word) of the compound word extraction function 42 to 11 words extracted from the morpheme analysis result of the morpheme analysis function 41. Here, the added notation (compound word) is 3 notations of “initial symptom (noun)”, “Kakkonto (noun)”, and “LN (noun)”, and the extracted notation is 14 words (14 notation). . Here, the extracted 14-word notation is a candidate for an unknown word extracted from the plaintext corpus 24a.

次に、未知語抽出機能４３は、未知語の候補（表記）のリストと、構築済み辞書２４ｅとを比較して、構築済み辞書２４ｅに登録されていない未知語を抽出する。すなわち、未知語抽出機能４３は、未知語の候補のリストに含まれる表記と品詞の組のうち、構築済み辞書２４ｅに登録されていないものを抽出して出力する。 Next, the unknown word extraction function 43 compares the list of unknown word candidates (notation) with the built dictionary 24e, and extracts unknown words that are not registered in the built dictionary 24e. That is, the unknown word extraction function 43 extracts and outputs a combination of notation and part of speech included in the unknown word candidate list that is not registered in the built dictionary 24e.

構築済み辞書２４ｅには、「風邪（名詞−一般）」「初期（名詞−一般）」「症状（名詞−一般）」「処方（名詞−サ変接続）」が登録されているため、未知語抽出機能４３は、「訴え（名詞−一般）」「ある（動詞−自立）」「ため（名詞−非自立−副詞可能）」「葛根（名詞−固有名詞−地域−一般）」「湯（名詞−一般）」「する（動詞−自立）」「腫れ（名詞−一般）」「初期症状（名詞）」「葛根湯（名詞）」「ＬＮ（名詞）」の１０表記を未知語として抽出する。 Since the “cold (noun-general)”, “initial (noun-general)”, “symptom (noun-general)”, and “prescription (noun-variant connection)” are registered in the built dictionary 24e, unknown words are extracted. The functions 43 are “sue (noun-general)” “ar (verb-independent)” “for (noun-non-independent-adverb possible)” “katsune (noun-proprietary noun-region-general)” “yu (noun- Ten notations such as “general)” “do (verb—independence)”, “swelling (noun—general)”, “early symptom (noun)”, “Kakkonto (noun)”, “LN (noun)” are extracted as unknown words.

なお、未知語抽出機能４３は、未知語として抽出した表記に、重複する表記が含まれている場合には、一方を削除しても良い。例えば、前述した例では、複合語抽出機能４２により複合語として「葛根湯（名詞）」が抽出されている。一方、形態素解析機能４１の出力から「葛根湯」の構成要素となっている「葛根（名詞−固有名詞−地域−一般）」「湯（名詞−一般）」が抽出されている。この場合、未知語抽出機能４３は、形態素解析結果から抽出した「葛根（名詞−固有名詞−地域−一般）」「湯（名詞−一般）」を削除する。 Note that the unknown word extraction function 43 may delete one of the notations extracted as the unknown word if a duplicate notation is included. For example, in the example described above, “Kakkonto (noun)” is extracted as a compound word by the compound word extraction function 42. On the other hand, from the output of the morphological analysis function 41, “kazune (noun-proper noun-region-general)” and “yu (noun-general)”, which are components of “Kakkonyu”, are extracted. In this case, the unknown word extraction function 43 deletes “Kak root (noun-proper noun-region-general)” and “yu (noun-general)” extracted from the morphological analysis result.

この結果、未知語抽出機能４３は、「訴え（名詞−一般）」「ある（動詞−自立）」「ため（名詞−非自立−副詞可能）」「する（動詞−自立）」「腫れ（名詞−一般）」「初期症状（名詞）」「葛根湯（名詞）」「ＬＮ（名詞）」の８表記を出力する。 As a result, the unknown word extraction function 43 is “sue (noun-general)” “is (verb-independent)” “for (noun-non-independent-adverb possible)” “do (verb-independent)” “swelling (noun) -General 8) "Initial symptoms (nouns)", "Kakkonto (nouns)" and "LN (nouns)" are output.

さらに、未知語抽出機能４３は、構築済み辞書２４ｅに登録する表記（語彙）の候補を、品詞に基づいて制限する。例えば、未知語抽出機能４３は、例えば名詞で非自立ではない品詞の表記のみを登録の候補とする。 Further, the unknown word extraction function 43 restricts notation (vocabulary) candidates to be registered in the built dictionary 24e based on the part of speech. For example, the unknown word extraction function 43 sets only candidates for part-of-speech notation as non-independent, for example, as registration candidates.

この結果、未知語抽出機能４３は、「訴え（名詞−一般）」「初期症状（名詞）」「腫れ（名詞−一般）」「葛根湯（名詞）」「ＬＮ（名詞）」の５表記を出力する。以後の処理では、未知語抽出機能４３の出力を「訴え（名詞−一般）」「初期症状（名詞）」「腫れ（名詞−一般）」「葛根湯（名詞）」「ＬＮ（名詞）」の５表記として説明する。 As a result, the unknown word extraction function 43 uses five notations of “sue (noun-general)”, “early symptom (noun)”, “swelling (noun-general)”, “Kakkonto (noun)”, and “LN (noun)”. Output. In the subsequent processing, the output of the unknown word extraction function 43 is “sue (noun-general)”, “initial symptom (noun)”, “swelling (noun-general)”, “Kakkonto (noun)”, “LN (noun)”. It will be described as 5 notation.

次に、未知語関連情報付与機能４４は、未知語関連情報付与処理を実行し、未知語抽出機能４３から出力された表記（構築済み辞書２４ｅに登録する表記（未知語）の候補）のそれぞれについて、ユーザが構築済み辞書２４ｅに追加するか否かを判断する際に参考となるデータ（未知語関連情報）を求めて付与する（ステップＡ４）。 Next, the unknown word related information adding function 44 executes unknown word related information adding processing, and each of the notations (notation (unknown word) candidates registered in the built dictionary 24e) output from the unknown word extracting function 43. Is obtained and given as reference data (unknown word related information) when the user determines whether or not to add to the constructed dictionary 24e (step A4).

ここでは、未知語関連情報付与機能４４は、未知語抽出機能４３が出力した５表記それぞれについて、未知語関連情報を求めて付与する。 Here, the unknown word related information addition function 44 obtains and assigns unknown word related information for each of the five notations output by the unknown word extraction function 43.

未知語関連情報は、例えば、推定される品詞（「推定品詞」）、平文コーパス２４ａ（テキストデータ）を処理した際の出現頻度（「出現頻度」）、Ｗｅｂクローリングデータ２４ｄから抽出した未知語に対する読み（「読み」）、未知語に対する読みを抽出したスニペット・情報源（「スニペット・情報源」）、未知語と読み・表記・品詞が類似する構築済み辞書２４ｅに登録済みの表記（類似登録語）とその使用頻度、辞書に対する表記（見出し語）の追加あるいは削除をする前後の音声認識処理結果（解析結果）の差分などの情報の少なくとも１つを含む。 The unknown word related information includes, for example, an estimated part of speech (“estimated part of speech”), an appearance frequency when processing the plaintext corpus 24a (text data) (“appearance frequency”), and an unknown word extracted from the web crawling data 24d. Readings (“reading”), snippets / information sources (“snippets / information sources”) from which readings for unknown words are extracted, and notations (similar registrations) registered in the built-in dictionary 24e where the unknown words and readings / notations / parts of speech are similar Word) and the frequency of use thereof, and at least one of information such as a difference between speech recognition processing results (analysis results) before and after adding or deleting a notation (entry word) to the dictionary.

図６は、本実施形態における未知語関連情報付与機能４４により出力される未知語関連情報の一例を示す図である。
ここでは、平文コーパス２４ａに、日本語テキスト「風邪の初期症状の訴えがあったため、葛根湯を処方しました。ＬＮの腫れはありません。」を含み、この日本語テキスト以外の大量のテキストに「初期症状」「葛根湯」などの表記が、それぞれ複数回出現する場合の例を示している。 FIG. 6 is a diagram illustrating an example of unknown word related information output by the unknown word related information adding function 44 in the present embodiment.
Here, the plain text corpus 24a contains the Japanese text “Kekkonto was prescribed because there was a complaint of an early symptom of a cold. There is no swelling of LN.” An example in which the notation such as “initial symptom” and “Kakkonto” appears multiple times respectively.

「推定品詞」は、未知語抽出機能４３により出力される表記に付された形態素解析により得られた品詞である。 The “estimated part of speech” is a part of speech obtained by morphological analysis attached to the notation output by the unknown word extraction function 43.

「出現頻度」は、未知語抽出機能４３により出力される表記の平文コーパス２４ａ中の出現数をカウントした数である。 The “appearance frequency” is a number obtained by counting the number of appearances in the plaintext corpus 24 a written by the unknown word extraction function 43.

「読み」は、未知語抽出機能４３により出力される表記に付された形態素解析により得られた読み、あるいはＷｅｂクローリングデータ２４ｄから抽出した表記（未知語）に対する読みである。未知語関連情報付与機能４４は、未知語抽出機能４３が出力した表記（未知語）をもとに、Ｗｅｂクローリングデータ２４ｄから読みに相当する文字列を抽出する。 The “reading” is a reading obtained by morphological analysis attached to the notation output by the unknown word extraction function 43 or a notation (unknown word) extracted from the Web crawling data 24 d. The unknown word related information adding function 44 extracts a character string corresponding to reading from the Web crawling data 24d based on the notation (unknown word) output by the unknown word extraction function 43.

例えば、未知語関連情報付与機能４４は、Ｗｅｂクローリングデータ２４ｄからの未知語と読みの組み合わせが記述された部分を抽出する。例えば、未知語の直後に「（）」で囲まれた「ひらがな」もしくは「カタカナ」の記述がある場合に、未知語と読みの組み合わせが記述された部分として抽出する。 For example, the unknown word related information addition function 44 extracts a portion in which a combination of an unknown word and a reading from the web crawling data 24d is described. For example, when there is a description of “Hiragana” or “Katakana” surrounded by “()” immediately after the unknown word, it is extracted as a part in which the combination of the unknown word and the reading is described.

あるいは、未知語関連情報付与機能４４は、Ｗｅｂクローリングデータ２４ｄの表形式の記述部分において、ある列には未知語が記述され、他の列に「ひらがな」もしくは「カタカナ」による記述が未知語と対応づけられている場合に、未知語とその読みの組み合わせと判断して抽出する。 Alternatively, the unknown word related information adding function 44 includes an unknown word described in one column and a description of “Hiragana” or “Katakana” in the other column as an unknown word in the tabular description part of the Web crawling data 24d. If it is associated, it is extracted as a combination of an unknown word and its reading.

「スニペット・情報源」は、例えば、Ｗｅｂクローリングデータ２４ｄ中の未知語の読みを含むスニペット（一部でも良い）、及び未知語を含むＷｅｂサイト（Ｗｅｂページ）の例えばＵＲＬ（uniform resource locator）である。未知語の読みを含む「スニペット・情報源」の組が複数抽出された場合、未知語関連情報付与機能４４は、複数の組を全て抽出しても良いし、同じ読みが付与された回数が最も多いスニペットのみを採用しても良い。さらに未知語関連情報付与機能４４は、ユーザが予め付与したＷｅｂサイトの信頼度が高いものを優先的に採用するなどして、未知語関連情報とする情報を集約してもよい。 The “snippet / information source” is, for example, a snippet (may be a part) including an unknown word reading in the Web crawling data 24d and a URL (uniform resource locator) of a Web site (Web page) including the unknown word. is there. When a plurality of “snippet / information source” pairs including unknown word readings are extracted, the unknown word related information adding function 44 may extract all of the plurality of sets, or the number of times the same reading is given Only the most snippet may be used. Further, the unknown word related information adding function 44 may aggregate information as unknown word related information by preferentially adopting a Web site with high reliability given in advance by the user.

登録済みの表記とその使用頻度は、構築済み辞書２４ｅから抽出される未知語（表記）と読み・表記・品詞が類似（少なくとも読みが一致する）する登録済みの表記と、この登録済みの表記の平文コーパス２４ａ中の出現数をカウントした数である。 The registered notation and the frequency of use thereof are the registered notation and the registered notation in which the unknown word (notation) extracted from the constructed dictionary 24e is similar in reading / notation / part of speech (at least the readings match). The number of appearances in the plaintext corpus 24a is counted.

図６に示す未知語関連情報では、未知語「腫れ」に対して、「晴れ（はれ、品詞：名詞−一般、出現頻度：１）」の情報が追加されている。 In the unknown word related information shown in FIG. 6, information of “clear (swelling, part of speech: noun—general, appearance frequency: 1)” is added to the unknown word “swelling”.

「差分」は、未知語を仮構築辞書２４ｆに登録した場合の仮構築辞書２４ｆを用いた音声認識処理の結果と、未知語が登録されていない構築済み辞書２４ｅを用いた音声認識処理の結果との差分（音声認識結果の違い）についての情報である。未知語関連情報付与機能４４は、次のようにして「差分」の情報を求める。 “Difference” indicates the result of the speech recognition process using the temporary construction dictionary 24 f when the unknown word is registered in the temporary construction dictionary 24 f and the result of the speech recognition process using the built dictionary 24 e in which the unknown word is not registered. Information (difference in speech recognition result). The unknown word related information adding function 44 obtains “difference” information as follows.

未知語関連情報付与機能４４は、辞書編集機能４８によって未知語とする表記・品詞・読みの組み合わせを、辞書編集機能２１を通じて、構築済み辞書２４ｅのコピーである仮構築辞書２４ｆに追加させる。次に、未知語関連情報付与機能４４は、辞書編集機能４８に対して、仮構築辞書２４ｆと構築済み辞書２４ｅとを用いた音声認識処理の実行を指示する。辞書編集機能４８は、未知語関連情報付与機能４４からの指示に応じて、未知語が登録された仮構築辞書２４ｆと、構築済み辞書２４ｅを用いた音声認識処理を音声認識システム４９により実行させる。この際、辞書編集機能４８は、音声認識システム４９に対して、音声ファイル２４ｈを音声認識処理の対象とする音声データとして入力する。 The unknown word related information adding function 44 causes the dictionary editing function 48 to add the combination of the notation, the part of speech, and the reading, which are set as unknown words, to the temporary construction dictionary 24 f that is a copy of the constructed dictionary 24 e. Next, the unknown word related information addition function 44 instructs the dictionary editing function 48 to execute a speech recognition process using the temporary construction dictionary 24f and the built dictionary 24e. The dictionary editing function 48 causes the voice recognition system 49 to execute voice recognition processing using the temporary construction dictionary 24f in which the unknown words are registered and the built dictionary 24e in response to an instruction from the unknown word related information adding function 44. . At this time, the dictionary editing function 48 inputs the voice file 24h as voice data to be subjected to voice recognition processing to the voice recognition system 49.

辞書編集機能４８は、仮構築辞書２４ｆを用いた音声認識処理の結果と、構築済み辞書２４ｅを用いた音声認識処理の結果を、未知語関連情報付与機能４４に出力する。未知語関連情報付与機能４４は、仮構築辞書２４ｆと構築済み辞書２４ｅをそれぞれ用いた音声認識結果をもとに差分（音声認識結果の違い）についての情報を作成する。 The dictionary editing function 48 outputs the result of the speech recognition process using the temporary construction dictionary 24 f and the result of the speech recognition process using the built dictionary 24 e to the unknown word related information adding function 44. The unknown word related information adding function 44 creates information on the difference (difference in the voice recognition result) based on the voice recognition result using the temporary construction dictionary 24f and the built dictionary 24e.

なお、未知語関連情報付与機能４４は、未知語に対して「読み・表記・品詞が類似する登録語」が構築済み辞書２４ｅに存在する場合は、その登録語を仮構築辞書２４ｆから削除し、新しい解析結果として未知語関連情報に付与することもできる。 The unknown word related information providing function 44 deletes the registered word from the temporary construction dictionary 24f when the registered word having similar reading / notation / part of speech in the built dictionary 24e exists for the unknown word. It can also be added to unknown word related information as a new analysis result.

こうして、辞書に対する未知語の追加あるいは削除をする前後の音声認識処理結果の差分の情報を抽出することにより、ユーザが未知語を辞書へ登録した場合の有効性を確認して辞書編集を行うことができるため、辞書編集の効率が向上し、さらに辞書編集の弊害を予め確認して予防することができる。 In this way, by extracting information on the difference between the speech recognition processing results before and after adding or deleting unknown words to the dictionary, it is possible to check the effectiveness when the user registers the unknown words in the dictionary and perform dictionary editing Therefore, the efficiency of dictionary editing can be improved, and the adverse effects of dictionary editing can be confirmed and prevented in advance.

図６に示す未知語関連情報では、未知語「腫れ」「葛根湯」について、「差分」の情報が追加されている（図中Ａ，Ｂに示す）。 In the unknown word related information shown in FIG. 6, “difference” information is added for the unknown words “swelling” and “Kakkonto” (shown in A and B in the figure).

なお、図６に示す未知語関連情報では、抽出できなかった情報については空欄としている。例えば、スニペット・情報源がＷｅｂクローリングデータ２４ｄから抽出されなかった場合や、読み・表記・品詞が類似する登録語が構築済み辞書２４ｅから抽出されなかった場合は、空欄としている。また、「差分」の情報は、形態素解析機能４１が付与した読みとは異なる場合のみ付与するようにしても良い。 In the unknown word related information shown in FIG. 6, information that could not be extracted is blank. For example, if the snippet / information source is not extracted from the web crawling data 24d, or if a registered word with similar reading / notation / part of speech is not extracted from the built-in dictionary 24e, the field is left blank. Further, the “difference” information may be given only when the reading is different from the reading given by the morphological analysis function 41.

次に、略称推定機能４５は、未知語関連情報に含まれる略称を表す未知語に対して正式表記を付与するため、未知語関連情報に含まれる可能性のある略称を推定するための略称推定処理を実行する（ステップＡ５）。 Next, the abbreviation estimation function 45 assigns a formal notation to an unknown word representing an abbreviation included in the unknown word related information, and therefore abbreviation estimation for estimating an abbreviation that may be included in the unknown word related information. Processing is executed (step A5).

略称推定機能４５は、正式名称リスト２４ｂに登録された表記、形態素解析機能４１による形態素解析結果により得られた表記、及び複合語抽出機能４２によって正式名称の一部として抽出される表記をもとに略称を作成する。ここでは、略称推定機能４５は、英語の複数単語からなる表記、もしくは、日本語の複数の形態素から構成される表記に対して略称を生成する。 The abbreviation estimation function 45 is based on notations registered in the formal name list 24b, notations obtained as a result of morpheme analysis by the morpheme analysis function 41, and notations extracted as part of the formal name by the compound word extraction function 42. Create an abbreviation for Here, the abbreviation estimation function 45 generates an abbreviation for a notation composed of a plurality of English words or a notation composed of a plurality of Japanese morphemes.

例えば、正式名称リスト２４ｂに日本語の表記「リンパ節」が登録されていて、日英機械翻訳辞書２４ｃに英語の対訳である「ｌｙｍｐｈｎｏｄｅ」が登録されている場合、略称推定機能４５は、「リンパ節」の略称として、英語の対訳の頭文字を大文字にして連結した「ＬＮ」を生成する。 For example, when the Japanese name “lymph node” is registered in the official name list 24b and the English translation “lymph node” is registered in the Japanese-English machine translation dictionary 24c, the abbreviation estimation function 45 is: As an abbreviation for “lymph node”, “LN” is generated by concatenating the English initials with capital letters.

また、略称推定機能４５は、例えば、日本語の正式名称「動脈注射」に対して、形態素解析結果「動脈（名詞−一般）注射（名詞−サ変接続）」の形態素の最初の文字を連結した略称「動注」を生成する。 The abbreviation estimation function 45, for example, concatenates the first letter of the morpheme of the morphological analysis result “artery (noun-general) injection (noun-sa-variant connection)” to the Japanese official name “arterial injection”. Generates the abbreviation “articulation”.

次に、正式表記候補付与機能４６は、未知語関連情報に含まれる略称推定機能４５により生成された略称に相当する未知語に対して、正式表記候補と読みを付与する正式候補付与処理を実行する（ステップＡ６）。 Next, the formal notation candidate assigning function 46 executes a formal candidate assigning process for assigning formal notation candidates and readings to unknown words corresponding to the abbreviations generated by the abbreviation estimation function 45 included in the unknown word related information. (Step A6).

まず、正式表記候補付与機能４６は、未知語関連情報付与機能４４が出力した未知語関連情報中の表記（未知語）と、略称推定機能４５が生成した略称とを比較する。 First, the formal notation candidate giving function 46 compares the notation (unknown word) in the unknown word related information output by the unknown word related information giving function 44 with the abbreviation generated by the abbreviation estimation function 45.

未知語関連情報中の表記（未知語）と一致する略称がある場合、正式表記候補付与機能４６は、未知語関連情報中の該当する表記（未知語）に対して、略称の元となった正式名称とその読み・品詞を付与する。 When there is an abbreviation that matches the notation (unknown word) in the unknown word related information, the formal notation candidate giving function 46 is the source of the abbreviation for the corresponding notation (unknown word) in the unknown word related information. Give the official name and its reading and part of speech.

例えば、図６に示す未知語関連情報では、未知語「ＬＮ」が、略称推定機能４５により生成された正式表記「リンパ節」から推定した略称「ＬＮ」と一致する。この場合、正式表記候補付与機能４６は、未知語「ＬＮ」に対して、正式表記候補「リンパ節」と読み「りんぱせつ」と品詞「名詞−一般」を付与する。この正式表記候補の読みと品詞は、略称「ＬＮ」の読みと品詞の候補として扱う。 For example, in the unknown word related information shown in FIG. 6, the unknown word “LN” matches the abbreviation “LN” estimated from the formal expression “lymph node” generated by the abbreviation estimation function 45. In this case, the formal notation candidate assigning function 46 assigns the formal notation candidate “lymph node”, the reading “Rinpasetsu”, and the part of speech “noun-general” to the unknown word “LN”. The reading and part of speech of the formal notation candidate are handled as the reading of the abbreviation “LN” and the part of speech candidate.

次に、結果出力機能４７は、正式表記候補付与機能４６から出力される未知語関連情報を、ユーザに提示する形式に編集して出力する結果出力処理を実行する（ステップＡ７）。結果出力機能４７は、未知語関連情報に含まれる複数の未知語を、辞書追加登録効果の高い順に並べて語彙リスト２４ｇとして生成し、表示ユニット２６において表示させる。 Next, the result output function 47 executes a result output process in which the unknown word related information output from the formal notation candidate assignment function 46 is edited and output in a format presented to the user (step A7). The result output function 47 generates a vocabulary list 24g by arranging a plurality of unknown words included in the unknown word related information in descending order of the dictionary additional registration effect, and displays the vocabulary list 24g on the display unit 26.

なお、結果出力機能４７は、語彙リスト２４ｇを一覧表示するだけでなく、未知語（表記）毎に未知語関連情報を順番に表示するようにしても良い。 The result output function 47 may not only display the vocabulary list 24g as a list but also display the unknown word related information in order for each unknown word (notation).

図７は、本実施形態における結果出力機能４７から出力される語彙リスト２４ｇの一例を示す図である。図７に示す語彙リスト２４ｇは、図６に示す未知語関連に対して、未知語の並びを出現頻度の高い順に変更した例を示している。 FIG. 7 is a diagram showing an example of the vocabulary list 24g output from the result output function 47 in the present embodiment. The vocabulary list 24g shown in FIG. 7 shows an example in which the unknown word arrangement is changed in descending order of appearance frequency with respect to the unknown word relation shown in FIG.

なお、図７に示す語彙リスト２４ｇには、正式表記候補付与機能４６によって、表記「ＬＮ」に対して、正式表記候補「リンパ節（読み：りんぱせつ、品詞：名詞−一般）」（図中Ｄに示す）と、その読み「りんぱせつ」（図中Ｃに示す）が追加されている。 It should be noted that the vocabulary list 24g shown in FIG. 7 is given a formal notation candidate “lymph node (reading: Rinpatsutsu, part of speech: noun-general)” with respect to the notation “LN” by the formal notation candidate giving function 46 (FIG. 7). And the reading “Rinpetsutsu” (shown in C in the figure) is added.

また、前述した説明では、未知語関連情報の未知語（表記）を出現頻度の高い順に並べ替えているが、その他の条件に基づいて編集することも可能である。
例えば、複合語として抽出された表記や、Ｗｅｂクローリングデータ２４ｄから抽出した読み情報が形態素解析機能４１の解析結果と異なる表記は、辞書追加登録効果が高いと判断して、語彙リスト２４ｇの上位に位置づけたりしても良い。また、出現頻度が多い、複合語である、Ｗｅｂクローリングデータ２４ｄから抽出した読み情報が形態素解析機能４１の解析結果と異なるなど、辞書追加登録効果の判断結果が同じ表記が複数ある場合は、さらに別の辞書追加登録効果の判断基準に基づいて表記を並べ替えても良い。 Further, in the above description, the unknown words (notation) of the unknown word related information are rearranged in the order of appearance frequency, but can be edited based on other conditions.
For example, a notation extracted as a compound word or a notation in which the reading information extracted from the web crawling data 24d is different from the analysis result of the morphological analysis function 41 is judged to have a high dictionary addition registration effect, and is higher in the lexical list 24g. It may be positioned. In addition, when there are a plurality of notations having the same determination result of the dictionary additional registration effect, such as a compound word having a high appearance frequency, reading information extracted from the Web crawling data 24d being different from the analysis result of the morphological analysis function 41, The notation may be rearranged on the basis of another criterion for determining additional dictionary registration effects.

図７に示す語彙リスト２４ｇは、各表記について複数行からなる表形式の出力例を示しているが、他の形式にすることが可能である。例えば、１つの表記について、未知語関連情報を１行で示す表形式とすることもできる。また、「スニペット・情報源」に関する情報のように、テキストが長い情報については、該当情報へのリンク情報のみを語彙リスト２４ｇに提示するようにしても良い。 The vocabulary list 24g shown in FIG. 7 shows an output example of a tabular format consisting of a plurality of lines for each notation, but other formats are also possible. For example, for a single notation, the unknown word related information can be in a tabular form that is shown in one line. For information with a long text such as information on “snippet / information source”, only link information to the corresponding information may be presented in the vocabulary list 24g.

このようにして、本実施形態における語彙知識獲得装置１０は、Ｗｅｂクローリングデータ２４ｄから未知語に対応する読みを獲得することで、語彙知識獲得装置１０に予め設定された情報内に制限されることなく、構築済み辞書２４ｅには登録されていない未知語の読みを、形態素解析や推定ルールでは対応できない場合でも取得することができる。また、未知語に対して、略称と正式名称の対応を提示することで、正式名称の読みがそのまま適用される可能性も高い略称に対して適切な読みを付与できる。また、略称に対して正式名称との対応が提示されることで、認識した単語の意味を把握する必要のある音声対話にも対応が容易となる。平文コーパス２４ａからの構築済み辞書２４ｅへ登録する候補とする表記の抽出と読み推定が機械的に実施されることで、人手で実施する場合の作業時間を削減でき、また構築済み辞書２４ｅに登録されていない未知語の抽出漏れを削減できる。ユーザは、語彙リスト２４ｇによって提示された構築済み辞書２４ｅへの登録の候補とする表記（未知語）について、それぞれに付与された未知語関連情報をもとに、登録するか否かを判断することができる。 In this way, the vocabulary knowledge acquisition apparatus 10 according to the present embodiment is limited to information preset in the vocabulary knowledge acquisition apparatus 10 by acquiring readings corresponding to unknown words from the web crawling data 24d. In addition, readings of unknown words that are not registered in the built dictionary 24e can be acquired even when morphological analysis or estimation rules cannot handle them. In addition, by presenting the correspondence between an abbreviation and an official name for an unknown word, an appropriate reading can be given to an abbreviation that has a high possibility of being applied as it is. In addition, since the correspondence between the abbreviated name and the official name is presented, it is possible to easily cope with the voice conversation in which the meaning of the recognized word needs to be grasped. The extraction of notations as candidates to be registered in the constructed dictionary 24e from the plaintext corpus 24a and the reading estimation are performed mechanically, thereby reducing the work time for manual implementation and registration in the constructed dictionary 24e. This can reduce the omission of unknown words that have not been extracted. The user determines whether or not to register the notation (unknown word) as a candidate for registration in the constructed dictionary 24e presented by the vocabulary list 24g based on the unknown word related information given to each. be able to.

なお、未知語関連情報付与機能４４により抽出される未知語に対応する「差分」の情報は、語彙リスト２４ｇをユーザに提示した後、語彙リスト２４ｇからユーザ操作によって選択された表記（未知語）に対してのみ実行するようにしても良い。「差分」の情報の抽出方法は、前述と同様にして実行されるものとして詳細な説明を省略する。 The “difference” information corresponding to the unknown word extracted by the unknown word related information adding function 44 is the notation (unknown word) selected by the user operation from the vocabulary list 24g after the vocabulary list 24g is presented to the user. You may make it perform only with respect to. The “difference” information extraction method is executed in the same manner as described above, and a detailed description thereof will be omitted.

ユーザによって選択された表記（未知語）に対してのみ「差分」の情報を生成することにより、語彙リスト２４ｇを提示するための処理負担を軽減して、短時間で語彙リスト２４ｇをユーザに対して提示することが可能となる。 By generating “difference” information only for the notation (unknown word) selected by the user, the processing burden for presenting the vocabulary list 24g is reduced, and the vocabulary list 24g can be sent to the user in a short time. Can be presented.

次に、本実施形態における未知語関連情報付与機能４４による未知語関連情報付与処理の応用例について説明する。図８は、本実施形態における未知語関連情報付与処理を示すフローチャートである。 Next, an application example of the unknown word related information adding process by the unknown word related information adding function 44 in the present embodiment will be described. FIG. 8 is a flowchart showing the unknown word related information adding process in the present embodiment.

ここでは、未知語関連情報付与機能４４は、Ｗｅｂサイト（Ｗｅｂサーバ１４）により公開されている情報の信頼性を示す信頼性評価リストを利用して未知語関連情を作成する。 Here, the unknown word related information addition function 44 creates unknown word related information using a reliability evaluation list indicating the reliability of information published by the Web site (Web server 14).

Ｗｅｂサイトには、専門家が編集した信頼できる情報を公開しているものと、非専門家が編集した信頼性が低い情報を公開しているものが混在している。 There are a mixture of Web sites that disclose reliable information edited by experts and those that disclose low-reliability information edited by non-experts.

図９は、本実施形態における信頼性評価リストの一例を示す図である。図５に示す例では、信頼性評価リストは、Ｗｅｂサイト（ＵＲＬ）ごとに、信頼性を示す評価値、例えば「○」「△」「×」の３段階の評価値を記録できる。また、信頼性評価リストは、Ｗｅｂサイト（ＵＲＬ）ごとに、Ｗｅｂサイトから抽出した情報（未知語に対する読み）をユーザに提示した際に、ユーザがその情報を採用したか否かを「読み採用数」「読み不採用数」として記録できる。 FIG. 9 is a diagram showing an example of the reliability evaluation list in the present embodiment. In the example illustrated in FIG. 5, the reliability evaluation list can record evaluation values indicating reliability, for example, three-level evaluation values “◯”, “Δ”, and “×” for each Web site (URL). In addition, the reliability evaluation list indicates, for each website (URL), when the information extracted from the website (reading for unknown words) is presented to the user, whether or not the user has adopted the information is “reading adopted. It can be recorded as “number” and “number of reading failures”.

未知語関連情報付与機能４４は、Ｗｅｂクローリングデータ２４ｄから「スニペット・情報源」の組を複数抽出した場合（ステップＢ１、Ｙｅｓ）、図９に示す信頼性評価リストを参照し、ユーザに提示する「スニペット・情報源」の情報を選択する（ステップＢ２）。 When a plurality of “snippet / information source” pairs are extracted from the Web crawling data 24d (step B1, Yes), the unknown word related information adding function 44 refers to the reliability evaluation list shown in FIG. 9 and presents it to the user. The information of “snippet / information source” is selected (step B2).

例えば、未知語関連情報付与機能４４は、信頼性が「○」で、読み採用数が多く、読み不採用数が少ないサイトの情報を優先して選択して、未知語関連情報として付与する。 For example, the unknown word related information adding function 44 preferentially selects information of a site having a reliability of “◯”, a large number of reading adoptions, and a small number of reading unacceptances, and gives the information as unknown word related information.

また、未知語関連情報付与機能４４は、未知語関連情報に付与した情報が採用された場合（ステップＢ３、Ｙｅｓ）、すなわち語彙リスト２４ｇにおいて提示した読みが未知語と共に登録された場合、信頼性評価リスト中の該当する情報が抽出されたＷｅｂサイトの「読み採用数」をカウントアップする（ステップＢ４）。なお、信頼性評価リストの「読み不採用数」は、例えば、ユーザによって不採用として明示的に指定された場合や、１つの表記に対して複数の読みが提示されている時に選択されなかった場合にカウントアップする。 Further, the unknown word related information adding function 44 is reliable when the information added to the unknown word related information is adopted (step B3, Yes), that is, when the reading presented in the vocabulary list 24g is registered together with the unknown word. The “reading adoption number” of the Web site from which the corresponding information in the evaluation list is extracted is counted up (step B4). Note that the “reading rejection number” in the reliability evaluation list is not selected when, for example, the user explicitly specifies that the reading is not adopted or when a plurality of readings are presented for one notation. Count up when.

なお、信頼性の評価値は、ユーザがＷｅｂサイトの内容を確認した上でユーザ操作によって信頼性評価リストに設定しても良いし、「読み採用数」と「読み不採用数」に応じて予め設定されたルールに従って自動的に設定しても良い。例えば、「読み採用数」が基準値以上で「読み不採用数」が「０」の場合には信頼性を「○」に設定したり、「読み採用数」と「読み不採用数」との比率に基づいて設定したりしても良い。 The reliability evaluation value may be set in the reliability evaluation list by the user operation after the user confirms the content of the website, or according to the “reading adoption number” and “reading rejection number”. It may be automatically set according to a preset rule. For example, when the “reading adoption number” is equal to or higher than the reference value and the “reading rejection number” is “0”, the reliability is set to “○”, or “reading adoption number” and “reading rejection number” Or may be set based on the ratio.

なお、信頼性評価リストを利用する場合、信頼性が「○」のＷｅｂサイトのみを利用しても良いし、信頼性が「○」のＷｅｂサイトから必要な情報が抽出できない場合に信頼性が「△」のＷｅｂサイトを利用するようにしても良い。さらに、その他の利用方法を用いることも可能である。また、評価値は、３段階に限らず、任意の段数とすることができる。 When using the reliability evaluation list, it is possible to use only the website with the reliability “O”, and the reliability is obtained when necessary information cannot be extracted from the website with the reliability “O”. You may make it utilize the web site of "(triangle | delta)". Furthermore, other utilization methods can be used. Further, the evaluation value is not limited to three stages, and can be an arbitrary number of stages.

このようにして、事前に評価されたＷｅｂサイトの信頼性を参照して情報を選択することで、ユーザに信頼性の高い情報を提示することができる。また、ユーザが提示された情報を採用したか否かの履歴を蓄積し、評価を更新することで、ユーザへの提示情報の信頼性をさらに向上できる。 In this manner, highly reliable information can be presented to the user by selecting information with reference to the reliability of the website evaluated in advance. Moreover, the reliability of the information presented to the user can be further improved by accumulating a history of whether or not the information presented by the user has been adopted and updating the evaluation.

次に、本実施形態における結果出力機能４７による結果出力処理の応用例について説明する。図１０は、本実施形態における結果出力処理を示すフローチャートである。 Next, an application example of the result output process by the result output function 47 in this embodiment will be described. FIG. 10 is a flowchart showing a result output process in the present embodiment.

結果出力機能４７は、正式表記候補付与機能４６により作成された未知語関連情報（未知語のリスト）を、辞書追加登録効果の高い順に未知語を並べ変えてユーザに提示する。 The result output function 47 presents the unknown word related information (unknown word list) created by the formal notation candidate assignment function 46 to the user by rearranging the unknown words in descending order of the dictionary additional registration effect.

結果出力機能４７は、辞書追加登録効果の高い順の判断指標として、例えば以下の７指標を用いることができる。
第１指標：平文コーパス２４ａにおける出現頻度が高いこと。
第２指標：正式名称リスト２４ｂにおける出現頻度が高いこと。
第３指標：構築済み辞書２４ｅに登録された表記と同じ品詞の語彙が多いこと。
第４指標：Ｗｅｂクローリングデータ２４ｄから抽出した読み情報が形態素解析結果から推測される読みと異なること。
第５指標：平文コーパス２４ａ中で表記の直前直後に現れる形態素の異なり数が多いこと。
第６指標：表記の重み評価値ｔｆ−ｉｄｆの値が大きいこと。
第７表記：複合語の独立性を評価する指標（Ｃ−ｖａｌｕｅ，ＭＣ−ｖａｌｕｅなど）が高いこと。 The result output function 47 can use, for example, the following seven indices as the determination indices in descending order of the dictionary additional registration effect.
First index: High appearance frequency in the plaintext corpus 24a.
Second index: high appearance frequency in the formal name list 24b.
Third index: There are many vocabularies with the same part of speech as the notation registered in the built dictionary 24e.
Fourth index: Reading information extracted from the Web crawling data 24d is different from reading estimated from the morphological analysis result.
Fifth index: The number of different morphemes appearing immediately before and after the notation in the plaintext corpus 24a.
Sixth index: The value of the written weight evaluation value tf-idf is large.
Seventh notation: A high index (C-value, MC-value, etc.) for evaluating the independence of compound words.

第１指標を用いることで、出現頻度が高い表記を優先して登録の候補として提示できる。第２指標を用いることで、平文コーパス２４ａが十分でないとき（例えばデータ量が少ない）であっても、対象分野での出現の可能性が高い、正式名称リスト２４ｂに含まれる正しい表記を優先して提示できる。第３指標を用いることで、構築済み辞書２４ｅにおいて必要とされる可能性の高い品詞（例えば音声認識に有効な形容詞、地名や人名などの認識に有効な固有名詞など）の表記を優先して提示できる。第４指標を用いることで、新しい表記（新語や芸能人名など）であり読みが難しい（一般的ではない）可能性が高く、登録しておくことが有効である可能性が高い表記を優先して提示できる。第５指標を用いることで、独立した単語を優先して提示することができる。第６指標を用いることで、特定分野の文書に偏って出てくる、その分野では重要な単語である可能性が高い表記を優先して提示できる。第７表記を用いることで、複合語に含まれる単語の独立性が低い（いつも複合語で用いられる）表記について、複合語での表記を優先して提示することができる。 By using the first index, it is possible to preferentially present a notation with a high appearance frequency and present it as a registration candidate. By using the second index, priority is given to the correct notation included in the formal name list 24b, which is likely to appear in the target field even when the plaintext corpus 24a is not sufficient (for example, the amount of data is small). Can be presented. By using the third index, priority is given to the expression of parts of speech that are likely to be required in the constructed dictionary 24e (for example, adjectives effective for speech recognition, proper nouns effective for recognition of place names, person names, etc.). Can present. By using the fourth indicator, priority is given to new notations (new words, names of entertainers, etc.) that are likely to be difficult to read (uncommon) and that are likely to be effective to register. Can be presented. By using the fifth index, independent words can be preferentially presented. By using the sixth index, it is possible to preferentially present notations that are biased toward documents in a specific field and that are likely to be important words in that field. By using the seventh notation, it is possible to preferentially present the notation in the compound word for the notation in which the word included in the compound word is low independence (always used in the compound word).

なお、表記の重み評価値ｔｆ−ｉｄｆは、「ｔｆ」（単語の出現頻度）と、「ｉｄｆ」（逆文書頻度）の二つの指標を乗じて計算される指標である。「ｉｄｆ」は多くの文書に出現する語、すなわち一般的な語は値が下がり、特定の文書のみに出現する語は値が高くなる。すなわち、「ｉｄｆ」に「ｔｆ」を乗じた「ｔｆ−ｉｄｆ」は、特定の文書のみに高頻度で出現する表記に対して高い値となる。従って、ある専門分野に特有の重要単語を判断する指標とすることができる。 The notation weight evaluation value tf-idf is an index calculated by multiplying two indexes of “tf” (word appearance frequency) and “idf” (reverse document frequency). “Idf” has a lower value for words that appear in many documents, that is, general words, and has a higher value for words that appear only in a specific document. That is, “tf−idf” obtained by multiplying “idf” by “tf” is a high value for a notation that appears frequently only in a specific document. Therefore, it can be used as an index for determining an important word specific to a certain specialized field.

また、複合語の独立性を評価する指標Ｃ−ｖａｌｕｅは、文書における単語間の結合度を示す。 In addition, an index C-value for evaluating the independence of compound words indicates the degree of coupling between words in a document.

Ｃ−ｖａｌｕｅ(ｗ)＝(ｌｅｎｇｔｈ(ｗ)−１)(ｎ(ｗ)−(ｔ(ｗ)／ｃ(ｗ)))
ｗ：注目している単語
ｌｅｎｇｔｈ（ｗ）：ｗの長さ（ｗを構成する単語の数）
ｎ（ｗ）：ｗの出現回数
ｔ（ｗ）：ｗを含むより長い複合語の出現回数
ｃ（ｗ）：ｗを含むより長い複合語の異なり数
注目している単語がより長い複合語の一部としてしか使われていない場合は、Ｃ−ｖａｌｕｅは０に近い値となる。Ｃ−ｖａｌｕｅの値が大きい語は、独立性が高い。Ｃ−ｖａｌｕｅはｗが一つの単語のみから構成される場合は必ず０となってしまうため、一つの単語であっても０以外の評価値となるＭＣ−ｖａｌｕｅなどの修正式を使用することができる。 C-value (w) = (length (w) -1) (n (w)-(t (w) / c (w)))
w: focused word length (w): length of w (number of words constituting w)
n (w): Number of occurrences of w t (w): Number of occurrences of longer compound words including w c (w): Number of different compound words including w When it is used only as a part, C-value becomes a value close to zero. A word with a large C-value is highly independent. Since C-value is always 0 when w is composed of only one word, it is possible to use a correction formula such as MC-value that gives an evaluation value other than 0 even for one word. it can.

結果出力機能４７は、７指標のうちの一つもしくは複数の組み合わせを用いて、辞書追加登録効果の高さを判定し、結果を並べ変える。なお、何れの指標を用いるかは、ユーザが選択できるようにしても良いし、システムが自動的に設定しても良い。システムが自動的に設定する場合には、例えば処理対象とする平文コーパス２４ａの内容（長さ、分野）などに基づいて決定することができる。また、複数の指標を用いる場合には、指標に優先度を設定しても良い。 The result output function 47 determines the height of the dictionary additional registration effect using one or a combination of the seven indexes, and rearranges the results. Note that which index is used may be selectable by the user, or may be automatically set by the system. When the system automatically sets, it can be determined based on, for example, the contents (length, field) of the plaintext corpus 24a to be processed. In addition, when a plurality of indices are used, priority may be set for the indices.

また、各指標に対して、さらに条件を設定することもできる。例えば、ユーザに提示する値の範囲の指定を受け付け、結果の出力範囲を限定することができる。例えば、「平文コーパスにおける出現頻度が１０以上」の指定により出力範囲を限定したり、「推定される品詞が名詞であること」の指定により名詞と推定される表記に限定したりすることができる。 Further, conditions can be set for each index. For example, specification of a range of values to be presented to the user can be accepted and the output range of the result can be limited. For example, the output range can be limited by specifying “appearance frequency in plaintext corpus is 10 or more”, or can be limited to notation presumed to be a noun by specifying “estimated part of speech is a noun”. .

結果出力機能４７は、正式表記候補付与機能４６から出力された未知語（登録の候補とする表記）のリストに対して、予め設定された指標をもとに辞書追加登録効果の高さを判定し（ステップＣ１）、この判定結果に応じて未知語の順番を並べ替える（ステップＣ２）。 The result output function 47 determines the level of the dictionary additional registration effect on the list of unknown words (notation to be registered candidates) output from the formal notation candidate giving function 46 based on a preset index. (Step C1), the order of unknown words is rearranged according to the determination result (Step C2).

結果出力機能４７は、指標に基づいて表記の順番を並べ替えた語彙リスト２４ｇを出力する（ステップＣ３）。 The result output function 47 outputs the vocabulary list 24g in which the notation order is rearranged based on the index (step C3).

このようにして、複数の評価指標を設けて柔軟に組み合わせを選択でき、出力範囲を限定することで、出力される語彙リスト２４ｇの上位に、ユーザが求める内容が多く含まれるように精度を向上できる。 In this way, it is possible to select a combination flexibly by providing a plurality of evaluation indexes, and by limiting the output range, the accuracy is improved so that the content required by the user is included at the top of the output vocabulary list 24g. it can.

次に、本実施形態における複合語抽出機能４２による複合語抽出処理の応用例について説明する。図１１は、本実施形態における複合語抽出処理を示すフローチャートである。 Next, an application example of the compound word extraction process by the compound word extraction function 42 in this embodiment will be described. FIG. 11 is a flowchart showing compound word extraction processing in the present embodiment.

一般に、複合語を構成するか否かの判断を、隣接する形態素の品詞から判断する技術がある。例えば、「名詞−一般」の連続は、複合名詞と判断することが知られている。他に「接頭語と名詞は接続する」「名詞と接尾語は接続する」「格助詞"の"で接続された名詞は、格助詞"の"を含めて接続する」などの適合率の高いルールのみを適用することが一般的である。このような技術では、句読点「、」「。」や「」（スペース）は区切り文字として扱い、複合語を構成する要素としないことが多い。 In general, there is a technique for determining whether or not to constitute a compound word from the parts of speech of adjacent morphemes. For example, it is known that the sequence of “noun-general” is determined as a compound noun. In addition, "Prefix and noun connect", "Noun and suffix connect", "Noun connected with case particle" "is connected including case particle" "", etc. have high precision It is common to apply only rules. In such a technique, punctuation marks “,” “.” And “” (space) are often treated as delimiters and are not used as elements constituting a compound word.

しかし、近年は、商品名、各種コンテンツ（書籍、映画、アニメーション等）の名称、芸名などの人物名などの固有名詞において、適合率の高いルールでは確実な単語区切りとされる文字等（句読点、スペース、記号など）や品詞を含むものが多分野で使われるようになっている。 However, in recent years, in the proper nouns such as product names, names of various contents (books, movies, animations, etc.), names of people such as stage names, characters etc. (punctuation marks, Things including spaces, symbols, etc.) and parts of speech are used in many fields.

そこで、本実施形態における複合語抽出機能４２は、形態素解析機能４１により出力される形態素解析結果から、図１１に示す手順により複合語を抽出することで、複合語を構成する可能性のある品詞が隣接して現れる部分の組み合わせの全てを複合語候補として抽出する。 Therefore, the compound word extraction function 42 in this embodiment extracts a compound word from the morpheme analysis result output by the morpheme analysis function 41 according to the procedure shown in FIG. All combinations of parts that appear adjacently are extracted as compound word candidates.

すなわち、複合語抽出機能４２は、形態素解析機能４１の出力（図５に示す）について、確実な単語区切りとして予め設定した文字・品詞を含むか判定する（ステップＤ１）。予め設定した文字・品詞を含まない場合（ステップＤ２、Ｎｏ）、複合語抽出機能４２は、形態素の連結結果であって、複合語の最初にならない文字・品詞で始まるか判定する（ステップＤ３）。該当する文字・品詞で始まらない場合（ステップＤ４、Ｎｏ）、複合語抽出機能４２は、複合語の最後にならない文字・品詞で終わるかを判定する（ステップＤ５）。該当する文字・品詞で終らない場合、複合語抽出機能４２は、表記の全てを複合語候補に設定する（ステップＤ７）。 That is, the compound word extraction function 42 determines whether or not the output (shown in FIG. 5) of the morphological analysis function 41 includes a character / part of speech preset as a reliable word break (step D1). When a preset character / part of speech is not included (step D2, No), the compound word extraction function 42 determines whether the result is a morpheme concatenation result and starts with a character / part of speech that does not start the compound word (step D3). . If it does not start with the corresponding character / part of speech (step D4, No), the compound word extraction function 42 determines whether it ends with a character / part of speech that does not end at the end of the compound word (step D5). If it does not end with the corresponding character / part of speech, the compound word extraction function 42 sets all the notations as compound word candidates (step D7).

複合語抽出機能４２は、例えば、図１２に示すリストを参照して複合語抽出を実行することができる。図１２に示すリストの各行の指定は、「品詞」と「表現」がともに記載されている場合は、品詞と表現がともに一致する形態素を、一方のみが指定されている場合は他方は条件なしとして判断に使用する。なお、図１２のリストに該当しない最大長の文字列のみではなくて、その部分文字列も複合語候補とする。 The compound word extraction function 42 can execute compound word extraction with reference to the list shown in FIG. 12, for example. The specification of each line of the list shown in FIG. 12 is that when both “part of speech” and “expression” are described, a morpheme whose both part of speech and expression match, and when only one is specified, the other is unconditional. Used for judgment. It should be noted that not only the maximum length character string that does not correspond to the list of FIG.

複合語抽出機能４２は、図１２に示すリストをもとに複合語抽出をすると、図５に示す形態素解析結果からは「風邪、風邪の初期、風邪の初期症状、風邪の初期症状の訴え、初期症状、初期症状の訴え、葛根湯、葛根湯を処方」の複合語候補を抽出することができる。 When the compound word extraction function 42 extracts compound words based on the list shown in FIG. 12, the morphological analysis result shown in FIG. 5 indicates that “the complaint of cold, initial cold, initial symptoms of cold, initial symptoms of cold, Compound word candidates of “initial symptoms, complaints of initial symptoms, prescription kakkonto, kakkento” can be extracted.

図１２のリストから、句点、読点に関する指定を削除すれば、例えば原文「新チューハイ「○○○。」を発表した。」からは、複合語候補「新チューハイ」「○○○。」「発表」が抽出できる。 If the designations related to the punctuation marks and punctuation marks are deleted from the list of FIG. 12, for example, the original sentence “New Chu-Hi“ XXX ”is announced. ”Can extract compound word candidates“ new chu-hi ”,“ XXX ”, and“ announcement ”.

複合語抽出機能４２は、前提として形態素解析機能４１の出力から形態素の連結を作成するものに限定しない。例えば、平文コーパス２４ａのテキストもしくは正式名称リスト２４ｂの原文を入力としてＮ−ｇｒａｍにより語候補を切り出し、形態素解析結果と区切り位置が一致し、図６のリストに該当しない表記を複合語候補としてもよい。 The compound word extraction function 42 is not limited to the one that creates a morpheme concatenation from the output of the morpheme analysis function 41 as a premise. For example, by inputting the text of the plaintext corpus 24a or the original text of the formal name list 24b and cutting out word candidates by N-gram, the morphological analysis result and the delimiter position match, and a notation that does not correspond to the list of FIG. Good.

このようにして、複合語候補を柔軟に抽出することで、従来の適合率の高いルールを適用して限定した候補を抽出する場合と比較して、複合語の抽出漏れを削減することができる。 In this way, by extracting compound word candidates flexibly, it is possible to reduce compound word extraction omissions as compared to the case where limited candidates are extracted by applying a rule with a high relevance ratio. .

なお、前述した説明では、音声認識システム４９に音声認識用の辞書（構築済み辞書２４ｅ）への語彙の追加を支援する場合を例にしているが、本実施形態における語彙知識獲得装置１０は、音声認識以外のシステムに用いられる辞書へ表記を追加する場合にも利用することができる。例えば、日本語入力システム（ワードプロセッサ）のかな漢字変換辞書や、インターネットで配信される情報（ブログ、マイクロブログ、企業発表情報）などを内容ごとに分類するための用語辞書を対象とすることもできる。 In the above description, the case where the speech recognition system 49 supports the addition of a vocabulary to a dictionary for speech recognition (built dictionary 24e) is taken as an example, but the vocabulary knowledge acquisition device 10 in the present embodiment is It can also be used when adding a notation to a dictionary used in a system other than voice recognition. For example, a kana-kanji conversion dictionary of a Japanese input system (word processor), a term dictionary for classifying information distributed on the Internet (blog, microblog, company announcement information) and the like according to contents can be targeted.

また、語彙知識獲得装置１０は、日本語の表記だけでなく、他の言語の表記を対象とすることも可能である。 Moreover, the vocabulary knowledge acquisition apparatus 10 can target not only Japanese notation but also other language notations.

また、前述した説明では、Ｗｅｂサイトから取得されるＷｅｂクローリングデータ２４ｄから未知語の読みを抽出しているが、その他の語彙知識獲得装置１０の外部から取得されるデータを対象とすることも可能である。例えば、継続的にデータが更新されるデータベースシステムや、特定の電子機器に記録されたデータなどを、記録媒体あるいはネットワーク１２を通じて取得して、語彙知識獲得処理に利用することが可能である。 In the above description, the unknown word reading is extracted from the web crawling data 24d obtained from the website. However, it is also possible to target other data obtained from outside the vocabulary knowledge acquisition apparatus 10. It is. For example, a database system in which data is continuously updated, data recorded in a specific electronic device, or the like can be acquired through a recording medium or the network 12 and used for vocabulary knowledge acquisition processing.

なお、実施形態に記載した手法は、コンピュータに実行させることのできるプログラムとして、磁気ディスク（フレキシブルディスク、ハードディスクなど）、光ディスク（ＣＤ−ＲＯＭ、ＤＶＤなど）、光磁気ディスク（ＭＯ）、半導体メモリなどの記憶媒体に格納して頒布することもできる。 The method described in the embodiment is a program that can be executed by a computer, such as a magnetic disk (flexible disk, hard disk, etc.), an optical disk (CD-ROM, DVD, etc.), a magneto-optical disk (MO), a semiconductor memory, etc. It can also be stored in a storage medium and distributed.

また、記憶媒体としては、プログラムを記憶でき、かつコンピュータが読み取り可能な記憶媒体であれば、その記憶形式は何れの形態であっても良い。 In addition, as long as the storage medium can store a program and can be read by a computer, the storage format may be any form.

また、記憶媒体からコンピュータにインストールされたプログラムの指示に基づきコンピュータ上で稼働しているＯＳ（オペレーティングシステム）や、データベース管理ソフト、ネットワークソフト等のＭＷ（ミドルウェア）等が上記実施形態を実現するための各処理の一部を実行しても良い。 In addition, an OS (operating system) running on a computer based on an instruction of a program installed in the computer from a storage medium, MW (middleware) such as database management software, network software, and the like realize the above-described embodiment. A part of each process may be executed.

さらに、実施形態における記憶媒体は、コンピュータと独立した媒体に限らず、ＬＡＮやインターネット等により伝送されたプログラムをダウンロードして記憶または一時記憶した記憶媒体も含まれる。 Furthermore, the storage medium in the embodiment is not limited to a medium independent of the computer, but also includes a storage medium in which a program transmitted via a LAN, the Internet, or the like is downloaded and stored or temporarily stored.

また、記憶媒体は１つに限らず、複数の媒体から上記の各実施形態における処理が実行される場合も本発明における記憶媒体に含まれ、媒体構成は何れの構成であっても良い。 Further, the number of storage media is not limited to one, and the case where the processing in each of the above embodiments is executed from a plurality of media is also included in the storage media in the present invention, and the media configuration may be any configuration.

なお、実施形態におけるコンピュータは、記憶媒体に記憶されたプログラムに基づき、実施形態における各処理を実行するものであって、パーソナルコンピュータ等の１つからなる装置、複数の装置がネットワーク接続されたシステム等の何れの構成であっても良い。 The computer in the embodiment executes each process in the embodiment based on a program stored in a storage medium. The computer includes a single device such as a personal computer, and a system in which a plurality of devices are connected to a network. Any configuration may be used.

また、実施形態におけるコンピュータとは、情報処理機器に含まれる演算処理装置、マイコン等も含み、プログラムによって本発明の機能を実現することが可能な機器、装置を総称している。 In addition, the computer in the embodiment includes an arithmetic processing device, a microcomputer, and the like included in the information processing device, and is a generic term for devices and devices that can realize the functions of the present invention by a program.

なお、本発明のいくつかの実施形態を説明したが、これらの実施形態は、例として提示したものであり、発明の範囲を限定することは意図していない。これら新規な実施形態は、その他の様々な形態で実施されることが可能であり、発明の要旨を逸脱しない範囲で、種々の省略、置き換え、変更を行うことができる。これら実施形態やその変形は、発明の範囲や要旨に含まれるとともに、特許請求の範囲に記載された発明とその均等の範囲に含まれる。 In addition, although some embodiment of this invention was described, these embodiment is shown as an example and is not intending limiting the range of invention. These novel embodiments can be implemented in various other forms, and various omissions, replacements, and changes can be made without departing from the scope of the invention. These embodiments and modifications thereof are included in the scope and gist of the invention, and are included in the invention described in the claims and the equivalents thereof.

１０…語彙知識獲得装置、１２…ネットワーク、１４…Ｗｅｂサーバ、２０…プロセッサ、２１…メモリ、２１ａ…語彙知識獲得プログラム、２１ｂ…音声認識プログラム、２４…記憶装置、２４ａ…平文コーパス、２４ｂ…正式名称リスト、２４ｃ…日英機械翻訳辞書、２４ｄ…Ｗｅｂクローリング、２４ｅ…構築済み辞書、２４ｆ…仮構築辞書、２４ｇ…語彙リスト、２５…入力ユニット、２６…表示ユニット、２７…音声入力ユニット、２８…音声出力ユニット、２９…通信ユニット、４１…形態素解析機能、４２…複合語抽出機能、４３…未知語抽出機能、４４…未知語関連情報付与機能、４５…略称推定機能、４６…正式表記候補付与機能、４７…結果出力機能、４８…辞書編集機能、４８…音声認識システム。 DESCRIPTION OF SYMBOLS 10 ... Vocabulary knowledge acquisition device, 12 ... Network, 14 ... Web server, 20 ... Processor, 21 ... Memory, 21a ... Vocabulary knowledge acquisition program, 21b ... Speech recognition program, 24 ... Storage device, 24a ... Plain text corpus, 24b ... Formal Name list, 24c ... Japanese-English machine translation dictionary, 24d ... Web crawling, 24e ... Pre-built dictionary, 24f ... Temporary construction dictionary, 24g ... Vocabulary list, 25 ... Input unit, 26 ... Display unit, 27 ... Voice input unit, 28 ... voice output unit, 29 ... communication unit, 41 ... morphological analysis function, 42 ... compound word extraction function, 43 ... unknown word extraction function, 44 ... unknown word related information addition function, 45 ... abbreviation estimation function, 46 ... formal notation candidate Giving function, 47 ... result output function, 48 ... dictionary editing function, 48 ... voice recognition system.

Claims

A morpheme analysis means for dividing a text contained in a plaintext corpus into words and adding a part of speech to each word;
Compound word extraction means for extracting a compound word based on the result of the morphological analysis;
An unknown word extraction means for comparing the word obtained by the morphological analysis and the compound word obtained by the compound word extraction with a registered word of the constructed dictionary and extracting an unknown word not registered in the constructed dictionary;
Extracting candidate readings for the unknown word from data acquired from the outside, and adding unknown word related information giving means to the unknown word as unknown word related information;
Abbreviation estimation means for generating abbreviations from compound words;
When the abbreviation generated by the abbreviation estimation means matches the unknown word, a formal notation candidate giving means for giving the unknown word as a formal notation candidate as a formal word candidate,
A vocabulary knowledge acquisition device comprising: a result output means for combining the unknown word, the unknown word related information, and the formal notation candidates, and arranging them in the order of high dictionary addition registration effect and outputting the result as a vocabulary list.

The unknown word related information giving means is
As the unknown word related information, presumed part-of-speech, appearance frequency, reading / snippet / information source extracted from Web crawling data, constructed dictionary registered words with similar reading / notation / part of speech, similar to the unknown word The vocabulary knowledge acquisition apparatus according to claim 1, wherein at least one of information on the difference between the analysis results when the registered word usage frequency and the registered words in the dictionary are added and deleted is added and extracted.

The unknown word related information giving means is
It has a website reliability evaluation list,
The vocabulary knowledge acquisition apparatus according to claim 2, wherein when adding reading / snippet information extracted from the Web crawling data, information is selected based on an evaluation value of a Web site set in the reliability evaluation list.

The lexical knowledge acquisition apparatus according to claim 1, wherein the result output means rearranges based on one or a plurality of combinations among a plurality of indices as a criterion for determining the dictionary additional registration effect.

The compound word extraction means includes
The lexical knowledge acquisition apparatus according to claim 1, wherein from the result of the morphological analysis, all combinations of parts in which parts of speech that may constitute a compound word appear adjacently are extracted as compound word candidates.

Further comprising dictionary editing means for acquiring information on the difference between analysis results using the constructed dictionary before and after adding unknown words included in the vocabulary list to the constructed dictionary;
The vocabulary knowledge acquisition apparatus according to claim 1, wherein the unknown word related information adding unit adds difference information of analysis results to the unknown word.

Divide the text contained in the plaintext corpus into words, and perform morphological analysis to give parts of speech to each word,
Extract compound words based on the results of the morphological analysis,
Compare the word obtained by the morphological analysis and the compound word obtained by compound word extraction with the registered word of the constructed dictionary, and extract the unknown word that is not registered in the constructed dictionary,
Extracting candidate readings for the unknown word from data acquired from the outside, and giving the unknown word as unknown word related information,
Generate abbreviations from compound words,
When the abbreviation and the unknown word match, give the compound word as a formal notation candidate as the formal word candidate to the unknown word,
A vocabulary knowledge acquisition method of combining the unknown word, the unknown word related information, and the formal notation candidates, and arranging them in the order of high dictionary addition registration effect and outputting as a vocabulary list.

Computer
A morpheme analysis means for dividing a text contained in a plaintext corpus into words and adding a part of speech to each word;
Compound word extraction means for extracting a compound word based on the result of the morphological analysis;
An unknown word extraction means for comparing the word obtained by the morphological analysis and the compound word obtained by the compound word extraction with a registered word of the constructed dictionary and extracting an unknown word not registered in the constructed dictionary;
Extracting candidate readings for the unknown word from data acquired from the outside, and adding unknown word related information giving means to the unknown word as unknown word related information;
Abbreviation estimation means for generating abbreviations from compound words;
When the abbreviation generated by the abbreviation estimation means matches the unknown word, a formal notation candidate giving means for giving the unknown word as a formal notation candidate as a formal word candidate,
A vocabulary knowledge acquisition program for functioning as a result output means for combining the unknown word, the unknown word related information, and the formal notation candidates and arranging them as a vocabulary list in order from the highest dictionary addition registration effect.