JP2010140107A

JP2010140107A - Method, apparatus, program, and computer readable recording medium for registering unknown word

Info

Publication number: JP2010140107A
Application number: JP2008313676A
Authority: JP
Inventors: Chihiro Yamamoto; 千尋山本; Katsuto Bessho; 克人別所; Toshiro Uchiyama; 俊郎内山; Masashi Uchiyama; 匡内山
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2008-12-09
Filing date: 2008-12-09
Publication date: 2010-06-24

Abstract

<P>PROBLEM TO BE SOLVED: To extract with high accuracy unknown words for use in a morphological analysis apparatus. <P>SOLUTION: Unknown word candidates are extracted from a set of search keywords including differences of a search keyword obtained when one query of a query log often input as a set of search keywords consisting of one compound sentence is divided by spaces. The unknown word candidates which are used frequently are extracted as unknown words and registered in a reference dictionary. A method of extracting the unknown words includes: merging the set of unknown word candidates with the reference dictionary to create a temporary dictionary, morphologically analyzing corpus using the temporary dictionary, counting the frequencies of occurrence of the unknown word candidates, and defining the unknown word candidates whose occurrence frequencies are greater than a predetermined threshold value as unknown words. Alternatively, a search system performs search using the unknown word candidates as a search query, and defines the unknown word candidates whose search results are greater in number than a predetermined threshold value as unknown words. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、未知語登録方法及び装置及びプログラム及びコンピュータ読取可能な記録媒体に係り、特に、形態素解析装置で用いるための未知語を自動的に収集し、形態素解析装置の辞書に登録するための未知語登録方法及び装置及びプログラム及びコンピュータ読取可能な記録媒体に関する。 The present invention relates to an unknown word registration method, apparatus, program, and computer-readable recording medium, and more particularly to automatically collecting unknown words for use in a morphological analyzer and registering them in a dictionary of the morphological analyzer. The present invention relates to an unknown word registration method and apparatus, a program, and a computer-readable recording medium.

従来の未知語抽出技術として、単語の確率モデルを使用する手法がある。この手法では、単語を構成する文字列及び、単語が出現する文の確率モデルを利用することで、文字列の単語らしさを判定する（例えば、非特許文献１参照）。 As a conventional unknown word extraction technique, there is a technique using a word probability model. In this method, the character likelihood of a character string is determined by using a character string constituting the word and a probability model of a sentence in which the word appears (see, for example, Non-Patent Document 1).

また、形態素解析によって単語辞書に含まれないものについて未知語として抽出する手法がある。この手法では、形態素解析を利用し、解析できない文字列、またはその周辺の単語を合わせたものを未知語として抽出する（例えば、特許文献１参照）。
特開平１１−３３８８６３号公報永田昌明、"未知語の確率モデルと単語の出現頻度の期待値に基づくテキストからの語彙獲得"、情報処理学会論文誌、vol.40. No.9, 1999 In addition, there is a technique for extracting words that are not included in the word dictionary as unknown words by morphological analysis. In this method, morphological analysis is used to extract a character string that cannot be analyzed or a combination of words around it as an unknown word (see, for example, Patent Document 1).
JP 11-338863 A Masaaki Nagata, "Obtaining Vocabulary from Text Based on Probability Model of Unknown Words and Expected Frequency of Words", IPSJ Journal, vol.40. No.9, 1999

しかしながら、上記の確率モデルによる未知語抽出の手法では、モデルの学習に利用される単語分割済みコーパスを必要とするが、このコーパスを作成することはコストが大きい。また、このコーパスには現われないような新しいタイプの未知語の抽出ができない。 However, the above unknown word extraction method using the probability model requires a word-division corpus that is used for learning the model, but creating this corpus is expensive. Also, new types of unknown words that do not appear in this corpus cannot be extracted.

また、形態素解析による未知語抽出の手法では、同一字種文字列を一単語とみなすといった字種情報に基づいて未知語を抽出する方法が主にとられるが、この手法では多用な文字種で構成された単語についての抽出が難しい。 The unknown word extraction method based on morphological analysis mainly extracts unknown words based on character type information, such as treating the same character type character string as one word, but this method consists of many character types. It is difficult to extract words that have been used.

本発明は、上記の点に鑑みなされたもので、未知語を高精度で抽出することが可能な未知語登録方法及び装置及びプログラム及びコンピュータ読取可能な記録媒体を提供することを目的とする。 The present invention has been made in view of the above points, and an object thereof is to provide an unknown word registration method and apparatus, a program, and a computer-readable recording medium that can extract unknown words with high accuracy.

図１は、本発明の原理を説明するための図である。 FIG. 1 is a diagram for explaining the principle of the present invention.

本発明（請求項１）は、形態素解析に用いられる基準辞書に未知語を登録する未知語登録方法であって、
未知語候補獲得手段が、クエリログが入力されると、該クエリログの１クエリをスペースで分割して得られる検索キーワードの異なりの集合である検索キーワード集合から未知語候補を抽出し、記憶手段に格納する未知語候補獲得ステップ（ステップ１）と、
未知語抽出手段が、記憶手段に格納されている未知語候補の中から多用されているものを未知語として抽出し、記憶手段に格納する未知語抽出ステップ（ステップ２）と、
未知語登録手段が、記憶手段に格納されている未知語を基準辞書に登録する未知語登録ステップ（ステップ３）と、を行う。 The present invention (Claim 1) is an unknown word registration method for registering an unknown word in a reference dictionary used for morphological analysis,
When a query log is input, the unknown word candidate acquisition means extracts an unknown word candidate from a search keyword set that is a different set of search keywords obtained by dividing one query of the query log by a space, and stores it in the storage means An unknown word candidate acquisition step (step 1),
An unknown word extraction unit that extracts frequently used unknown word candidates from among the unknown word candidates stored in the storage unit as an unknown word and stores it in the storage unit (step 2);
The unknown word registration means performs an unknown word registration step (step 3) for registering the unknown words stored in the storage means in the reference dictionary.

また、本発明（請求項２）は、未知語候補獲得ステップ（ステップ１）において、
検索キーワード集合の各検索キーワードが基準辞書に含まれていない場合のみ、未知語候補とする、
または、
検索キーワード集合の各検索キーワードが動詞語幹と活用語尾、または、一形態素のみである場合は未知語候補とせず、それ以外の場合は未知語候補とする。 Further, the present invention (Claim 2) is an unknown word candidate acquisition step (Step 1).
Only when each search keyword in the search keyword set is not included in the reference dictionary, it is considered as an unknown word candidate.
Or
If each search keyword in the search keyword set is a verb stem and an inflection ending, or only one morpheme, it is not an unknown word candidate, otherwise it is an unknown word candidate.

また、本発明（請求項３）は、未知語抽出ステップ（ステップ２）において、
記憶手段に格納されている未知語候補の集合と基準辞書とをマージして暫定辞書を生成し、該暫定辞書を用いてコーパスの形態素解析を行い、
未知語候補の出現頻度をカウントし、該出現頻度が所定の閾値より多い未知語候補を未知語として記憶手段に格納する。 Further, the present invention (Claim 3) provides an unknown word extraction step (Step 2).
A temporary dictionary is generated by merging a set of unknown word candidates stored in the storage means and the reference dictionary, and a morphological analysis of the corpus is performed using the temporary dictionary.
The frequency of appearance of unknown word candidates is counted, and unknown word candidates whose appearance frequency is higher than a predetermined threshold are stored in the storage means as unknown words.

また、本発明（請求項４）は、未知語抽出ステップ（ステップ２）において、
記憶手段に格納されている未知語候補を検索クエリとして検索システムで検索を行い、
検索結果件数が所定の閾値より多い未知語候補を未知語として記憶手段に格納する。 Further, the present invention (Claim 4), in the unknown word extraction step (Step 2),
Perform a search with the search system using the unknown word candidate stored in the storage means as a search query,
Unknown word candidates whose number of search results is greater than a predetermined threshold are stored in the storage means as unknown words.

図２は、本発明の原理構成図である。 FIG. 2 is a principle configuration diagram of the present invention.

本発明（請求項５）は、形態素解析に用いられる基準辞書２に未知語を登録する未知語登録装置であって、
クエリログが入力されると、該クエリログの１クエリをスペースで分割して得られる検索キーワードの異なりの集合である検索キーワード集合から未知語候補を抽出し、記憶手段に格納する未知語候補獲得手段１０と、
記憶手段に格納されている未知語候補の中から多用されているものを未知語として抽出し、記憶手段に格納する未知語抽出手段２０と、
記憶手段に格納されている未知語を基準辞書２に登録する未知語登録手段３０と、を有する。 The present invention (Claim 5) is an unknown word registration device for registering an unknown word in the reference dictionary 2 used for morphological analysis,
When a query log is input, an unknown word candidate is extracted from a search keyword set, which is a set of different search keywords obtained by dividing one query of the query log by a space, and is stored in a storage unit. When,
An unknown word extraction means 20 that extracts frequently-used words from unknown word candidates stored in the storage means as unknown words, and stores them in the storage means;
And an unknown word registration means 30 for registering an unknown word stored in the storage means in the reference dictionary 2.

また、本発明（請求項６）は、未知語候補獲得手段１０において、
検索キーワード集合の各検索キーワードが基準辞書に含まれていない場合のみ、未知語候補とする手段、
または、
検索キーワード集合の各検索キーワードが動詞語幹と活用語尾、または、一形態素のみである場合は未知語候補とせず、それ以外の場合は未知語候補とする手段、のいずれかを含む。 Further, the present invention (Claim 6) provides the unknown word candidate acquisition means 10 with:
Means for making an unknown word candidate only when each search keyword of the search keyword set is not included in the reference dictionary,
Or
Each search keyword in the search keyword set includes either a verb stem and an inflection ending, or means for not using an unknown word candidate if it is only one morpheme, and otherwise making it an unknown word candidate.

また、本発明（請求項７）は、未知語抽出手段２０において、
記憶手段に格納されている未知語候補の集合と基準辞書２とをマージして暫定辞書を生成し、該暫定辞書を用いてコーパスの形態素解析を行い、該未知語候補の出現頻度をカウントし、該出現頻度が所定の閾値より多い未知語候補を未知語として該記憶手段に格納する手段を含む。 Further, the present invention (Claim 7) is provided in the unknown word extraction means 20,
A temporary dictionary is generated by merging a set of unknown word candidates stored in the storage means and the reference dictionary 2, and a morphological analysis of the corpus is performed using the temporary dictionary, and the frequency of appearance of the unknown word candidates is counted. And means for storing in the storage means unknown word candidates whose appearance frequency is greater than a predetermined threshold value as unknown words.

また、本発明（請求項８）は、未知語抽出手段２０において、
記憶手段に格納されている未知語候補を検索クエリとして検索システムで検索を行い、検索結果件数が所定の閾値より多い未知語候補を未知語として該記憶手段に格納する手段を含む。 Further, the present invention (claim 8) provides the unknown word extraction means 20 with:
A means for searching the unknown word candidate stored in the storage means using the search system as a search query, and storing in the storage means an unknown word candidate whose number of search results is greater than a predetermined threshold as an unknown word.

本発明（請求項９）は、請求項５乃至８のいずれか１項に記載の未知語登録装置を構成する各手段としてコンピュータを機能させるための未知語登録プログラムである。 The present invention (Claim 9) is an unknown word registration program for causing a computer to function as each means constituting the unknown word registration apparatus according to any one of Claims 5 to 8.

本発明（請求項１０）は、請求項９記載の未知語登録プログラムを格納したコンピュータ読取可能な記録媒体である。 The present invention (Claim 10) is a computer-readable recording medium storing the unknown word registration program according to Claim 9.

本発明では、未知語の抽出において正確な形態素で抽出するために、一つの複合語文節からなる検索キーワードの集合で入力されることの多いクエリログを利用する。また、入力されたクエリを最小単位にするためにスペース毎に分割する。その結果得られた検索キーワードは、複数の複合語文節から構成されることは少なく、また、多用される場合は、一形態素の性質を持っていると考えられる。分割されて得られた検索キーワードを未知語候補として、その候補がどれだけ多用されて未知語であると判定できるかを２種類の方法で行う。 In the present invention, in order to extract an unknown word with an accurate morpheme, a query log that is frequently input as a set of search keywords including one compound word phrase is used. Also, in order to make the input query the smallest unit, it is divided for each space. The search keyword obtained as a result is rarely composed of a plurality of compound word phrases, and when used frequently, it is considered to have a morpheme property. The search keyword obtained by the division is set as an unknown word candidate, and how many of the candidates can be determined as an unknown word is determined by two types of methods.

これにより、本発明によれば、確率モデルによる未知語抽出の方法と比べ、単語分割済みコーパスを必要とせず、確率モデルに合わない新しいタイプの未知語を抽出できないという問題がない。また、形態素解析による未知語抽出の手法と比べ、多様な文字種で構成された未知語を抽出することが可能となる。 Thus, according to the present invention, compared to the method of extracting unknown words using a probability model, there is no problem that a new type of unknown word that does not match the probability model cannot be extracted without requiring a word-divided corpus. In addition, it is possible to extract unknown words composed of various character types, compared to the method of extracting unknown words by morphological analysis.

以下、図面と共に本発明の実施の形態を説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

［第１の実施の形態］
図３は、本発明の第１の実施の形態における未知語登録装置の構成を示す。 [First Embodiment]
FIG. 3 shows the configuration of the unknown word registration device according to the first embodiment of the present invention.

同図に示す未知語登録装置は、クエリログ記憶部１、基準辞書２、暫定辞書３、コーパス記憶部４、キーワード集合メモリ５、未知語候補集合メモリ６、未知語集合メモリ７、未知語候補獲得部１０、未知語抽出部２０、未知語登録部３０から構成される。このうち、クエリログ記憶部１、基準辞書２、暫定辞書３、コーパス記憶部４は、ハードディスク装置等の記憶媒体であり、また、キーワード集合メモリ５、未知語候補集合メモリ６、未知語集合メモリ７は一時的に処理結果を格納するメモリである。 The unknown word registration apparatus shown in FIG. 1 includes a query log storage unit 1, a reference dictionary 2, a temporary dictionary 3, a corpus storage unit 4, a keyword set memory 5, an unknown word candidate set memory 6, an unknown word set memory 7, and an unknown word candidate acquisition. Unit 10, unknown word extraction unit 20, and unknown word registration unit 30. Among these, the query log storage unit 1, the reference dictionary 2, the temporary dictionary 3, and the corpus storage unit 4 are storage media such as a hard disk device, and also include a keyword set memory 5, an unknown word candidate set memory 6, and an unknown word set memory 7. Is a memory for temporarily storing processing results.

未知語候補獲得部１０は、クエリログ記憶部１から取得したクエリログに基づいて基準辞書２を参照し、当該基準辞書２に含まれないような未知語候補を収集し、未知語候補集合メモリ６に格納する。 The unknown word candidate acquisition unit 10 refers to the reference dictionary 2 based on the query log acquired from the query log storage unit 1, collects unknown word candidates that are not included in the reference dictionary 2, and stores them in the unknown word candidate set memory 6. Store.

未知語抽出部２０は、未知語候補集合メモリ６から未知語候補を取得して、その中から未知語を抽出し、未知語集合メモリ７に格納する。 The unknown word extraction unit 20 acquires unknown word candidates from the unknown word candidate set memory 6, extracts unknown words from the unknown word candidates, and stores them in the unknown word set memory 7.

未知語登録部３０は、未知語集合メモリ７から取得した未知語を基準辞書２に登録する。 The unknown word registration unit 30 registers the unknown word acquired from the unknown word set memory 7 in the reference dictionary 2.

以下に、各構成の動作を説明する。 Hereinafter, the operation of each component will be described.

未知語候補獲得部１０の動作を以下に示す。未知語候補獲得部１０は、クエリログ記憶部１のクエリログより未知語候補を抽出するものである。 The operation of the unknown word candidate acquisition unit 10 will be described below. The unknown word candidate acquisition unit 10 extracts unknown word candidates from the query log in the query log storage unit 1.

図４は、本発明の第１の実施の形態における未知語候補獲得部の動作のフローチャートである。 FIG. 4 is a flowchart of the operation of the unknown word candidate acquisition unit in the first embodiment of the present invention.

まず、未知語候補獲得部１０は、クエリログ記憶部１からクエリログを取得し、１クエリをスペースで分割して得られる検索キーワードの異なる集合である検索キーワード集合を生成し、キーワード集合メモリ５に格納する（ステップ１０１）。 First, the unknown word candidate acquisition unit 10 acquires a query log from the query log storage unit 1, generates a search keyword set that is a different set of search keywords obtained by dividing one query by a space, and stores it in the keyword set memory 5. (Step 101).

次に、キーワード集合メモリ５から検索キーワードを取得して未知語候補集合を生成し、未知語候補集合メモリ６に格納する（ステップ１０２）。その際、２つの方法が考えられる。 Next, a search keyword is acquired from the keyword set memory 5, an unknown word candidate set is generated, and stored in the unknown word candidate set memory 6 (step 102). In that case, two methods can be considered.

１つは、検索キーワードが基準辞書２に含まれなかった場合のみ、未知語候補とする方法である。もう一つは、検索キーワードに対し、基準辞書２を用いた形態素解析を行い、検索キーワードが動詞語幹と活用語尾、または、一形態素である場合は、未知語候補とせず、それ以外の場合、未知語候補とする方法である。 One is a method of making an unknown word candidate only when the search keyword is not included in the reference dictionary 2. The other is a morphological analysis using the reference dictionary 2 for the search keyword. If the search keyword is a verb stem and an inflection ending, or one morpheme, it is not an unknown word candidate. This is a method of making an unknown word candidate.

次に、未知語抽出部２０について説明する。 Next, the unknown word extraction unit 20 will be described.

未知語抽出部２０では、未知語候補集合メモリ６から未知語候補を取得して、当該未知語候補を一単語として暫定辞書３に登録した状態で、コーパス記憶部４のコーパスに対して形態素解析を行い、未知語候補の出現頻度をカウントした結果に基づいて未知語集合を得る。 In the unknown word extraction unit 20, an unknown word candidate is acquired from the unknown word candidate set memory 6, and the morphological analysis is performed on the corpus of the corpus storage unit 4 with the unknown word candidate registered in the temporary dictionary 3 as one word. And an unknown word set is obtained based on the result of counting the appearance frequency of unknown word candidates.

図５は、本発明の第１の実施の形態における未知語抽出部の動作のフローチャートである。 FIG. 5 is a flowchart of the operation of the unknown word extraction unit in the first embodiment of the present invention.

未知語抽出部２０は、未知語候補集合メモリ６に含まれる未知語候補を、名詞と仮定して基準辞書２の内容とマージした暫定辞書３を生成する（ステップ２０１）。次に、暫定辞書３を用いて、コーパス記憶部４のコーパスの形態素解析を行い（ステップ２０２）、その結果の中から一形態素として出力された未知語候補集合メモリ６に含まれる未知語候補の出現頻度をカウントする（ステップ２０３）。出現頻度が閾値以上の未知語候補を未知語として抽出し、未知語集合メモリ７に格納する（ステップ２０４）。 The unknown word extraction unit 20 generates a temporary dictionary 3 in which the unknown word candidates included in the unknown word candidate set memory 6 are merged with the contents of the reference dictionary 2 assuming nouns (step 201). Next, a morphological analysis of the corpus of the corpus storage unit 4 is performed using the temporary dictionary 3 (step 202), and unknown word candidates included in the unknown word candidate set memory 6 output as one morpheme from the result are displayed. The appearance frequency is counted (step 203). An unknown word candidate whose appearance frequency is equal to or higher than a threshold is extracted as an unknown word and stored in the unknown word set memory 7 (step 204).

次に、未知語登録部３０について説明する。 Next, the unknown word registration unit 30 will be described.

未知語登録部３０は、未知語抽出部２０で抽出され未知語集合メモリ７に格納されている未知語を基準辞書２に登録する。 The unknown word registration unit 30 registers unknown words extracted by the unknown word extraction unit 20 and stored in the unknown word set memory 7 in the reference dictionary 2.

［第２の実施の形態］
本実施の形態は、未知語抽出部２０の処理が第１の実施の形態と異なる。 [Second Embodiment]
In the present embodiment, the processing of the unknown word extraction unit 20 is different from that of the first embodiment.

図６は、本発明の第２の実施の形態における未知語登録装置の構成を示す。同図において図３と同一構成要素には同一符号を付し、その説明を省略する。 FIG. 6 shows the configuration of the unknown word registration device according to the second embodiment of the present invention. In the figure, the same components as those in FIG.

同図に示す未知語登録装置は、クエリログ記憶部１、基準辞書２、キーワード集合メモリ５、未知語候補集合メモリ６、未知語集合メモリ７、未知語候補獲得部１０、未知語抽出部２５、未知語登録部３０から構成され、未知語抽出部２５は外部の検索システム４０に接続されている。 The unknown word registration device shown in FIG. 1 includes a query log storage unit 1, a reference dictionary 2, a keyword set memory 5, an unknown word candidate set memory 6, an unknown word set memory 7, an unknown word candidate acquisition unit 10, an unknown word extraction unit 25, The unknown word registration unit 30 is configured, and the unknown word extraction unit 25 is connected to an external search system 40.

未知語候補獲得部１０と未知語登録部３０の処理は、第１の実施の形態と同様である。 The processes of the unknown word candidate acquisition unit 10 and the unknown word registration unit 30 are the same as those in the first embodiment.

未知語抽出部２５は、未知語候補集合メモリ６に格納されている未知語候補を検索クエリとして検索システム４０で検索し、その検索結果件数で未知語候補が未知語か否かを判定する。 The unknown word extraction unit 25 searches the unknown word candidate stored in the unknown word candidate set memory 6 as a search query by the search system 40, and determines whether the unknown word candidate is an unknown word based on the number of search results.

図７は、本発明の第２の実施の形態における未知語抽出部の動作のフローチャートである。 FIG. 7 is a flowchart of the operation of the unknown word extraction unit in the second embodiment of the present invention.

未知語抽出部２５は、未知語候補集合メモリ６に格納されている未知語候補の中から処理対象の未知語候補Ｘを決定する（ステップ３０１）。未知語候補Ｘを検索クエリとして検索システム４０で検索を行う（ステップ３０２）。このとき、検索システム４０としてＷｅｂを対象とした検索エンジンを用いてもよい。検索システム４０から得られた検索結果件数と所定の閾値とを比較し、検索結果件数が閾値以上の場合は（ステップ３０３、Ｙｅｓ）、未知語候補を未知語として抽出し、未知語集合メモリ７に格納する。一方、検索結果件数が閾値より少ない場合は（ステップ３０３、Ｎｏ）、次の処理対象の未知語候補Ｘを決定し、上記の処理を繰り返す（ステップ３０１）。 The unknown word extraction unit 25 determines an unknown word candidate X to be processed from the unknown word candidates stored in the unknown word candidate set memory 6 (step 301). A search is performed by the search system 40 using the unknown word candidate X as a search query (step 302). At this time, a search engine for the Web may be used as the search system 40. The number of search results obtained from the search system 40 is compared with a predetermined threshold, and if the number of search results is greater than or equal to the threshold (step 303, Yes), an unknown word candidate is extracted as an unknown word, and the unknown word set memory 7 To store. On the other hand, when the number of search results is smaller than the threshold (No at Step 303), the next candidate word X to be processed is determined and the above process is repeated (Step 301).

また、上記の第１の実施の形態の図３、第２の実施の形態の図６の構成要素をプログラムとして構築し、未知語登録装置として利用されるコンピュータにインストールしてＣＰＵに実行させる、または、ネットワークを介して流通させることが可能である。 Further, the components of FIG. 3 of the first embodiment and FIG. 6 of the second embodiment are constructed as a program, installed in a computer used as an unknown word registration device, and executed by the CPU. Alternatively, it can be distributed via a network.

また、構築されたプログラムをハードディスクや、フレキシブルディスク・ＣＤ−ＲＯＭ等の可搬記憶媒体に格納し、コンピュータにインストールする、または、配布することが可能である。 Further, the constructed program can be stored in a portable storage medium such as a hard disk, a flexible disk, or a CD-ROM, and can be installed or distributed in a computer.

なお、本発明は、上記の実施の形態に限定されることなく、特許請求の範囲内において種々変更・応用が可能である。 The present invention is not limited to the above-described embodiment, and various modifications and applications can be made within the scope of the claims.

本発明は、形態素解析に用いられる辞書に未知語を登録する技術に適用可能である。 The present invention is applicable to a technique for registering an unknown word in a dictionary used for morphological analysis.

本発明の原理を説明するための図である。It is a figure for demonstrating the principle of this invention. 本発明の原理構成図である。It is a principle block diagram of this invention. 本発明の第１の実施の形態における未知語登録装置の構成図である。It is a block diagram of the unknown word registration apparatus in the 1st Embodiment of this invention. 本発明の第１の実施の形態における未知語候補獲得部の動作のフローチャートである。It is a flowchart of operation | movement of the unknown word candidate acquisition part in the 1st Embodiment of this invention. 本発明の第１の実施の形態における未知語抽出部の動作のフローチャートである。It is a flowchart of operation | movement of the unknown word extraction part in the 1st Embodiment of this invention. 本発明の第２の実施の形態における未知語登録装置の構成図である。It is a block diagram of the unknown word registration apparatus in the 2nd Embodiment of this invention. 本発明の第２の実施の形態における未知語抽出部の動作のフローチャートである。It is a flowchart of operation | movement of the unknown word extraction part in the 2nd Embodiment of this invention.

Explanation of symbols

１クエリログ記憶部
２基準辞書
３暫定辞書
４コーパス記憶部
５キーワード集合メモリ
６未知語候補集合メモリ
７未知語集合メモリ
１０未知語候補獲得手段、未知語候補獲得部
２０未知語抽出手段、未知語抽出部
２５未知語抽出部
３０未知語登録手段、未知語登録部
４０検索システム（検索エンジン） DESCRIPTION OF SYMBOLS 1 Query log memory | storage part 2 Reference dictionary 3 Temporary dictionary 4 Corpus memory | storage part 5 Keyword set memory 6 Unknown word candidate set memory 7 Unknown word set memory 10 Unknown word candidate acquisition means, unknown word candidate acquisition part 20 Unknown word extraction means, unknown word extraction Unit 25 unknown word extraction unit 30 unknown word registration means, unknown word registration unit 40 search system (search engine)

Claims

An unknown word registration method for registering an unknown word in a reference dictionary used for morphological analysis,
When a query log is input, the unknown word candidate acquisition means extracts an unknown word candidate from a search keyword set that is a different set of search keywords obtained by dividing one query of the query log by a space, and stores it in the storage means An unknown word candidate acquisition step,
An unknown word extracting means extracts an frequently used unknown word candidate from among the unknown word candidates stored in the storage means as an unknown word, and stores the unknown word extraction step in the storage means;
An unknown word registration unit registers an unknown word stored in the storage unit in the reference dictionary, and an unknown word registration step;
The unknown word registration method characterized by performing.

In the unknown word candidate acquisition step,
Only when each search keyword of the search keyword set is not included in the reference dictionary, is an unknown word candidate,
Or
The unknown word registration method according to claim 1, wherein each search keyword in the search keyword set is a verb stem and an inflection ending, or an unknown word candidate if it is only a morpheme, and an unknown word candidate otherwise.

In the unknown word extraction step,
Merging the set of unknown word candidates stored in the storage means and the reference dictionary to generate a temporary dictionary, using the temporary dictionary to perform a morphological analysis of the corpus,
The unknown word registration method according to claim 1, wherein the appearance frequency of the unknown word candidates is counted, and unknown word candidates having the appearance frequency higher than a predetermined threshold value are stored in the storage unit as unknown words.

In the unknown word extraction step,
Perform a search in a search system using the unknown word candidate stored in the storage means as a search query,
The unknown word registration method according to claim 1, wherein unknown word candidates whose number of search results is greater than a predetermined threshold are stored in the storage unit as unknown words.

An unknown word registration device that registers unknown words in a reference dictionary used for morphological analysis,
An unknown word candidate acquiring means for extracting an unknown word candidate from a search keyword set which is a different set of search keywords obtained by dividing one query of the query log by a space when a query log is input; ,
Extracting frequently used unknown word candidates from among the unknown word candidates stored in the storage means as unknown words, and storing the unknown word extraction means in the storage means;
Unknown word registration means for registering unknown words stored in the storage means in the reference dictionary;
An unknown word registration device characterized by comprising:

The unknown word candidate acquisition means includes:
Means for making an unknown word candidate only when each search keyword of the search keyword set is not included in the reference dictionary;
Or
6. The method according to claim 5, wherein each search keyword of the search keyword set includes a verb stem and an inflection ending, or a means for not using an unknown word candidate if it is only a morpheme, and an unknown word candidate otherwise. Unknown word registration device.

The unknown word extraction means includes
The temporary dictionary is generated by merging the set of unknown word candidates stored in the storage means and the reference dictionary, and a morphological analysis of the corpus is performed using the temporary dictionary, and the appearance frequency of the unknown word candidates is determined. 6. The unknown word registration device according to claim 5, further comprising means for counting and storing unknown word candidates whose appearance frequency is greater than a predetermined threshold as unknown words in the storage means.

The unknown word extraction means includes
A means for performing a search by a search system using the unknown word candidate stored in the storage means as a search query, and storing an unknown word candidate having a number of search results larger than a predetermined threshold as an unknown word in the storage means. 5. The unknown word registration device according to 5.

The unknown word registration program for functioning a computer as each means which comprises the unknown word registration apparatus of any one of Claim 5 thru | or 8.

A computer-readable recording medium storing the unknown word registration program according to claim 9.