JP2008276561A

JP2008276561A - Morpheme analysis device, morpheme analysis method, morpheme analysis program, and recording medium with computer program recorded thereon

Info

Publication number: JP2008276561A
Application number: JP2007119982A
Authority: JP
Inventors: Takeshi Masuyama; 毅司増山; Shigeo Makinoda; 成男牧野田
Original assignee: Yahoo Japan Corp
Current assignee: Yahoo Japan Corp
Priority date: 2007-04-27
Filing date: 2007-04-27
Publication date: 2008-11-13
Anticipated expiration: 2027-04-27
Also published as: JP4953440B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a morpheme analysis device for acquiring a proper morpheme analysis result even when any undefined word exists. <P>SOLUTION: This morpheme analysis device is provided with a retrieval result requesting means for, when any undefined word which is not stored in a word dictionary storage means 140 exists in a character string, requesting the retrieval result to an internal or external retrieval device 50 on the basis of the undefined word as retrieval conditions; a document vector calculation means for calculating the whole or a portion of the retrieval result as one document; a similarity calculation means for calculating the similarity of the document vector of the undefined word with the document vector of a known word; a similar word specification means for specifying a similar word as a known word corresponding to the document vector whose similarity is high; and an attribute application means for associating a part of speech and costs of the similar word with the undefined word. The division means is configured to divide the input character string into units by using the part of speech and costs associated with the undefined word by the undefined word attribute application means. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

この発明は、日本語文章を自動的に単語に分割する形態素解析装置、形態素解析方法、形態素解析プログラム及びコンピュータプログラムを格納した記憶媒体に関する。 The present invention relates to a morpheme analyzer that automatically divides a Japanese sentence into words, a morpheme analysis method, a morpheme analysis program, and a storage medium storing a computer program.

例えば、日本語ワープロに文字列を入力すると、その文字列が適切な言語単位に分割され、さらに、必要に応じてその言語単位に漢字が当てはめられる。文字列を適切な言語単位に分割するために、形態素解析が実施される。形態素解析においては、入力された文字列が意味を持つ最小の言語単位である形態素（Ｍｏｒｐｈｅｍｅ）に分割される場合もあるが、後述の単語辞書に複数の形態素から構成される複合語が存在する場合には、複合語に分割される場合もある。このため、本明細書において、「形態素解析」とは、文字列を単語辞書の項目（単語）に分割することである、と定義する。
このような形態素解析は、機械翻訳や自然言語インターフェース等においても、その第１段階の処理として重要な役割を有する。以下、「単語」のことを適宜、「語」とも呼ぶ。
形態素解析においては、単語辞書と連接可能性辞書が使用される。単語辞書は、各語の品詞、読み、活用型等を指定するものである。連接可能性辞書は、連接可能な２語の「タイプ」を指定するものである。連接可能性辞書における語の「タイプ」は、具体的な語であっても、品詞であっても、活用形であってもよい。文頭にあり得る語、文末にあり得る語は、「文頭」、「文末」という特別な「タイプ」と連接可能であるとする。
形態素解析結果は、入力された文字列を構成する語をノード（矩形）として、位置的かつ文法的に連接可能な単語間をエッジ（辺）としてグラフで表現される。単語分割の多義と、同形語の多義によって、文頭のノードから文末のノードまでのパス（経路）は膨大である。このため、単語に付与されたコスト（以下、「単語コスト」と呼ぶ）と、隣接する２つの単語間に付与されたコスト（以下、「連接コスト」と呼ぶ）を使用して、従来、例えば、文頭から文末までの総コストが低いパスを優先条件として、ｎ個のパスを抽出している。ここで、「単語分割の多義」とは、見出しの分割方法が異なることによる多義である。例えば、「その日本人」という文字列に対して、（ａ）「その」＋「日」＋「本人」と、（ｂ）「その」＋「日本人」という単語分割があり得る。「同形語」の多義とは、見出しが同じである場合に、その読みや品詞が異なる単語が存在することによる多義である。例えば、「工夫」という見出しに対して、「クフウ」と「コウフ」という読みがあり得る。「単語コスト」とは、その単語がどれくらい出現し易いかを示す指標である。「連接コスト」とは、隣接する２つの単語の隣接が、どれくらい出現し易いかを示す指標である。「単語コスト」及び「連接コスト」は、統計的手法によって設定される。
形態素解析においては、文頭から文末までのパスを選択して、単語列として展開された解を生成するため、単語分割の多義や同系語の多義を把握しにくい。これに対して、同系語をグループ化することにより、単語分割の多義と同系語の多義を分離した形態素解析結果を生成する技術が提案されている（例えば、特許文献１）。
そして、単語辞書に記載されていない語（以下、「未定義語」と呼ぶ）は、その多くが人名、地名、会社名等の固有名詞である。このため、一般的には、連続する漢字列、カタカナ列、記号列等が名詞であると仮定して、一律のコストと品詞を付与して処理する等、便宜的な処理で対処するようになっている。
特開２００４−３０２８９ For example, when a character string is input to a Japanese word processor, the character string is divided into appropriate language units, and kanji characters are applied to the language units as necessary. Morphological analysis is performed to divide the string into appropriate language units. In morphological analysis, an input character string may be divided into morphemes that are the smallest meaningful language units, but there are compound words composed of a plurality of morphemes in a word dictionary described later. In some cases, it may be divided into compound words. Therefore, in this specification, “morpheme analysis” is defined as dividing a character string into items (words) of a word dictionary.
Such morphological analysis has an important role as the first stage of processing in machine translation and natural language interface. Hereinafter, “word” is also referred to as “word” as appropriate.
In morphological analysis, a word dictionary and a connectability dictionary are used. The word dictionary designates the part of speech, reading, and utilization type of each word. The concatenation possibility dictionary specifies “types” of two words that can be concatenated. The “type” of a word in the connectability dictionary may be a specific word, a part of speech, or an inflection form. It is assumed that a word that can be at the beginning of a sentence and a word that can be at the end of a sentence can be connected to a special “type” such as “beginning of sentence” and “end of sentence”.
The morphological analysis result is expressed in a graph with words constituting the input character string as nodes (rectangles) and words that can be connected in a positional and grammatical manner as edges (sides). Due to the ambiguity of word division and the ambiguity of isomorphic words, the path from the beginning node to the ending node is enormous. For this reason, using a cost given to a word (hereinafter referred to as “word cost”) and a cost given between two adjacent words (hereinafter referred to as “joint cost”), for example, , N paths are extracted with a path having a low total cost from the beginning of the sentence to the end of the sentence as a priority condition. Here, the “ambiguity of word division” is an ambiguity due to different heading division methods. For example, for the character string “that Japanese”, there may be (a) “that” + “day” + “person” and (b) “that” + “Japanese”. The ambiguity of “isomorphic” is ambiguity due to the presence of words with different readings and parts of speech when the headings are the same. For example, for the heading “ingenuity”, there may be readings “Kuufu” and “Koufu”. The “word cost” is an index indicating how easily the word appears. The “joint cost” is an index indicating how easily adjacent two adjacent words appear. The “word cost” and “joint cost” are set by a statistical method.
In morphological analysis, the path from the beginning of the sentence to the end of the sentence is selected, and a solution expanded as a word string is generated. Therefore, it is difficult to grasp the ambiguity of word division and the ambiguity of related words. On the other hand, the technique which produces | generates the morphological analysis result which isolate | separated the ambiguity of the word division and the ambiguity of the synonym word by grouping a similar word is proposed (for example, patent document 1).
Many of the words that are not described in the word dictionary (hereinafter referred to as “undefined words”) are proper nouns such as personal names, place names, and company names. For this reason, in general, it is assumed that continuous kanji strings, katakana strings, symbol strings, etc. are nouns, and are dealt with by a convenient process such as processing with a uniform cost and part of speech. It has become.
JP2004-30289

しかし、未定義語に対して、ある仮定に基づいて一律のコストと品詞を付与して処理する場合には、例えば、未定義語が名詞ではない場合等、適切な形態素解析結果を得ることができない場合があるという問題がある。 However, when processing an undefined word with a uniform cost and part of speech based on certain assumptions, an appropriate morphological analysis result can be obtained, for example, when the undefined word is not a noun. There is a problem that it may not be possible.

そこで、本発明は、未定義語が存在する場合であっても、適切な形態素解析結果を得ることができる形態素解析装置、形態素解析方法、形態素解析プログラム及びコンピュータプログラムを格納した記憶媒体を提供することを目的とする。 Therefore, the present invention provides a morpheme analysis apparatus, a morpheme analysis method, a morpheme analysis program, and a storage medium storing a computer program that can obtain an appropriate morpheme analysis result even when an undefined word exists. For the purpose.

（１）日本語の複数の単語を、それぞれ品詞及びコストを関連付けた状態で記憶している単語辞書記憶手段と、隣接する前記単語間が文法的に接続することができる条件を記憶している連接可能性辞書記憶手段と、入力された文字列を前記単語辞書記憶手段及び前記連接可能性辞書記憶手段を参照して、前記文字列を所定の単位に分割する分割手段と、を有する形態素解析装置であって、さらに、前記単語辞書記憶手段に記憶されている前記単語である既知語を検索条件として内部または外部の検索装置に検索結果を要求する既知語検索結果要求手段と、各前記既知語についての検索結果の全部または一部を１文書として文書ベクトルを算出する既知語文書ベクトル算出手段と、前記既知語について生成した文書ベクトルを前記既知語に関連付ける既知語文書ベクトル関連付け手段と、前記文字列の中に、前記単語辞書記憶手段に記憶されていない語である未定義語が存在する場合には、前記未定義語を検索条件として内部または外部の検索装置に検索結果を要求する検索結果要求手段と、前記検索結果の全部または一部を１文書として文書ベクトルを算出する文書ベクトル算出手段と、前記未定義語の文書ベクトルと、前記既知語の文書ベクトルの類似度を算出する類似度算出手段と、前記類似度が高い文書ベクトルに対応する前記既知語である類似語を特定する類似語特定手段と、前記類似語の品詞及びコストを前記未定義語に関連付ける属性付与手段と、を有し、前記分割手段は、前記属性付与手段によって前記未定義語に関連付けられた品詞及びコストを使用して、入力された文字列を前記単位に分割する構成となっていることを特徴とする形態素解析装置。 (1) A word dictionary storage unit that stores a plurality of Japanese words in a state in which parts of speech and costs are associated with each other, and a condition that allows the adjacent words to be connected grammatically. Morphological analysis comprising: a concatenation possibility dictionary storage means, and a dividing means for dividing the character string into predetermined units by referring to the word dictionary storage means and the concatenation possibility dictionary storage means for the input character string A known word search result requesting means for requesting a search result from an internal or external search device using a known word that is the word stored in the word dictionary storage means as a search condition, and each known A known word document vector calculating means for calculating a document vector using all or a part of search results for words as one document, and a document vector generated for the known word with respect to the known word. If there is an undefined word that is not stored in the word dictionary storage means in the known word document vector associating means to be attached, and the character string, the internal or external Search result requesting means for requesting a search result from the search apparatus, document vector calculating means for calculating a document vector with all or part of the search results as one document, a document vector of the undefined word, and the known word Similarity calculation means for calculating the similarity of the document vector, similar word specification means for specifying the similar word that is the known word corresponding to the document vector having a high similarity, and the part of speech and cost of the similar word Attribute assigning means for associating with an undefined word, and the dividing means is input using the part of speech and cost associated with the undefined word by the attribute assigning means. The morpheme analyzer is configured to divide the character string into the units.

（１）の発明によれば、形態素解析装置は、未定義語について、類似語の品詞及びコストを付与することができる。 According to the invention of (1), the morphological analyzer can give the part-of-speech and cost of a similar word for an undefined word.

（２）前記属性付与手段は、前記未定義語に対して、予め規定した前記類似度の範囲に属する少なくとも１つの前記類似語の品詞及びコストを関連付ける構成となっていることを特徴とする（１）に記載の形態素解析装置。 (2) The attribute assigning unit is characterized in that the undefined word is associated with a part of speech and a cost of at least one similar word belonging to the predetermined similarity range ( The morphological analyzer according to 1).

（２）の構成によれば、未定義語に複数種類の品詞及びコストを関連付けることができる。このため、入力された文字列の文頭から文末までについて、例えば、総コストの低い順に形態素解析結果を出力する場合において、未定義語の多義も考慮に入れて、より適切に複数のパス（経路）を出力することができる。 According to the configuration of (2), multiple types of parts of speech and costs can be associated with undefined words. For this reason, for example, when outputting the morphological analysis results from the beginning of the input character string to the end of the sentence in the order of the lowest total cost, more appropriate multiple paths (paths) are taken into consideration of the ambiguity of undefined words. ) Can be output.

（３）さらに、前記既知語を所定のグループに分類し、各前記既知語の文書ベクトルに基づいて、前記グループの文書ベクトルを生成するグループ文書ベクトル生成手段と、前記グループと、前記グループに対応する文書ベクトルを関連付けて記憶するグループ文書ベクトル記憶手段と、を有し、前記属性付与手段は、前記未定義語の文書ベクトルと類似度が高い文書ベクトルに対応する前記グループの品詞及びコストを前記未定義語に関連付ける構成となっていることを特徴とする（１）または（２）のいずれかに記載の形態素解析装置。 (3) Further, group the known words into a predetermined group, group document vector generating means for generating a document vector of the group based on the document vector of each known word, the group, and the group Group document vector storage means for associating and storing document vectors to be stored, wherein the attribute assigning means stores the part of speech and cost of the group corresponding to a document vector having a high similarity to the document vector of the undefined word. The morpheme analyzer according to any one of (1) and (2), wherein the morpheme analyzer is configured to be associated with an undefined word.

（３）の構成によれば、１つの語の品詞及びコストではなくて、グループの品詞及びコストを未定義語に関連付けるから、未定義語対して、妥当な品詞及びコストを関連付けることができる。 According to the configuration of (3), since the part of speech and cost of a group are associated with an undefined word instead of the part of speech and cost of one word, an appropriate part of speech and cost can be associated with an undefined word.

（４）日本語の複数の単語を、それぞれ品詞及びコストを関連付けた状態で記憶している単語辞書記憶手段と、隣接する前記単語間が文法的に接続することができる条件を記憶している連接可能性辞書記憶手段と、入力された文字列を前記単語辞書記憶手段及び前記連接可能性辞書記憶手段を参照して、前記文字列を所定の単位に分割する分割手段と、を有する形態素解析装置が、前記単語辞書記憶手段に記憶されている前記単語である既知語を検索条件として内部または外部の検索装置に検索結果を要求する既知語検索結果要求ステップと、各前記既知語についての検索結果の全部または一部を１文書として文書ベクトルを算出する既知語文書ベクトル算出ステップと、前記既知語について生成した文書ベクトルを前記既知語に関連付ける既知語文書ベクトル関連付けステップと、前記文字列の中に、前記単語辞書記憶手段に記憶されていない語である未定義語が存在する場合には、前記未定義語を検索条件として内部または外部の検索装置に検索結果を要求する検索結果要求ステップと、前記検索結果の全部または一部を１文書として文書ベクトルを算出する文書ベクトル算出ステップと、前記未定義語の文書ベクトルと、前記既知語の文書ベクトルの類似度を算出する類似度算出ステップと、前記類似度が高い文書ベクトルに対応する前記既知語である類似語を特定する類似語特定ステップと、前記類似語の品詞及びコストを前記未定義語に関連付ける属性付与ステップと、前記属性付与ステップにおいて前記未定義語に関連付けられた品詞及びコストを使用して、入力された文字列を前記単位に分割する分割ステップと、を有することを特徴とする形態素解析方法。 (4) A word dictionary storage unit that stores a plurality of Japanese words in a state in which parts of speech and costs are associated with each other, and a condition that allows grammatical connection between the adjacent words. Morphological analysis comprising: a concatenation possibility dictionary storage means; and a dividing means for dividing the character string into predetermined units with reference to the word dictionary storage means and the concatenation possibility dictionary storage means for the input character string A known word search result requesting step for requesting a search result from an internal or external search device by using a known word that is the word stored in the word dictionary storage means as a search condition; and a search for each known word A known word document vector calculating step for calculating a document vector with all or part of the result as one document, and associating the document vector generated for the known word with the known word When there is an undefined word that is a word that is not stored in the word dictionary storage means in the intellectual word document vector associating step and the character string, an internal or external word is used as a search condition. A search result requesting step for requesting a search result to the search device, a document vector calculating step for calculating a document vector with all or part of the search results as one document, a document vector of the undefined word, A similarity calculating step for calculating a similarity of a document vector, a similar word specifying step for specifying a similar word that is a known word corresponding to a document vector having a high similarity, a part of speech and a cost of the similar word An attribute assignment step associated with a definition word, and a part of speech and a cost associated with the undefined word in the attribute assignment step. Morphological analysis method characterized by comprising a dividing step of dividing the character string into the unit, the.

（４）の発明によれば、（１）の発明と同様に、未定義語について、類似語の品詞及びコストを付与することができる。 According to the invention of (4), as in the invention of (1), parts of speech and costs of similar words can be assigned to undefined words.

（５）コンピュータに、日本語の複数の単語を、それぞれ品詞及びコストを関連付けた状態で記憶している単語辞書記憶手段と、隣接する前記単語間が文法的に接続することができる条件を記憶している連接可能性辞書記憶手段と、入力された文字列を前記単語辞書記憶手段及び前記連接可能性辞書記憶手段を参照して、前記文字列を所定の単位に分割する分割手段と、を有する形態素解析装置が、前記単語辞書記憶手段に記憶されている前記単語である既知語を検索条件として内部または外部の検索装置に検索結果を要求する既知語検索結果要求ステップと、各前記既知語についての検索結果の全部または一部を１文書として文書ベクトルを算出する既知語文書ベクトル算出ステップと、前記既知語について生成した文書ベクトルを前記既知語に関連付ける既知語文書ベクトル関連付けステップと、前記文字列の中に、前記単語辞書記憶手段に記憶されていない語である未定義語が存在する場合には、前記未定義語を検索条件として内部または外部の検索装置に検索結果を要求する検索結果要求ステップと、前記検索結果の全部または一部を１文書として文書ベクトルを算出する文書ベクトル算出ステップと、前記未定義語の文書ベクトルと、前記既知語の文書ベクトルの類似度を算出する類似度算出ステップと、前記類似度が高い文書ベクトルに対応する前記既知語である類似語を特定する類似語特定ステップと、前記類似語の品詞及びコストを前記未定義語に関連付ける属性付与ステップと、前記属性付与ステップにおいて前記未定義語に関連付けられた品詞及びコストを使用して、入力された文字列を前記単位に分割する分割ステップと、を実行させることを特徴とする形態素解析プログラム。 (5) A word dictionary storage means for storing a plurality of Japanese words in a state in which parts of speech and costs are associated with each other, and a condition for allowing grammatical connection between adjacent words. A connection possibility dictionary storage means, and a dividing means for dividing the character string into predetermined units by referring to the word dictionary storage means and the connection possibility dictionary storage means for the inputted character string. A known word search result requesting step for requesting a search result to an internal or external search device using a known word that is the word stored in the word dictionary storage means as a search condition; and each known word A known word document vector calculating step for calculating a document vector with all or a part of the search results for one document as one document, and the document vector generated for the known word A known word document vector associating step associated with a word, and when an undefined word that is not stored in the word dictionary storage means exists in the character string, the undefined word is used as a search condition. Or a search result requesting step for requesting a search result to an external search device; a document vector calculating step for calculating a document vector with all or a part of the search results as one document; a document vector of the undefined word; A similarity calculation step for calculating the similarity of a document vector of a known word, a similarity word specifying step for specifying a similar word that is the known word corresponding to a document vector having a high similarity, a part of speech and a cost of the similar word Assigning an attribute to the undefined word, and using the part of speech and cost associated with the undefined word in the attribute assigning step. To, morphological analysis program characterized by executing a division step of dividing an input character string to the unit, the.

本発明によれば、未定義語が存在する場合であっても、適切な形態素解析結果を得ることができる。 According to the present invention, an appropriate morpheme analysis result can be obtained even when an undefined word exists.

本発明に係る好適な実施形態の一例について、図面に基づいて以下に説明する。 An example of a preferred embodiment according to the present invention will be described below based on the drawings.

［第１実施形態］
（システムの概要）
図１は、本発明の第１実施形態の形態素解析システム１０（以下、「システム１０」と呼ぶ）を示す概略図である。 [First Embodiment]
(System overview)
FIG. 1 is a schematic diagram showing a morphological analysis system 10 (hereinafter referred to as “system 10”) according to the first embodiment of the present invention.

図１に示すように、システム１０は、形態素解析サーバ２０（以下、「サーバ２０」と呼ぶ）及び検索サーバ５０を有する。サーバ２０と検索サーバ５０は、例えば、インターネットである通信回線６０を解して通信可能になっている。サーバ２０は、入力された文字列を形態素解析するための装置であり、形態素解析装置の一例である。サーバ２０はまた、外部のパーソナルコンピュータ（ＰＣ）から通信回線６０を介して文字列（文字列を示すデータ）を示すデータを受信し、翻訳等の処理を行い、処理後のデータをそのＰＣ等へ返送するようになっている。 As shown in FIG. 1, the system 10 includes a morphological analysis server 20 (hereinafter referred to as “server 20”) and a search server 50. The server 20 and the search server 50 can communicate with each other through a communication line 60 that is, for example, the Internet. The server 20 is a device for morphological analysis of an input character string, and is an example of a morphological analysis device. The server 20 also receives data indicating a character string (data indicating a character string) from an external personal computer (PC) via the communication line 60, performs processing such as translation, and transmits the processed data to the PC or the like. To be sent back to.

検索サーバ５０は、通信回線６０を介して、検索条件（「検索語」または「クエリ」とも呼ぶ）を受信し、その検索条件を使用して格納しているウェブ（Ｗｅｂ）サイトの情報を検索し、検索条件に関連するウェブ（Ｗｅｂ）サイトのＵＲＬ、そのウェブサイトのタイトル及びスニペット（説明文）を検索結果として出力するための装置であり、検索装置の一例である。検索サーバ５０は、サーバ２０や外部のＰＣから検索条件を受信する。 The search server 50 receives a search condition (also referred to as “search word” or “query”) via the communication line 60, and searches for information on a Web site stored using the search condition. This is an apparatus for outputting the URL of a web site related to the search condition, the title of the website, and a snippet (description) as a search result, and is an example of a search apparatus. The search server 50 receives search conditions from the server 20 or an external PC.

なお、本実施形態においては、検索サーバ５０はサーバ２０の外部の装置として構成しているが、検索サーバ５０とサーバ２０を一体として、検索サーバ５０をサーバ２０の内部の検索装置としてもよい。 In the present embodiment, the search server 50 is configured as a device external to the server 20, but the search server 50 and the server 20 may be integrated, and the search server 50 may be a search device inside the server 20.

（サーバ２０の主なハードウェア構成）
図２は、サーバ２０の主なハードウェア構成を示す概略図である。サーバ２０は、コンピュータであり、バス２２を有する。バス２２には、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）２４、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）２６、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）２８、ＨＤＤ（ＨａｒｄＤｉｓｋＤｒｉｖｅ）３０、電源装置３２、入力装置３４、通信装置３６及び表示装置３８が接続されている。ＣＰＵ２４は、ＲＯＭ２８に記憶された各種プログラムを適宜読み出して実行することにより、上述のハードウェアとその各種プログラムを協働させ、本実施形態に係る各種機能を実現している。ＲＡＭ２６は、プログラムの実行に使用するローカルメモリである。入力装置３４は、各種データの入力の受付を行うものであり、キーボード、ポインティング・デバイス等を含んでよい。表示装置３８は、ユーザにデータの入力を受け付ける画面を表示したり、当該コンピュータによる演算処理結果の画面を表示したりするものであり、ブラウン管表示装置（ＣＲＴ）、液晶表示装置（ＬＣＤ）等のディスプレイ装置を含む。 (Main hardware configuration of server 20)
FIG. 2 is a schematic diagram illustrating a main hardware configuration of the server 20. The server 20 is a computer and has a bus 22. The bus 22 includes a CPU (Central Processing Unit) 24, a RAM (Random Access Memory) 26, a ROM (Read Only Memory) 28, an HDD (Hard Disk Drive) 30, a power supply device 32, an input device 34, a communication device 36, and a display. A device 38 is connected. The CPU 24 reads and executes various programs stored in the ROM 28 as appropriate, thereby causing the above-described hardware and the various programs to cooperate with each other to realize various functions according to the present embodiment. The RAM 26 is a local memory used for program execution. The input device 34 accepts input of various data, and may include a keyboard, a pointing device, and the like. The display device 38 displays a screen for accepting data input to the user or displays a calculation result screen by the computer, such as a cathode ray tube display device (CRT) or a liquid crystal display device (LCD). Including a display device.

（サーバ２０の主なソフトウェア構成）
図３は、サーバ２０の主なソフトウェア構成を示す概略図である。図３に示すように、
サーバ２０は、既知語文書ベクトル生成部１００、文字列受付部１１０、単語分割部１１２、ベストパス探索部１１４、アプリケーション部１１６及び単語クラスタリング部１２０を有する。サーバ２０は、また、単語辞書ＤＢ１４０及び連接可能性辞書ＤＢ１４２を有する。上述の各部は、ＣＰＵ２４とＲＯＭ２８に格納された各種プログラムによって実現される。 (Main software configuration of server 20)
FIG. 3 is a schematic diagram illustrating a main software configuration of the server 20. As shown in FIG.
The server 20 includes a known word document vector generation unit 100, a character string reception unit 110, a word division unit 112, a best path search unit 114, an application unit 116, and a word clustering unit 120. The server 20 also has a word dictionary DB 140 and a connectability dictionary DB 142. Each unit described above is realized by various programs stored in the CPU 24 and the ROM 28.

既知語文書ベクトル生成部１００は、単語辞書ＤＢ１４０に格納される単語について、後述のように、文書ベクトルを生成するための構成である。文字列受付部１１０は、通信装置３６によって外部から受信した文字列を受け付けるための構成である。単語分割部１１２は、文字列受付部１１０によって受け付けた文字列を形態素解析して解析結果を生成するための構成である。単語分割部１１２は、分割手段の一例である。単語クラスタリング部１２０は、未定義語について、品詞とコストを関連付けるための構成である。ベストパス探索部１１４は、形態素解析された文字列について、所定の条件に基づいて、少なくとも１つの経路（パス）を特定するための構成である。単語辞書ＤＢ１４０は、日本語の複数の単語を、それぞれ品詞及びコストを関連付けた状態で記憶しており、単語辞書記憶手段の一例である。連接可能性辞書ＤＢ１４２は、隣接する単語間が文法的に接続することができる条件を記憶しており、連接可能性辞書記憶手段の一例である。 The known word document vector generation unit 100 is configured to generate a document vector for the words stored in the word dictionary DB 140 as will be described later. The character string accepting unit 110 is configured to accept a character string received from the outside by the communication device 36. The word dividing unit 112 is configured to generate an analysis result by performing a morphological analysis on the character string received by the character string receiving unit 110. The word dividing unit 112 is an example of a dividing unit. The word clustering unit 120 is a configuration for associating parts of speech and costs for undefined words. The best path search unit 114 is configured to specify at least one route (path) based on a predetermined condition for the character string subjected to morphological analysis. The word dictionary DB 140 stores a plurality of Japanese words in a state in which parts of speech and costs are associated with each other, and is an example of a word dictionary storage unit. The connection possibility dictionary DB 142 stores conditions that allow grammatical connection between adjacent words, and is an example of a connection possibility dictionary storage unit.

図４は、単語辞書ＤＢ１４０の一例を示す図である。
図４に示すように、単語辞書ＤＢ１４０は、単語辞書を格納している。
単語辞書は、表記（見出し）、読み、品詞、コスト及び文書ベクトルが関連付けられて構成されている。単語辞書は、実際には、トライ（Ｔｒｉｅ）構造等の高速検索可能な形式に変換されている。 FIG. 4 is a diagram illustrating an example of the word dictionary DB 140.
As shown in FIG. 4, the word dictionary DB 140 stores a word dictionary.
The word dictionary is configured by associating notation (heading), reading, part of speech, cost, and document vector. The word dictionary is actually converted into a format that can be searched at high speed, such as a Trie structure.

図５は、連接可能性辞書の一例を示す図である。図６は、連接可能性辞書ＤＢ１４２の一例を示す図である。
まず、連接可能性辞書ＤＢ１４２の構成を説明する前提として、図５を使用して、連接可能性辞書について説明する。連接可能性辞書には、隣接する２語（例えば、左側の語と右側の語）が、文法的に接続可能であるか否かについての情報が記載されている。連接可能性辞書には、例えば、現在の日本語において、横書の場合、左から右に向かって記載するのが通常であるから、先行する左側の語に対して、後続する右側の語がどのようなタイプ（品詞、または、具体的な単語）であるかが示されている。 FIG. 5 is a diagram illustrating an example of a connection possibility dictionary. FIG. 6 is a diagram illustrating an example of the connection possibility dictionary DB 142.
First, as a premise for explaining the configuration of the connection possibility dictionary DB 142, the connection possibility dictionary will be described with reference to FIG. The connection possibility dictionary describes information about whether two adjacent words (for example, the left word and the right word) can be connected grammatically. In the concatenation possibility dictionary, for example, in the current Japanese, in horizontal writing, it is usually written from left to right. What type (part of speech or specific word) is shown.

連接可能性辞書ＤＢ１４２には、サーバ２０による高速処理が可能なように、連接可能性辞書が、図６に示すように、連接可能性行列に変換されて格納されている。連接可能性行例においては、行が左側にあらわれる語のタイプを示し、列が右側にあらわれる語のタイプが示されている。左側の語のタイプと右側の語のタイプが連接可能であれば、行列の値が１に設定され、連接不可能であれば行列の値が０に設定される。 In the connection possibility dictionary DB 142, a connection possibility dictionary is converted into a connection possibility matrix and stored as shown in FIG. 6 so that the server 20 can perform high-speed processing. In the connectability row example, the row indicates the type of word that appears on the left and the column indicates the type of word that appears on the right. If the word type on the left side and the word type on the right side are connectable, the value of the matrix is set to 1, and if the connection is not possible, the value of the matrix is set to 0.

上述の単語辞書ＤＢ１４０に格納される単語（以下、「既知語」とも呼ぶ）には、既知語文書ベクトル生成部１００によって、文書ベクトルが関連付けられる。「文書ベクトル」は、出現単語に基づいて文書（または、文章）を１つのベクトルで表現したものである。言い換えると、「文書ベクトル」とは、１つの文書に対する出現単語の重要度（頻度等）を成分とするベクトルのことであり、本明細書においては、自然言語処理の分野における通常の意味で使用する。図３に示すように、既知語文書ベクトル生成部１００は、既知語検索結果要求部１０２、既知語文書ベクトル算出部１０４及び既知語文書ベクトル関連付け部１０６を有する。既知語検索結果要求部１０２は、各既知語を検索条件（「検索語」、「クエリ」とも呼ぶ）として、検索サーバ５０に対して検索結果を要求するための構成であり、既知語検索結果要求手段の一例である。既知語文書ベクトル算出部１０４は、検索サーバ５０から受信した検索結果のうち予め規定したＫ（Ｋは、自然数）ページを１文書として文書ベクトルを算出するための構成であり、既知語文書ベクトル算出手段の一例である。なお、文書ベクトルの算出については、検索結果におけるタイトルとスニペットの１組を１ページとする。スニペットとは、検索結果のタイトルに続いて表示されるテキストである。既知語文書ベクトル関連付け部１０６は、各既知語に対して、その文書ベクトルを関連付けるための構成であり、既知語文書ベクトル関連付け手段の一例である。 A word stored in the word dictionary DB 140 (hereinafter also referred to as “known word”) is associated with a document vector by the known word document vector generation unit 100. The “document vector” is a representation of a document (or sentence) with one vector based on the appearance word. In other words, the “document vector” is a vector whose component is the importance (frequency, etc.) of the appearance word for one document, and is used in the present specification in a normal sense in the field of natural language processing. To do. As illustrated in FIG. 3, the known word document vector generation unit 100 includes a known word search result request unit 102, a known word document vector calculation unit 104, and a known word document vector association unit 106. The known word search result request unit 102 is configured to request a search result from the search server 50 using each known word as a search condition (also referred to as “search word” or “query”). It is an example of a request means. The known word document vector calculation unit 104 is configured to calculate a document vector with a predetermined K (K is a natural number) page of the search results received from the search server 50 as one document. It is an example of a means. For the calculation of the document vector, one set of title and snippet in the search result is set as one page. A snippet is text that is displayed following the title of a search result. The known word document vector associating unit 106 is a configuration for associating each known word with the document vector, and is an example of a known word document vector associating unit.

図７は、既知語文書ベクトル生成部１００の説明図である。
既知語検索結果要求部１０２は、既知語である「トラックバック」、「車」等の既知語を検索条件として、検索サーバ５０に対して検索結果を要求する。既知語検索結果要求部１０２は、図７（ａ）に示すように、検索サーバ５０から検索結果を受信する。 FIG. 7 is an explanatory diagram of the known word document vector generation unit 100.
The known word search result request unit 102 requests a search result from the search server 50 using known words such as “truck back” and “car” as search conditions. The known word search result request unit 102 receives the search result from the search server 50 as shown in FIG.

既知語文書ベクトル算出部１０４は、検索結果から、上位Ｋページの検索結果を特定する。そして、図７（ｂ）に示すように、上位Ｋページの検索結果のタイトルとスニペットから、クエリ以外の語を抽出し、１文書を生成する。そして、式１を使用して、文書ベクトルを算出する。

式１においては、ＴＦ・ＩＤＦ法によって、図７（ｃ）に示すように、各語の重み付けが行われる。ＴＦ・ＩＤＦ法によって、ある文書における出現頻度が高く、すべての文書のうち特定の文書に偏在する単語が、その文書の特徴を表す単語であると看做されて、重み付けが重くなる。具体的には、例えば、サーバ２０内の図示しない文書記憶ＤＢに記憶した複数の文書データを参照し、各単語の当該文書内における出現回数及び出願文書数を算出し、重要度が算出される。
そして、既知語文書ベクトル関連付け部１０６は、各既知語に対して、その文書ベクトルを関連付ける。 The known word document vector calculation unit 104 specifies the search result of the upper K page from the search result. Then, as shown in FIG. 7B, words other than the query are extracted from the title and snippet of the search result of the upper K page to generate one document. Then, using Equation 1, a document vector is calculated.

In Expression 1, each word is weighted by the TF / IDF method as shown in FIG. According to the TF / IDF method, the occurrence frequency of a certain document is high, and a word that is unevenly distributed in a specific document among all the documents is regarded as a word representing the feature of the document, and the weighting is increased. Specifically, for example, by referring to a plurality of document data stored in a document storage DB (not shown) in the server 20, the number of appearances of each word in the document and the number of application documents are calculated, and the importance is calculated. .
Then, the known word document vector associating unit 106 associates the document vector with each known word.

図８、図９及び図１０は、単語分割部１１２の説明図である。
例えば、図８（ａ）に示すように、「このひとことで元気になった」という文字列が単語分割部１１２に入力されたとする。
単語分割部１１２は、図８（ｂ）に示すように、文字列の位置（文字と文字の間、文頭では文字の左側、文末では文字の右側）を示すポインターを設定する。初期状態として、ポインターを位置０（先頭の文字「こ」の左側）に設定する。また、「文頭」という仮想的なノードを設定する。
続いて、単語分割部１１２は、ポインター位置（以下、「始点」と呼ぶ）から始まる語を単語辞書ＤＢ１４０を参照して検索する。図８（ｂ）の始点０からは、「この（連体詞」と「こ（接尾辞：個）」が検索される。単語分割部１１２は、始点を後方（右側）に１文字づつずらしながら、単語辞書ＤＢ１４０を参照して網羅的に単語辞書ＤＢ１４０から単語を抽出する。始点ｎから始まる語を単語辞書ＤＢ１４０から抽出する処理を、「始点ｎ（ｎは、０及び自然数）についての単語抽出処理」と呼ぶ。
単語分割部１１２は、始点で終わっている語（位置０の場合は「文頭」、以下「先行語」と呼ぶ）と始点から始まる語（位置０の場合は「この」及び「こ」、以下「後続語」と呼ぶ）の各ペアについて、連接可能性辞書ＤＢ１４２を参照し、図８（ｃ）に示すように、連接可能なものがあればその間にリンクをはる。後続語の中で、いずれの先行語とも連接可能ではない語は排除する。図８（ｃ）の例では、「こ」は排除される。
ポインターが文末位置（図８（ｂ）の例では位置１３）に来ると、「文末」という仮想的なノードを設定し、文末位置で終わっている語（図８（ｂ）の例では「た」）と「文末」との連接可能性を調べ、連接可能なものだけを「文末」ノードにリンクして処理を終了する。
最終的に、「文頭」ノードから「文末」ノードまでの経路（パス）が、入力された文字列に対する形態素解析結果となる。 8, 9, and 10 are explanatory diagrams of the word dividing unit 112.
For example, as illustrated in FIG. 8A, it is assumed that a character string “I am fine with this word” is input to the word division unit 112.
As shown in FIG. 8B, the word dividing unit 112 sets a pointer indicating the position of the character string (between characters, the left side of the character at the beginning of the sentence and the right side of the character at the end of the sentence). As an initial state, the pointer is set to position 0 (the left side of the first character “ko”). In addition, a virtual node called “sentence” is set.
Subsequently, the word division unit 112 searches the word dictionary DB 140 for a word starting from the pointer position (hereinafter referred to as “starting point”). 8 (b), “this (conjunction)” and “ko (suffix: pieces)” are searched for.The word dividing unit 112 shifts the start point backward (right side) one character at a time, A word is exhaustively extracted from the word dictionary DB 140 with reference to the word dictionary DB 140. A process of extracting a word starting from the start point n from the word dictionary DB 140 is referred to as a word extraction process for the start point n (n is 0 and a natural number). "
The word segmentation unit 112 uses a word ending at the start point (in the case of position 0, “sentence”, hereinafter referred to as “preceding word”) and a word starting from the start point (in case of position 0, “this” and “ko”, For each pair of “succeeding words”), the concatenation possibility dictionary DB 142 is referred to, and as shown in FIG. Of the succeeding words, words that are not connectable to any preceding word are excluded. In the example of FIG. 8C, “ko” is excluded.
When the pointer reaches the end of the sentence (position 13 in the example of FIG. 8B), a virtual node “end of sentence” is set, and the word ending at the end of the sentence (in the example of FIG. )) And “end of sentence” are examined, and only those that can be connected are linked to the “end of sentence” node, and the process is terminated.
Finally, the path from the “beginning of sentence” node to the “end of sentence” node becomes the morphological analysis result for the input character string.

上述の単語分割部１１２は、各ノードとリンクに適当なコストを付与する。各ノードのコストは単語辞書ＤＢ１４０に記憶されており、リンクのコストは連接可能性辞書ＤＢ１４２に記憶されている（図示せず）。
先行語と後続語が連接可能な場合に、文頭から先行語までの部分最小コストと、先行語と後続語との間の連接コストと、後続語の単語コストの和が最小であるような先行語と後続語の間に、図９に示すように、特別のマークをつける。例えば、図９においては、その特別のマークは、太線として示されている。 The above-described word dividing unit 112 assigns an appropriate cost to each node and link. The cost of each node is stored in the word dictionary DB 140, and the cost of the link is stored in the connection possibility dictionary DB 142 (not shown).
When the antecedent and successor can be concatenated, the antecedent such that the sum of the partial minimum cost from the beginning of the sentence to the antecedent, the concatenation cost between the antecedent and successor, and the word cost of the successor is the smallest As shown in FIG. 9, a special mark is added between the word and the subsequent word. For example, in FIG. 9, the special mark is shown as a thick line.

ベストパス探索部１１４は、コストが小さいことを優先条件として、予め規定された所定数のパスを特定する。ベストパス探索部１１４は、例えば、コスト最小法を用いる。なお、本実施形態とは異なり、ベストパスの特定方法としては、最長一致法、２文節最長一致法、形態素数最小法、文節数最小法（「岩波講座ソフトウェア科学１５自然言語処理長尾真編岩波書店」等参照）等を使用してもよい。
アプリケーション部１１６は、形態素解析結果（パス）の入力を受け、例えば、必要に応じて漢字に変換するワープロ部である。なお、アプリケーション部１１６は、一般的なワープロソフトや翻訳ソフトを含んで構成されるから、説明を省略する。 The best path search unit 114 specifies a predetermined number of paths that are defined in advance, with a low cost as a priority condition. The best path search unit 114 uses, for example, a minimum cost method. Unlike the present embodiment, the best path identification method includes the longest match method, the two-segment longest match method, the minimum morpheme number method, and the minimum clause number method (“Iwanami Laboratory Software Science 15 Natural Language Processing Masao Nagao Iwanami You may use bookstore "etc.).
The application unit 116 is a word processor unit that receives an input of a morphological analysis result (pass) and converts it into, for example, a kanji character as necessary. The application unit 116 includes general word processing software and translation software, and thus description thereof is omitted.

単語分割部１１２が受け付けた文字列に、未定義語が存在しない場合には、上述の処理で形態素解析を完了することができる。これに対して、文字列に未定義語が存在する場合には、単語クラスタリング部１２０が起動する。 If there is no undefined word in the character string received by the word dividing unit 112, the morphological analysis can be completed by the above-described processing. On the other hand, when an undefined word exists in the character string, the word clustering unit 120 is activated.

図１０（ａ）に示すように、単語分割部１１２に入力される文字列が、例えば、「面白いと思った記事をどんどんトラバしていく」であるとする。「トラバ」は未定義語である。単語分割部１１２は、図１０（ｃ）に示すように、各始点から開始する語を単語辞書ＤＢ１４０から抽出していく。なお、説明の便宜のため、図１０（ｃ）においては、パスを１つだけ記載し、かつ、すべてのノードをリンクしている。 As shown in FIG. 10A, it is assumed that the character string input to the word dividing unit 112 is, for example, “traversing articles that are interesting”. “Trava” is an undefined word. As shown in FIG. 10C, the word dividing unit 112 extracts words starting from each starting point from the word dictionary DB 140. For convenience of explanation, only one path is described in FIG. 10C and all nodes are linked.

図１０（ｂ）の始点１４から開始する語を単語辞書ＤＢ１４０から抽出することができない。そして、始点１４についての単語抽出に続いて、始点１５、始点１６についての単語抽出処理を行っても単語辞書ＤＢ１４０から語を抽出することはできない。さらに、始点１７についての単語抽出処理を行うと、単語辞書ＤＢ１４０から単語「して」を抽出することができる。この場合、単語抽出ができなかった始点１４から、単語抽出ができた始点１７までの間の文字列「トラバ」が未定義語である。単語分割部１１２は、未定義語を単語クラスタリング部１２０へ送信する。単語クラスタリング部１２０は未定義語を受信することによって起動する。 A word starting from the start point 14 in FIG. 10B cannot be extracted from the word dictionary DB 140. Then, following the word extraction for the start point 14, even if the word extraction process for the start point 15 and the start point 16 is performed, words cannot be extracted from the word dictionary DB 140. Furthermore, when the word extraction process for the starting point 17 is performed, the word “do” can be extracted from the word dictionary DB 140. In this case, the character string “trava” between the starting point 14 where the word cannot be extracted and the starting point 17 where the word can be extracted is an undefined word. The word dividing unit 112 transmits the undefined word to the word clustering unit 120. The word clustering unit 120 is activated by receiving an undefined word.

図３に示すように、単語クラスタリング部１２０は、検索結果要求部１２２、文書ベクトル生成部１２４、類似度算出部１２６、類似語特定部１２８及び属性付与部１３０を有する。検索結果要求部１２２は、未定義語を検索条件として、検索サーバ５０に対して検索結果を要求するための構成であり、検索結果要求手段の一例である。文書ベクトル生成部１２４は、検索サーバ５０から受信した検索結果のうち予め規定したＫページを１文書として文書ベクトルを算出するための構成であり、文書ベクトル算出手段の一例である。類似度算出部１２６は、未定義語の文書ベクトルと、既知語の文書ベクトルの類似度を評価するための構成であり、類似度算出手段の一例である。類似語特定部１２８は、類似度が最も高い文書ベクトルに対応する既知語である類似語を特定するための構成であり、類似語特定手段の一例である。属性付与部１３０は、類似語の品詞及びコストを未定義語に関連付けるための構成であり、属性付与手段の一例である。 As illustrated in FIG. 3, the word clustering unit 120 includes a search result request unit 122, a document vector generation unit 124, a similarity calculation unit 126, a similar word specification unit 128, and an attribute assignment unit 130. The search result request unit 122 is configured to request a search result from the search server 50 using an undefined word as a search condition, and is an example of a search result request unit. The document vector generation unit 124 is a configuration for calculating a document vector using a predetermined K page among search results received from the search server 50 as one document, and is an example of a document vector calculation unit. The similarity calculation unit 126 is a configuration for evaluating the similarity between a document vector of an undefined word and a document vector of a known word, and is an example of a similarity calculation unit. The similar word specifying unit 128 is configured to specify a similar word that is a known word corresponding to the document vector having the highest similarity, and is an example of a similar word specifying unit. The attribute assigning unit 130 is a configuration for associating a similar part of speech and cost with an undefined word, and is an example of an attribute assigning unit.

図１１及び図１２は、単語クラスタリング部１２０の説明図である。
検索結果要求部１２２は、未定義語である「トラバ」を検索条件として、検索サーバ５０に対して検索結果を要求する。検索結果要求部１２２は、検索サーバ５０から、例えば、図１１（ａ）に示す検索結果を受信する。 11 and 12 are explanatory diagrams of the word clustering unit 120. FIG.
The search result request unit 122 requests the search server 50 for a search result by using an undefined word “trava” as a search condition. The search result request unit 122 receives, for example, the search result shown in FIG. 11A from the search server 50.

文書ベクトル生成部１２４は、検索結果から、上位Ｋページの検索結果を特定する。そして、図１１（ｂ）に示すように、上位Ｋページの検索結果のタイトルとスニペットから、クエリ以外の語を抽出し、１文書を生成する。そして、この１文書について、上述の式１を使用して、文書ベクトルを算出する。 The document vector generation unit 124 specifies the search result of the upper K page from the search result. Then, as shown in FIG. 11B, words other than the query are extracted from the title and snippet of the search result of the upper K page, and one document is generated. Then, a document vector is calculated for the one document by using the above-described formula 1.

類似度算出部１２６は、式２を使用して、「トラバ」の文書ベクトルと、既知語の文書ベクトルの類似度を算出する。

The similarity calculator 126 uses Equation 2 to calculate the similarity between the “Trava” document vector and the document vector of the known word.

例えば、図１２（ａ）に示すように、未定義語である「トラバ」の文書ベクトルと既知語である「トラックバック」の文書ベクトルの類似度は０．１２６であり、未定義語である「トラバ」の文書ベクトルと既知語である「車」の文書ベクトルの類似度は０．０１１である。 For example, as shown in FIG. 12A, the similarity between the document vector of “Trava”, which is an undefined word, and the document vector of “Trackback”, which is a known word, is 0.126, which is an undefined word “ The similarity between the document vector of “Traver” and the document vector of “car”, which is a known word, is 0.011.

類似語特定部１２８は、すべての既知後の中で、既知語である「トラックバック」の文書ベクトルと「トラバ」の文書ベクトルの類似度が最も高い場合には、「トラックバック」を「トラバ」の類似語として特定する。 The similar word specifying unit 128 sets “track back” to “trava” when the similarity between the document vector of “track back” that is a known word and the document vector of “trava” is the highest among all known words. Identify as a similar term.

属性付与部１３０は、類似語である「トラックバック」の品詞及びコストを未定義語である「トラバ」に付与する。 The attribute assigning unit 130 assigns the part of speech and cost of the similar term “trackback” to the undefined term “trava”.

単語クラスタリング部１２０は、このようにして、未定義語に品詞及びコストを付与すると、その未定義語と品詞及びコストを単語分割部１１２に送信する。単語分割部１１２は、受信した未定義語と品詞及びコストを使用して、未定義語の末尾と次の文字の間の位置を始点として、形態素解析処理を継続する。 When the word clustering unit 120 assigns the part of speech and the cost to the undefined word in this way, the word clustering unit 120 transmits the undefined word, the part of speech and the cost to the word dividing unit 112. Using the received undefined word, part of speech, and cost, the word dividing unit 112 continues the morphological analysis process starting from the position between the end of the undefined word and the next character.

以上が、サーバ２０の構成である。以下、主に図１３、図１４及び図１５を使用して、サーバ２０の動作例を説明する。主に図１３、図１４、図１５及び図１６は、サーバ２０の動作例を示す概略フローチャートである。 The above is the configuration of the server 20. Hereinafter, an operation example of the server 20 will be described mainly using FIGS. 13, 14, and 15. 13, 14, 15, and 16 are mainly flowcharts illustrating an operation example of the server 20.

（サーバの動作例）
まず、サーバ２０が、文字列の入力を受け付ける（図１３のステップＳ１）。続いて、サーバ２０が、単語辞書ＤＢ１４０及び連接可能性辞書ＤＢ１４２を参照し、文字列の形態素解析を行う（ステップＳ２）。ステップＳ２は、分割ステップの一例である。続いて、サーバ２０は、総コストが小さいことを優先条件として、予め規定された数のパスを出力する（ステップＳ３）。
サーバ２０は、また、例えば、２４時間毎等、定期的に既知語についての文書ベクトルの生成及び更新を行っている。具体的には、サーバ２０は、既知語をクエリとして検索サーバ５０に検索結果を要求し、検索結果を取得する（図１４のステップＳ１１）。ステップＳ１１は、既知語検索結果要求ステップの一例である。続いて、サーバ２０は、上位Ｋページについて、タイトルとスニペットからクエリ以外のターム（語）を抽出し（ステップＳ１２）、各タームに重み付けを行い、文書ベクトルを生成する（ステップＳ１３）。ステップＳ１２及びステップＳ１３は、既知語文書ベクトル算出ステップの一例である。続いて、サーバ２０は、既知語と、その文書ベクトルを関連付けて、単語辞書ＤＢ１４０に記憶する（ステップＳ１４）。ステップＳ１４は、既知語文書ベクトル関連付けステップの一例である。 (Server operation example)
First, the server 20 receives an input of a character string (step S1 in FIG. 13). Subsequently, the server 20 refers to the word dictionary DB 140 and the concatenation possibility dictionary DB 142 and performs morphological analysis of the character string (step S2). Step S2 is an example of a division step. Subsequently, the server 20 outputs a predetermined number of paths under the priority condition that the total cost is small (step S3).
The server 20 also periodically generates and updates document vectors for known words, for example, every 24 hours. Specifically, the server 20 requests a search result from the search server 50 using a known word as a query, and acquires the search result (step S11 in FIG. 14). Step S11 is an example of a known word search result request step. Subsequently, the server 20 extracts terms (words) other than the query from the title and snippet for the top K pages (step S12), weights each term, and generates a document vector (step S13). Steps S12 and S13 are an example of a known word document vector calculation step. Subsequently, the server 20 associates the known word with the document vector and stores it in the word dictionary DB 140 (step S14). Step S14 is an example of a known word document vector association step.

ここで、上述のステップＳ２について図１５及び図１６を使用して説明する。
まず、サーバ２０は、始点から始まる語（後続語）を単語辞書ＤＢ１４０から検索する（図１５のステップＳ１０１）。続いて、始点が文末に来たか否かを判断し（ステップＳ１０２）、始点が文末に来ていないと判断した場合には、ステップＳ１０１及びステップＳ１０２を繰り返す。これに対して、ステップＳ１０２において、始点が文末に来たと判断した場合には、各始点から始まる語が単語辞書ＤＢ１４０から検索されたか否かを判断する（ステップＳ１０３）。ステップＳ１０３において、語が検索されたと判断した場合には、後続語に既知語の品詞とコストを関連付ける（ステップＳ１０４）。これに対して、ステップＳ１０３において、語が検索されないと判断した場合には、未定義語処理をする（ステップＳ１０４Ａ）。 Here, the above-described step S2 will be described with reference to FIGS.
First, the server 20 searches the word dictionary DB 140 for a word (following word) starting from the starting point (step S101 in FIG. 15). Subsequently, it is determined whether or not the start point has come to the end of the sentence (step S102). If it is determined that the start point has not come to the end of the sentence, steps S101 and S102 are repeated. On the other hand, if it is determined in step S102 that the starting point has come to the end of the sentence, it is determined whether or not a word starting from each starting point has been searched from the word dictionary DB 140 (step S103). If it is determined in step S103 that the word has been searched, the part of speech of the known word and the cost are associated with the subsequent word (step S104). On the other hand, if it is determined in step S103 that no word is searched, undefined word processing is performed (step S104A).

未定義後処理（ステップＳ１０４Ａ）の詳細を、図１６を使用して説明する。
まず、サーバ２０は、未定義語をクエリとして検索サーバ５０に検索結果を要求し、検索結果を取得する（図１６のステップＳ２０１）。ステップＳ２０１は、検索結果要求ステップの一例である。続いて、検索結果の上位Ｋページについて、タイトルとスニペットからクエリ以外の語を抽出し（ステップＳ２０２）、各語に重み付けを行い、上位Ｋページを１文書として文書ベクトルを生成する（ステップＳ２０３）。ステップＳ２０２及び
ステップＳ２０３は、文書ベクトル算出ステップの一例である。続いて、クエリとして未定義語の文書ベクトルと、単語辞書ＤＢ１４０に記憶されたすべての既知語の文書ベクトルとの類似度を算出する（ステップＳ２０４）。ステップＳ２０４は、類似度算出ステップの一例である。続いて、クエリとした未定義語に対して、その未定義語についての文書ベクトルと類似度が最も高い既知語を類似語として特定する（ステップＳ２０５）。ステップＳ２０５は、類似語特定ステップの一例である。続いて、サーバ２０は、クエリとした未定義語に対して、その類似語の品詞とコストを関連付ける（ステップＳ２０６）。ステップＳ２０６は、属性付与ステップの一例である。 Details of the undefined post-processing (step S104A) will be described with reference to FIG.
First, the server 20 requests a search result from the search server 50 using an undefined word as a query, and acquires the search result (step S201 in FIG. 16). Step S201 is an example of a search result request step. Subsequently, for the top K pages of the search results, words other than the query are extracted from the title and snippet (step S202), each word is weighted, and a document vector is generated with the top K page as one document (step S203). . Steps S202 and S203 are an example of a document vector calculation step. Subsequently, the similarity between the document vector of undefined words as a query and the document vectors of all known words stored in the word dictionary DB 140 is calculated (step S204). Step S204 is an example of a similarity calculation step. Subsequently, for the undefined word used as a query, a known word having the highest similarity with the document vector for the undefined word is specified as a similar word (step S205). Step S205 is an example of a similar word specifying step. Subsequently, the server 20 associates the part-of-speech and cost of the similar word with the undefined word as the query (step S206). Step S206 is an example of an attribute assignment step.

ステップＳ１０４またはステップＳ１０４Ａに続いて、始点で終わる語（先行語）と後続語が連接可能かを判断する（ステップＳ１０５）。なお、文末位置で終わっている語については、文末との連接が可能か否かを判断する。ステップＳ１０５において、連接可能であると判断した場合には、先行語と後続語との間にリンクを張る（ステップＳ１０６）。ステップＳ１０５において、連接可能であると判断しなかった場合には、その後続語は排除する（ステップＳ１０６Ａ）。 Following step S104 or step S104A, it is determined whether the word ending at the start point (preceding word) and the succeeding word can be connected (step S105). For words ending at the end of the sentence, it is determined whether or not connection with the end of the sentence is possible. If it is determined in step S105 that the connection is possible, a link is established between the preceding word and the succeeding word (step S106). If it is not determined in step S105 that the connection is possible, the subsequent word is excluded (step S106A).

上述のように、サーバ２０は、未定義語については、類似語の品詞及びコストを付与することができる。このため、未定義語が存在する場合であっても、適切な形態素解析結果を得ることができる。 As described above, the server 20 can give parts of speech and costs of similar words for undefined words. For this reason, even if there is an undefined word, an appropriate morphological analysis result can be obtained.

［変形例］
次に、上述の第１実施例の変形例を説明する。
図１７は、第１実施例の変形例の説明図である。 [Modification]
Next, a modification of the first embodiment will be described.
FIG. 17 is an explanatory diagram of a modification of the first embodiment.

変形例においては、サーバ２０の類似語特定部１２８（図３参照）において、類似語を特定するための類似度の閾値ｔが設定されている。例えば、図１７に示すように、「スタンばる」という語について、文書ベクトル生成部１２４（図３参照）が、文書ベクトルａを生成するとする。類似度算出部１２６は、未定義語の「スタンばる」と既知語の「スタンバイ」、「待機する」、「待つ」等の語との類似度を算出する。類似語特定部１２８は、類似度が、閾値ｔよりも大きい語を、類似語として特定する。例えば、「スタンバイ」の類似度ｘ１、「待機する」の類似度ｘ２及び「待つ」の類似度ｘ３が、閾値ｔよりも大きい場合には、これら３語を類似語として特定するようになっている。 In the modification, a similarity threshold t for specifying a similar word is set in the similar word specifying unit 128 (see FIG. 3) of the server 20. For example, as shown in FIG. 17, it is assumed that the document vector generation unit 124 (see FIG. 3) generates a document vector a for the word “Stand Bar”. The similarity calculation unit 126 calculates the similarity between the undefined word “Stanbaru” and the known words “Standby”, “Standby”, “Wait”, and the like. The similar word specifying unit 128 specifies a word having a similarity higher than the threshold value t as a similar word. For example, when the “standby” similarity x1, the “waiting” similarity x2, and the “waiting” similarity x3 are larger than the threshold value t, these three words are specified as similar words. Yes.

これにより、未定義語に複数種類の品詞及びコストを関連付けることができる。このため、入力された文字列の文頭から文末までについて、例えば、総コストの低い順に形態素解析結果を出力する場合において、未定義語の多義も考慮に入れて、より適切に複数のパス（経路）を出力することができる。 Thereby, a plurality of types of parts of speech and costs can be associated with undefined words. For this reason, for example, when outputting the morphological analysis results from the beginning of the input character string to the end of the sentence in the order of the lowest total cost, more appropriate multiple paths (paths) are taken into consideration of the ambiguity of undefined words. ) Can be output.

［第２実施例］
次に、第１実施例との相違を中心に、第２実施例を説明する。
図１８は、第２実施例の説明図である。 [Second Embodiment]
Next, the second embodiment will be described focusing on the differences from the first embodiment.
FIG. 18 is an explanatory diagram of the second embodiment.

図１８（ａ）に示すように、第２実施例においては、サーバ２０の単語辞書ＤＢ１４０には、単語の文書ベクトルに加えて、「グループ」の文書ベクトルが記憶されている。「グループ」とは、例えば、「スタンバイ」、「待機する」、「待つ」という互いに類似する概念を有する語の集合を意味し、「クラスタ」とも呼ぶ。既知語文書ベクトル算出部１０４は、各既知語の文書ベクトルに加えて、各既知語の類似度が所定範囲である語によるグループを構成し、各グループ（クラスタ）の文書ベクトルを算出する。グループの文書ベクトルは、グループを構成する各語の文書ベクトルの和として算出される。具体的には、既知語文書ベクトル算出部１０４はグループの文書ベクトルを式３を使用して算出する。

As shown in FIG. 18A, in the second embodiment, the word dictionary DB 140 of the server 20 stores a document vector of “group” in addition to the word document vector. “Group” means, for example, a set of words having concepts similar to each other such as “standby”, “waiting”, and “waiting”, and is also referred to as “cluster”. The known word document vector calculation unit 104 forms a group of words whose similarity is within a predetermined range in addition to the document vector of each known word, and calculates the document vector of each group (cluster). The document vector of the group is calculated as the sum of the document vectors of the words constituting the group. Specifically, the known word document vector calculation unit 104 calculates the document vector of the group using Equation 3.

既知語文書ベクトル算出部１０４は、グループ文書ベクトル生成手段でもある。そして、グループ及びグループの文書ベクトルは、単語辞書ＤＢ１４０に記憶される。単語辞書ＤＢ１４０は、グループ文書ベクトル記憶手段の一例でもある。 The known word document vector calculation unit 104 is also a group document vector generation unit. The group and the document vector of the group are stored in the word dictionary DB 140. The word dictionary DB 140 is also an example of a group document vector storage unit.

文書ベクトル生成部１２４（図３参照）が、図１８（ｂ）に示すように、未定義語「スタンばる」についての文書ベクトルａを生成するとする。類似度算出部１２６は、式４を使用して、未定義語の「スタンばる」と既知語のグループの文書ベクトルとの類似度を算出する。

Assume that the document vector generation unit 124 (see FIG. 3) generates the document vector a for the undefined word “stanbaru” as shown in FIG. 18B. The similarity calculation unit 126 uses Equation 4 to calculate the similarity between the undefined word “Stanbaru” and the document vector of the known word group.

類似語特定部１２８は、類似度が最も高いグループを、類似語として特定するようになっている。 The similar word specifying unit 128 specifies a group having the highest similarity as a similar word.

以上のように、第２実施例のサーバ２０は、１つの語の品詞及びコストではなくて、グループの品詞及びコストを未定義語に関連付けるから、未定義語対して、一層妥当な品詞及びコストを関連付けることができる。 As described above, the server 20 according to the second embodiment associates the part of speech and cost of a group with an undefined word instead of the part of speech and cost of one word. Therefore, a more appropriate part of speech and cost for an undefined word. Can be associated.

（プログラム及びコンピュータ読み取り可能な記録媒体等について）
コンピュータに上述の動作例の既知語検索結果要求ステップと、既知語文書ベクトル算出ステップと、既知語文書ベクトル関連付けステップと、検索結果要求ステップと、文書ベクトル算出ステップと、類似度算出ステップと、類似語特定ステップと、属性付与ステップと、分割ステップ等を実行させるためのサーバの制御プログラムとすることができる。 (About programs and computer-readable recording media)
In the computer, a known word search result request step, a known word document vector calculation step, a known word document vector association step, a search result request step, a document vector calculation step, a similarity calculation step, and a similarity A server control program for executing the word specifying step, the attribute assigning step, the dividing step, and the like can be provided.

これらサーバの制御プログラム等をコンピュータにインストールし、コンピュータによって実行可能な状態とするために用いられるプログラム格納媒体は、例えばフロッピー（登録商標）のようなフレキシブルディスク、ＣＤ−ＲＯＭ（ＣｏｍｐａｃｔＤｉｓｃＲｅａｄＯｎｌｙＭｅｍｏｒｙ）、ＣＤ−Ｒ（ＣｏｍｐａｃｔＤｉｓｃ−Ｒｅｃｏｒｄａｂｌｅ）、ＣＤ−ＲＷ（ＣｏｍｐａｃｔＤｉｓｃ−Ｒｅｗｒｉｔｅｒｂｌｅ）、ＤＶＤ（ＤｉｇｉｔａｌＶｅｒｓａｔｉｌｅＤｉｓｃ）等のパッケージメディアのみならず、プログラムが一時的若しくは永続的に格納される半導体メモリ、磁気ディスクあるいは光磁気ディスク等で実現することができる。 A program storage medium used for installing these server control programs in a computer and making them executable by the computer is, for example, a floppy disk such as a floppy (registered trademark), a CD-ROM (Compact Disc Read Only Memory). ), CD-R (Compact Disc-Recordable), CD-RW (Compact Disc-Rewriterable), DVD (Digital Versatile Disc), etc., as well as semiconductor memory in which programs are temporarily or permanently stored, It can be realized by a magnetic disk or a magneto-optical disk.

以上、本発明の実施形態について説明したが、本発明は上述した実施形態に限るものではない。また、本発明の実施形態に記載された効果は、本発明から生じる最も好適な効果を列挙したに過ぎず、本発明による効果は、本発明の実施例に記載されたものに限定されるものではない。 As mentioned above, although embodiment of this invention was described, this invention is not restricted to embodiment mentioned above. The effects described in the embodiments of the present invention are only the most preferable effects resulting from the present invention, and the effects of the present invention are limited to those described in the embodiments of the present invention. is not.

第１実施形態の一例に係る形態素解析システムを示す概略図である。It is the schematic which shows the morphological analysis system which concerns on an example of 1st Embodiment. サーバの主なハードウェア構成を示す概略図である。It is the schematic which shows the main hardware constitutions of a server. サーバの主なソフトウェア構成を示す概略図である。It is the schematic which shows the main software structures of a server. 単語辞書ＤＢの一例を示す図である。It is a figure which shows an example of word dictionary DB. 連接可能性辞書の一例を示す図である。It is a figure which shows an example of a connection possibility dictionary. 連接可能性辞書ＤＢの一例を示す図である。It is a figure which shows an example of a connection possibility dictionary DB. 既知語文書ベクトル生成部の説明図である。It is explanatory drawing of a known word document vector production | generation part. 単語分割部の説明図である。It is explanatory drawing of a word division part. 単語分割部の説明図である。It is explanatory drawing of a word division part. 単語分割部の説明図である。It is explanatory drawing of a word division part. 単語クラスタリング部の説明図である。It is explanatory drawing of a word clustering part. 単語クラスタリング部の説明図である。It is explanatory drawing of a word clustering part. サーバの動作例を示す概略フローチャートである。It is a schematic flowchart which shows the operation example of a server. サーバの動作例を示す概略フローチャートである。It is a schematic flowchart which shows the operation example of a server. サーバの動作例を示す概略フローチャートである。It is a schematic flowchart which shows the operation example of a server. サーバの動作例を示す概略フローチャートである。It is a schematic flowchart which shows the operation example of a server. 第１実施例の変形例の説明図である。It is explanatory drawing of the modification of 1st Example. 第２実施例の説明図である。It is explanatory drawing of 2nd Example.

Explanation of symbols

１０形態素解析システム
２０形態素解析サーバ
５０検索サーバ
１００既知語文書ベクトル生成部
１０２既知語検索結果要求部
１０４既知語文書ベクトル算出部
１０６既知語文書ベクトル関連付け部
１１０文字列受付部
１１２単語分割部
１１４ベストパス探索部
１１６アプリケーション部
１２０単語クラスタリング部
１２２検索結果要求部
１２４文書ベクトル生成部
１２６類似度算出部
１２８類似語特定部
１３０属性付与部
１４０単語辞書ＤＢ
１４２連接可能性辞書ＤＢ DESCRIPTION OF SYMBOLS 10 Morphological analysis system 20 Morphological analysis server 50 Search server 100 Known word document vector production | generation part 102 Known word search result request part 104 Known word document vector calculation part 106 Known word document vector correlation part 110 Character string reception part 112 Word division part 114 Best Path search unit 116 Application unit 120 Word clustering unit 122 Search result request unit 124 Document vector generation unit 126 Similarity calculation unit 128 Similar word specification unit 130 Attribute addition unit 140 Word dictionary DB
142 Connectability Dictionary DB

Claims

A word dictionary storage means for storing a plurality of Japanese words in a state where parts of speech and costs are associated with each other;
A concatenation possibility dictionary storage means for storing conditions that allow grammatical connection between adjacent words;
A dividing unit that divides the character string into predetermined units with reference to the word dictionary storage unit and the connectability dictionary storage unit for the input character string;
A morphological analyzer having
further,
Known word search result requesting means for requesting a search result to an internal or external search device using a known word that is the word stored in the word dictionary storage means as a search condition;
A known word document vector calculating means for calculating a document vector using all or part of the search results for each of the known words as one document;
A known word document vector associating means for associating a document vector generated for the known word with the known word;
When an undefined word that is not a word stored in the word dictionary storage means exists in the character string, a search result is requested from an internal or external search device using the undefined word as a search condition. Search result request means;
A document vector calculation means for calculating a document vector using all or part of the search results as one document;
Similarity calculation means for calculating the similarity between the document vector of the undefined word and the document vector of the known word;
A similar word specifying means for specifying a similar word that is the known word corresponding to a document vector having a high similarity;
Attribute assigning means for associating the part of speech and cost of the similar word with the undefined word;
Have
The divide unit is configured to divide the input character string into the units using the part of speech and the cost associated with the undefined word by the attribute assigning unit. .

The attribute assigning unit is configured to associate the part-of-speech and cost of at least one similar word belonging to the predetermined similarity range with respect to the undefined word. The morphological analyzer described.

The morphological analyzer is
A group document vector generating means for classifying the known words into a predetermined group and generating a document vector of the group based on the document vectors of the known words;
Group document vector storage means for storing the group and a document vector corresponding to the group in association with each other;
Have
2. The attribute assigning unit is configured to associate the part of speech and cost of the group corresponding to a document vector having a high similarity with a document vector of the undefined word with the undefined word. The morphological analyzer according to claim 2.

Possibility of connection that stores a word dictionary storage means that stores a plurality of Japanese words in a state in which parts of speech and costs are associated with each other, and a condition that allows grammatical connection between adjacent words A morpheme analyzer comprising: a dictionary storage unit; and a dividing unit that divides the character string into predetermined units with reference to the word dictionary storage unit and the connectability dictionary storage unit with respect to the input character string. A known word search result requesting step for requesting a search result to an internal or external search device using a known word that is the word stored in the word dictionary storage means as a search condition;
A known word document vector calculation step of calculating a document vector by using all or part of the search results for each of the known words as one document;
A known word document vector associating step for associating a document vector generated for the known word with the known word;
When an undefined word that is not a word stored in the word dictionary storage means exists in the character string, a search result is requested from an internal or external search device using the undefined word as a search condition. A search result request step;
A document vector calculation step of calculating a document vector by using all or part of the search results as one document;
A similarity calculation step of calculating a similarity between the document vector of the undefined word and the document vector of the known word;
A similar word specifying step of specifying a similar word that is the known word corresponding to a document vector having a high similarity;
An attribute assignment step for associating the part of speech and cost of the similar word with the undefined word;
A dividing step of dividing the input character string into the units using the part of speech and cost associated with the undefined word in the attribute assigning step;
A morphological analysis method characterized by comprising:

On the computer,
Possibility of connection that stores a word dictionary storage means that stores a plurality of Japanese words in a state in which parts of speech and costs are associated with each other, and a condition that allows grammatical connection between adjacent words A morpheme analyzer comprising: a dictionary storage unit; and a dividing unit that divides the character string into predetermined units with reference to the word dictionary storage unit and the connectability dictionary storage unit with respect to the input character string. A known word search result requesting step for requesting a search result to an internal or external search device using a known word that is the word stored in the word dictionary storage means as a search condition;
A known word document vector calculation step of calculating a document vector by using all or part of the search results for each of the known words as one document;
A known word document vector associating step for associating a document vector generated for the known word with the known word;
When an undefined word that is not a word stored in the word dictionary storage means exists in the character string, a search result is requested from an internal or external search device using the undefined word as a search condition. A search result request step;
A document vector calculation step of calculating a document vector by using all or part of the search results as one document;
A similarity calculation step of calculating a similarity between the document vector of the undefined word and the document vector of the known word;
A similar word specifying step of specifying a similar word that is the known word corresponding to a document vector having a high similarity;
An attribute assignment step for associating the part of speech and cost of the similar word with the undefined word;
A dividing step of dividing the input character string into the units using the part of speech and cost associated with the undefined word in the attribute assigning step;
A morpheme analysis program characterized in that is executed.