JP2020004157A

JP2020004157A - Classification method, apparatus, and program

Info

Publication number: JP2020004157A
Application number: JP2018123998A
Authority: JP
Inventors: 淳真工藤; Jumma Kudo; 大紀塙; Daiki Hanawa; 俊秀宮城; Toshihide Miyagi; 幸太山越; Kota Yamakoshi; 佳祐廣田; Keisuke Hirota
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2018-06-29
Filing date: 2018-06-29
Publication date: 2020-01-09
Anticipated expiration: 2038-06-29
Also published as: JP7131130B2

Abstract

To improve the accuracy of classifying texts.SOLUTION: A receiving/analysis section 12 acquires multiple pieces of analysis result information each including a set of morphemes included in a text and attribute information of the morphemes, for any text out of received multiple texts. A dividing section 14 refers to a storage section storing morpheme information including a specific morpheme and attribute information of the specific morpheme, to determine whether a set of the specific morpheme included in the morpheme information and the attribute information of the specific morpheme is included in the acquired multiple pieces of analysis result information. When a result of determination is affirmative, the dividing section divides the text in a position, corresponding to an appearance position of a morpheme included in one piece of the analysis result information in any text, to generate multiple texts. A classification section 16 classifies the generated texts and other texts of the other received texts into multiple clusters.SELECTED DRAWING: Figure 9

Description

開示の技術は、分類方法、分類装置、及び分類プログラムに関する。 The disclosed technology relates to a classification method, a classification device, and a classification program.

従来、自然言語で記述された文書（テキストデータ）を、記述されている内容に基づいて分類することが行われている。 2. Description of the Related Art Conventionally, documents (text data) described in a natural language are classified based on the described contents.

例えば、互いに対応付けられた質問と回答とを文書記憶部に記憶しておき、文書記憶部中の各回答の特徴ベクトルに基づいて回答をクラスタ分類しておく情報検索システムが提案されている。 For example, an information search system has been proposed in which questions and answers associated with each other are stored in a document storage unit, and answers are classified into clusters based on feature vectors of the respective answers in the document storage unit.

また、談話データ及び談話セマンティクスを入力とし、談話データからＦＡＱ候補となる質問文を抽出して出力するＦＡＱ候補抽出システムが提案されている。このシステムでは、談話セマンティクスは各ステートメントのフロー情報を含み、談話データから、顧客によって発話され、質問文若しくは要求文であることを示すフローが設定された質問・要求ステートメントを抽出する。そして、質問・要求ステートメントから指定されたキーワードを含むものを抽出し、質問・要求ステートメントについてクラスタリングし、各クラスタの代表となる質問・要求ステートメントをＦＡＱ候補として出力する。 Further, an FAQ candidate extraction system has been proposed in which discourse data and discourse semantics are input and a question sentence that is a FAQ candidate is extracted from the discourse data and output. In this system, the discourse semantics include flow information of each statement, and extract question / request statements uttered by the customer and set with a flow indicating a question or request sentence from the discourse data. Then, those containing the specified keyword are extracted from the question / request statement, the question / request statement is clustered, and the question / request statement representative of each cluster is output as an FAQ candidate.

また、ツリー状の観点及び属性単語を含む観点リストを記憶した観点リスト記憶手段と、各属性単語に関連する多数の学習文章情報を記憶した学習文章情報記憶手段とを有する装置が提案されている。この装置は、共有コンテンツから複数のキーワードを抽出し、キーワードを要素とし且つその出現頻度を値とする第１のベクトルを導出する。そして、キーワード毎に、当該キーワードと一致する属性単語における学習文章情報について、当該学習文章情報に含まれる単語を要素とし且つその出現頻度を値とする第２のベクトルを導出する。さらに、両ベクトルの類似度を算出し、類似度を対応付けた類似度付き観点リストを生成し、観点リストのレイヤ毎に、類似度の分散が最も大きい観点及び属性単語を導出する。 Further, an apparatus has been proposed that includes a viewpoint list storage unit that stores a viewpoint list including a tree-shaped viewpoint and attribute words, and a learning document information storage unit that stores a large number of learning document information related to each attribute word. . This apparatus extracts a plurality of keywords from shared content, and derives a first vector having the keywords as elements and the appearance frequency thereof as values. Then, for each keyword, for the learning text information in the attribute word that matches the keyword, a second vector having the word included in the learning text information as an element and the appearance frequency as a value is derived. Further, the similarity between the two vectors is calculated, a viewpoint list with similarity in which the similarities are associated is generated, and a viewpoint and an attribute word having the largest variance of the similarity are derived for each layer of the viewpoint list.

特開２００２−４１５７３号公報JP-A-2002-41573 特開２０１２−３７０４号公報JP 2012-3704 A 特開２０１２−７００３６号公報JP 2012-70036 A

しかしながら、例えば、各テキストに定型的な表現が含まれている場合には、その定型的な表現が影響して、各文書から適切な特徴を抽出することができず、文書の分類を適切に行うことができない場合がある。 However, for example, when each text contains a fixed expression, the fixed expression is not able to extract appropriate features from each document, and the classification of the document is appropriately performed. You may not be able to do it.

一つの側面として、開示の技術は、テキストの分類精度を向上させることを目的とする。 In one aspect, the disclosed technology aims to improve text classification accuracy.

一つの態様として、開示の技術は、複数のテキストを受け付け、受け付けた前記複数のテキストのうちの何れかのテキストについて、該テキストに含まれる形態素と、該形態素の属性情報との組をそれぞれが含む複数の解析結果情報を取得する。また、開示の技術は、特定の形態素と、該特定の形態素の属性情報とを含む形態素情報を記憶する記憶部を参照する。そして、取得した前記複数の解析結果情報のうちの何れかの解析結果情報に、前記形態素情報に含まれる前記特定の形態素と、該特定の形態素の属性情報との組が含まれるか否かの判定を行う。また、開示の技術は、判定結果が肯定的である場合、前記何れかのテキストを、該何れかのテキストにおける、前記何れかの解析結果情報に含まれる形態素の出現位置に応じた位置で分割して、複数のテキストを生成する。さらに、開示の技術は、受け付けた前記複数のテキストのうちの他のテキストと、生成した前記複数のテキストと、を複数のクラスタに分類する。 As one aspect, the disclosed technology accepts a plurality of texts, and for each of the received texts, sets a set of a morpheme included in the text and attribute information of the morpheme. Get multiple analysis result information including Further, the disclosed technology refers to a storage unit that stores morpheme information including a specific morpheme and attribute information of the specific morpheme. Then, it is determined whether any one of the plurality of pieces of analysis result information obtained includes a set of the specific morpheme included in the morpheme information and attribute information of the specific morpheme. Make a decision. Further, according to the disclosed technology, when the determination result is positive, the one of the texts is divided at a position corresponding to the appearance position of the morpheme included in the one of the analysis result information in the one of the texts To generate multiple texts. Further, the disclosed technology classifies other texts among the received texts and the generated texts into clusters.

一つの側面として、テキストの分類精度を向上させることができる、という効果を有する。 As one aspect, there is an effect that the accuracy of text classification can be improved.

文書の分類を説明するための図である。FIG. 4 is a diagram for explaining classification of documents. 定型表現が文書の分類に与える影響を説明するための図である。FIG. 4 is a diagram for explaining an influence of a fixed expression on a classification of a document. 定型表現が文書の分類に与える影響を説明するための図である。FIG. 4 is a diagram for explaining an influence of a fixed expression on a classification of a document. 定型表現を抽出するためのテキストの分類を説明するための図である。It is a figure for explaining classification of a text for extracting a fixed expression. 定型表現を抽出するためのテキストの分類の問題点を説明するための図である。FIG. 9 is a diagram for explaining a problem of text classification for extracting a fixed expression. 本実施形態に係る分類装置の機能ブロック図である。It is a functional block diagram of a classification device concerning this embodiment. 受付解析部による解析結果の一例を示す図である。It is a figure showing an example of an analysis result by a reception analysis part. 分割用辞書の一例を示す図である。It is a figure showing an example of the dictionary for division. テキストの分割を説明するための図である。It is a figure for explaining division of a text. 単語モデルの一例を示す図である。It is a figure showing an example of a word model. 分類結果画面の一例を示す図である。It is a figure showing an example of a classification result screen. 本実施形態に係る分類装置として機能するコンピュータの概略構成を示すブロック図である。It is a block diagram showing a schematic structure of a computer which functions as a classification device concerning this embodiment. 本実施形態における分類処理の一例を示すフローチャートである。It is a flowchart which shows an example of the classification processing in this embodiment. 分割処理の一例を示すフローチャートである。It is a flowchart which shows an example of a division process. クラスタリング処理の一例を示すフローチャートである。9 is a flowchart illustrating an example of a clustering process. 表示制御処理の一例を示すフローチャートである。9 is a flowchart illustrating an example of a display control process.

以下、図面を参照して、開示の技術に係る実施形態の一例を説明する。 Hereinafter, an example of an embodiment according to the disclosed technology will be described with reference to the drawings.

本実施形態に係る分類装置では、定型表現を抽出するために、テキスト集合に含まれる各テキストを複数のクラスタに分類する。 The classification device according to the present embodiment classifies each text included in a text set into a plurality of clusters in order to extract a fixed expression.

ここで、実施形態の詳細を説明する前に、定型表現を抽出するために、テキストを分類する理由について説明する。例えば、システムに関するインシデント対応時のメールなどの文書を分類し、各文書が表すインシデントが何の事案に関するインシデントかを特定する場合を考える。 Here, before describing the details of the embodiment, the reason for classifying the text in order to extract the fixed expression will be described. For example, consider a case in which documents such as e-mails for responding to incidents related to the system are classified, and the incident represented by each document is identified as to what kind of incident.

例えば、図１に示すように、インシデント１件における文章の集合を１つの文書とし、複数のインシデントに関する文書集合に含まれる各文書を、ＢｏＷ（Bag of Words）などの手法により、文書に含まれる単語の出現頻度でベクトル化する。そして、ベクトルのコサイン類似度（図１中の「０．７」、「０．０」、「０．４」）が高い文書同士をクラスタにまとめることにより、文書を分類する。 For example, as shown in FIG. 1, a set of sentences in one incident is defined as one document, and each document included in a document set related to a plurality of incidents is included in a document by a method such as BoW (Bag of Words). Vectorize by frequency of word occurrence. Then, documents having high vector cosine similarity (“0.7”, “0.0”, “0.4” in FIG. 1) are grouped into clusters to classify the documents.

そして、ＴＦ（Term Frequency）−ＩＤＦ（Inverse Document Frequency）などにより、各クラスタに属する文書に含まれる特徴語を抽出し（図１中の下線で示す単語）、各クラスタに対応付けることで、各クラスタに含まれる文書が何の事案のインシデントかを把握可能にする。なお、ＴＦ値及びＩＤＦ値は、以下のように定義される。 Then, a characteristic word included in a document belonging to each cluster is extracted by TF (Term Frequency) -IDF (Inverse Document Frequency) or the like (a word indicated by an underline in FIG. 1), and is associated with each cluster, so that each cluster is extracted. It is possible to grasp what kind of incident the document contained in is. Note that the TF value and the IDF value are defined as follows.

単語ｗのＴＦ値
＝文書中の単語ｗの出現数／文書中の全ての単語の出現数
単語ｗのＩＤＦ値
＝ｌｏｇ（文書の総数／単語ｗを含む文書の数） TF value of word w = number of occurrences of word w in document / number of occurrences of all words in document IDF value of word w = log (total number of documents / number of documents including word w)

しかし、上記のベクトル化の際には、各文書に含まれる定型表現もベクトル化されてしまい、定型表現がノイズとなってコサイン類似度に影響を及ぼしてしまう。例えば、図２に示すように、質問文１は、サーバに関する問い合わせであり、質問文２は、ネットワークに関する問い合わせであり、それぞれ質問内容が異なる。しかし、両質問文とも、「お疲れさまです、サービス第一開発部の工藤です。」のような定型表現を含むため、コサイン類似度が高くなってしまう。 However, in the above-described vectorization, the fixed expression included in each document is also vectorized, and the fixed expression becomes noise and affects the cosine similarity. For example, as shown in FIG. 2, question 1 is an inquiry about a server, and question 2 is an inquiry about a network, and the content of each question is different. However, since both of the question sentences include a fixed expression such as "Thank you, I am Kudo of the Service First Development Department", the cosine similarity will be high.

また、図３に示すように、文書が定型表現を含む場合には、文書に出現する単語数が多くなるため、特徴語となるべき単語のＴＦ値が下がってしまう。 Further, as shown in FIG. 3, when the document includes a fixed-form expression, the number of words appearing in the document increases, so that the TF value of a word to be a characteristic word decreases.

特に、システムに関するインシデント対応の分野では、顧客からのメールなどでの問合せの文章に、挨拶や結びの言葉などの定型表現が現れ易い。そこで、定型表現が文書の分類に与える影響を低減するために、各文書から定型表現を削除することが考えられる。しかし、定型表現の中には、「サービス第一開発部の工藤です。」などの固有名詞や、顧客毎の固有の表現等が含まれるため、定型表現を事前に定義しておくことは困難である。 In particular, in the field of responding to incidents related to the system, fixed expressions such as greetings and closing words tend to appear in sentences of inquiries by e-mail or the like from customers. In order to reduce the influence of the fixed expression on the classification of the document, the fixed expression may be deleted from each document. However, since the fixed expressions include proper nouns such as "I am Kudo of Service 1 Development Department" and unique expressions for each customer, it is difficult to define the fixed expressions in advance. It is.

そこで、例えば、図４に示すように、文書集合に含まれる各文書を、文書中に含まれる「。（句点）」や「￥ｎ（改行コード）」といった一文の区切りを表す箇所で分割することによって一文のテキスト集合を作成する。そして、各文をベクトル化してクラスタリングし、各クラスタに含まれるテキストを確認することで、定型表現が分類されたクラスタを特定し、定型表現を抽出することが考えられる。そして、抽出した定型表現を各文書から削除する。 Therefore, for example, as shown in FIG. 4, each document included in the document set is divided at a location indicating a delimiter of one sentence such as “. (Period)” or “@n (line feed code)” included in the document. This creates a one-sentence text set. Then, it is conceivable that each sentence is vectorized and clustered, the text included in each cluster is confirmed, the cluster into which the fixed expression is classified is specified, and the fixed expression is extracted. Then, the extracted fixed expression is deleted from each document.

このように、テキストを分類して定型表現を抽出することで、固有の表現等を含む定型表現も抽出することができる。したがって、本実施形態では、定型表現を抽出するために、テキストを分類する。 In this way, by classifying the text and extracting the fixed expression, the fixed expression including the unique expression and the like can also be extracted. Therefore, in the present embodiment, the text is classified to extract the fixed expression.

しかし、各文書から作成された１文が複文の場合、定型表現を一部に含むにも関わらず、全体としては定型表現とは見做されず、定型表現が分類されるクラスタに分類されない場合がある。例えば、図５に示すように、クラスタ２に分類されている一文は、クラスタ１に分類されている定型表現と同様の「対処方法を教えて下さい」という定型表現を含むが、「予想外のメッセージが出力されている為、」の部分の影響で、クラスタ１に分類されない。 However, when one sentence created from each document is a compound sentence, the sentence is not regarded as a fixed expression as a whole, even though the fixed expression is partially included, and is not classified into a cluster into which the fixed expression is classified. There is. For example, as shown in FIG. 5, a sentence classified into cluster 2 includes a fixed expression “Tell me how to cope” similar to the fixed expression classified in cluster 1, but “unexpected sentence” Since the message has been output, the message is not classified into cluster 1 due to the influence of "".

そこで、本実施形態では、複文に含まれる定型表現も抽出可能に、テキストを分類する。以下、本実施形態の詳細について説明する。 Therefore, in the present embodiment, texts are classified so that fixed expressions included in compound sentences can be extracted. Hereinafter, details of the present embodiment will be described.

図６に示すように、本実施形態に係る分類装置１０は、機能的には、受付解析部１２と、分割部１４と、分類部１６と、表示制御部１８とを含む。また、分類装置１０の所定の記憶領域には、分割用辞書２２と、単語モデル２４とが記憶される。なお、受付解析部１２は、開示の技術の受付部及び取得部の一例であり、分割部１４は、開示の技術の判定部及び生成部の一例である。 As shown in FIG. 6, the classification device 10 according to the present embodiment functionally includes a reception analysis unit 12, a division unit 14, a classification unit 16, and a display control unit 18. In a predetermined storage area of the classification device 10, a dictionary for division 22 and a word model 24 are stored. Note that the reception analysis unit 12 is an example of a reception unit and an acquisition unit of the disclosed technology, and the division unit 14 is an example of a determination unit and a generation unit of the disclosed technology.

受付解析部１２は、分類装置１０に入力されるテキスト集合を受け付ける。例えば、受付解析部１２は、システムに関するインシデント対応時のメールなどの文章を含み、インシデント１件における文章の集合を１つの文書とし、文書集合に含まれる各文書を一文のテキストに整形したテキスト集合を受け付ける。一文のテキストの整形は、例えば、文書中に含まれる「。（句点）」や「￥ｎ（改行コード）」といった一文の区切りを表す箇所で文書を分割するなどされたものである。 The reception analysis unit 12 receives a text set input to the classification device 10. For example, the reception analysis unit 12 includes a text such as an e-mail at the time of responding to an incident related to the system, sets a set of texts in one incident as one document, and formats each document included in the document set into a single text. Accept. The text of one sentence is formed, for example, by dividing the document at a portion indicating a break of one sentence such as “. (Period)” or “$ n (line feed code)” included in the document.

受付解析部１２は、受け付けたテキスト集合に含まれる各テキストに対して形態素解析を行い、テキストを形態素に分割すると共に、各形態素に品詞、形態素情報等の属性情報を付与する。また、受付解析部１２は、形態素解析の結果を用いて、各テキストに対して係り受け解析を行い、文節毎の係り受け関係を解析する。 The reception analysis unit 12 performs morphological analysis on each text included in the received text set, divides the text into morphemes, and adds attribute information such as part of speech and morpheme information to each morpheme. In addition, the reception analysis unit 12 performs a dependency analysis on each text using the result of the morphological analysis, and analyzes a dependency relationship for each phrase.

図７に、「予想外のメッセージが出力されている為、対処方法を教えて下さい」というテキストに対する、受付解析部１２による形態素解析結果及び係り受け解析結果の一例を示す。図７中のＡが形態素解析結果である。図７の例では、テキストに含まれる形態素毎に、テキストの先頭から順に番号を付すと共に、形態素解析の結果得られた属性情報が各形態素に対応付けられている。すなわち、形態素解析の解析結果は、テキストに含まれる各形態素とその形態素の属性情報とを含む形態素情報のリストである。また、図７中のＢが係り受け解析結果である。図７の例では、形態素解析結果に基づいて、テキストを各文節に区切り、各文節間の係り受け関係を解析した例であり、各文節をボックスで、文節間の係り受け関係を矢印で表している。 FIG. 7 shows an example of the result of the morphological analysis and the result of the dependency analysis performed by the reception analysis unit 12 with respect to the text "An unexpected message has been output, so please tell me what to do about it." A in FIG. 7 is a morphological analysis result. In the example of FIG. 7, for each morpheme included in the text, a number is sequentially assigned from the beginning of the text, and the attribute information obtained as a result of the morphological analysis is associated with each morpheme. That is, the analysis result of the morphological analysis is a list of morphological information including each morpheme included in the text and attribute information of the morpheme. B in FIG. 7 is a dependency analysis result. In the example of FIG. 7, the text is divided into segments based on the result of the morphological analysis, and the dependency relationship between the segments is analyzed. Each clause is indicated by a box, and the dependency relationship between the segments is indicated by an arrow. ing.

分割用辞書２２は、図８に示すように、テキストを分割する際に、区切り箇所となる文節に含まれる特定の形態素と、その特定の形態素の属性情報とを含む形態素情報が記憶された辞書である。例えば、所定の副詞節を構成する文節に含まれる形態素情報を予め分割用辞書２２に定義しておくことができる。 As shown in FIG. 8, the division dictionary 22 is a dictionary in which, when a text is divided, a morpheme information including a specific morpheme included in a phrase serving as a delimiter and attribute information of the specific morpheme is stored. It is. For example, morpheme information included in a clause that forms a predetermined adverb clause can be defined in the division dictionary 22 in advance.

分割部１４は、分割用辞書２２を参照して、各テキストについて、受付解析部１２による形態素解析結果に含まれる形態素情報に、分割用辞書２２に含まれる特定の形態素と、その特定の形態素の属性情報との組が含まれるか否かの判定を行う。分割部１４は、判定結果が肯定的である場合、各テキストを、そのテキストにおいて特定の形態素の出現位置に応じた位置で分割する。 The division unit 14 refers to the division dictionary 22 and adds, for each text, the morpheme information included in the morphological analysis result by the reception analysis unit 12 to the specific morpheme included in the division dictionary 22 and the specific morpheme of the specific morpheme. It is determined whether a pair with the attribute information is included. When the determination result is affirmative, the dividing unit 14 divides each text at a position corresponding to the appearance position of a specific morpheme in the text.

より具体的には、分割部１４は、図９に示すように、テキストの末尾の文節から順に、その文節に係る文節を特定し、特定した文節に、分割用辞書２２に定義された形態素と属性情報との組と合致する特定の形態素と属性情報との組が含まれるか否かを判定する。図９の例では、末尾の文節に係る文節内に（図９中のＡ）、分割用辞書２２に定義された「形態素：為、品詞：名詞、形態素情報：副詞可能」が含まれるため（図９中の破線の丸で示す箇所）、この文節の直後でテキストを分割する。例えば副詞節などがテキストに含まれる場合、そのテキストは複文である可能性が高く、副詞節を表す形態素を含む文節の直後で分割することで、テキストを単文に分割することができる。 More specifically, as illustrated in FIG. 9, the dividing unit 14 specifies a phrase related to the phrase in order from the phrase at the end of the text, and adds the morpheme defined in the division dictionary 22 to the specified phrase. It is determined whether or not a set of a specific morpheme and attribute information that matches the set of the attribute information is included. In the example of FIG. 9, the phrase related to the last phrase (A in FIG. 9) includes “morpheme: purpose, part of speech: noun, morpheme information: adverbible” defined in the division dictionary 22 ( The text is divided immediately after this phrase, which is indicated by a broken-line circle in FIG. 9). For example, when an adverbial clause or the like is included in a text, the text is likely to be a compound sentence, and the text can be divided into a single sentence by dividing immediately after a clause including a morpheme representing the adverbial clause.

また、分割部１４は、分割後のテキストの前半部分を新たなテキストとし、図９中のＢに示すように、新たなテキストの末尾の文節から上記の処理を繰り返す。これにより、３つ以上の内容を含む複文であっても、それぞれを単文に分割することができる。 In addition, the dividing unit 14 sets the first half of the divided text as a new text, and repeats the above processing from the last clause of the new text as shown in B in FIG. Thus, even a compound sentence including three or more contents can be divided into simple sentences.

また、分割部１４は、処理対象の文節に係る文節に、分割用辞書２２に定義された形態素情報と合致する形態素情報が含まれない場合は、テキストを分割することなく、処理対象の文節に係る文節から処理を継続する。また、分割部１４は、処理対象の文節に係る文節が存在しない場合は、テキストを分割することなく、処理対象の文節を先頭側に１つ進める。 In addition, if the phrase relating to the phrase to be processed does not include morphological information that matches the morphological information defined in the dictionary for division 22, the dividing unit 14 does not divide the text into the phrase to be processed. Processing is continued from such a clause. If there is no phrase relating to the phrase to be processed, the dividing unit 14 advances the phrase to be processed by one to the beginning without dividing the text.

なお、末尾から処理することにより、副詞節などの述部に係る文節を効率良く特定することができる。 By processing from the end, clauses related to predicates such as adverb clauses can be efficiently specified.

分割部１４は、分割したテキストについては、分割後のテキストを、分割していないテキストについては、元のテキストをそれぞれ単文として、単文集合に入れる。 The division unit 14 puts the text after the division into the single sentence set with the original text as the single sentence for the undivided text.

ここで、図１０に、単語モデル２４の一例を示す。単語モデル２４は、単語ベクトルテーブル２４Ａと、ＩＤＦ値テーブル２４Ｂとを含む。単語ベクトルテーブル２４Ａは、単語と、その単語をＴＦ−ＩＤＦやｗｏｒｄ２ｖｅｃ等によりベクトル表現した単語ベクトルとを対応付けて記憶したテーブルである。ＩＤＦ値テーブル２４Ｂは、単語とその単語のＩＤＦ値とを対応付けて記憶したテーブルである。ＩＤＦ値は、予め任意の文書集合から生成しておいてもよいし、入力されたテキスト集合の元になった文書集合から生成してもよい。 Here, an example of the word model 24 is shown in FIG. The word model 24 includes a word vector table 24A and an IDF value table 24B. The word vector table 24A is a table in which words are associated with word vectors obtained by expressing the words by TF-IDF, word2vec, or the like. The IDF value table 24B is a table that stores a word and an IDF value of the word in association with each other. The IDF value may be generated in advance from an arbitrary document set, or may be generated from a document set that is the basis of an input text set.

分類部１６は、テキスト集合に含まれる他のテキストと、生成した新たな複数のテキスト、すなわち、単文集合に含まれる単文の各々を、複数のクラスタに分類する。 The classification unit 16 classifies the other texts included in the text set and the generated new texts, that is, each of the simple sentences included in the simple sentence set, into a plurality of clusters.

具体的には、分類部１６は、単語モデル２４に含まれる単語ベクトルテーブル２４Ａを参照して、単文集合に含まれる各単文をベクトル化する。また、分類部１６は、各単文の単語ベクトルのコサイン類似度などを用いて、ｋ−ｍｅａｎｓやｓｉｍｐｌｅｌｉｎｋａｇｅなどの従来既知のクラスタリング手法により、各単文をクラスタリングする。 More specifically, the classification unit 16 refers to the word vector table 24A included in the word model 24 and converts each simple sentence included in the simple sentence set into a vector. Also, the classifying unit 16 clusters each simple sentence by using a conventionally known clustering technique such as k-means or simple linking using the cosine similarity of the word vector of each simple sentence.

また、分類部１６は、複数のクラスタの各々に分類された単文に含まれる単語の出現状況に基づいて、複数のクラスタの各々から特徴語を抽出し、抽出した特徴語を、各クラスタに対応付ける。なお、単語の出現状況としては、ＴＦ−ＩＤＦなどを用いることができる。また、特徴語は、開示の技術の特徴情報及び代表形態素の一例である。 In addition, the classification unit 16 extracts characteristic words from each of the plurality of clusters based on the appearance status of the words included in the simple sentence classified into each of the plurality of clusters, and associates the extracted characteristic words with each cluster. . Note that TF-IDF or the like can be used as the appearance status of the word. The characteristic word is an example of characteristic information and a representative morpheme of the disclosed technology.

表示制御部１８は、テキスト集合における各単文の出現状況に関する指標に基づいて、複数のクラスタの各々に含まれる単文についての指標が、出現頻度が高いことを示す順にクラスタを並べて、表示装置（図示省略）に表示する。 The display control unit 18 arranges the clusters in the index indicating the appearance frequency of the simple sentences included in each of the plurality of clusters based on the index regarding the appearance status of each simple sentence in the text set, and displays the clusters on the display device (shown in the drawing). (Omitted).

例えば、表示制御部１８は、単語モデル２４に含まれるＩＤＦ値テーブル２４Ｂを参照して、各単文に含まれる各単語のＩＤＦ値を取得し、各単文のＩＤＦ値ベクトルのノルムを算出する。また、表示制御部１８は、クラスタ毎に、そのクラスタに含まれる単文の各々のＩＤＦ値ベクトルのノルムの平均を算出する。そして、表示制御部１８は、ＩＤＦ値ベクトルのノルムの平均が小さい順にクラスタをソートし、表示装置に表示する。ＩＤＦ値ベクトルのノルムの平均が小さいクラスタは、そのクラスタに含まれる単文が、テキスト集合において横断的に出現していることを表しているため、定型表現が分類されたクラスタであると見做すものである。 For example, the display control unit 18 refers to the IDF value table 24B included in the word model 24, acquires the IDF value of each word included in each simple sentence, and calculates the norm of the IDF value vector of each simple sentence. Further, the display control unit 18 calculates, for each cluster, the average of the norms of the IDF value vectors of the simple sentences included in the cluster. Then, the display control unit 18 sorts the clusters in ascending order of the average of the norm of the IDF value vector, and displays the clusters on the display device. A cluster having a small average of the norm of the IDF value vector indicates that a simple sentence included in the cluster appears transversely in the text set. Therefore, the cluster is regarded as a cluster in which the fixed expression is classified. Things.

図１１に、表示装置に表示される分類結果画面３０の一例を示す。図１１の例では、各クラスタを１つの枠で囲み、枠内に、そのクラスタに含まれる単文を表示している。また、各クラスタには、分類部１６により、そのクラスタに対応付けられた特徴語を対応付けて表示している。図１１では、サーバやネットワークなどのインシデントに関する具体的な内容を示す単文が分類されたクラスタよりも上位に、定型表現が分類されたクラスタが表示されている例を示している。 FIG. 11 shows an example of the classification result screen 30 displayed on the display device. In the example of FIG. 11, each cluster is surrounded by one frame, and the simple sentences included in the cluster are displayed in the frame. Each cluster is displayed by the classifying unit 16 in association with a characteristic word associated with the cluster. FIG. 11 shows an example in which a cluster in which a fixed expression is classified is displayed at a higher level than a cluster in which a simple sentence indicating a specific content related to an incident such as a server or a network is classified.

なお、分類結果画面３０は、図１１の例に限定されない。例えば、クラスタに対応付けられた特徴語のみをソート順に表示装置に表示し、画面からその特徴語を選択することで、その特徴語が表すクラスタに含まれる単文を表示するような表示形態としてもよい。 Note that the classification result screen 30 is not limited to the example of FIG. For example, a display form in which only the characteristic words associated with the cluster are displayed on the display device in the sort order, and the characteristic words are selected from the screen to display a simple sentence included in the cluster represented by the characteristic words. Good.

分類装置１０は、例えば図１２に示すコンピュータ４０で実現することができる。コンピュータ４０は、ＣＰＵ（Central Processing Unit）４１と、一時記憶領域としてのメモリ４２と、不揮発性の記憶部４３とを備える。また、コンピュータ４０は、入力装置、表示装置等の入出力装置４４と、記憶媒体４９に対するデータの読み込み及び書き込みを制御するＲ／Ｗ（Read/Write）部４５とを備える。また、コンピュータ４０は、インターネット等のネットワークに接続される通信Ｉ／Ｆ４６を備える。ＣＰＵ４１、メモリ４２、記憶部４３、入出力装置４４、Ｒ／Ｗ部４５、及び通信Ｉ／Ｆ４６は、バス４７を介して互いに接続される。 The classification device 10 can be realized by, for example, a computer 40 illustrated in FIG. The computer 40 includes a CPU (Central Processing Unit) 41, a memory 42 as a temporary storage area, and a nonvolatile storage unit 43. Further, the computer 40 includes an input / output device 44 such as an input device and a display device, and an R / W (Read / Write) unit 45 that controls reading and writing of data from and to the storage medium 49. Further, the computer 40 includes a communication I / F 46 connected to a network such as the Internet. The CPU 41, the memory 42, the storage unit 43, the input / output device 44, the R / W unit 45, and the communication I / F 46 are connected to each other via a bus 47.

記憶部４３は、ＨＤＤ（Hard Disk Drive）、ＳＳＤ（Solid State Drive）、フラッシュメモリ等によって実現できる。記憶媒体としての記憶部４３には、コンピュータ４０を、分類装置１０として機能させるための分類プログラム５０が記憶される。分類プログラム５０は、受付解析プロセス５２と、分割プロセス５４と、分類プロセス５６と、表示制御プロセス５８とを有する。また、記憶部４３は、分割用辞書２２及び単語モデル２４を構成する情報が記憶される情報記憶領域６０を有する。 The storage unit 43 can be realized by a hard disk drive (HDD), a solid state drive (SSD), a flash memory, or the like. The storage unit 43 as a storage medium stores a classification program 50 for causing the computer 40 to function as the classification device 10. The classification program 50 has a reception analysis process 52, a division process 54, a classification process 56, and a display control process 58. The storage unit 43 has an information storage area 60 in which information forming the division dictionary 22 and the word model 24 is stored.

ＣＰＵ４１は、分類プログラム５０を記憶部４３から読み出してメモリ４２に展開し、分類プログラム５０が有するプロセスを順次実行する。ＣＰＵ４１は、受付解析プロセス５２を実行することで、図６に示す受付解析部１２として動作する。また、ＣＰＵ４１は、分割プロセス５４を実行することで、図６に示す分割部１４として動作する。また、ＣＰＵ４１は、分類プロセス５６を実行することで、図６に示す分類部１６として動作する。また、ＣＰＵ４１は、表示制御プロセス５８を実行することで、図６に示す表示制御部１８として動作する。また、ＣＰＵ４１は、情報記憶領域６０から情報を読み出して、分割用辞書２２及び単語モデル２４をメモリ４２に展開する。これにより、分類プログラム５０を実行したコンピュータ４０が、分類装置１０として機能することになる。なお、プログラムを実行するＣＰＵ４１はハードウェアである。 The CPU 41 reads out the classification program 50 from the storage unit 43, expands it in the memory 42, and sequentially executes the processes of the classification program 50. The CPU 41 operates as the reception analysis unit 12 illustrated in FIG. 6 by executing the reception analysis process 52. The CPU 41 operates as the dividing unit 14 illustrated in FIG. 6 by executing the dividing process 54. The CPU 41 operates as the classification unit 16 illustrated in FIG. 6 by executing the classification process 56. The CPU 41 operates as the display control unit 18 illustrated in FIG. 6 by executing the display control process 58. Further, the CPU 41 reads information from the information storage area 60 and expands the division dictionary 22 and the word model 24 in the memory 42. As a result, the computer 40 that has executed the classification program 50 functions as the classification device 10. Note that the CPU 41 that executes the program is hardware.

なお、分類プログラム５０により実現される機能は、例えば半導体集積回路、より詳しくはＡＳＩＣ（Application Specific Integrated Circuit）等で実現することも可能である。 The function realized by the classification program 50 can be realized by, for example, a semiconductor integrated circuit, more specifically, an ASIC (Application Specific Integrated Circuit).

次に、本実施形態に係る分類装置１０の作用について説明する。 Next, the operation of the classification device 10 according to the present embodiment will be described.

例えば、システムに関するインシデント対応時のメールなどの文章を含み、インシデント１件における文章の集合を１つの文書とし、文書集合に含まれる各文書を一文のテキストに整形したテキスト集合が、分類装置１０へ入力される。そして、分類装置１０において、図１３に示す分類処理が実行される。なお、分類処理は、開示の技術の分類方法の一例である。 For example, a text set including a sentence such as an e-mail at the time of an incident response related to the system, and a set of sentences in one incident as one document, and forming each document included in the document set into one sentence text is sent to the classification device 10. Is entered. Then, the classification device 10 executes the classification processing shown in FIG. Note that the classification process is an example of a classification method of the disclosed technology.

図１３に示す分類処理のステップＳ１０で、受付解析部１２が、分類装置１０に入力されたテキスト集合Ｓを受け付ける。テキスト集合Ｓは、テキスト１、テキスト２、・・・、テキストＮ（Ｎはテキスト集合Ｓに含まれるテキストの数）を含む。 In step S10 of the classification process illustrated in FIG. 13, the reception analysis unit 12 receives the text set S input to the classification device 10. The text set S includes text 1, text 2,..., Text N (N is the number of texts included in the text set S).

次に、ステップＳ２０で、図１４に示す分割処理が実行される。 Next, in step S20, the division processing shown in FIG. 14 is performed.

図１４に示す分割処理のステップＳ２２で、単文集合Ｐとして空集合を用意し、次のステップＳ２４で、テキストを識別するための変数ｓに１を設定する。 In step S22 of the dividing process shown in FIG. 14, an empty set is prepared as a single sentence set P, and in the next step S24, 1 is set to a variable s for identifying text.

次に、ステップＳ２６で、受付解析部１２が、テキストｓに対して形態素解析を行い、テキストを形態素に分割すると共に、各形態素に属性情報を付与して、テキストｓに含まれる各形態素の形態素情報を得る。また、受付解析部１２が、形態素解析の結果を用いて、テキストｓに対して係り受け解析を行い、文節毎の係り受け関係を解析する。なお、テキストｓの文節には、テキストｓの先頭から順に、０、１、・・・、ｍの番号を付与するものとする（ｍはテキストｓの末尾の文節に付与される番号）。 Next, in step S26, the reception analysis unit 12 performs a morphological analysis on the text s, divides the text into morphemes, adds attribute information to each morpheme, and adds a morpheme of each morpheme included in the text s. get information. In addition, the reception analysis unit 12 performs a dependency analysis on the text s using the result of the morphological analysis, and analyzes a dependency relationship for each phrase. .., M are assigned to the phrases of the text s in order from the beginning of the text s (m is the number assigned to the last phrase of the text s).

次に、ステップＳ２８で、分割部１４が、テキストｓの文節を識別するための変数ｉにｍを設定する。 Next, in step S28, the dividing unit 14 sets m to a variable i for identifying a clause of the text s.

次に、ステップＳ３０で、分割部１４が、上記ステップＳ２６で得られた係り受け解析結果に基づいて、文節ｉに係る文節ｊが存在するか否かを判定する。文節ｊが存在する場合には、処理はステップＳ３４へ移行する。文節ｊが存在しない場合には、処理はステップＳ３２へ移行し、分割部１４が、文節ｉ−１、すなわち、文節ｉの１つ前の文節を新たな文節ｉに設定し、処理はステップＳ４２へ移行する。 Next, in step S30, based on the result of the dependency analysis obtained in step S26, the dividing unit 14 determines whether or not a phrase j associated with the phrase i exists. If the clause j exists, the process proceeds to step S34. If the clause j does not exist, the process proceeds to step S32, and the dividing unit 14 sets the clause i-1, that is, the clause immediately before the clause i, as a new clause i, and the process proceeds to step S42. Move to.

ステップＳ３４では、分割部１４が、文節ｉに係る文節ｊに、分割用辞書２２に定義された形態素情報と合致する形態素情報が含まれるか否かを判定する。含まれる場合には、処理はステップＳ３６へ移行し、含まれない場合には、処理はステップＳ４０へ移行する。 In step S34, the division unit 14 determines whether or not the phrase j associated with the phrase i includes morpheme information that matches the morpheme information defined in the division dictionary 22. If it is included, the process proceeds to step S36; otherwise, the process proceeds to step S40.

ステップＳ３６では、分割部１４が、テキストｓを、文節ｊより後の部分ｓ＿１と、文節ｊ以前の部分ｓ＿２とに分割する。次に、ステップＳ３８で、分割部１４が、部分ｓ＿１を単文集合Ｐに追加すると共に、部分ｓ＿２を新たなテキストｓに設定する。次に、ステップＳ４０で、分割部１４が、文節ｊを新たな文節ｉに設定する。 In step S36, the dividing unit 14 divides the text s into a part s_1 after the clause j and a part s_2 before the clause j. Next, in step S38, the dividing unit 14 adds the part s_1 to the simple sentence set P and sets the part s_2 to a new text s. Next, in step S40, the dividing unit 14 sets the phrase j to a new phrase i.

次に、ステップＳ４２で、分割部が、ｉが０か否かを判定することで、テキストｓの先頭まで処理が終了したか否かを判定する。ｉ＝０の場合は、処理はステップＳ４４へ移行し、ｉがまだ０に達していない場合には、処理はステップＳ３０に戻る。 Next, in step S42, the division unit determines whether i is 0 or not, thereby determining whether the processing has been completed up to the beginning of the text s. If i = 0, the process proceeds to step S44, and if i has not yet reached 0, the process returns to step S30.

ステップＳ４４では、分割部１４が、テキストｓを単文集合Ｐに追加する。これにより、分割が行われたテキストについては、分割後の前半部分が単文集合Ｐに追加され、分割が行われていないテキストについては、元のテキストがそのまま単文集合Ｐに追加される。 In step S44, the dividing unit 14 adds the text s to the single sentence set P. As a result, the first half of the divided text is added to the simple sentence set P, and the original text of the undivided text is added to the simple sentence set P as it is.

次に、ステップＳ４６で、受付解析部１２が、ｓがＮか否かを判定することにより、受け付けたテキスト集合Ｓに含まれるテキストの全てについて、上記ステップＳ２６〜Ｓ４４の処理が終了したか否かを判定する。ｓがまだＮに達していない場合には、処理はステップＳ４８へ移行し、受付解析部１２がｓを１インクリメントして、処理はステップＳ２６に戻る。ｓ＝Ｎの場合には、分割処理は終了して、分類処理に戻る。 Next, in step S46, the reception analysis unit 12 determines whether or not s is N, thereby determining whether or not the processing in steps S26 to S44 has been completed for all the texts included in the received text set S. Is determined. If s has not reached N yet, the process proceeds to step S48, the reception analysis unit 12 increments s by 1, and the process returns to step S26. If s = N, the division process ends, and the process returns to the classification process.

次に、図１３に示す分類処理のステップＳ５０で、図１５に示すクラスタリング処理が実行される。 Next, the clustering process shown in FIG. 15 is executed in step S50 of the classification process shown in FIG.

図１５に示すクラスタリング処理のステップＳ５２で、分類部１６が、単文集合Ｐに含まれる各単文を、単語モデル２４の単語ベクトルテーブル２４Ａを用いてベクトル化する。 In step S52 of the clustering process illustrated in FIG. 15, the classification unit 16 vectorizes each simple sentence included in the simple sentence set P using the word vector table 24A of the word model 24.

次に、ステップＳ５４で、分類部１６が、各単文の単語ベクトルのコサイン類似度などを用いて、ｋ−ｍｅａｎｓやｓｉｍｐｌｅｌｉｎｋａｇｅなどの従来既知のクラスタリング手法により、各単文をクラスタリングする。 Next, in step S54, the classification unit 16 uses the cosine similarity of the word vector of each simple sentence or the like to cluster each simple sentence by a conventionally known clustering method such as k-means or simple linkage.

次に、ステップＳ５６で、分類部１６が、複数のクラスタの各々に分類された単文に含まれる単語のＴＦ−ＩＤＦなどの出現状況を示す指標に基づいて、複数のクラスタの各々から特徴語を抽出し、抽出した特徴語を各クラスタに対応付ける。そして、クラスタリング処理は終了して、分類処理に戻る。 Next, in step S56, the classification unit 16 determines a characteristic word from each of the plurality of clusters based on an index indicating the appearance state of a word included in the simple sentence classified into each of the plurality of clusters, such as TF-IDF. The extracted feature words are associated with each cluster. Then, the clustering process ends, and the process returns to the classification process.

次に、図１３に示す分類処理のステップＳ６０で、図１６に示す表示制御処理が実行される。 Next, the display control process shown in FIG. 16 is executed in step S60 of the classification process shown in FIG.

図１６に示す表示制御処理のステップＳ６２で、表示制御部１８が、単語モデル２４に含まれるＩＤＦ値テーブル２４Ｂを参照して、各単文に含まれる各単語のＩＤＦ値を取得し、各単文のＩＤＦ値ベクトルのノルムを算出する。 In step S62 of the display control process shown in FIG. 16, the display control unit 18 refers to the IDF value table 24B included in the word model 24, acquires the IDF value of each word included in each simple sentence, and obtains the IDF value of each simple sentence. Calculate the norm of the IDF value vector.

次に、ステップＳ６４で、表示制御部１８が、クラスタ毎に、そのクラスタに含まれる単文の各々のＩＤＦ値ベクトルのノルムの平均を算出する。 Next, in step S64, the display control unit 18 calculates, for each cluster, the average of the norms of the IDF value vectors of the simple sentences included in the cluster.

次に、ステップＳ６６で、表示制御部１８が、ＩＤＦ値ベクトルのノルムの平均が小さい順にクラスタをソートし、例えば、図１１に示すような分類結果画面３０を表示装置に表示する。そして、表示制御処理は終了し、分類処理も終了する。 Next, in step S66, the display control unit 18 sorts the clusters in ascending order of the average of the norm of the IDF value vector, and displays, for example, a classification result screen 30 as shown in FIG. 11 on the display device. Then, the display control processing ends, and the classification processing ends.

以上説明したように、本実施形態に係る分類装置によれば、テキスト集合に含まれる各テキストを、副詞節などの予め定めた形態素情報を含む文節の直後で分割した上で、単語ベクトルのコサイン類似度などに基づいてクラスタリングして分類する。これにより、テキストが複文で、その一部に定型表現を含む場合でも、定型表現を抽出するためのテキストの分類精度を向上させることができる。 As described above, according to the classification device of the present embodiment, each text included in a text set is divided immediately after a phrase including predetermined morpheme information such as an adverbial clause, and then the cosine of a word vector is divided. Clustering and classification are performed based on similarity and the like. Thereby, even when the text is a compound sentence and a part of the text includes a fixed expression, the classification accuracy of the text for extracting the fixed expression can be improved.

また、テキストに含まれる末尾の文節から順に係り受け関係を辿って上記の副詞節などの予め定めた形態素情報を含む文節を特定するため、効率良く分割箇所を特定することができる。 Further, since the dependency relation is traced in order from the last clause included in the text to specify the clause including the predetermined morphological information such as the adverbial clause, it is possible to efficiently specify the division part.

なお、上記実施形態では、クラスタに含まれる単文の各々のＩＤＦ値ベクトルのノルムの平均が小さい順にクラスタをソートする場合について説明したが、これに限定されない。例えば、各クラスタに分類された単文の数が多い順にクラスタをソートしてもよい。定型表現は、テキスト集合での出現頻度が高いことが想定されるため、多くの単文が含まれるクラスタは、定型表現が分類されたクラスタであると見做すものである。 In the above-described embodiment, a case has been described in which clusters are sorted in ascending order of the average of the norms of the IDF value vectors of each simple sentence included in the cluster. However, the present invention is not limited to this. For example, clusters may be sorted in descending order of the number of simple sentences classified into each cluster. Since it is assumed that a fixed expression has a high appearance frequency in a text set, a cluster including many simple sentences is regarded as a cluster into which the fixed expression is classified.

また、上記実施形態において、１つのテキストに、分割用辞書に定義された同一の形態素が複数回出現する場合も想定される（例えば、「〜の為、〜の為、〜下さい。」）。このような場合でも、形態素解析時に各形態素に付与した番号により、どの文節にどの番号の形態素が含まれるかを特定可能である。したがって、例えば、１つのテキストに、同一の形態素が含まれる場合であっても、分割用辞書に定義された形態素情報との合致を判定する際に、それらの形態素を識別可能である。 Further, in the above embodiment, it is assumed that the same morpheme defined in the division dictionary appears more than once in one text (for example, “for, for, please”). Even in such a case, the number assigned to each morpheme during the morphological analysis can specify which phrase contains which morpheme. Therefore, for example, even when one text includes the same morpheme, it is possible to identify those morphemes when determining a match with the morpheme information defined in the division dictionary.

また、上記実施形態では、システムのインシデント対応に関する文書を整形したテキスト集合を入力する場合について説明したが、これに限定されず、開示の技術は、様々な文書に適用可能である。特に、文書内に定型的な表現が多く含まれる文書に対して、開示の技術は有効である。 Further, in the above-described embodiment, a case has been described in which a text set obtained by shaping a document relating to the incident response of the system is input. However, the present invention is not limited to this, and the disclosed technology can be applied to various documents. In particular, the disclosed technique is effective for a document that includes many fixed expressions in the document.

また、上記実施形態では、分類プログラムが記憶部に予め記憶（インストール）されている態様を説明したが、これに限定されない。開示の技術に係るプログラムは、ＣＤ−ＲＯＭ、ＤＶＤ−ＲＯＭ、ＵＳＢメモリ等の記憶媒体に記憶された形態で提供することも可能である。 In the above embodiment, the mode in which the classification program is stored (installed) in the storage unit in advance has been described, but the present invention is not limited to this. The program according to the disclosed technology can be provided in a form stored in a storage medium such as a CD-ROM, a DVD-ROM, and a USB memory.

以上の実施形態に関し、更に以下の付記を開示する。 Regarding the above embodiments, the following supplementary notes are further disclosed.

（付記１）
複数のテキストを受け付け、
受け付けた前記複数のテキストのうちの何れかのテキストについて、該テキストに含まれる形態素と、該形態素の属性情報との組をそれぞれが含む複数の解析結果情報を取得し、
特定の形態素と、該特定の形態素の属性情報とを含む形態素情報を記憶する記憶部を参照して、取得した前記複数の解析結果情報のうちの何れかの解析結果情報に、前記形態素情報に含まれる前記特定の形態素と、該特定の形態素の属性情報との組が含まれるか否かの判定を行い、
判定結果が肯定的である場合、前記何れかのテキストを、該何れかのテキストにおける、前記何れかの解析結果情報に含まれる形態素の出現位置に応じた位置で分割して、複数のテキストを生成し、
受け付けた前記複数のテキストのうちの他のテキストと、生成した前記複数のテキストと、を複数のクラスタに分類する、
処理をコンピュータが実行することを特徴とする分類方法。 (Appendix 1)
Accept multiple texts,
For any one of the received texts, a plurality of pieces of analysis result information each including a set of a morpheme included in the text and attribute information of the morpheme are obtained,
With reference to a storage unit that stores morpheme information including a specific morpheme and attribute information of the specific morpheme, any one of the plurality of pieces of analysis result information obtained as analysis result information, The specific morpheme included is determined whether or not a set of attribute information of the specific morpheme is included,
When the determination result is positive, the one of the texts is divided at a position corresponding to the appearance position of the morpheme included in the one of the analysis result information in the one of the texts, and a plurality of texts are divided. Generate
Classifying the other texts of the received texts and the generated texts into a plurality of clusters,
A classification method, wherein the processing is performed by a computer.

（付記２）
前記何れかのテキストを係り受け解析して、該何れかのテキストに含まれる複数の文節間の係り受け関係を示す係り受け情報を生成し、
生成した前記係り受け情報に基づき、前記複数の文節のうち、何れかの文節との間に特定の係り受け関係を有する文節を特定し、
取得した前記複数の解析結果情報のうち、特定した前記文節に含まれる形態素と、該形態素の属性情報との組をそれぞれが含む複数の解析結果情報を特定し、
特定した前記解析結果情報のうちの何れかの解析結果情報に、前記形態素情報に含まれる前記特定の形態素と、該特定の形態素の属性情報との組が含まれる場合、前記何れかのテキストを分割して、前記複数のテキストを生成する、
ことを特徴とする付記１に記載の分類方法。 (Appendix 2)
Dependency analysis of any of the texts to generate dependency information indicating a dependency relationship between a plurality of phrases included in any of the texts,
Based on the generated dependency information, specify a phrase having a specific dependency relationship with any one of the plurality of phrases,
Of the plurality of pieces of acquired analysis result information, a plurality of pieces of analysis result information each including a set of a morpheme included in the specified phrase and attribute information of the morpheme are specified,
When any of the analysis result information of the specified analysis result information includes a set of the specific morpheme included in the morpheme information and attribute information of the specific morpheme, the text of any one of Splitting to generate the plurality of texts,
The classification method according to Supplementary Note 1, wherein:

（付記３）
前記複数のクラスタそれぞれに分類されたテキストに基づき、前記複数のクラスタそれぞれの特徴情報を生成し、
生成した前記特徴情報を、前記複数のクラスタそれぞれに対応付けて表示部に表示する、
ことを特徴とする付記１又は付記２に記載の分類方法。 (Appendix 3)
Based on the text classified into each of the plurality of clusters, to generate feature information of each of the plurality of clusters,
Displaying the generated feature information on a display unit in association with each of the plurality of clusters;
The classification method according to Supplementary Note 1 or 2, wherein:

（付記４）
前記複数のクラスタについて、前記複数のクラスタそれぞれに分類されたテキストと、該テキストに含まれる複数の形態素の、受け付けた前記複数のテキストでの出現状況とに基づき、前記複数のクラスタをそれぞれ代表する複数の代表形態素を決定し、
決定した前記複数の代表形態素それぞれを、前記複数の代表形態素それぞれが代表する複数のクラスタそれぞれに対応付けて前記表示部に表示する、
ことを特徴とする付記３に記載の分類方法。 (Appendix 4)
The plurality of clusters respectively represent the plurality of clusters based on the text classified into each of the plurality of clusters and the appearance of the plurality of morphemes included in the text in the received plurality of texts. Determine a plurality of representative morphemes,
Displaying the determined plurality of representative morphemes on the display unit in association with the plurality of clusters represented by the plurality of representative morphemes,
The classification method according to appendix 3, characterized in that:

（付記５）
決定した前記複数の代表形態素それぞれを、前記複数のクラスタそれぞれに分類された前記テキストの数に応じた順序で並べて前記表示部に表示する、
ことを特徴とする付記４に記載の分類方法。 (Appendix 5)
Displaying each of the determined plurality of representative morphemes on the display unit in an order corresponding to the number of the texts classified into each of the plurality of clusters;
The classification method according to appendix 4, characterized in that:

（付記６）
前記複数のテキストにおける各テキストの出現状況に関する指標に基づいて、前記複数のクラスタの各々に含まれるテキストについての前記指標が、出現頻度が高いことを示す順に前記クラスタを並べて前記表示部に表示する
ことを特徴とする付記３〜付記５のいずれか１項に記載の分類方法。 (Appendix 6)
Based on the index related to the appearance status of each text in the plurality of texts, the index for the text included in each of the plurality of clusters is arranged on the display unit in the order indicating that the appearance frequency is high, and the clusters are displayed on the display unit. 6. The classification method according to any one of Supplementary Notes 3 to 5, wherein:

（付記７）
複数のテキストを受け付ける受付部と、
前記受付部により受け付けられた前記複数のテキストのうちの何れかのテキストについて、該テキストに含まれる形態素と、該形態素の属性情報との組をそれぞれが含む複数の解析結果情報を取得する取得部と、
特定の形態素と、該特定の形態素の属性情報とを含む形態素情報を記憶する記憶部を参照して、前記取得部により取得された前記複数の解析結果情報のうちの何れかの解析結果情報に、前記形態素情報に含まれる前記特定の形態素と、該特定の形態素の属性情報との組が含まれるか否かを判定する判定部と、
前記判定部による判定結果が肯定的である場合、前記何れかのテキストを、該何れかのテキストにおける、前記何れかの解析結果情報に含まれる形態素の出現位置に応じた位置で分割して、複数のテキストを生成する生成部と、
前記受付部により受け付けられた前記複数のテキストのうちの他のテキストと、前記生成部により生成された前記複数のテキストと、を複数のクラスタに分類する分類部と、
を含むことを特徴とする分類装置。 (Appendix 7)
A reception unit for receiving a plurality of texts,
An acquisition unit configured to acquire a plurality of pieces of analysis result information each including a set of a morpheme included in the text and attribute information of the morpheme for any of the plurality of texts received by the reception unit; When,
With reference to a storage unit that stores morpheme information including a specific morpheme and attribute information of the specific morpheme, the analysis result information of any of the plurality of pieces of analysis result information acquired by the acquisition unit A determination unit that determines whether a set of the specific morpheme included in the morpheme information and attribute information of the specific morpheme is included,
When the determination result by the determination unit is affirmative, the one of the texts is divided at a position corresponding to an appearance position of a morpheme included in the one of the analysis result information in the one of the texts, A generator for generating a plurality of texts,
A classification unit that classifies the other texts of the plurality of texts received by the reception unit and the plurality of texts generated by the generation unit into a plurality of clusters,
A classification device comprising:

（付記８）
前記取得部は、前記何れかのテキストを係り受け解析して、該何れかのテキストに含まれる複数の文節間の係り受け関係を示す係り受け情報を取得し、
前記判定部は、前記取得部により取得された前記係り受け情報に基づき、前記複数の文節のうち、何れかの文節との間に特定の係り受け関係を有する文節を特定し、前記取得部により取得された前記複数の解析結果情報のうち、特定した前記文節に含まれる形態素と、該形態素の属性情報との組をそれぞれが含む複数の解析結果情報を特定し、
前記生成部は、前記判定部により特定された前記解析結果情報のうちの何れかの解析結果情報に、前記形態素情報に含まれる前記特定の形態素と、該特定の形態素の属性情報との組が含まれる場合、前記何れかのテキストを分割して、前記複数のテキストを生成する、
ことを特徴とする付記７に記載の分類装置。 (Appendix 8)
The acquisition unit performs dependency analysis on the text, and obtains dependency information indicating a dependency relationship between a plurality of phrases included in the text.
The determining unit, based on the dependency information obtained by the obtaining unit, among the plurality of phrases, to specify a phrase having a specific dependency relationship with any of the clauses, by the obtaining unit Of the plurality of pieces of acquired analysis result information, a plurality of pieces of analysis result information each including a set of a morpheme included in the specified phrase and attribute information of the morpheme,
The generation unit may include, in any analysis result information among the analysis result information specified by the determination unit, a set of the specific morpheme included in the morpheme information and attribute information of the specific morpheme. If included, split any of the texts to generate the plurality of texts;
8. The classification device according to claim 7, wherein

（付記９）
前記分類部は、前記複数のクラスタそれぞれに分類されたテキストに基づき、前記複数のクラスタそれぞれの特徴情報を生成し、
前記分類部により生成された前記特徴情報を、前記複数のクラスタそれぞれに対応付けて表示部に表示する表示制御部を更に含む、
ことを特徴とする付記７又は付記８に記載の分類装置。 (Appendix 9)
The classifying unit generates feature information of each of the plurality of clusters based on the text classified into each of the plurality of clusters,
The display apparatus further includes a display control unit configured to display the feature information generated by the classification unit on a display unit in association with each of the plurality of clusters.
9. The classification device according to attachment 7 or 8, wherein:

（付記１０）
前記分類部は、前記複数のクラスタについて、前記複数のクラスタそれぞれに分類されたテキストと、該テキストに含まれる複数の形態素の、受け付けた前記複数のテキストでの出現状況とに基づき、前記複数のクラスタをそれぞれ代表する複数の代表形態素を決定し、
前記表示制御部は、前記分類部により決定された前記複数の代表形態素それぞれを、前記複数の代表形態素それぞれが代表する複数のクラスタそれぞれに対応付けて前記表示部に表示する、
ことを特徴とする付記９に記載の分類装置。 (Appendix 10)
The classification unit, for the plurality of clusters, based on the text classified into each of the plurality of clusters, and a plurality of morphemes included in the text, the appearance status in the received plurality of texts, the plurality of Determine a plurality of representative morphemes each representing a cluster,
The display control unit displays each of the plurality of representative morphemes determined by the classification unit on the display unit in association with each of a plurality of clusters represented by each of the plurality of representative morphemes.
10. The classification device according to claim 9, wherein

（付記１１）
前記表示制御部は、前記分類部により決定された前記複数の代表形態素それぞれを、前記複数のクラスタそれぞれに分類された前記テキストの数に応じた順序で並べて前記表示部に表示する、
ことを特徴とする付記１０に記載の分類装置。 (Appendix 11)
The display control unit, the plurality of representative morphemes determined by the classification unit, arranged on the display unit arranged in an order according to the number of the text classified into each of the plurality of clusters,
11. The classification device according to supplementary note 10, wherein:

（付記１２）
前記表示制御部は、前記複数のテキストにおける各テキストの出現状況に関する指標に基づいて、前記複数のクラスタの各々に含まれるテキストについての前記指標が、出現頻度が高いことを示す順に前記クラスタを並べて前記表示部に表示する
ことを特徴とする付記９〜付記１１のいずれか１項に記載の分類装置。 (Appendix 12)
The display control unit, based on an index related to the appearance of each text in the plurality of texts, the index for the text included in each of the plurality of clusters, by arranging the clusters in the order indicating that the appearance frequency is high The classification device according to any one of Supplementary Notes 9 to 11, wherein the classification device is displayed on the display unit.

（付記１３）
複数のテキストを受け付け、
受け付けた前記複数のテキストのうちの何れかのテキストについて、該テキストに含まれる形態素と、該形態素の属性情報との組をそれぞれが含む複数の解析結果情報を取得し、
特定の形態素と、該特定の形態素の属性情報とを含む形態素情報を記憶する記憶部を参照して、取得した前記複数の解析結果情報のうちの何れかの解析結果情報に、前記形態素情報に含まれる前記特定の形態素と、該特定の形態素の属性情報との組が含まれるか否かの判定を行い、
判定結果が肯定的である場合、前記何れかのテキストを、該何れかのテキストにおける、前記何れかの解析結果情報に含まれる形態素の出現位置に応じた位置で分割して、複数のテキストを生成し、
受け付けた前記複数のテキストのうちの他のテキストと、生成した前記複数のテキストと、を複数のクラスタに分類する、
処理をコンピュータに実行させることを特徴とする分類プログラム。 (Appendix 13)
Accept multiple texts,
For any one of the received texts, a plurality of pieces of analysis result information each including a set of a morpheme included in the text and attribute information of the morpheme are obtained,
With reference to a storage unit that stores morpheme information including a specific morpheme and attribute information of the specific morpheme, any one of the plurality of pieces of analysis result information obtained as analysis result information, The specific morpheme included is determined whether or not a set of attribute information of the specific morpheme is included,
When the determination result is positive, the one of the texts is divided at a position corresponding to the appearance position of the morpheme included in the one of the analysis result information in the one of the texts, and a plurality of texts are divided. Generate
Classifying the other texts of the received texts and the generated texts into a plurality of clusters,
A classification program for causing a computer to execute processing.

（付記１４）
前記何れかのテキストを係り受け解析して、該何れかのテキストに含まれる複数の文節間の係り受け関係を示す係り受け情報を生成し、
生成した前記係り受け情報に基づき、前記複数の文節のうち、何れかの文節との間に特定の係り受け関係を有する文節を特定し、
取得した前記複数の解析結果情報のうち、特定した前記文節に含まれる形態素と、該形態素の属性情報との組をそれぞれが含む複数の解析結果情報を特定し、
特定した前記解析結果情報のうちの何れかの解析結果情報に、前記形態素情報に含まれる前記特定の形態素と、該特定の形態素の属性情報との組が含まれる場合、前記何れかのテキストを分割して、前記複数のテキストを生成する、
ことを特徴とする付記１３に記載の分類プログラム。 (Appendix 14)
Dependency analysis of any of the texts to generate dependency information indicating a dependency relationship between a plurality of phrases included in any of the texts,
Based on the generated dependency information, specify a phrase having a specific dependency relationship with any one of the plurality of phrases,
Of the plurality of pieces of acquired analysis result information, a plurality of pieces of analysis result information each including a set of a morpheme included in the specified phrase and attribute information of the morpheme are specified,
When any of the analysis result information of the specified analysis result information includes a set of the specific morpheme included in the morpheme information and attribute information of the specific morpheme, the text of any one of Splitting to generate the plurality of texts,
13. The classification program according to supplementary note 13, wherein

（付記１５）
前記複数のクラスタそれぞれに分類されたテキストに基づき、前記複数のクラスタそれぞれの特徴情報を生成し、
生成した前記特徴情報を、前記複数のクラスタそれぞれに対応付けて表示部に表示する、
ことを特徴とする付記１３又は付記１４に記載の分類プログラム。 (Appendix 15)
Based on the text classified into each of the plurality of clusters, to generate feature information of each of the plurality of clusters,
Displaying the generated feature information on a display unit in association with each of the plurality of clusters;
13. The classification program according to Supplementary Note 13 or 14, wherein:

（付記１６）
前記複数のクラスタについて、前記複数のクラスタそれぞれに分類されたテキストと、該テキストに含まれる複数の形態素の、受け付けた前記複数のテキストでの出現状況とに基づき、前記複数のクラスタをそれぞれ代表する複数の代表形態素を決定し、
決定した前記複数の代表形態素それぞれを、前記複数の代表形態素それぞれが代表する複数のクラスタそれぞれに対応付けて前記表示部に表示する、
ことを特徴とする付記１５に記載の分類プログラム。 (Appendix 16)
The plurality of clusters respectively represent the plurality of clusters based on the text classified into each of the plurality of clusters and the appearance of the plurality of morphemes included in the text in the received plurality of texts. Determine a plurality of representative morphemes,
Displaying the determined plurality of representative morphemes on the display unit in association with the plurality of clusters represented by the plurality of representative morphemes,
16. The classification program according to supplementary note 15, wherein

（付記１７）
決定した前記複数の代表形態素それぞれを、前記複数のクラスタそれぞれに分類された前記テキストの数に応じた順序で並べて前記表示部に表示する、
ことを特徴とする付記１６に記載の分類プログラム。 (Appendix 17)
Displaying each of the determined plurality of representative morphemes on the display unit in an order corresponding to the number of the texts classified into each of the plurality of clusters;
18. The classification program according to supplementary note 16, wherein

（付記１８）
前記複数のテキストにおける各テキストの出現状況に関する指標に基づいて、前記複数のクラスタの各々に含まれるテキストについての前記指標が、出現頻度が高いことを示す順に前記クラスタを並べて前記表示部に表示する
ことを特徴とする付記１５〜付記１７のいずれか１項に記載の分類プログラム。 (Appendix 18)
Based on the index related to the appearance status of each text in the plurality of texts, the index for the text included in each of the plurality of clusters is arranged on the display unit in the order indicating that the appearance frequency is high, and the clusters are displayed on the display unit. 18. The classification program according to any one of supplementary notes 15 to 17, characterized in that:

（付記１９）
複数のテキストを受け付け、
受け付けた前記複数のテキストのうちの何れかのテキストについて、該テキストに含まれる形態素と、該形態素の属性情報との組をそれぞれが含む複数の解析結果情報を取得し、
特定の形態素と、該特定の形態素の属性情報とを含む形態素情報を記憶する記憶部を参照して、取得した前記複数の解析結果情報のうちの何れかの解析結果情報に、前記形態素情報に含まれる前記特定の形態素と、該特定の形態素の属性情報との組が含まれるか否かの判定を行い、
判定結果が肯定的である場合、前記何れかのテキストを、該何れかのテキストにおける、前記何れかの解析結果情報に含まれる形態素の出現位置に応じた位置で分割して、複数のテキストを生成し、
受け付けた前記複数のテキストのうちの他のテキストと、生成した前記複数のテキストと、を複数のクラスタに分類する、
処理をコンピュータに実行させることを特徴とする分類プログラムを記憶した記憶媒体。 (Appendix 19)
Accept multiple texts,
For any one of the received texts, a plurality of pieces of analysis result information each including a set of a morpheme included in the text and attribute information of the morpheme are obtained,
With reference to a storage unit that stores morpheme information including a specific morpheme and attribute information of the specific morpheme, any one of the plurality of pieces of analysis result information obtained as analysis result information, The specific morpheme included is determined whether or not a set of attribute information of the specific morpheme is included,
When the determination result is positive, the one of the texts is divided at a position corresponding to the appearance position of the morpheme included in the one of the analysis result information in the one of the texts, and a plurality of texts are divided. Generate
Classifying the other texts of the received texts and the generated texts into a plurality of clusters,
A storage medium storing a classification program that causes a computer to execute processing.

１０分類装置
１２受付解析部
１４分割部
１６分類部
１８表示制御部
２４単語モデル
２４Ａ単語ベクトルテーブル
２４ＢＩＤＦ値テーブル
３０分類結果画面
４０コンピュータ
４１ＣＰＵ
４２メモリ
４３記憶部
４９記憶媒体
５０分類プログラム 10 Classification device 12 Reception analysis unit 14 Division unit 16 Classification unit 18 Display control unit 24 Word model 24A Word vector table 24B IDF value table 30 Classification result screen 40 Computer 41 CPU
42 memory 43 storage unit 49 storage medium 50 classification program

Claims

Accept multiple texts,
For any one of the received texts, a plurality of pieces of analysis result information each including a set of a morpheme included in the text and attribute information of the morpheme are obtained,
With reference to a storage unit that stores morpheme information including a specific morpheme and attribute information of the specific morpheme, any one of the plurality of pieces of analysis result information obtained as analysis result information, The specific morpheme included is determined whether or not a set of attribute information of the specific morpheme is included,
When the determination result is positive, the one of the texts is divided at a position corresponding to the appearance position of the morpheme included in the one of the analysis result information in the one of the texts, and a plurality of texts are divided. Generate
Classifying the other texts of the received texts and the generated texts into a plurality of clusters,
A classification method, wherein the processing is performed by a computer.

Dependency analysis of any of the texts to generate dependency information indicating a dependency relationship between a plurality of phrases included in any of the texts,
Based on the generated dependency information, specify a phrase having a specific dependency relationship with any one of the plurality of phrases,
Of the plurality of pieces of acquired analysis result information, a plurality of pieces of analysis result information each including a set of a morpheme included in the specified phrase and attribute information of the morpheme are specified,
When any of the analysis result information of the specified analysis result information includes a set of the specific morpheme included in the morpheme information and attribute information of the specific morpheme, the text of any one of Splitting to generate the plurality of texts,
The method according to claim 1, wherein:

Based on the text classified into each of the plurality of clusters, to generate feature information of each of the plurality of clusters,
Displaying the generated feature information on a display unit in association with each of the plurality of clusters;
The classification method according to claim 1 or 2, wherein:

The plurality of clusters respectively represent the plurality of clusters based on the text classified into each of the plurality of clusters and the appearance of the plurality of morphemes included in the text in the received plurality of texts. Determine a plurality of representative morphemes,
Displaying the determined plurality of representative morphemes on the display unit in association with the plurality of clusters represented by the plurality of representative morphemes,
4. The classification method according to claim 3, wherein:

Displaying each of the determined plurality of representative morphemes on the display unit in an order corresponding to the number of the texts classified into each of the plurality of clusters;
The classification method according to claim 4, wherein:

Based on the index related to the appearance status of each text in the plurality of texts, the index for the text included in each of the plurality of clusters is arranged on the display unit in the order indicating that the appearance frequency is high, and the clusters are displayed on the display unit. The classification method according to any one of claims 3 to 5, wherein:

A reception unit for receiving a plurality of texts,
An acquisition unit configured to acquire a plurality of pieces of analysis result information each including a set of a morpheme included in the text and attribute information of the morpheme for any of the plurality of texts received by the reception unit; When,
With reference to a storage unit that stores morpheme information including a specific morpheme and attribute information of the specific morpheme, the analysis result information of any of the plurality of pieces of analysis result information acquired by the acquisition unit A determination unit that determines whether a set of the specific morpheme included in the morpheme information and attribute information of the specific morpheme is included,
When the determination result by the determination unit is affirmative, the one of the texts is divided at a position corresponding to an appearance position of a morpheme included in the one of the analysis result information in the one of the texts, A generator for generating a plurality of texts,
A classification unit that classifies the other texts of the plurality of texts received by the reception unit and the plurality of texts generated by the generation unit into a plurality of clusters,
A classification device comprising:

Accept multiple texts,
For any one of the received texts, a plurality of pieces of analysis result information each including a set of a morpheme included in the text and attribute information of the morpheme are obtained,
With reference to a storage unit that stores morpheme information including a specific morpheme and attribute information of the specific morpheme, any one of the plurality of pieces of analysis result information obtained as analysis result information, The specific morpheme included is determined whether or not a set of attribute information of the specific morpheme is included,
When the determination result is positive, the one of the texts is divided at a position corresponding to the appearance position of the morpheme included in the one of the analysis result information in the one of the texts, and a plurality of texts are divided. Generate
Classifying the other texts of the received texts and the generated texts into a plurality of clusters,
A classification program for causing a computer to execute processing.