JP2007293685A

JP2007293685A - Device, program, and method for mining text

Info

Publication number: JP2007293685A
Application number: JP2006122164A
Authority: JP
Inventors: Yasuhiro Takayama; 泰博高山; Takeyuki Aikawa; 勇之相川; Makoto Imamura; 誠今村
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 2006-04-26
Filing date: 2006-04-26
Publication date: 2007-11-08
Anticipated expiration: 2026-04-26
Also published as: JP4876692B2

Abstract

PROBLEM TO BE SOLVED: To solve such a problem that a modification relation, which focuses on the relation of not less than three words from the aspect of wider-range of analysis, is not freely and interactively reviewed since the range of analysis by a user is restricted up to the relation of two words since conventional technique is for two dimensional mapping of a two-word modification relation focusing on the relation of the two words. SOLUTION: A text mining device includes: a modification relation extracting means for extracting the modification relation of the not less than two words; a modification relation structure constituting means for classifying the words constituting the inter-word modification relation by category and constituting a modification relation network structure among the words included in the classified category; a display means for displaying the inter-word modification relation network structure; an input means for receiving the input of setting of a display condition; and a display object data operating means for indicating the setting of the display condition with respect to the displayed inter-word modification relation network structure. COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、大量のテキストデータに含まれる言語情報を分析して、単語や係り受け関係の関連性を抽出し、これらの関連性のテキストマイニング結果を視覚化して表示する技術に関する。 The present invention relates to a technique for analyzing linguistic information contained in a large amount of text data, extracting relationships between words and dependency relationships, and visualizing and displaying the results of text mining of these relationships.

インターネットの普及に伴い、製品に関する企業の宣伝や消費者の意見がＷｅｂページとして公開されているのに加え、大量の顧客の声が電子メールやＷｅｂにより文書データとして発信されるようになった。また、企業内では文書の電子化が進み、設計仕様書や故障調査報告書などの品質管理業務などで必要とされる文書の蓄積が進んでいる。これらの大量に蓄積された文書のテキストから、有用な情報を抽出し、業務改善のために活用するためのテキストマイニング装置の重要性が増している。 With the spread of the Internet, corporate advertisements and consumer opinions regarding products have been released as Web pages, and a large amount of customer feedback has been sent as document data by e-mail and Web. In addition, the digitization of documents is progressing in the enterprise, and the accumulation of documents required for quality control work such as design specifications and failure investigation reports is progressing. The importance of text mining devices for extracting useful information from these large amounts of stored text and utilizing it for business improvement is increasing.

このようなテキストマイニング装置のうち、従来から、テキストの属性データや全文検索等の条件によりユーザがグループを定義し、そのうちのあるグループを指定すると、該当グループ中のテキストから単語または係り受け情報を抽出し、これらの出現頻度や統計的な指標による重みに基づいて単語または係り受け情報をランク付けしたリストを表示することにより、テキスト中の情報を概観する手段を備える装置がある。 Among such text mining devices, conventionally, when a user defines a group according to conditions such as text attribute data and full-text search, and designates a certain group, a word or dependency information is extracted from the text in the group. There is an apparatus provided with means for overviewing information in a text by extracting and displaying a list in which words or dependency information are ranked based on their appearance frequency and weight based on a statistical index.

例えば、以下に示す特許文献１では、属性データ、テキストの全文検索等の条件により、ユーザが定義したあるグループを指定すると、該当グループ中のテキストから抽出した単語または係り受け情報を出現頻度やＩＤＦ（ＩｎｖｅｒｔｅｄＤｏｃｕｍｅｎｔＦｒｅｑｕｅｎｃｙ）重みに基づいて、単語または係り受け情報をランク付けしたリストを表示する技術が開示されている。また、これらの統計情報をもとに、コレスポンデンス分析や主成分分析などの一般的に広く用いられている統計的な分析手法を用いて、単語や係り受け情報を二次元マップ上に配置して可視化する技術が開示されている。 For example, in Patent Document 1 shown below, when a certain group defined by the user is specified according to conditions such as attribute data and full text search of text, the word or dependency information extracted from the text in the corresponding group is displayed as an appearance frequency or IDF. (Inverted Document Frequency) A technique for displaying a list in which words or dependency information are ranked based on weights is disclosed. In addition, based on these statistical information, words and dependency information are arranged on a two-dimensional map using commonly used statistical analysis techniques such as correspondence analysis and principal component analysis. A technique for visualization is disclosed.

また、単語間の関係をマップ上に配置して表示する方法として、特許文献２のような情報マップ作成技術が開示されている。この特許文献２では、単語間の関連性をネットワーク形式で表示し、アークに対する間引きを行なう技術が開示されている。 Further, as a method for arranging and displaying the relationship between words on a map, an information map creation technique as disclosed in Patent Document 2 is disclosed. In this patent document 2, a technique for displaying relations between words in a network format and thinning out arcs is disclosed.

特開２００４−０２１４４５号公報（第１５頁第１図）Japanese Patent Laid-Open No. 2004-021445 (page 15, FIG. 1) 特開２００４−２１９１３号公報（第１７頁第１図〜第２図、第１８頁第５図）JP 2004-21913 A (page 17, FIG. 1 to FIG. 2, page 18, FIG. 5)

上記の特許文献１および特許文献２に開示された従来技術では、２つの単語間の関係に着目した２語係り受け関係の二次元マッピングであったため、ユーザの分析の範囲は２語までの関係に限定されていた。このため、より広範囲の分析の観点を持って３単語以上の関係に着目した係り受け関係に対して、自由で対話的な概観をすることができないという課題があった。 In the prior art disclosed in Patent Document 1 and Patent Document 2 described above, since the two-dimensional dependency relationship is a two-word dependency relationship focusing on the relationship between two words, the range of analysis of the user is a relationship up to two words. It was limited to. For this reason, there has been a problem that it is impossible to give a free and interactive overview to the dependency relationship that focuses on a relationship of three words or more from a wider viewpoint of analysis.

また、従来技術では、テキストを解析して抽出した係り受け関係は、数多く表示された場合、そのテキストの内容の把握が容易でなく、さらに、分析で着目すべき単語が、他の重要でない単語群の中に埋もれてしまって見つけ出すのが困難になるという課題があった。 Also, in the prior art, when many dependency relationships extracted by analyzing texts are displayed, it is not easy to grasp the contents of the texts, and the words to be focused on in the analysis are other insignificant words. There was a problem that it was difficult to find it because it was buried in the group.

また、従来技術では、３単語以上の関係に着目した係り受け関係をネットワーク表示する場合、ネットワークを構成するリンクに対する重要性を考慮せずに表示すると、単一のネットワーク構造のみしか得られないため、概略表示から詳細表示までを自由に変更して、分析の粒度を変えて容易に閲覧することができないという課題があった。 In addition, in the prior art, when displaying a dependency relationship focusing on a relationship of three words or more, if only a single network structure is obtained if it is displayed without considering the importance of the links constituting the network. However, there is a problem that it is not possible to easily browse from a summary display to a detailed display by changing the granularity of analysis.

また、従来技術では、分析で着目したい３単語以上の係り受け関係が、他の不要なリンクの中に埋もれてしまい、着目すべきノードおよびリンクに絞って閲覧することが困難であるという課題があった。 In addition, in the related art, the dependency relationship of three or more words to be noticed in the analysis is buried in other unnecessary links, and it is difficult to browse only the nodes and links to be noticed. there were.

また、従来技術では、低頻度語あるいは非重要語が混在した係り受けネットワーク構造である場合、重要語を中心とした係り受け関係を分析することが困難であるという課題があった。 Further, in the related art, in the case of a dependency network structure in which low-frequency words or non-important words are mixed, there is a problem that it is difficult to analyze the dependency relationship centered on important words.

また、従来技術では、類似する単語が個別にネットワーク構造のノードとなった場合、類似する単語であるにもかかわらず別のノードとしてリンクが張られるために、リンク数が増加して錯綜し、内容の把握が困難であるという課題があった。 Also, in the prior art, when similar words individually become nodes of the network structure, links are created as different nodes even though they are similar words, so the number of links increases and is complicated, There was a problem that it was difficult to grasp the contents.

また、従来技術では、２つの単語間の係り受け関係の表示内容においては、高い頻度の係り受け関係のさらに周辺にある詳細な内容を把握することが困難であるという課題があった。 In addition, in the related art, there is a problem in that it is difficult to grasp the detailed contents in the vicinity of a high-frequency dependency relationship in the display content of the dependency relationship between two words.

また、従来技術では、２つの単語間の係り受け関係の表示内容においては、３語以上の係り受け関係を表示できないため、３語以上のひとまとまりの係り受け関係の内容の把握ができないという課題があった。 Further, in the related art, in the display contents of the dependency relationship between two words, since the dependency relationship of three words or more cannot be displayed, the content of the dependency relationship of a group of three words or more cannot be grasped. was there.

また、従来技術では、ユーザが任意の組の係り受け群を複合概念として指定できないため、複合概念として任意に指定した係り受け関係の組から、複合概念間の新規な関係を抽出して、分析対象に潜在する重要な関係情報を分析することができないという課題があった。 In addition, in the prior art, since a user cannot specify any group of dependency groups as a composite concept, a new relationship between composite concepts is extracted from the set of dependency relationships arbitrarily specified as a composite concept and analyzed. There was a problem that important relationship information latent in the subject could not be analyzed.

また、従来技術では、係り受け関係を構成する単語をグループに分類する場合、分類の基準として、分析対象の分野に依存した知識源を利用しないため、分析対象の分野に依存した係り受け関係の把握ができず、分析対象の業務に有用な分析が困難であるという課題があった。 In addition, in the prior art, when the words constituting the dependency relationship are classified into groups, a knowledge source that depends on the field to be analyzed is not used as a classification criterion. There was a problem that it was difficult to grasp and it was difficult to perform analysis useful for the work to be analyzed.

また、従来技術では、表示されたカテゴリ間の各単語の係り受け関係を参照して、各単語が属するカテゴリを対話的に自由に編集、修正することができないため、対象データから抽出した係り受け関係の内容の把握が容易でないという課題があった。 Also, in the prior art, it is impossible to interactively freely edit and modify the category to which each word belongs by referring to the dependency relationship of each word between displayed categories. There was a problem that it was not easy to grasp the contents of the relationship.

本発明は、上記のような課題を解決するためになされたもので、係り受け関係構造構成手段を備えることにより、分析対象テキストから抽出した単語間の係り受け関係を複数のカテゴリに分類したネットワーク表示を可能とし、表示したネットワークに対して、カテゴリ、ノード、及びアークに関する表示情報を自由に調節可能としたことで、２語以上にわたる広範囲な係り受け関係の全体概観から、詳細な絞り込み表示までの自由で対話的な分析を可能にすることを目的とする。 The present invention has been made in order to solve the above-described problems, and includes a dependency relationship structure forming unit, whereby a dependency relationship between words extracted from an analysis target text is classified into a plurality of categories. From the overall overview of a wide range of dependency relationships spanning more than two words, to detailed refinement display, by enabling display and freely adjusting display information on categories, nodes, and arcs for the displayed network The purpose is to enable free and interactive analysis.

このような問題を解決するため、本発明に係るテキストマイニング装置、テキストマイニングプログラム、及びテキストマイニング方法は、分析対象データを解析して、前記分析対象データのテキストから抽出した単語情報と前記分析対象データの属性値とを登録する登録手段と、抽出した前記単語情報を解析して２語以上の単語間の係り受け関係を抽出する係り受け抽出手段と、前記単語間の係り受け関係の頻度を集計する係り受け頻度集計手段と、前記属性値の頻度を集計する属性値頻度集計手段と、前記登録手段によって抽出され、前記属性値頻度集計手段によって属性値の頻度を集計された前記単語情報を、予め定義したカテゴリごとに集計するカテゴリ別単語集計手段と、前記単語間の係り受け関係を構成する単語を前記カテゴリ別に分類し、分類された前記カテゴリに含まれる単語間の係り受け関係ネットワーク構造を構成する係り受け関係構造構成手段と、前記単語間の係り受け関係ネットワーク構造を表示する表示手段と、表示条件の設定の入力を受け付ける入力手段と、表示された前記単語間の係り受け関係ネットワーク構造に対し、前記表示条件の設定を指示する表示対象データ操作手段とを備えることとしたものである。 In order to solve such a problem, the text mining device, the text mining program, and the text mining method according to the present invention analyze the analysis target data, extract the word information extracted from the text of the analysis target data, and the analysis target A registration means for registering attribute values of data; a dependency extraction means for analyzing the extracted word information to extract a dependency relationship between two or more words; and a frequency of the dependency relationship between the words. Dependency frequency counting means for counting, attribute value frequency counting means for counting the frequency of the attribute value, and the word information extracted by the registration means and tabulated for the attribute value frequency by the attribute value frequency counting means Category-by-category word counting means for counting for each predefined category, and the words constituting the dependency relationship between the words as the category And a dependency relationship structure configuring unit that configures a dependency relationship network structure between words included in the classified category, a display unit that displays the dependency relationship network structure between the words, and a display condition Input means for receiving setting input and display target data operating means for instructing setting of the display condition for the displayed dependency relation network structure between words.

本発明によれば、係り受け関係構造構成手段を備えることにより、分析対象テキストから抽出した単語間の係り受け関係を３種類以上のカテゴリに分類したネットワーク表示を可能とし、表示したネットワークに対して、カテゴリ、ノード、及びアークに関する表示情報を自由に調節可能としたことで、３語以上にわたる広範囲な係り受け関係の全体概観から、詳細な絞り込み表示までの自由で対話的な分析ができるようになる。 According to the present invention, it is possible to display a network in which the dependency relationships between words extracted from the text to be analyzed are classified into three or more categories by providing the dependency relationship structure constituting unit. The display information on categories, nodes, and arcs can be freely adjusted so that free interactive analysis can be performed from a broad overview of dependency relationships over three words to detailed refined display. Become.

実施の形態１．
図１は、この発明の実施の形態１を示す構成図である。１０１は登録部であり、分析対象データを解析して、分析対象データのテキストから抽出した単語情報と分析対象データの属性値を、それぞれ後述の単語データベース１１３、および、属性データベース１１２に登録する。なお、以下、データベースをＤＢと記載する。１０２は係り受け抽出部であり、抽出した単語情報を解析して２語以上の単語間の係り受け関係を抽出する。１０３は入力部であり、ユーザからの入力を受け付ける。１０４は表示対象データ操作部であり、表示された単語間の係り受け関係ネットワーク構造に関するデータの設定の変更を指示する。１０５は制御部であり、分析対象データの登録から表示にいたるまでの一連の処理の流れを制御する。１０６は係り受け頻度集計部であり、係り受け抽出部１０２によって抽出された単語間の係り受け関係の頻度を集計する。１０７は属性値頻度集計部であり、分析対象データ中の属性値の頻度を集計する。１０８はカテゴリ別単語集計部であり、登録部１０１によって抽出された単語を予め定義したカテゴリごとに集計する。１０９は係り受け関係構造構成部であり、分析対象テキストから抽出した単語間の係り受け関係を分類する。１１０は表示部であり、係り受け関係構造構成部１０９によって構成された単語間の係り受け関係ネットワーク構造を表示する。１１１は分析対象とする文書データである。１１２は属性ＤＢであり、文書データから抽出した属性に関する情報を格納する。１１３は単語ＤＢであり、文書データから抽出した単語に関する情報を格納する。１１４は係り受けＤＢであり、文書データから抽出した係り受け関係に関する情報を格納する。１１５は意味カテゴリ辞書であり、意味カテゴリとその意味カテゴリに属する単語を格納する。１１６は表示部からの出力情報であり、係り受け関係構造を視覚的に表示したものである。 Embodiment 1 FIG.
1 is a block diagram showing Embodiment 1 of the present invention. A registration unit 101 analyzes the analysis target data and registers word information extracted from the text of the analysis target data and the attribute value of the analysis target data in a word database 113 and an attribute database 112, which will be described later. Hereinafter, the database is referred to as DB. A dependency extraction unit 102 analyzes the extracted word information and extracts a dependency relationship between two or more words. Reference numeral 103 denotes an input unit that accepts input from the user. Reference numeral 104 denotes a display target data operation unit, which instructs to change data settings related to the dependency relationship network structure between displayed words. A control unit 105 controls a flow of a series of processes from registration of analysis target data to display. 106 is a dependency frequency totaling unit, which totals the frequency of dependency relationships between words extracted by the dependency extraction unit 102. Reference numeral 107 denotes an attribute value frequency totaling unit that counts the frequency of attribute values in the analysis target data. Reference numeral 108 denotes a category-specific word totaling unit that totalizes the words extracted by the registration unit 101 for each predefined category. Reference numeral 109 denotes a dependency relationship structure forming unit that classifies dependency relationships between words extracted from the text to be analyzed. Reference numeral 110 denotes a display unit that displays a dependency relationship network structure between words configured by the dependency relationship structure configuration unit 109. Reference numeral 111 denotes document data to be analyzed. An attribute DB 112 stores information related to attributes extracted from document data. Reference numeral 113 denotes a word DB, which stores information about words extracted from document data. Reference numeral 114 denotes a dependency DB, which stores information related to dependency relationships extracted from document data. A semantic category dictionary 115 stores semantic categories and words belonging to the semantic categories. Reference numeral 116 denotes output information from the display unit, which visually displays the dependency relationship structure.

以下、本発明の実施の形態１における基本処理の概要について、図１から図４までを適宜参照しつつ説明する。基本処理の詳細は、図２に示した本発明の実施の形態１における基本処理のフローチャートを参照しながら説明する。 Hereinafter, an outline of basic processing in Embodiment 1 of the present invention will be described with reference to FIGS. 1 to 4 as appropriate. Details of the basic processing will be described with reference to the flowchart of the basic processing in the first embodiment of the present invention shown in FIG.

まず、図２のステップＳ２０１において、登録部１０１は、分析対象の文書データ１１１を読み込み、文書データ１１１中に出現する単語、および書誌情報などの属性の情報を抽出する。抽出した単語に関する情報は、単語ＤＢ１１３に格納し、また、抽出した属性の情報は、属性ＤＢ１１２に格納する。図３に、本発明における単語ＤＢ１１３の例を示す。図３では、文書データ１１１中に出現する単語を抽出した後に、単語ＩＤ、見出し、品詞、活用形などの単語情報が単語ＤＢ１１３に格納されている一例を示している。格納する単語は、例えば形態素解析により解析した結果から、抽出対象とする品詞として名詞、形容詞、形容動詞などを選択する。また、図４は、本発明における属性ＤＢ１１２の例を示したものである。図４では、文書の属性として、文書ＩＤ、受付日、種別、対象部門などが格納されている一例を示している。この属性については、分析対象の文書データ１１１に対して、どのような観点で分析するかによって、種々の属性の種類が設計されて、各文書データに属性の値が付与される。 First, in step S201 of FIG. 2, the registration unit 101 reads the document data 111 to be analyzed, and extracts words appearing in the document data 111 and attribute information such as bibliographic information. Information about the extracted word is stored in the word DB 113, and information on the extracted attribute is stored in the attribute DB 112. FIG. 3 shows an example of the word DB 113 in the present invention. FIG. 3 shows an example in which word information such as a word ID, a headline, a part of speech, and a utilization form is stored in the word DB 113 after a word appearing in the document data 111 is extracted. As words to be stored, nouns, adjectives, adjective verbs and the like are selected as parts of speech to be extracted from, for example, a result of analysis by morphological analysis. FIG. 4 shows an example of the attribute DB 112 in the present invention. FIG. 4 shows an example in which document ID, reception date, type, target department, and the like are stored as document attributes. With respect to this attribute, various attribute types are designed and attribute values are given to each document data depending on the viewpoint of analysis of the document data 111 to be analyzed.

次に、ステップＳ２０２において、登録部１０１は、係り受け抽出部１０２を呼び出し、分析対象の文書データ中に出現する単語間の係り受け関係を抽出して、係り受けＤＢ１１４に格納する。図５に、本発明における係り受け関係の例を示す。図では、文例「Ａ店だけど、不親切な態度が頭にくる。」（文ＩＤ＝１）に対して、形態素解析により抽出した単語「Ａ店／不親切／態度／頭にくる」を得て、２語の係り受け関係となる「不親切 → 態度」および「態度 → 頭にくる」を抽出した一例を示している。この場合、文ＩＤ＝１の文に対して、上記の２語の係り受け関係を組み合わせて、３語の係り受け関係「不親切 → 態度 → 頭にくる」も同時に抽出可能である。このようにして得た係り受け関係を、例えば、図６の本発明における係り受けＤＢの例のように、係り受けＩＤとともに、係り元単語、係り先単語、文ＩＤなどを格納しておく。 Next, in step S202, the registration unit 101 calls the dependency extraction unit 102, extracts a dependency relationship between words appearing in the document data to be analyzed, and stores the relationship in the dependency DB 114. FIG. 5 shows an example of the dependency relationship in the present invention. In the figure, the word “A store / unkindness / attitude / heading” extracted by morphological analysis is obtained for the sentence example “A store, but unfriendly attitude comes to mind” (sentence ID = 1). It shows an example of extracting “unkind → attitude” and “attitude → head” which are two word dependency relationships. In this case, for the sentence with sentence ID = 1, the above-described dependency relationship of two words can be combined to simultaneously extract the dependency relationship of three words “unfriendly → attitude → head”. The dependency relationship obtained in this way is stored with a dependency source word, a dependency destination word, a sentence ID, etc. together with the dependency ID, as in the example of the dependency DB in the present invention of FIG.

次に、ステップＳ２０３において、制御部１０５は、属性ＤＢ１１２に格納されている属性情報を受け取り、属性値頻度集計部１０６に送って、属性値頻度集計部１０６では、個々の属性に対する属性値の頻度を集計する。 Next, in step S203, the control unit 105 receives the attribute information stored in the attribute DB 112 and sends the attribute information to the attribute value frequency totaling unit 106. The attribute value frequency totaling unit 106 determines the frequency of the attribute value for each attribute. Are counted.

次に、ステップＳ２０４において、制御部１０５は、単語ＤＢ１１２に格納されている単語情報を受け取り、カテゴリ別単語頻度集計部１０８に送る。カテゴリ別単語頻度集計手段は、単語ＤＢ１１２から受け取った単語である単語情報、および単語分類と意味カテゴリとの対応関係を格納している意味カテゴリ辞書１１５を参照して、意味カテゴリごとに単語の頻度を集計する。 Next, in step S 204, the control unit 105 receives word information stored in the word DB 112 and sends it to the category-specific word frequency totaling unit 108. The category-specific word frequency totaling unit refers to the word information received from the word DB 112 and the semantic category dictionary 115 storing the correspondence between the word classification and the semantic category, and the word frequency for each semantic category. Are counted.

次に、ステップＳ２０５において、制御部１０５は、係り受けＤＢ１１４に格納されている係り受け情報を受け取り、係り受け頻度集計部１０７において、係り受け関係ごとの頻度を集計する。 Next, in step S205, the control unit 105 receives the dependency information stored in the dependency DB 114, and the dependency frequency totaling unit 107 totals the frequency for each dependency relationship.

次に、ステップＳ２０６において、係り受け関係構造構成の処理を行なう。係り受け関係構造構成の処理は係り受け関係構造構成部１０９で実行される。以下、係り受け関係構造構成部１０９の動作については、図７の詳細な流れ図を参照して説明する。係り受け関係構造構成手段１０９は、まず、ステップＳ７０１で、係り受け頻度集計部１０７から、係り受け対と係り受け対ごとの頻度を取得する。 Next, in step S206, a dependency relationship structure configuration process is performed. The dependency relationship structure configuration processing is executed by the dependency relationship structure configuration unit 109. Hereinafter, the operation of the dependency relationship structure constituent unit 109 will be described with reference to the detailed flowchart of FIG. First, in step S701, the dependency relationship structure configuring unit 109 acquires a dependency pair and a frequency for each dependency pair from the dependency frequency totaling unit 107.

次に、ステップＳ７０２で、カテゴリ別単語頻度集計部１０８から、単語のカテゴリ分類と各カテゴリに属する単語の頻度を取得する。 Next, in step S 702, the category classification of words and the frequency of words belonging to each category are acquired from the category-specific word frequency totaling unit 108.

次に、ステップＳ７０３で、ステップＳ７０１およびステップＳ７０１で取得した情報から、カテゴリ対ごとに、係り元の各カテゴリに属する単語と、係り先の各カテゴリに属する単語との間のリンク情報を生成する。このとき、係り受け頻度が大きいリンクに対して、リンクに付随する値として大きな値を設定するものとする。 Next, in step S703, link information between the words belonging to each category of the relation source and the words belonging to each category of the relation destination is generated for each category pair from the information acquired in steps S701 and S701. . At this time, a large value is set as a value associated with the link for a link with a large dependency frequency.

次に、ステップＳ７０４では、カテゴリ別にネットワーク状に係り受け関係を配置して表示するために、カテゴリごとに表示する位置を決定して表示情報を生成する。例えば、カテゴリとして、一般的な品詞、および主語−目的語のような格関係を用いる場合には、「動詞」を中心に配置し、その左側に「動詞」と格関係にある「主語である名詞」および「目的語である名詞」を配置し、さらにその左側に名詞に係る「形容詞」を配置する。また、「動詞」右側に動詞を修飾する「副詞」を配置する。このとき、配置に関する情報は、設定ファイルを用いてユーザが「形容詞」、「名詞」、「動詞」、「副詞」の順に左から並べるように指定しても良い。また、係り先となる単語の頻度の合計が大きい順にカテゴリを左から右に配置する、という方式でも良い。以上が、係り受け関係構造構成部１０９の動作の説明である。 Next, in step S704, in order to arrange and display the dependency relationship in a network form for each category, a display position is determined for each category and display information is generated. For example, when using a category part such as a general part of speech and subject-object as a category, the “verb” is placed at the center, and the “subject” is a subject with a “verb” on the left side. “Noun” and “noun as object” are arranged, and “adjective” relating to the noun is arranged on the left side. Also, an “adverb” that modifies the verb is placed on the right side of the “verb”. At this time, the information regarding the arrangement may be specified by using a setting file so that the user arranges “adjective”, “noun”, “verb”, and “adverb” in order from the left. Alternatively, a method may be used in which categories are arranged from left to right in descending order of the sum of the frequencies of words to be related. The above is the description of the operation of the dependency relationship structure configuration unit 109.

次に、ステップＳ２０７において、表示部１１０では、係り受け関係構造構成部１０９で生成された情報を元に、図８に例示するように係り受け関係構造の視覚化表示を行なう。例えば、図８の８０１「形容詞」、８０２「名詞：主語」、８０３「名詞：目的語」、８０４「動詞」、８０５「副詞」は、カテゴリを表す。また、図８の８０６、８０７、８０８、８０９、８１０は、それぞれ８０１〜８０５に対応するカテゴリに属する単語の一覧を示している。このとき、リンク情報８１１は、「形容詞」カテゴリを持つ単語「わかりにくい」と、「名詞：主語」カテゴリを持つ「表示」とが直接の係り受け関係にあり、係り受け頻度が大きいことから、図７のステップ７０３で取得したリンクに対して与えられた値が大きいため、太字で表示されている。 Next, in step S207, the display unit 110 performs a visual display of the dependency relationship structure as illustrated in FIG. 8 based on the information generated by the dependency relationship structure configuration unit 109. For example, 801 “adjective”, 802 “noun: subject”, 803 “noun: object”, 804 “verb”, and 805 “adverb” in FIG. 8 represent categories. Further, reference numerals 806, 807, 808, 809, and 810 in FIG. 8 indicate lists of words belonging to categories corresponding to 801 to 805, respectively. At this time, the link information 811 has a direct dependency relationship between the word “difficult” having the “adjective” category and “display” having the “noun: subject” category, and the dependency frequency is high. Since the value given to the link acquired in step 703 in FIG. 7 is large, it is displayed in bold.

以上述べたように、本発明によれば、図８の表示例のように、分析対象テキストから抽出した単語間の係り受け関係をカテゴリの観点から概観することができ、かつ、個別の単語間の係り受け関係も容易に把握できるようになる。 As described above, according to the present invention, as shown in the display example of FIG. 8, the dependency relationship between words extracted from the text to be analyzed can be overviewed from the viewpoint of a category, and between individual words. It becomes possible to easily grasp the dependency relationship.

また、図８に示した各単語にはカテゴリ別単語頻度集計手部１０８で集計された頻度が付随しているので、表示対象データ操作部１０４で、図８の８０６〜８１０に例示したカテゴリ内の各表示単語に対して、頻度順に上から下へ表示するように指定したり、表示する単語数の設定を頻度上位の指定件数のみとしたりするような指定ができるようになる。また、図８には明示していないが、単語とともに各単語の頻度を表示することもできる。 In addition, since each word shown in FIG. 8 is accompanied by the frequency counted by the category-specific word frequency totaling unit 108, the display target data operation unit 104 uses the categories 806 to 810 illustrated in FIG. For each of the display words, it is possible to specify to display from top to bottom in order of frequency, or to set the number of words to be displayed to only the specified number of high-frequency items. Although not explicitly shown in FIG. 8, the frequency of each word can be displayed together with the word.

また、表示対象データ操作部１０４で、係り受け頻度によって、表示するリンクを頻度がある閾値以上のものだけに絞って表示することもできる。この閾値の設定を変更することで、概略表示から詳細表示までスムーズに閲覧できるようにすることが可能となる。 In addition, the display target data operation unit 104 can display only the links to be displayed with a frequency equal to or higher than a certain threshold according to the dependency frequency. By changing the setting of the threshold value, it is possible to smoothly browse from the outline display to the detail display.

また、表示対象データ操作部１０４で、指定したノード（図８のカテゴリ名８０１とカテゴリに属する単語群８０６の対）に対して、指定したノード中のカテゴリに属する単語に関する係り受け関係によって、指定ノードの係り先あるいは係り元を表示／非表示の指定をすることが可能となる。 Further, in the display target data operation unit 104, the specified node (a pair of the category name 801 and the word group 806 belonging to the category in FIG. 8) is specified by the dependency relation regarding the word belonging to the category in the specified node. It is possible to specify whether to display or hide the node destination or source.

また、表示対象データ操作部１０４において、ノードとして作業ボックスを表示画面に新たに挿入して表示可能とする。このとき、頻度が閾値以下の低頻度語を非重要語として表示画面内の作業ボックスへ自動的に移動したり、指定した非重要語を表示画面内の作業ボックスへ移動したりすることにより、重要語を中心とした係り受けネットワーク構造に再構築することが可能となる。ここで、非重要語は、カイ２乗値やｔｆ＊ｉｄｆ値のような一般的に良く用いられる統計的な指標により自動的に求める方式としても良い。 In the display target data operation unit 104, a work box is newly inserted as a node on the display screen and can be displayed. At this time, by automatically moving a low frequency word whose frequency is equal to or less than a threshold value as a non-important word to a work box in the display screen or by moving a designated non-important word to a work box in the display screen, It becomes possible to reconstruct the dependency network structure centered on important words. Here, a non-important word may be automatically obtained by a statistical index that is generally used, such as a chi-square value or tf * idf value.

また、係り受け関係構造構成部１０９において、類似度を定義して、類似する単語を一つの概念としてグループ化するようにできる。この場合の類似度としては、（１）共通部分文字列の一致の割合や編集距離（文字の挿入や削除、置換によって、一つの文字列を別の文字列に変形するのに必要な手順の最小回数）による文字面の類似度、（２）人手で定義した同義語辞書や類義語辞書を用いた類似度、（３）分析対象データ中の単語出現頻度のベクトル距離による類似度、などを用いることができる。このような類似度を利用してまとめられたグループは、類似する単語の集合により構成されるようになるため、表示手段１１０で表示するノードの表示単位として、ここで構成したグループを用いることにより、類似した内容を一箇所にまとめて表示することが可能になる。 In addition, the dependency relationship structure configuration unit 109 can define a similarity and group similar words as one concept. The similarity in this case is as follows: (1) Matching ratio of common partial character strings and editing distance (procedures necessary for transforming one character string into another character string by inserting, deleting, or replacing characters) (2) Similarity using a manually defined synonym dictionary or synonym dictionary, (3) Similarity based on vector distance of word appearance frequency in analysis target data, etc. be able to. A group grouped using such a similarity degree is composed of a set of similar words. Therefore, by using the group configured here as a display unit of a node displayed on the display unit 110, , Similar contents can be displayed together in one place.

このように、本発明によるテキストマイニング装置では、係り受け関係構造構成部１０９および表示対象データ操作部１０４を備えることにより、従来技術である２語の関係に着目した係り受け情報の二次元マッピング方法では困難だった、３種類以上のカテゴリに単語間の係り受け関係を分類したネットワークを表示できるようになり、広範囲に概観することが可能となる。 As described above, the text mining device according to the present invention includes the dependency relationship structure configuration unit 109 and the display target data operation unit 104, and is a conventional two-dimensional mapping method of dependency information focusing on the relationship between two words. This makes it possible to display a network in which dependency relationships between words are classified into three or more categories, which is difficult to do, and allows a broad overview.

また、表示対象データ操作部１０４において、カテゴリ内の表示単語群に対して表示単語数、および順序等を指定可能にしたことにより、カテゴリごとに各カテゴリを構成する単語集合の内容の把握が容易になり、分析で着目すべき単語を見つけ出し易くなる。 In addition, the display target data operation unit 104 can specify the number of display words, the order, and the like for the display word group in the category, so that the contents of the word set constituting each category can be easily grasped for each category. This makes it easier to find words that should be noted in the analysis.

また、表示対象データ操作部１０４において、３単語以上の関係に着目した係り受け関係の単位で、係り受けの頻度等に基づく関連性の強さに対して閾値を設定することにより、表示するリンク数を制御することで、関連性の高い重要なリンクに絞った概略表示から、表示するリンク数を増やして網羅的に関連性を表示する詳細表示までを、分析の粒度を変えて容易に閲覧できるようになる。 In the display target data operation unit 104, a link to be displayed by setting a threshold for the strength of relevance based on the dependency frequency or the like in a dependency relationship unit focusing on a relationship of three words or more. By controlling the number, it is easy to view from a summary display focused on highly relevant and important links to a detailed display that increases the number of links to be displayed and displays the relationship comprehensively, changing the granularity of the analysis. become able to.

また、表示対象データ操作部１０４において、指定したノードに対して、係り受けおよび係り元のリンクに関する表示／非表示、または並べ替えの指定を可能にしたことにより、着目すべきノードおよびリンクに絞って閲覧できるようになるという利点がある。 In addition, the display target data operation unit 104 can specify display / non-display or rearrangement of the dependency and the relation source link for the specified node, thereby narrowing down to the nodes and links to be noted. There is an advantage that it can be viewed.

また、表示対象データ操作部１０４において、低頻度語あるいは非重要語をフィルタすることにより、重要語を中心とした係り受けネットワーク構造に再構築することが可能となる。 Further, by filtering the infrequent words or non-important words in the display target data operation unit 104, it is possible to reconstruct the dependency network structure centered on the important words.

また、係り受け関係構造構成部１０９において、類似する単語を一つの概念としてグループ化することにより、類似した内容を一箇所にまとめて表示して、内容の把握を容易にすることが可能となる。 In addition, by grouping similar words as one concept in the dependency relationship structure configuration unit 109, it is possible to display similar contents together in one place and to easily understand the contents. .

以上述べたように、本発明によれば、カテゴリとそのカテゴリに属する単語の係り受け関係をネットワークで表示し、表示されたネットワークに対して、カテゴリ、ノード、及びリンクに関する表示情報を自由に調節可能としたことで、係り受け関係の全体概観から詳細な絞り込み表示までの対話的な分析が可能となる。 As described above, according to the present invention, a dependency relationship between a category and words belonging to the category is displayed on the network, and display information on the category, node, and link can be freely adjusted for the displayed network. By making it possible, it becomes possible to perform interactive analysis from the overall overview of the dependency relationship to the detailed refinement display.

実施の形態２．
以上の実施の形態１では、カテゴリとそのカテゴリに属する単語の係り受け関係をネットワークで表示し、表示されたネットワークの表示情報を自由に調節可能としたものであるが、次に、中核的な係り受け関係と周辺的な係り受け関係を抽出する実施の形態２を示す。 Embodiment 2. FIG.
In the first embodiment described above, the dependency relationship between a category and words belonging to the category is displayed on the network, and the display information of the displayed network can be freely adjusted. Embodiment 2 in which a dependency relationship and a peripheral dependency relationship are extracted will be described.

図９は、この発明の実施の形態２を示す構成図である。本構成では、実施の形態１の構成に加えて、係り受け関係構造構成部１０９に、係り受け頻度が閾値以上の係り受け関係を抽出する中核的係り受け関係抽出部９０１と、中核的係り受け関係抽出部９０１で抽出された中核となる係り受け関係を構成する各単語と閾値以上の頻度がある係り受け関係を抽出する周辺係り受け関係抽出部９０２とを備える。その他の構成については実施の形態１と同様なので説明を省略する。 FIG. 9 is a block diagram showing Embodiment 2 of the present invention. In this configuration, in addition to the configuration of the first embodiment, the dependency relationship structure configuration unit 109 includes a core dependency relationship extraction unit 901 that extracts a dependency relationship whose dependency frequency is equal to or higher than a threshold, and a core dependency relationship. Each word constituting the core dependency relationship extracted by the relationship extraction unit 901 and a peripheral dependency relationship extraction unit 902 that extracts a dependency relationship having a frequency equal to or higher than a threshold value. Since other configurations are the same as those of the first embodiment, description thereof is omitted.

中核的係り受け関係抽出部９０１では、実施の形態１で説明した係り受け関係ネットワークにおいて、単独の単語の入りリンク数（係り先となる数）が閾値以上、かつ、該当する単語の直接の係り受け関係にある係り受け対頻度が他の閾値以上の係り受け関係を抽出する。 In the core dependency relationship extraction unit 901, in the dependency relationship network described in the first embodiment, the number of incoming links (number of dependency destinations) of a single word is equal to or greater than a threshold value, and the direct relationship of the corresponding word. A dependency relationship in which the dependency pair frequency in the reception relationship is greater than or equal to another threshold is extracted.

次に、周辺係り受け関係抽出部９０２において、中核的係り受け関係を構成する各単語との係り受け関係が所定の閾値以上の係り受け関係を抽出する。表示部１１０では、中核的係り受け関係抽出部９０１および周辺係り受け関係抽出部９０２で抽出された係り受け関係に絞って、実施の形態１と同様な係り受け関係ネットワークを表示する。 Next, the peripheral dependency relationship extraction unit 902 extracts a dependency relationship whose dependency relationship with each word constituting the core dependency relationship is a predetermined threshold or more. The display unit 110 displays the same dependency relationship network as in the first embodiment, focusing on the dependency relationship extracted by the core dependency relationship extraction unit 901 and the peripheral dependency relationship extraction unit 902.

図１０は、実施の形態２における係り受け関係ネットワークの表示の例である。図２では、「態度（に）−頭にくる」、「店員（の）−態度」、「不親切（な）−態度」等の高頻度な係り受け関係に埋没していた「守衛（の）−態度（が）−悪い」、「紛らわしい―態度（が）―頭にくる」のような周辺の係り受け関係が抽出可能になる。 FIG. 10 is an example of a display of the dependency relationship network in the second embodiment. In FIG. 2, the “guardian” (of the “guard” (of “the attitude (ni)-come to the head”, “the clerk (no) —attitude”, “unkind (na) —attitude)) ) -Attitude (ga) -bad "," confusing-attitude (ga) -coming to the head "surrounding dependency relationships can be extracted.

また、図１１は、実施の形態２における係り受け関係構造構成部１０９で用いる索引の例である。図１１において、１１０１は、各単語に一意に与えられる単語ＩＤ、１１０２は単語が出現するレコード（分析対象から抽出した登録単位）ＩＤのリスト、１１０３は、レコードから抽出した文に付与された文ＩＤのリスト、１１０４はレコードから抽出した係り受けＩＤのリストである。このとき、レコードＩＤ、文ＩＤ、係り受けＩＤは階層関係にあるものとする。すなわち、図１１の係り受けＩＤで、Ａ−Ｂ−Ｃと表記したものにおいて、ＡはレコードＩＤ、ＢはレコードＡの文ＩＤ、ＣはレコードＡ中の文Ｂにおける係り受けＩＤであるとする。 FIG. 11 is an example of an index used in the dependency relationship structure configuration unit 109 according to the second embodiment. In FIG. 11, 1101 is a word ID uniquely given to each word, 1102 is a list of record IDs (registration units extracted from the analysis target) ID, and 1103 is a sentence given to a sentence extracted from the record An ID list 1104 is a list of dependency IDs extracted from the record. At this time, it is assumed that the record ID, sentence ID, and dependency ID are in a hierarchical relationship. That is, in the dependency ID shown in FIG. 11 and expressed as ABC, A is a record ID, B is a sentence ID of record A, and C is a dependency ID of sentence B in record A. .

一般に、３語間の係り受け関係を索引に保持しようとすると膨大なデータ量となる。係り受け関係の索引を図１１のように構成することによって、係り受け関係構造構成部１０９において、２語の係り受け関係を抽出し、さらに共通の文ＩＤを持つ係り受け関係のみを選択することによって、擬似的な３語係り受け関係のみを抽出することができる。これにより、強い結びつきを持つ擬似的な３語係り受け関係にあるリンクのみを選択して、表示部１１０に表示させることができる。このとき、「不親切な態度が頭にくる」の３語間の係り受け関係（「不親切−態度−頭にくる」がすべて直接の係り受け関係）にある場合以外に「不親切な表示があると、態度よりも頭にくる」のような場合も含めてしまうが、中核的な係り受け関係を把握するための視覚化においては、大局的な概観という目的に関して問題とはならない。 In general, if an attempt is made to maintain a dependency relationship between three words in an index, a huge amount of data is required. By configuring the dependency relationship index as shown in FIG. 11, the dependency relationship structure unit 109 extracts a dependency relationship of two words and selects only dependency relationships having a common sentence ID. Thus, only the pseudo three-word dependency relationship can be extracted. As a result, only the links having a pseudo three-word dependency relationship with strong connections can be selected and displayed on the display unit 110. At this time, “Unfriendly display” except when there is a dependency relationship between three words “Unfriendly attitude comes to mind” (“Unfriendly-Attitude-Head” is all a direct dependency relationship) If there is, there will be cases such as “I come to the head rather than the attitude”, but in the visualization for grasping the core dependency relationship, there is no problem regarding the purpose of the general overview.

また、表示対象データ操作部１０４において、ユーザが任意の組の係り受け群を複合概念として指定できるようにする。次に、指定された複合概念に属する単語と係り受け関係にある他の複合概念を図１１に示した索引を検索して表示する。例えば、「悪い−気分−待つ」を複合概念として指定した場合、「悪い−気分−ミスする」や「打ち間違い−レジ−待つ」という他の複合概念を検索して表示する。これにより、複合概念間の関係を抽出して表示することができ、分析対象に潜在する重要な関係情報を分析できるようになる。 In addition, the display target data operation unit 104 allows the user to designate an arbitrary set of dependency groups as a composite concept. Next, another compound concept having a dependency relationship with a word belonging to the specified compound concept is retrieved and displayed in the index shown in FIG. For example, when “bad-mood-wait” is designated as a composite concept, other composite concepts such as “bad-mood-miss” and “wrong-registry-wait” are retrieved and displayed. As a result, it is possible to extract and display the relationship between complex concepts, and to analyze important relationship information that is latent in the analysis target.

従来技術では、係り受け関係の表示内容において高い頻度の係り受け関係の周辺にある詳細な内容を把握することが困難だったのに対し、本発明では、係り受け関係構造構成部１０９を前記のように構成することにより、分析対象テキストから抽出した単語間の係り受け関係を３種類以上のカテゴリに分類してネットワーク表示を可能とするのに加えて、３語以上の関係について容易に把握でき、かつ、高い頻度の係り受け関係の周辺にある詳細な係り受け関係の内容を分析することが可能となる。 In the prior art, it is difficult to grasp the detailed contents around the high-frequency dependency relationship in the display content of the dependency relationship, whereas in the present invention, the dependency relationship structure configuration unit 109 is described above. With this configuration, the dependency relationships between words extracted from the text to be analyzed can be classified into three or more categories and displayed on the network, and in addition, the relationship between three or more words can be easily grasped. In addition, it is possible to analyze the details of the dependency relationships around the high-frequency dependency relationships.

また、本発明では、係り受け関係構造構成部１０９を前記のように構成することにより、分析対象テキストから抽出した単語間の係り受け関係を３語以上の係り受け関係に絞って表示することができるので、ひとまとまりの係り受け関係の表示により、内容の把握を容易にすることが可能となる。 Further, in the present invention, by configuring the dependency relationship structure configuration unit 109 as described above, it is possible to display the dependency relationship between words extracted from the analysis target text by focusing on the dependency relationship of three or more words. Therefore, it is possible to easily grasp the contents by displaying a group of dependency relationships.

また、本発明では、表示対象データ操作部１０４を前記のように構成することにより、複合概念として任意に指定した係り受け関係の組から、複合概念間の新規な関係を抽出して、分析対象に潜在する重要な関係情報を分析することが可能となる。 Further, in the present invention, by configuring the display target data operation unit 104 as described above, a new relationship between composite concepts is extracted from a set of dependency relationships arbitrarily designated as a composite concept, and the analysis target It is possible to analyze important relationship information latent in

実施の形態３．
以上の実施の形態２では、索引の構造を利用して、単語間の係り受け関係を３種類以上のカテゴリに分類してネットワーク表示を行なうことを可能とし、かつ、高い頻度の係り受け関係の周辺にある詳細な係り受け関係の内容の把握を容易にするようにした構成であるが、次に、実施の形態１と同様な構成で、意味カテゴリ辞書１１５に、分析対象とする分野に依存した意味カテゴリを格納し、分野依存の係り受け関係の内容把握と分析を容易にするようにした実施の形態３を示す。 Embodiment 3 FIG.
In the second embodiment described above, it is possible to classify the dependency relationships between words into three or more categories and perform network display using the index structure, and to display a high-frequency dependency relationship. The configuration is such that it is easy to grasp the contents of detailed dependency relationships in the vicinity. Next, the semantic category dictionary 115 depends on the field to be analyzed with the same configuration as in the first embodiment. Embodiment 3 will be described in which the semantic categories stored are stored to facilitate the understanding and analysis of the contents of the field-dependent dependency relationship.

実施の形態１のカテゴリ別単語頻度集計部１０８では、意味カテゴリ辞書１１５に記載された意味カテゴリとして、「形容詞」、「名詞：主語」、「名詞：目的語」、「動詞」のように、一般的な言語解析における品詞に基づいて集計を行なう例を示した。実施の形態３では、分野に依存した意味カテゴリと単語との対応情報を意味カテゴリ辞書１１５に格納しておくものとする。これにより、表示部１１０において、分野依存の意味カテゴリ別に、係り受けネットワーク構造を表示することが可能となる。 In the category-specific word frequency totaling unit 108 according to the first embodiment, the semantic categories described in the semantic category dictionary 115 are “adjective”, “noun: subject”, “noun: object”, “verb”, and the like. An example of tabulation based on parts of speech in general language analysis was shown. In the third embodiment, correspondence information between semantic categories and words depending on the field is stored in the semantic category dictionary 115. Accordingly, the dependency network structure can be displayed on the display unit 110 for each field-dependent semantic category.

発明の実施の形態３を示す構成図は、実施の形態１で用いた図１と同じものとする。図１２は、この発明の実施の形態３における係り受け関係ネットワーク表示の例である。この例では、ある機械製品の故障報告書の分析を目的として、意味カテゴリ辞書１１５に、対象としている機械製品の「機器名」と、機器名に関する「機器属性」と、機器に生じる故障の「現象」と、現象が起こった「条件」と、故障に対して取った「処置」とが意味カテゴリとして格納されているものとする。 The configuration diagram showing the third embodiment of the invention is the same as FIG. 1 used in the first embodiment. FIG. 12 is an example of the dependency relationship network display according to the third embodiment of the present invention. In this example, for the purpose of analyzing a failure report of a certain machine product, the “category name” of the target machine product, the “device attribute” related to the device name, and “ It is assumed that “phenomenon”, “condition” in which the phenomenon occurred, and “action” taken for the failure are stored as semantic categories.

また、意味カテゴリ「機器名」１２０２を持つ単語として、「Ｚユニット」、「主制御装置」などがあり、意味カテゴリ「機器属性」１２０１を持つ単語として、「Ａ群」、「1次側」などが意味カテゴリとして格納されているものとする。同様に、意味カテゴリ「現象」１２０３を持つ単語として、「スリップ」、「表示」などが格納されており、意味カテゴリ「処置」１２０４を持つ単語として、「点検」、「交換」などが格納されており、意味カテゴリ「条件」１２０５を持つ単語として、「雨天時」、「走行中」などが格納されているものとする。 The words having the semantic category “device name” 1202 include “Z unit” and “main control device”, and the words having the semantic category “device attribute” 1201 include “group A” and “primary side”. Are stored as semantic categories. Similarly, “slip”, “display”, and the like are stored as words having the semantic category “phenomenon” 1203, and “inspection”, “exchange”, etc. are stored as words having the semantic category “action” 1204. It is assumed that “rainy weather”, “running”, and the like are stored as words having the semantic category “condition” 1205.

意味カテゴリ辞書１１５に前記のような単語が格納されていることにより、表示部１１０では、図１２に示すように、分野に依存した「機器名」、「現象」等の意味カテゴリごとに係り受け関係ネットワークの表示が可能となる。 Since the words as described above are stored in the semantic category dictionary 115, the display unit 110 depends on the semantic categories such as “device name” and “phenomenon” depending on the field as shown in FIG. The related network can be displayed.

なお、本実施の形態３では、カテゴリ別単語頻度集計部１０８において、意味カテゴリ辞書１１５に格納したカテゴリに基づいて単語頻度を集計し、表示部１１０において、意味カテゴリ辞書１１５に格納したカテゴリごとに係り受け関係ネットワークを表示する例を示したが、属性ＤＢ１１２に格納された属性別に集計を行い、表示手段１１０では属性別に係り受けネットワーク構造を表示する形態としても良い。 In the third embodiment, the word frequency totaling unit by category 108 totals the word frequencies based on the categories stored in the semantic category dictionary 115, and the display unit 110 stores the categories for each category stored in the semantic category dictionary 115. Although an example in which the dependency relationship network is displayed has been shown, it may be configured such that aggregation is performed for each attribute stored in the attribute DB 112 and the dependency network structure is displayed for each attribute in the display unit 110.

表示手段１１０において意味カテゴリ別に、係り受けネットワーク構造を表示することにより、分析対象の分野に依存した係り受け関係の把握が可能となり、文書データ中に記述された内容の分析を容易にすることができるようになる。 By displaying the dependency network structure for each semantic category in the display means 110, the dependency relationship depending on the field to be analyzed can be grasped, and the analysis of the contents described in the document data can be facilitated. become able to.

また、表示対象データ操作部１０４において、表示手段１１０に表示されたカテゴリ間の各単語の係り受け関係を参照して各単語が属するカテゴリを対話的に編集可能となり、意味カテゴリの名称を分析対象データに応じて変更した表示が可能になり、対象データから抽出した係り受け関係の内容の把握が容易になる。 In addition, the display object data operation unit 104 can interactively edit the category to which each word belongs by referring to the dependency relationship between the words displayed on the display unit 110, and the name of the semantic category can be analyzed. The display can be changed according to the data, and the contents of the dependency relationship extracted from the target data can be easily understood.

本発明の実施の形態１における基本構成図である。It is a basic composition figure in Embodiment 1 of the present invention. 本発明の実施の形態１における基本処理のフローチャートである。It is a flowchart of the basic process in Embodiment 1 of this invention. 本発明の実施の形態１における単語ＤＢの例の図である。It is a figure of the example of word DB in Embodiment 1 of this invention. 本発明の実施の形態１における属性ＤＢの例の図である。It is a figure of the example of attribute DB in Embodiment 1 of this invention. 本発明の実施の形態１における係り受け関係の例の図である。It is a figure of the example of the dependency relation in Embodiment 1 of this invention. 本発明の実施の形態１における係り受けＤＢの例の図である。It is a figure of the example of dependency DB in Embodiment 1 of this invention. 本発明の実施の形態１における係り受け関係構造構成処理のフローチャートである。It is a flowchart of the dependency relationship structure configuration processing in the first embodiment of the present invention. 本発明の実施の形態１における係り受け関係ネットワーク表示の例の図である。It is a figure of the example of the dependency relation network display in Embodiment 1 of this invention. 本発明の実施の形態２における構成図である。It is a block diagram in Embodiment 2 of this invention. 本発明の実施の形態２における係り受け関係ネットワーク表示の例の図である。It is a figure of the example of the dependency relation network display in Embodiment 2 of this invention. 本発明の実施の形態２における係り受け関係構造構成部の索引の例の図である。It is a figure of the example of the index of the dependency relation structure structure part in Embodiment 2 of this invention. 本発明の実施の形態３における係り受け関係ネットワーク表示の例の図である。It is a figure of the example of the dependency relation network display in Embodiment 3 of this invention.

Explanation of symbols

１０１登録部、１０２係り受け抽出部、１０３入力部、１０４表示対象データ操作部、１０５制御部、１０６属性値頻度集計部、１０７係り受け頻度集計部、１０８カテゴリ別単語頻度集計部、１０９係り受け関係構造構成部、１１０表示部、１１１分析対象、１１２属性ＤＢ、１１３単語ＤＢ、１１４係り受けＤＢ、１１５意味カテゴリ辞書、１１６係り受け関係構造の視覚化表示、８０１カテゴリ「形容詞」、８０２カテゴリ「名詞：主語」、８０３カテゴリ「名詞：目的語」、８０４カテゴリ「動詞」、８０５カテゴリ「副詞」、８０６〜８１０カテゴリに属する単語の一覧、８１１リンク情報、９０１周辺係り受け関係抽出部、９０２中核的係り受け関係抽出部、１１０１各単語に一意に与えられる単語ＩＤ、１１０２単語が出現するレコード（分析対象から抽出した登録単位）ＩＤのリスト、１１０３レコードから抽出した文に付与された文ＩＤのリスト、１１０４レコードから抽出した係り受けＩＤのリスト、１２０１意味カテゴリ「機器属性」、１２０２意味カテゴリ「機器名」、１２０３意味カテゴリ「現象」、１２０４意味カテゴリ「処置」、１２０５意味カテゴリ「条件」。 DESCRIPTION OF SYMBOLS 101 Registration part, 102 Dependency extraction part, 103 Input part, 104 Display object data operation part, 105 Control part, 106 Attribute value frequency totaling part, 107 Dependency frequency totaling part, 108 Word frequency totaling part according to category, 109 Dependency Relationship structure component, 110 display, 111 analysis object, 112 attribute DB, 113 word DB, 114 dependency DB, 115 semantic category dictionary, 116 visualization display of dependency relationship structure, 801 category “adjective”, 802 category “ Noun: Subject, 803 Category “Noun: Object”, 804 Category “Verb”, 805 Category “Adverb”, 806-810 List of words belonging to category, 811 Link information, 901 Peripheral dependency relationship extraction unit, 902 Core Target dependency relationship extraction unit, 1101 given to each word uniquely List of word IDs, 1102 IDs of records in which words appear (registration units extracted from the analysis target) IDs, lists of sentence IDs assigned to sentences extracted from 1103 records, lists of dependency IDs extracted from 1104 records, 1201 Semantic category “device attribute”, 1202 semantic category “device name”, 1203 semantic category “phenomenon”, 1204 semantic category “action”, 1205 semantic category “condition”.

Claims

Registering means for analyzing the analysis target data and registering the word information extracted from the text of the analysis target data and the attribute value of the analysis target data;
A dependency extraction means for analyzing the extracted word information and extracting a dependency relationship between two or more words;
Dependency frequency totaling means for totalizing the frequency of dependency relationships between the words;
Attribute value frequency counting means for counting the frequency of the attribute value;
Category-based word totalizing means for totaling the word information extracted by the registration means and totalized for the attribute value frequency by the attribute value frequency totaling means for each predefined category;
Classifying the words constituting the dependency relationship between the words by category, and configuring the dependency relationship structure configuring means for configuring the dependency relationship network structure between the words included in the classified category;
Display means for displaying the dependency relationship network structure between the words;
A text mining device comprising:

The display condition set includes: input means for receiving an input of setting of display conditions; and display target data operating means for instructing setting of the display conditions for the displayed dependency relation network structure between the words. The text mining device according to claim 1, wherein the display unit displays a dependency relationship network structure between the words.

The text mining device according to claim 2, wherein the display target data operation means designates the number of display words and the display order for the display word group in the category.

The dependency relationship network between the words is limited by limiting the links to be displayed by setting the threshold of the relationship strength based on the frequency of the dependency relationship or the conceptual similarity by the display object data operating means. 3. The text mining apparatus according to claim 2, wherein the text mining apparatus displays from a schematic display to a detailed display of the structure.

3. The display target data operation means designates a word in the category as a node, and designates display, non-display, or rearrangement regarding a dependency and a link of a dependency source for the specified node. The text mining device described in 1.

By the display object data operating means, non-important words and low-frequency words are moved from the dependency relationship network structure between the words to a work box in the display screen, and the dependency network structure between words centered on the important words is obtained. The text mining device according to claim 2, wherein the text mining device is reconstructed.

7. The text mining device according to claim 1, wherein the dependency relationship structure constituting unit groups similar words as one concept for the node.

A core dependency relationship extracting unit that extracts a dependency relationship having a frequency of a predetermined number or more by the dependency relationship structure forming unit, and each of the dependency relationships extracted by the core dependency relationship extracting unit. The text mining device according to claim 1, further comprising a peripheral dependency relationship extracting unit that extracts a dependency relationship having a frequency more than a certain number of words.

9. The text mining device according to claim 8, wherein the peripheral dependency relationship extracting unit extracts only links having a three-word dependency relationship that is a dependency relationship between three words.

By designating an arbitrary set of dependency groups as a composite concept by the display target data operation means, search for another composite concept co-occurring with the specified composite concept, and a new co-occurrence relationship between the composite concepts The text mining apparatus according to claim 2, wherein the important co-occurrence relation latent in the analysis target data is displayed.

In the category-specific word frequency counting means, counting is performed for each semantic category described in a semantic category dictionary, and the display means displays a dependency network structure of each word between the categories for each semantic category. The text mining device according to claim 1.

3. The category to which each word belongs can be interactively edited by the display target data operating means with reference to a dependency network structure of each word between the categories displayed on the display means. The text mining device according to any one of 11 to 11.

A registration step of registering word information extracted from the text of the analysis target data and attribute values of the analysis target data by analyzing the analysis target data;
A dependency extraction step of analyzing the extracted word information and extracting a dependency relationship between two or more words;
Aggregating the frequency of the dependency relationship between the words, summing up the frequency of the attribute value,
A word summarizing step by category for summarizing the word information extracted by the registration step and summed up the frequency of the attribute value for each predefined category;
Classifying the words constituting the dependency relationship between the words by category, and configuring a dependency relationship network structure between the words included in the classified category;
A display step for displaying a dependency relationship network structure between the words;
An input step for receiving display condition setting input;
The dependency relationship network structure between the words is displayed according to the display condition set by the display target data operation step for instructing setting of the display condition for the displayed dependency relationship network structure between the words. Display step,
Text mining program to function as.

Analyzing the analysis target data, registering the word information extracted from the text of the analysis target data and the attribute value of the analysis target data,
Analyzing the extracted word information to extract a dependency relationship between two or more words,
Aggregate the frequency of dependency relationships between the words and the frequency of the attribute values;
After the word information extracted by the registration and the frequency of attribute values is totalized for each predefined category,
Classifying the words constituting the dependency relationship between the words by the category, configuring a dependency relationship network structure between the words included in the classified category,
Display the dependency relationship network structure between the words,
Accept input of display condition setting,
A text mining method comprising instructing setting of the display condition to the displayed dependency relation network structure between words and displaying the dependency relation network structure between words according to the display condition.