JP2009015394A

JP2009015394A - Dictionary construction support device

Info

Publication number: JP2009015394A
Application number: JP2007173411A
Authority: JP
Inventors: Saori Kurata; 早織倉田; Kyoko Makino; 恭子牧野; Shigeru Matsumoto; 茂松本; Rumi Hayakawa; ルミ早川; Toshiyuki Kano; 敏行加納
Original assignee: Toshiba Corp; Toshiba Solutions Corp
Current assignee: Toshiba Corp; Toshiba Digital Solutions Corp
Priority date: 2007-06-29
Filing date: 2007-06-29
Publication date: 2009-01-22
Anticipated expiration: 2027-06-29
Also published as: JP4956298B2

Abstract

<P>PROBLEM TO BE SOLVED: To support the construction of dictionary information with high precision without making it necessary for a user to perform selection operation. <P>SOLUTION: A characteristic expression extraction part 31 of a registration candidate expression generation part 3 extracts characteristic expressions from an analysis object data group stored in an analysis object data storage part 21. A similar expression extraction part 32 extracts similar expressions from the analysis object data group. A expression generation condition setting part 33 sets the generation conditions of expressions to be registered in a dictionary. An expression generation part 34 extracts co-occurrence expressions from the analysis object data group under various expression generation conditions set by the expression generation condition setting part 33, and generates the expressions of the registration candidates in dictionary information. A dictionary registration decision part 51 of a dictionary registration part 5 registers the dictionary information by storing the expressions decided to be expressions to be registered in the dictionary information by the dictionary registration decision part 51 in a dictionary registration candidate generation result obtained by the expression generation part 34 in a dictionary information storage part 22 of a storage part 2 together with a group name associated in the dictionary registration candidate generation result. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

本発明は、例えば文書データ群のテキストマイニングにおいて必要とされる辞書情報の構築を支援する辞書構築支援装置に関する。 The present invention relates to a dictionary construction support apparatus that supports construction of dictionary information required for text mining of a document data group, for example.

従来、文書分析システム用の辞書構築を支援する装置として、文書データから抽出された表現を、簡単な操作で辞書に登録出来る装置があった（例えば、特許文献１参照）。この装置では、例えば、文章データを形態素解析および構文解析して単語を抽出した後、抽出された単語を羅列することにより単語リストを作成する。そして、この単語リストから辞書に登録する単語をユーザに選択させることにより、辞書構築を行っている。
特開２００２−１４０３３８号公報 Conventionally, as an apparatus for supporting the construction of a dictionary for a document analysis system, there has been an apparatus that can register an expression extracted from document data in a dictionary with a simple operation (see, for example, Patent Document 1). In this apparatus, for example, after word data is extracted by morphological analysis and syntax analysis of sentence data, a word list is created by enumerating the extracted words. The dictionary is constructed by allowing the user to select a word to be registered in the dictionary from this word list.
JP 2002-140338 A

しかしながら、ユーザの目視による単語の選択作業は、試行錯誤のため時間がかかる作業であった。一方、分類精度がよい辞書を作成するためには、単なる出現頻度順に並んだ単語リストから単語を選択し登録するのみでは不十分であった。 However, the user's visual selection of words is a time-consuming operation due to trial and error. On the other hand, in order to create a dictionary with high classification accuracy, it is not sufficient to simply select and register words from a word list arranged in order of appearance frequency.

そこで、本発明の目的は、ユーザによる選択操作を必要とせずに精度がよい辞書情報の構築を支援することが可能になる辞書構築支援装置を提供することにある。 Therefore, an object of the present invention is to provide a dictionary construction support apparatus that can support construction of dictionary information with high accuracy without requiring a selection operation by a user.

すなわち、本発明に係わる辞書構築支援装置は、文書データ群の分析用の辞書情報を記憶する記憶手段と、辞書情報の構築支援用の文書データ群を入力する入力手段と、入力した文書データ群中の特徴表現を抽出する第１の抽出手段と、入力した文書データ群中の類似表現を抽出する第２の抽出手段とを備え、抽出した特徴表現および類似表現をもとに、入力した文書データ群に含まれる表現と異なる表現を生成し、この生成した表現を含む辞書情報を記憶手段に記憶することを特徴とする。 That is, the dictionary construction support apparatus according to the present invention includes a storage unit that stores dictionary information for analyzing document data groups, an input unit that inputs document data groups for construction support of dictionary information, and an input document data group. A first input means for extracting a feature expression in the document and a second extraction means for extracting a similar expression in the input document data group, and the input document based on the extracted feature expression and the similar expression An expression different from the expression included in the data group is generated, and dictionary information including the generated expression is stored in a storage unit.

本発明によれば、ユーザによる選択操作を必要とせずに精度がよい辞書情報の構築を支援することができる。 According to the present invention, it is possible to support the construction of accurate dictionary information without requiring a selection operation by the user.

以下図面により本発明の実施形態について説明する。
（第１の実施形態）
まず、本発明の第１の実施形態について説明する。
図１は、本発明の第１の実施形態にしたがった辞書構築支援装置の構成例を示すブロック図である。
図１に示すように、本発明の第１の実施形態にしたがった辞書構築支援装置は、入力手段である入力装置１、記憶手段である記憶装置２、登録候補表現生成部３、表現の編集手段である表現編集部４、辞書登録部５、表示装置６、分類処理手段である分類処理部７を備え、それぞれがバス８を介して相互に接続される。 Embodiments of the present invention will be described below with reference to the drawings.
(First embodiment)
First, a first embodiment of the present invention will be described.
FIG. 1 is a block diagram showing a configuration example of a dictionary construction support apparatus according to the first embodiment of the present invention.
As shown in FIG. 1, the dictionary construction support apparatus according to the first embodiment of the present invention includes an input device 1 as an input means, a storage device 2 as a storage means, a registration candidate expression generation unit 3, and an expression edit. An expression editing unit 4, which is a means, a dictionary registration unit 5, a display device 6, and a classification processing unit 7 which is a classification processing unit, are connected to each other via a bus 8.

記憶装置２は、ハードディスクドライブや不揮発性メモリなどの記憶媒体であり、登録候補表現生成部３、表現編集部４、辞書情報の記憶制御手段である辞書登録部５、分類処理部７による実行対象の制御用プログラムを記憶する。
また、記憶装置２は、辞書情報の構築に用いる文書データ群を記憶する分析対象データ記憶部２１、辞書情報を記憶する辞書情報記憶部２２を有する。本実施形態では、辞書情報は、文書データ群の分類のための表現を分類項目名と対応付けた分類用辞書情報である。また、記憶装置２は、登録候補表現生成部３、表現編集部４、辞書登録部５、分類処理部７によるワークメモリとしても機能する。 The storage device 2 is a storage medium such as a hard disk drive or a non-volatile memory, and is subject to execution by a registration candidate expression generation unit 3, an expression editing unit 4, a dictionary registration unit 5 that is a storage control unit for dictionary information, and a classification processing unit 7. The control program is stored.
In addition, the storage device 2 includes an analysis target data storage unit 21 that stores a document data group used to construct dictionary information, and a dictionary information storage unit 22 that stores dictionary information. In the present embodiment, the dictionary information is classification dictionary information in which expressions for classification of document data groups are associated with classification item names. The storage device 2 also functions as a work memory for the registration candidate expression generation unit 3, the expression editing unit 4, the dictionary registration unit 5, and the classification processing unit 7.

登録候補表現生成部３は、文書データ群から特徴表現を抽出する第１の抽出手段である特徴表現抽出部３１、文書データ群から類似表現を抽出する第２の抽出手段である類似表現抽出部３２、辞書情報に登録する表現の生成条件を設定する表現生成条件設定部３３、辞書情報へ登録する表現を生成する生成手段である表現生成部３４を有する。入力装置１は例えばキーボードやマウスである。表示装置６はディスプレイ装置である。 The registered candidate expression generation unit 3 includes a feature expression extraction unit 31 that is a first extraction unit that extracts a feature expression from a document data group, and a similar expression extraction unit that is a second extraction unit that extracts a similar expression from the document data group. 32, an expression generation condition setting unit 33 that sets a generation condition of an expression to be registered in dictionary information, and an expression generation unit 34 that is a generation unit that generates an expression to be registered in dictionary information. The input device 1 is, for example, a keyboard or a mouse. The display device 6 is a display device.

表現編集部４は、登録候補表現生成部３の特徴表現抽出部３１や類似表現抽出部３２により抽出した表現を入力装置１への入力操作にしたがって編集する。
辞書登録部５は、登録候補表現生成部３の表現生成部３４が生成した表現を辞書情報に登録する。また、辞書登録部５は、表現生成部３４が生成した表現のうち辞書情報に登録する表現を判定する辞書登録判定部５１を有する。
分類処理部７は、分類を要する文書データ群を記憶装置２の辞書情報記憶部２２に記憶される辞書情報をもとに分類する。 The expression editing unit 4 edits the expression extracted by the feature expression extraction unit 31 and the similar expression extraction unit 32 of the registered candidate expression generation unit 3 in accordance with an input operation to the input device 1.
The dictionary registration unit 5 registers the expression generated by the expression generation unit 34 of the registration candidate expression generation unit 3 in the dictionary information. The dictionary registration unit 5 includes a dictionary registration determination unit 51 that determines an expression to be registered in the dictionary information among the expressions generated by the expression generation unit 34.
The classification processing unit 7 classifies document data groups that require classification based on dictionary information stored in the dictionary information storage unit 22 of the storage device 2.

次に、図１に示した構成の辞書構築支援装置の動作について説明する。
この実施形態では、旅行会社のアンケート分析用辞書情報の構築を目的とし、辞書情報構築のための分析対象データ群としてアンケート回答文書データを想定している。ただし、分析対象データはテキストデータであれば特に限定されない。
また、分析対象データ群のそれぞれにおいては、本文である分析対象テキストの属性情報が含まれる。この実施形態では、属性情報として、アンケート対象の旅行が国内であるか海外であるかの区別、アンケートの意見が好評意見であるか不評意見であるかの区別、利用施設やコースの区別が挙げられる。 Next, the operation of the dictionary construction support apparatus having the configuration shown in FIG. 1 will be described.
In this embodiment, for the purpose of building dictionary information for questionnaire analysis of a travel agency, questionnaire response document data is assumed as an analysis target data group for building dictionary information. However, the analysis target data is not particularly limited as long as it is text data.
Each analysis target data group includes attribute information of the analysis target text that is the body. In this embodiment, the attribute information includes a distinction between whether the trip to be surveyed is domestic or overseas, a distinction between whether the opinion of the questionnaire is a popular opinion or an unfavorable opinion, and a distinction between use facilities and courses. It is done.

図２は、本発明の第１の実施形態にしたがった辞書構築支援装置の処理動作の一例を示すフローチャートである。
分析対象データ群となる文書データ群は記憶装置２に既に記憶されているとする。この文書データ群は、入力装置１への操作により新規に作成して記憶装置２に記憶してもよいし、外部装置から図示しない入出力インタフェースを介して記憶装置２に記憶してもよい。 FIG. 2 is a flowchart showing an example of the processing operation of the dictionary construction support apparatus according to the first embodiment of the present invention.
It is assumed that the document data group serving as the analysis target data group is already stored in the storage device 2. This document data group may be newly created by operating the input device 1 and stored in the storage device 2, or may be stored in the storage device 2 from an external device via an input / output interface (not shown).

まず、登録候補表現生成部３は、入力装置１への予め定められた入力操作により、分析対象データ群の指定および辞書情報の構築が指示されると、記憶装置２に記憶される文書データ群のうち、指定された分析対象データ群を読み出し、当該データ群中の各分析対象データであるアンケート回答文書データについて、当該回答文書データに含まれる属性情報と関連付けた情報を分析対象データ記憶部２１に記憶する（ステップＳ１）。 First, the registration candidate expression generation unit 3 receives a document data group stored in the storage device 2 when an analysis target data group is specified and dictionary information is constructed by a predetermined input operation to the input device 1. Among these, the specified analysis target data group is read out, and the questionnaire response document data which is each analysis target data in the data group is associated with the information associated with the attribute information included in the response document data as the analysis target data storage unit 21. (Step S1).

図３は、本発明の第１の実施形態にしたがった辞書構築支援装置の分析対象データ記憶部に記憶された分析対象データ群の一例を表形式で示す図である。
図３に示した分析対象データ群の１行分は、１つのアンケート回答文書データに対応する。各行の分析対象データは、アンケート回答本文である分析対象テキスト、アンケート対象の旅行が国内であるか海外であるかを区別する属性Ａ、アンケートの意見が好評意見であるか不評意見であるかを区別する属性Ｂ、利用施設やコースを区別する属性Ｃが関連付けられる。
例えば、図３に示した分析対象データ群の１行目に対応する分析対象データでは、属性Ａは「国内」で、属性Ｂは「好評意見」であり、属性Ｃは「ホテル」であり、分析対象テキストは「ホテルの食事が良い、満足した」となる。 FIG. 3 is a diagram showing an example of the analysis target data group stored in the analysis target data storage unit of the dictionary construction support apparatus according to the first embodiment of the present invention in a tabular format.
One row of the analysis target data group shown in FIG. 3 corresponds to one questionnaire response document data. The analysis target data of each row is the analysis target text that is the questionnaire response body, the attribute A that distinguishes whether the survey target travel is domestic or overseas, and whether the opinion of the questionnaire is a positive opinion or a bad opinion The attribute B that distinguishes and the attribute C that distinguishes the use facility and the course are associated with each other.
For example, in the analysis target data corresponding to the first row of the analysis target data group shown in FIG. 3, the attribute A is “domestic”, the attribute B is “favored opinion”, the attribute C is “hotel”, The text to be analyzed is “Hotel meals are good and satisfied”.

次に、登録候補表現生成部３の特徴表現抽出部３１は、分析対象データ記憶部２１に記憶された分析対象データ群から特徴表現を抽出する（ステップＳ２）。
具体的には、特徴表現抽出部３１は、例えば、形態素解析や構文解析により単語頻度やＴＦ（Term Frequency）とＩＤＦ（Inverse Document Frequency）の積を算出し、これらの値（スコア）を基準に特徴的な単語とするといった、公知の特徴単語抽出方法で、分析対象データから特徴表現を抽出する。
ＴＦは、各文書内における出現頻度を全文書について和をとった値である。ＩＤＦは全文書数とＤＦの商の対数である。ＤＦは、表現が出現する文書数である。 Next, the feature expression extraction unit 31 of the registration candidate expression generation unit 3 extracts a feature expression from the analysis target data group stored in the analysis target data storage unit 21 (step S2).
Specifically, the feature expression extraction unit 31 calculates, for example, a word frequency or a product of TF (Term Frequency) and IDF (Inverse Document Frequency) by morphological analysis or syntax analysis, and uses these values (scores) as a reference. A feature expression is extracted from the analysis target data by a known feature word extraction method such as making a characteristic word.
TF is a value obtained by summing up the appearance frequency in each document for all documents. IDF is the logarithm of the total number of documents and the quotient of DF. DF is the number of documents in which an expression appears.

また、分析対象データに属性情報が付与されている場合、この値に基づき分析対象データを幾つかの集合に分割し、相互情報量などといった値を基準に、グループ特有の表現やグループ毎の特徴表現を抽出することも可能である。相互情報量は、文章・文書中に出現する２単語が同時に出現する度合いなどにより、単語間の関連度を表す量であり、この２単語を、グループとそのグループに属する単語すると、相互情報量は、この単語がグループに属する度合いを表す量でもある。 In addition, when attribute information is given to the analysis target data, the analysis target data is divided into several sets based on this value, and based on values such as mutual information amount, group-specific expressions and characteristics for each group It is also possible to extract expressions. The mutual information amount is an amount that indicates the degree of association between words based on the degree at which two words appearing in a sentence / document appear at the same time. If these two words are a group and a word belonging to the group, the mutual information amount Is also an amount representing the degree to which this word belongs to a group.

特徴表現抽出部３１は、記憶装置２に記憶された特徴情報抽出の条件にしたがって特徴情報の抽出をグループごとに行う。ここでは、第１のグループに属する特徴表現抽出の条件は「グループ分割指定」、「グループごとに表現抽出」および「属性Ｂが好評意見である」である。また、第２のグループに属する特徴表現抽出の条件は「グループ分割指定」、「グループごとに表現抽出」および「属性Ｂが不評意見である」である。 The feature expression extraction unit 31 extracts feature information for each group in accordance with the feature information extraction conditions stored in the storage device 2. Here, the feature expression extraction conditions belonging to the first group are “group division designation”, “expression extraction for each group”, and “attribute B is a popular opinion”. The feature expression extraction conditions belonging to the second group are “group division designation”, “expression extraction for each group”, and “attribute B is unpopular opinion”.

特徴表現抽出部３１は、図３に示した内容の分析対象データ群をもとに、特徴表現と、当該特徴表現のスコアと、当該特徴表現の抽出元の分析対象テキストが図３に示した分析対象データ群の表で対応する属性Ｂの名称と、属性の名称ごとに固有の仮のグループ名とを関連付けて、これらをグループごとに纏めた上でスコア順に並べた処理データを生成する。 Based on the analysis target data group having the contents shown in FIG. 3, the feature expression extraction unit 31 shows the feature expression, the score of the feature expression, and the analysis target text from which the feature expression is extracted as shown in FIG. In the table of the analysis target data group, the name of the attribute B corresponding to the attribute group and the temporary group name unique to each attribute name are associated with each other, and the processing data is generated by arranging them in order of score.

図４は、本発明の第１の実施形態にしたがった辞書構築支援装置の登録候補表現生成部による特徴表現抽出のための処理データの一例を表形式で示す図である。
図４に示した処理データでは、グループ名「グループＡ」に属する特徴表現は、抽出元の分析対象テキストが図３に示した分析対象データ群の表で対応する属性Ｂが「好評意見」であった「満足」、「抜群」、「良い」、「面白い」であり、グループ名「グループＢ」に属する特徴表現は、抽出元の分析対象テキストが図３に示した分析対象データ群の表で対応する属性Ｂが「不評意見」であった「汚い」などである。 FIG. 4 is a diagram showing an example of processing data for feature expression extraction by the registration candidate expression generation unit of the dictionary construction support apparatus according to the first embodiment of the present invention in a table format.
In the processing data shown in FIG. 4, the feature expression belonging to the group name “Group A” has the attribute B corresponding to “analyzed opinion” corresponding to the analysis target text in the table of the analysis target data group shown in FIG. 3. The feature expression that is “satisfied”, “excellent”, “good”, “interesting” and belonging to the group name “group B” is the table of the analysis target data group shown in FIG. And the corresponding attribute B is “unsatisfactory opinion” and “dirty”.

特徴表現抽出部３１は、図４に示した処理データのうち、グループ名の項目における仮のグループ名の記述を抽出条件の項目の属性情報の名称の記述に置き換えたデータを生成し、これを特徴表現抽出結果として表示装置６に表示させる。 The feature expression extraction unit 31 generates data in which the description of the temporary group name in the group name item is replaced with the attribute information name description of the extraction condition item in the processing data shown in FIG. It is displayed on the display device 6 as a feature expression extraction result.

図５は、本発明の第１の実施形態にしたがった辞書構築支援装置の登録候補表現生成部による特徴表現抽出結果の一例を表形式で示す図である。
図５に示した特徴表現抽出結果では、図４に示した処理データ中の「グループＡ」が「好評意見」に置き換えられ、「グループＢ」が「不評意見」に置き換えられる。この結果、グループ名「好評意見」に関わる特徴表現が「満足」、「抜群」、「良い」、「面白い」となり、グループ名「不評意見」に関わる特徴表現が「汚い」などとなる。 FIG. 5 is a diagram showing an example of the feature expression extraction result by the registration candidate expression generation unit of the dictionary construction support apparatus according to the first embodiment of the present invention in a table format.
In the feature expression extraction result shown in FIG. 5, “Group A” in the processing data shown in FIG. 4 is replaced with “favored opinion”, and “Group B” is replaced with “unpopular opinion”. As a result, the feature expression related to the group name “popular opinion” becomes “satisfied”, “excellent”, “good”, “interesting”, and the feature expression related to the group name “unpopular opinion” becomes “dirty”.

表現編集部４は、図５に示した特徴表現抽出結果が表示装置６に表示される状態で入力装置１への表現編集にかかる入力操作がなされた場合、表示装置６に表示された内容の特徴表現抽出結果を編集し、この編集後の結果を記憶装置２に上書き記憶する。 When the input operation related to the expression editing to the input device 1 is performed in a state where the feature expression extraction result shown in FIG. 5 is displayed on the display device 6, the expression editing unit 4 displays the contents displayed on the display device 6. The feature expression extraction result is edited, and the edited result is overwritten and stored in the storage device 2.

次に、登録候補表現生成部３の類似表現抽出部３２は、分析対象データ記憶部２１に記憶された分析対象データ群から類似表現を抽出する（ステップＳ３）。
具体的には、類似表現抽出部３２は、例えば、クラスタリングといった、文書を内容が互いに似たもの同士である集合（クラスタ）を幾つか生成する方法などの、公知の文書分類方法で、分析対象データ群をこのような幾つかの集合に分割する。 Next, the similar expression extraction unit 32 of the registration candidate expression generation unit 3 extracts a similar expression from the analysis target data group stored in the analysis target data storage unit 21 (step S3).
Specifically, the similar expression extraction unit 32 uses a known document classification method such as a method of generating several sets (clusters) of documents whose contents are similar to each other, such as clustering. Divide the data group into several such sets.

類似表現抽出部３２は、各クラスタにおいて、表現がクラスタに属する度合である帰属度を示すスコアを単語別に算出し、これらの単語のスコアを基準に、クラスタ毎に意味が類似する単語である類似表現を抽出する。
類似表現抽出部３２は、図３に示した内容の分析対象データ群をもとに、類似表現と、当該類似表現のスコアと、当該類似表現が属するクラスタに固有の仮のクラスタ名とを関連付けて、これらをクラスタごとに纏めた上でスコア順に並べた処理データを生成する。 The similar expression extraction unit 32 calculates, for each cluster, a score indicating the degree of belonging, which is the degree to which the expression belongs to the cluster, for each word. Based on the scores of these words, the similar expression is a word having a similar meaning for each cluster. Extract the expression.
The similar expression extraction unit 32 associates the similar expression, the score of the similar expression, and the temporary cluster name unique to the cluster to which the similar expression belongs based on the analysis target data group having the contents shown in FIG. Then, processing data is generated in which these are collected for each cluster and arranged in order of score.

図６は、本発明の第１の実施形態にしたがった辞書構築支援装置の登録候補表現生成部による類似表現抽出のための処理データの一例を表形式で示す図である。
図６に示した処理データでは、クラスタ名「クラスタＡ」に属する類似表現は、図３に示した分析対象データ群の分析対象テキストに含まれる「ミュージアム」、「美術館」および「博物館」であり、クラスタ名「クラスタＢ」に属する類似表現は、図３に示した分析対象データ群の分析対象テキストに含まれる「お風呂」および「浴室」である。 FIG. 6 is a diagram showing an example of processing data for extraction of similar expressions by the registration candidate expression generation unit of the dictionary construction support apparatus according to the first embodiment of the present invention in a table format.
In the processing data shown in FIG. 6, the similar expressions belonging to the cluster name “cluster A” are “museum”, “museum”, and “museum” included in the analysis target text of the analysis target data group shown in FIG. Similar expressions belonging to the cluster name “cluster B” are “bath” and “bathroom” included in the analysis target text of the analysis target data group shown in FIG.

類似表現抽出部３２は、図６に示した処理データのうち、クラスタ名の項目の記述を当該クラスタ名に属する類似表現のうちスコアが最も高い表現に置き換えたデータを生成し、これを類似表現抽出結果として表示装置６に表示させる。 The similar expression extraction unit 32 generates data obtained by replacing the description of the item of the cluster name with the expression having the highest score among the similar expressions belonging to the cluster name in the processing data shown in FIG. The extracted result is displayed on the display device 6.

図７は、本発明の第１の実施形態にしたがった辞書構築支援装置の登録候補表現生成部による類似表現抽出結果の一例を表形式で示す図である。
図７に示した特徴表現抽出結果は、図６に示した処理データ中のクラスタ名「クラスタＡ」が、当該「クラスタＡ」に属する類似表現「ミュージアム」、「美術館」および「博物館」のうちスコアが最も高い「ミュージアム」に置き換えられ、処理データ中のクラスタ名「クラスタＢ」が、当該「クラスタＢ」に属する類似表現「お風呂」および「浴室」のうちスコアが最も高い「お風呂」に置き換えられたものである。この結果、クラスタ名「ミュージアム」に関わる類似表現が「ミュージアム」、「美術館」および「博物館」となり、クラスタ名「お風呂」に関わる特徴表現が「お風呂」および「浴室」となる。 FIG. 7 is a diagram showing an example of a similar expression extraction result by the registration candidate expression generation unit of the dictionary construction support apparatus according to the first embodiment of the present invention in a table format.
The feature expression extraction result shown in FIG. 7 shows that the cluster name “cluster A” in the processing data shown in FIG. 6 is among the similar expressions “museum”, “museum”, and “museum” belonging to the “cluster A”. The “Museum” with the highest score is replaced, and the cluster name “Cluster B” in the processing data is “Bath” with the highest score among the similar expressions “Bath” and “Bathroom” belonging to the “Cluster B”. Has been replaced. As a result, the similar expressions relating to the cluster name “museum” are “museum”, “art museum”, and “museum”, and the characteristic expressions relating to the cluster name “bath” are “bath” and “bathroom”.

表現編集部４は、図７に示した類似表現抽出結果が表示装置６に表示される状態で入力装置１への表現編集にかかる入力操作がなされた場合、表示装置６に表示された内容の類似表現抽出結果を編集し、この編集後の結果を記憶装置２に上書き記憶する。 When the input operation related to the expression editing to the input device 1 is performed in a state where the similar expression extraction result shown in FIG. 7 is displayed on the display device 6, the expression editing unit 4 displays the contents displayed on the display device 6. The similar expression extraction result is edited, and the edited result is overwritten and stored in the storage device 2.

次に、登録候補表現生成部３の表現生成条件設定部３３は、辞書に登録する表現の生成条件を設定し、これを表示装置６に表示させる（ステップＳ４）。
具体的には、表現生成条件設定部３３は、第１の生成条件として、ステップＳ２の処理で特徴表現抽出部３１が生成した特徴表現抽出結果を設定する。また、表現生成条件設定部３３は、第２の生成条件として、ステップＳ３の処理で類似表現抽出部３２が生成した類似表現抽出結果をもとに、見出し表現と当該見出し表現に属する複数の同義語を設定する。 Next, the expression generation condition setting unit 33 of the registration candidate expression generation unit 3 sets expression generation conditions to be registered in the dictionary, and displays them on the display device 6 (step S4).
Specifically, the expression generation condition setting unit 33 sets the feature expression extraction result generated by the feature expression extraction unit 31 in the process of step S <b> 2 as the first generation condition. In addition, the expression generation condition setting unit 33 uses, as the second generation condition, a headline expression and a plurality of synonyms belonging to the headline expression based on the similar expression extraction result generated by the similar expression extraction unit 32 in the process of step S3. Set the word.

図８は、本発明の第１の実施形態にしたがった辞書構築支援装置の登録候補表現生成部による類似表現抽出結果にしたがった表現生成条件の一例を表形式で示す図である。
図８に示した表現生成条件は、図７に示した類似表現抽出結果の「ミュージアム」に関わる類似表現である「ミュージアム」、「美術館」および「博物館」のうち、「ミュージアム」を見出し表現「ミュージアム」に属する第１の同義語である同義語Ａとして、「美術館」を第２の同義語である同義語Ｂとして、「博物館を」第３の同義語である同義語Ｃとして設定し、かつ、図７に示した類似表現抽出結果の「お風呂」に関わる類似表現である「お風呂」および「浴室」のうち、「お風呂」を見出し表現「お風呂」に属する第１の同義語である同義語Ａとして、「浴室」を第２の同義語である同義語Ｂにそれぞれ設定したものである。 FIG. 8 is a table showing an example of expression generation conditions according to the similar expression extraction result by the registration candidate expression generation unit of the dictionary construction support apparatus according to the first embodiment of the present invention.
The expression generation condition shown in FIG. 8 is a similar expression related to “museum” of the similar expression extraction result shown in FIG. As the first synonym A belonging to “Museum”, “Museum” is set as the second synonym B, and “Museum” is set as the third synonym C, In addition, the first synonym belonging to the expression “bath” with the heading “bath” among “bath” and “bathroom” which are similar expressions related to “bath” in the similar expression extraction result shown in FIG. As a synonym A that is a word, “bathroom” is set to a synonym B that is a second synonym.

表現生成条件設定部３３は、図８に示した生成条件が表示装置６に表示される状態で入力装置１への条件編集にかかる入力操作がなされた場合、表示装置６に表示された内容の表現生成条件を編集し、この編集後の結果を記憶装置２に上書き記憶する。 The expression generation condition setting unit 33 displays the content displayed on the display device 6 when an input operation related to condition editing on the input device 1 is performed in a state where the generation conditions shown in FIG. The expression generation condition is edited, and the edited result is overwritten and stored in the storage device 2.

次に、登録候補表現生成部３の表現生成部３４は、表現生成条件設定部３３が設定した各種の表現生成条件をもとに、単独で存在する表現や共起表現を分析対象データ群から抽出することにより、辞書情報への登録候補の表現を生成する（ステップＳ５）。 Next, the expression generation unit 34 of the registration candidate expression generation unit 3 extracts expressions and co-occurrence expressions that exist independently from the analysis target data group based on various expression generation conditions set by the expression generation condition setting unit 33. By extracting, the expression of the registration candidate to dictionary information is produced | generated (step S5).

表現生成部３４による共起表現の抽出方法は、ステップＳ４の処理で設定された表現生成条件中の表現と共起する表現を抽出する方法である。この方法は、共起する表現間は形態素単位で隣接もしくは離れていてもよいといった、既存の様々な基準で定められる。また、表現生成部３４は構文解析を用いて共起表現を抽出してもよい。 The method for extracting the co-occurrence expression by the expression generation unit 34 is a method for extracting an expression that co-occurs with the expression in the expression generation condition set in the process of step S4. This method is defined by various existing standards such that co-occurring expressions may be adjacent or separated in morpheme units. The expression generation unit 34 may extract a co-occurrence expression using syntax analysis.

図９は、本発明の第１の実施形態にしたがった辞書構築支援装置の登録候補表現生成部による辞書登録候補生成結果の一例を表形式で示す図である。
図９に示した辞書登録候補生成結果では、表現生成部３４が辞書情報への登録候補として得た表現と、当該表現のＴＦ、ＤＦと、当該表現が属するグループ名とが関連付けられる。 FIG. 9 is a diagram showing an example of a dictionary registration candidate generation result by the registration candidate expression generation unit of the dictionary construction support apparatus according to the first embodiment of the present invention in a table format.
In the dictionary registration candidate generation result shown in FIG. 9, the expression obtained by the expression generation unit 34 as a registration candidate for dictionary information, the TF and DF of the expression, and the group name to which the expression belongs are associated.

この辞書登録候補生成結果において、共起表現と関わるグループ名は、表現生成部３４が、共起表現を構成する各種表現と、各種表現生成条件や特徴表現抽出結果や類似表現抽出結果における表現と対応付けられるグループ名やクラスタ名をもとに生成したものである。
また、この辞書登録候補生成結果において、共起表現と関わるＴＦ、ＤＦは、表現生成部３４が、各種表現生成条件や特徴表現抽出結果や類似表現抽出結果における各種表現のスコアなどをもとにして計算したものである。 In the dictionary registration candidate generation result, the group name related to the co-occurrence expression is expressed by the expression generation unit 34 as various expressions constituting the co-occurrence expression, expressions in various expression generation conditions, feature expression extraction results, and similar expression extraction results. It is generated based on the group name and cluster name to be associated.
In this dictionary registration candidate generation result, the TF and DF related to the co-occurrence expression are expressed by the expression generation unit 34 based on various expression generation conditions, feature expression extraction results, scores of various expressions in similar expression extraction results, and the like. Calculated.

表現生成部３４は、第１の生成条件として設定された、図５に示した特徴表現抽出結果中の特徴表現「満足」、「抜群」、「良い」、「汚い」を使用することで、図９に示すように、辞書情報への登録候補の表現「満足」、「食事満足」、「ツアー価格満足」、「景色が抜群」、「食事が満足」「食事が良い」、「部屋が汚い」を取得する。これらの登録候補の表現のグループ名は、当該表現の取得元の生成条件中の該当表現のグループ名である「好評意見」や「不評意見」に準ずる。 The expression generation unit 34 uses the feature expressions “satisfied”, “excellent”, “good”, and “dirty” in the characteristic expression extraction result shown in FIG. 5 set as the first generation condition, As shown in FIG. 9, the expressions for candidate registration in the dictionary information are “satisfied”, “meal satisfaction”, “tour price satisfaction”, “excellent scenery”, “meal satisfaction”, “meal is good”, “room is Get "dirty". The group names of these registered candidate expressions are based on “favorite opinions” and “unpopular opinions” that are the group names of the corresponding expressions in the generation conditions of the acquisition source of the expressions.

この得られた表現のうち図９に示した表の１行目の「満足」は、表現生成部３４が、第１の生成条件中の特徴表現「満足」を、そのまま辞書情報への登録候補の表現としたものであり、図９中の線Ｌ１で囲まれた「食事が良い」は分析対象データ群中の分析対象テキストの表現と一致するものである。しかし、「食事満足」、「ツアー価格満足」、「景色が抜群」、「食事が満足」、「部屋が汚い」は、分析対象データ群中の分析対象テキストの表現と一致しないが、第１の生成条件中の特徴表現「満足」、「抜群」、「良い」、「汚い」および分析対象データ群中の分析対象テキストの表現をもとに表現生成部３４が新たに生成した表現である。
つまり、表現生成部３４は、第１の生成条件を用いることで、用言、例えば第１の生成条件中の特徴表現「良い」などに対する主体まで含んだ表現を得ることができる。 Among the obtained expressions, “satisfied” in the first row of the table shown in FIG. 9 indicates that the expression generating unit 34 directly uses the feature expression “satisfied” in the first generation condition as a candidate for registration in the dictionary information. The “meal is good” surrounded by the line L1 in FIG. 9 matches the expression of the analysis target text in the analysis target data group. However, “meal satisfaction”, “tour price satisfaction”, “excellent scenery”, “meal satisfaction”, and “room is dirty” do not match the expression of the analysis target text in the analysis target data group. The expression generation unit 34 newly generates a feature expression “satisfied”, “excellent”, “good”, “dirty” in the generation conditions of the text and the expression of the analysis target text in the analysis target data group. .
That is, the expression generation unit 34 can obtain an expression including a subject for a predicate, for example, the feature expression “good” in the first generation condition, by using the first generation condition.

また、記憶装置２には「表現生成部３４は辞書情報への登録候補の表現として「食事」と「満足」の間に助詞も含んだ表現を生成してもよい」といった表現生成ルールが記憶されており、表現生成部３４は、このルールをもとに、図９に示すように第１の生成条件中の特徴表現「満足」に対し、辞書登録候補の表現「食事満足」、「食事が満足」といった、「食事」に関して２つ以上の表現を生成する。このようにして、表現生成部３４は、適切な表現のバリエーションを生成することが出来る。 The storage device 2 stores an expression generation rule such as “the expression generation unit 34 may generate an expression including a particle between“ meal ”and“ satisfaction ”as an expression of a candidate for registration in the dictionary information”. Based on this rule, the expression generation unit 34, for the feature expression “satisfaction” in the first generation condition as shown in FIG. Two or more expressions for “meal” such as “I am satisfied” are generated. In this way, the expression generation unit 34 can generate an appropriate expression variation.

ここで、表現生成部３４は、辞書情報への登録候補として得た表現の中で、ＴＦが１０以上の表現を抽出し、これを辞書登録候補生成結果とする。つまり、表現生成部３４は生成済みの表現のうち予め定められた条件を満たす表現を選択する選択手段として機能する。 Here, the expression generation unit 34 extracts an expression having a TF of 10 or more from expressions obtained as candidates for registration in the dictionary information, and uses this as a dictionary registration candidate generation result. That is, the expression generation unit 34 functions as a selection unit that selects an expression that satisfies a predetermined condition from the generated expressions.

また、表現生成部３４は、図８に示した第２の生成条件中の同義語である「ミュージアム」、「美術館」と「博物館」の組、および「お風呂」、「浴室」の組を使用することで、図９中の線Ｌ２で囲まれたように、辞書情報への登録候補の表現「ミュージアムは面白い」、「美術館は面白い」といった、同じ用言に対する主体が同義語からなる表現群や、「お風呂が広い」、「浴室が汚い」というお風呂に関する表現群を生成する。これらの登録候補の表現のグループ名は、当該表現の取得元の生成条件中の該当表現のクラスタ名である「ミュージアム」や「お風呂」に準ずる。 In addition, the expression generation unit 34 sets the “Museum”, “Museum” and “Museum” combinations, and “Bath” and “Bathroom” as synonyms in the second generation condition shown in FIG. By using it, as enclosed by the line L2 in FIG. 9, the expressions for registration candidates in the dictionary information, such as “museum is interesting” and “museum is interesting”, the subject of the same predicate consists of synonyms. A group of expressions related to a bath, such as a group or “bath is wide” and “bathroom is dirty” is generated. The group names of these registered candidate expressions conform to “museum” and “bath”, which are the cluster names of the corresponding expressions in the generation condition of the expression acquisition source.

この「ミュージアムは面白い」は、分析対象データ群には存在しない文章である。一方、「博物館は面白い」は分析対象データ群に存在する文章である。よって、表現生成部３４は、この同義語条件を用いた表現生成により、入力データである分析対象データ群にない表現も得ることができる。この様な表現生成方法により、他の分類対象の文書データ群の分類に対して適用可能で分類精度の高い辞書登録表現を生成できる効果がある。 This “museum is interesting” is a sentence that does not exist in the analysis target data group. On the other hand, “museum is interesting” is a sentence existing in the analysis target data group. Therefore, the expression generation unit 34 can also obtain an expression that is not in the analysis target data group that is the input data by generating an expression using the synonym condition. By such an expression generation method, there is an effect that it is possible to generate a dictionary registration expression that can be applied to the classification of other document data groups to be classified and has high classification accuracy.

次に、辞書登録部５の辞書登録判定部５１は、表現生成部３４が得た辞書登録候補生成結果をもとに、記憶装置２に記憶された辞書登録基準に基づき、表現を辞書に登録するか否かの判定を行う（ステップＳ６）。
具体的には、辞書登録判定部５１は、辞書登録基準として、一般的に単語の重みを示す指標である、ＴＦとＩＤＦの積などを使用し、この指標がある値以上を取る表現を辞書情報に登録すべき表現と判定する。この値は、例えば、ＴＦとＩＤＦの積を単語全体に対して和をとった値の何％とするといった経験値や統計学による分析結果に基づく値でもよい。 Next, the dictionary registration determination unit 51 of the dictionary registration unit 5 registers the expression in the dictionary based on the dictionary registration criteria stored in the storage device 2 based on the dictionary registration candidate generation result obtained by the expression generation unit 34. It is determined whether or not to perform (step S6).
Specifically, the dictionary registration determination unit 51 uses, as a dictionary registration standard, a product of TF and IDF, which is an index generally indicating the weight of a word, and expresses an expression that takes a certain value or more. It is determined that the expression should be registered in the information. This value may be, for example, a value based on an empirical value such as a product of TF and IDF, or what percentage of a value obtained by summing up the whole word, or a statistical analysis result.

辞書登録部５は、表現生成部３４が得た辞書登録候補生成結果中の表現のうち、辞書登録判定部５１が辞書情報に登録すべき表現と判定した表現を、当該辞書登録候補生成結果において対応付けられるグループ名とともに記憶装置２の辞書情報記憶部２２に記憶することで辞書情報を登録する（ステップＳ７）。辞書情報として登録された表現のそれぞれには通し番号が付与される。
図１０は、本発明の第１の実施形態にしたがった辞書構築支援装置の辞書登録部による辞書情報登録結果の一例を表形式で示す図である。 The dictionary registration unit 5 uses, in the dictionary registration candidate generation result, an expression determined by the dictionary registration determination unit 51 as an expression to be registered in the dictionary information among the expressions in the dictionary registration candidate generation result obtained by the expression generation unit 34. The dictionary information is registered by storing it in the dictionary information storage unit 22 of the storage device 2 together with the associated group name (step S7). A serial number is assigned to each expression registered as dictionary information.
FIG. 10 is a table showing an example of the dictionary information registration result by the dictionary registration unit of the dictionary construction support apparatus according to the first embodiment of the present invention.

そして、辞書登録部５は、入力装置１への、登録済み辞書情報の検証にかかる各種メニューである「辞書作成支援」、「辞書編集」の選択にかかる入力操作にしたがって、記憶装置２の辞書情報記憶部２２に記憶される辞書情報を表示装置６に表示させる（ステップＳ８）。 Then, the dictionary registration unit 5 performs a dictionary of the storage device 2 in accordance with an input operation related to selection of “dictionary creation support” and “dictionary editing” which are various menus related to verification of registered dictionary information to the input device 1. The dictionary information stored in the information storage unit 22 is displayed on the display device 6 (step S8).

図１１は、本発明の第１の実施形態にしたがった辞書構築支援装置による辞書情報の表示例を示す図である。
この画面上の「名前を付けて保存」と表示されたアイコンが入力装置１への操作により選択されると、辞書登録部５は、表示される辞書情報の識別名である辞書名の入力画面を表示装置６に表示させる。この画面にしたがって入力された辞書名の辞書情報を辞書情報記憶部２２に記憶する。これにより辞書情報の構築が終了する。 FIG. 11 is a diagram showing a display example of dictionary information by the dictionary construction support apparatus according to the first embodiment of the present invention.
When an icon displayed as “Save As” on this screen is selected by operating the input device 1, the dictionary registration unit 5 displays an input screen for a dictionary name that is an identification name of the displayed dictionary information. Is displayed on the display device 6. The dictionary information of the dictionary name input according to this screen is stored in the dictionary information storage unit 22. This completes the construction of the dictionary information.

また、図１１に示された画面上で、入力装置１への入力操作により、辞書情報の内容を変更することができる。変更の終了後、「上書き保存」と表示されたアイコンが入力装置１への操作により選択されると、当該辞書情報を辞書情報記憶部２２に上書き記憶する。 Further, the contents of the dictionary information can be changed by an input operation to the input device 1 on the screen shown in FIG. After the change is completed, when an icon displayed as “overwrite” is selected by operating the input device 1, the dictionary information is overwritten and stored in the dictionary information storage unit 22.

辞書情報の構築後、分類処理部７は、辞書情報の構築に用いた分析対象データ群とは別の、分類対象の文書データ群を記憶装置２から新たに読み出す。分類対象の文書データ群は、入力装置１への操作により新規に作成して記憶装置２に記憶してもよいし、外部装置から図示しない入出力インタフェースを介して記憶装置２に記憶してもよい。 After constructing the dictionary information, the classification processing unit 7 newly reads out a document data group to be classified, which is different from the analysis target data group used for constructing the dictionary information, from the storage device 2. The document data group to be classified may be newly created by operating the input device 1 and stored in the storage device 2, or may be stored in the storage device 2 from an external device via an input / output interface (not shown). Good.

そして、分類処理部７は、読み出した文書データ群の一つと辞書情報中の表現とを照合し、当該文書データ中の表現のうち辞書情報における表現と一致する表現があれば、当該表現に対して辞書情報中で関連付けられるグループ名を取得し、照合した文書データを当該グループ名に関わる文書データとして分類する（ステップＳ９）。分類処理部７は、この分類処理を文書データ群のそれぞれについて行なう。 Then, the classification processing unit 7 collates one of the read document data groups with the expression in the dictionary information, and if there is an expression that matches the expression in the dictionary information among the expressions in the document data, The group name associated in the dictionary information is acquired, and the collated document data is classified as the document data related to the group name (step S9). The classification processing unit 7 performs this classification processing for each document data group.

以上のように、本発明の第１の実施形態にしたがった辞書構築支援装置では、登録候補表現生成部３と辞書登録部５により、ユーザが表現を一から考えて手作業で辞書情報を登録しなくても、この辞書情報が自動で構築でき、容易に辞書構築を行うことが可能となる。 As described above, in the dictionary construction support device according to the first embodiment of the present invention, the registration candidate expression generation unit 3 and the dictionary registration unit 5 allow the user to manually register the dictionary information considering the expression from the beginning. Even without this, this dictionary information can be automatically constructed, and the dictionary can be easily constructed.

加えて、この辞書構築支援装置は、分析対象データ群の文章にそのまま記述される表現のみを辞書情報の表現とするのではなく、分析対象データ群のそれぞれの文章データ中に点在する表現をもとに、分析対象データ群の文章にそのまま記述されていない表現を辞書情報の表現とすることができるので、辞書情報の構築後に、別の分類対象の文書データ群を精度良く分類することができる。 In addition, this dictionary construction support device does not use only the expression described in the sentence of the analysis target data group as it is as the expression of the dictionary information, but the expression scattered in each sentence data of the analysis target data group. Originally, expressions that are not directly described in the text of the analysis target data group can be used as the dictionary information expression. Therefore, after the dictionary information is constructed, another document data group to be classified can be classified with high accuracy. it can.

また、この発明は、文書校正システムといった他の文書分析システム用の辞書構築にも適用可能である。さらに、辞書情報の更新作業といった、文書分析システムの運用作業の時間削減にも効果がある。 The present invention is also applicable to the construction of a dictionary for other document analysis systems such as a document proofreading system. Furthermore, it is effective in reducing the time required for the operation of the document analysis system, such as a dictionary information update operation.

（第２の実施形態）
次に、本発明の第２の実施形態について説明する。なお、本実施形態に係る辞書構築支援装置の構成のうち図１に示したものと同一部分の説明は省略する。 (Second Embodiment)
Next, a second embodiment of the present invention will be described. In addition, the description of the same part as the thing shown in FIG. 1 among the structures of the dictionary construction assistance apparatus concerning this embodiment is abbreviate | omitted.

図１２は、本発明の第２の実施形態にしたがった辞書構築支援装置の構成例を示すブロック図である。
図１２に示すように、本発明の第２の実施形態にしたがった辞書構築支援装置は、第１の実施形態にしたがった辞書構築支援装置と比較して、辞書検証部４１と、辞書情報の編集手段である辞書編集部４２をさらに備え、それぞれがバス８に接続される。 FIG. 12 is a block diagram showing a configuration example of the dictionary construction support apparatus according to the second embodiment of the present invention.
As shown in FIG. 12, the dictionary construction support apparatus according to the second embodiment of the present invention is compared with the dictionary construction support apparatus according to the first embodiment. A dictionary editing unit 42 as editing means is further provided, and each is connected to the bus 8.

辞書検証部４１は、ユーザが分析対象データ群と当該データ群をもとに構築された辞書情報との対応関係を表示装置６に表示させる。
また、辞書編集部４２は、入力装置１への入力操作にしたがって辞書情報記憶部２２に記憶される辞書情報を編集する。 The dictionary verifying unit 41 causes the display device 6 to display a correspondence relationship between the analysis target data group and dictionary information constructed based on the data group.
The dictionary editing unit 42 edits the dictionary information stored in the dictionary information storage unit 22 in accordance with an input operation to the input device 1.

図１３は、本発明の第２の実施形態にしたがった辞書構築支援装置の処理動作の一例を示すフローチャートである。
本発明の第２の実施形態では、第１の実施形態で説明したステップＳ１からＳ８までの処理がなされた後で、入力装置１への、登録済み辞書情報の検証にかかる登録済み辞書情報の検証にかかる各種メニューである「辞書作成支援」、「辞書検証」および「辞書検証結果表示」の選択や検証対象の辞書名の入力操作がなされると、辞書検証部４１は、記憶装置２の辞書情報記憶部２２に記憶される辞書情報のうち、検証対象の辞書名に対応する辞書情報を読み出し、かつ、当該辞書情報の構築に用いられた分析対象データ群を記憶装置２の分析対象データ記憶部２１から読み出し、これらをもとに、文章表示フィールドおよび辞書表示フィールドを含む辞書情報検証用画面を表示装置６に表示させる（ステップＳ１１）。 FIG. 13 is a flowchart showing an example of the processing operation of the dictionary construction support apparatus according to the second embodiment of the present invention.
In the second embodiment of the present invention, after the processing from steps S1 to S8 described in the first embodiment is performed, the registered dictionary information for verification of the registered dictionary information to the input device 1 is updated. When various menus related to verification, such as “Dictionary creation support”, “Dictionary verification”, and “Display dictionary verification result display” are selected and an operation of inputting a dictionary name to be verified is performed, the dictionary verification unit 41 Of the dictionary information stored in the dictionary information storage unit 22, the dictionary information corresponding to the dictionary name to be verified is read out, and the analysis target data group used for the construction of the dictionary information is stored as the analysis target data in the storage device 2. Read from the storage unit 21, and based on these, display a dictionary information verification screen including a text display field and a dictionary display field on the display device 6 (step S11).

図１４は、本発明の第２の実施形態にしたがった辞書構築支援装置の辞書検証部による辞書情報検証結果の表示例を示す図である。
図１４に示すように、辞書情報検証結果の画面の文章表示フィールドには、分析対象データ群の各文書データの文章番号と文章、つまり本文とが関連付けられて表示される。文章表示フィールドに一度に表示される文章は分析対象データ群の一部であるが、表示されていない文章番号に対応する文章は、入力装置１による画面上のスクロールバーへの操作により表示させることができる。 FIG. 14 is a diagram showing a display example of the dictionary information verification result by the dictionary verification unit of the dictionary construction support apparatus according to the second embodiment of the present invention.
As shown in FIG. 14, in the text display field of the dictionary information verification result screen, the text number of each document data in the analysis target data group and the text, that is, the text, are displayed in association with each other. Sentences displayed in the sentence display field at a time are part of the analysis target data group, but sentences corresponding to sentence numbers that are not displayed should be displayed by operating the scroll bar on the screen by the input device 1. Can do.

ユーザが入力装置１への操作により画面上の文章の一つを選択すると、辞書検証部４１は、ステップＳ１１の処理で読み出した辞書情報の表現のうち、選択された文章中の表現と一致する表現を、当該辞書情報中で対応付けられるグループ名、通し番号とともに辞書表示フィールドに表示させる。 When the user selects one of the sentences on the screen by operating the input device 1, the dictionary verification unit 41 matches the expression in the selected sentence among the expressions of the dictionary information read out in the process of step S11. The expression is displayed in the dictionary display field together with the group name and serial number associated with the dictionary information.

例えば、図１０において、「文章番号１」の文章がカーソルなどで選択されている場合、辞書表示フィールドには、この文章において抽出された辞書情報に登録されている表現、つまり文章表示フィールドにおける選択済み文章において下線が付される表現「食事が良い」、「満足」が表示される。これにより、ユーザは、分析対象データ群の文章のうち、構築された辞書情報に反映されている表現を容易に確認できる。 For example, in FIG. 10, when the sentence “sentence number 1” is selected with a cursor or the like, the dictionary display field has an expression registered in the dictionary information extracted in this sentence, that is, a selection in the sentence display field. The expressions “meal is good” and “satisfied” are displayed underlined in the completed text. Thereby, the user can easily confirm the expression reflected in the constructed dictionary information among the sentences of the analysis target data group.

この辞書構築支援装置は、ユーザが辞書情報検証結果の画面を参照して、この画面で選択されている文章中の表現に関わる辞書情報の表現を追加する機能を有する。
例えば、ユーザが「文章番号１」の文章中の下線が付されていない表現「対応悪かった」を確認し、これをグループ名「不評意見」に関わる表現「対応悪かった」として辞書情報に登録したいと考え、入力装置１への入力操作により、図１０の画面上の辞書表示フィールドに行を追加し、グループ名「不評意見」、表現「対応悪かった」を入力したとする。 This dictionary construction support apparatus has a function for a user to add a representation of dictionary information related to a representation in a sentence selected on this screen with reference to the dictionary information verification result screen.
For example, the user confirms the expression “Uncorresponding” that is not underlined in the sentence of “Sentence No. 1”, and registers this in the dictionary information as the expression “Uncorresponding” related to the group name “Unpopular Opinion”. Suppose that a line is added to the dictionary display field on the screen of FIG. 10 and the group name “unpopular opinion” and the expression “corresponding poor” are input by an input operation to the input device 1.

そして、入力装置１への操作により、画面上の「上書き保存」と示されたアイコンが選択されると、辞書編集部４２は、追加した表現を辞書情報記憶部２２に上書き記憶することで辞書情報を編集する（ステップＳ１２）。 When the icon indicated as “overwrite” on the screen is selected by operating the input device 1, the dictionary editing unit 42 overwrites and stores the added expression in the dictionary information storage unit 22. Information is edited (step S12).

そして、辞書検証部４１は、ステップＳ１１の処理と同様に、記憶装置２の辞書情報記憶部２２に記憶される辞書情報のうち、検証対象の辞書名に対応する辞書情報を読み出し、かつ、当該辞書情報の構築に用いられた分析対象データ群を記憶装置２の分析対象データ記憶部２１から読み出し、これらをもとに、辞書情報編集後の文章表示フィールドおよび辞書表示フィールドを含む辞書情報検証用画面を表示装置６に表示させる（ステップＳ１３）。 And the dictionary verification part 41 reads the dictionary information corresponding to the dictionary name of verification object among the dictionary information memorize | stored in the dictionary information storage part 22 of the memory | storage device 2 similarly to the process of step S11, and the said The analysis target data group used for constructing the dictionary information is read from the analysis target data storage unit 21 of the storage device 2, and based on these, dictionary information verification including a sentence display field and a dictionary display field after editing the dictionary information A screen is displayed on the display device 6 (step S13).

図１５は、本発明の第２の実施形態にしたがった辞書構築支援装置の辞書編集部による辞書情報編集後の辞書情報検証結果の表示例を示す図である。
図１５に示した画面では、図１４に示した画面と比較して、文章フィールドにおいて選択済みの文章番号「１」の文章中の「対応悪かった」に下線が新たに付され、この下線部分に対応して、辞書情報フィールドにグループ名「不評意見」、番号「９」および表現「対応悪かった」が追加される。以後は、第１の実施形態で説明したステップＳ９，Ｓ１０の処理に移行する。 FIG. 15 is a diagram showing a display example of the dictionary information verification result after dictionary information editing by the dictionary editing unit of the dictionary construction support apparatus according to the second embodiment of the present invention.
In the screen shown in FIG. 15, compared with the screen shown in FIG. 14, “Uncorresponding” is newly added in the sentence of the sentence number “1” selected in the sentence field. In response to this, the group name “unpopular opinion”, the number “9”, and the expression “corresponding poor” are added to the dictionary information field. Thereafter, the process proceeds to steps S9 and S10 described in the first embodiment.

以上説明したように、本発明の第２の実施形態にしたがった辞書構築支援装置は、第１の実施形態での効果に加え、ユーザは、構築済みの辞書情報と当該構築に用いられた分析対象データ群の文章との対応関係を検証した上で、辞書の編集を行なうことができ、この編集された辞書情報に関する検証を行なうこともできる。 As described above, the dictionary construction support apparatus according to the second embodiment of the present invention provides the user with the dictionary information already constructed and the analysis used for the construction in addition to the effects of the first embodiment. The dictionary can be edited after verifying the correspondence with the text in the target data group, and the edited dictionary information can also be verified.

なお、この発明は前記実施形態そのままに限定されるものではなく実施段階ではその要旨を逸脱しない範囲で構成要素を変形して具体化できる。また、前記実施形態に開示されている複数の構成要素の適宜な組み合わせにより種々の発明を形成できる。例えば、実施形態に示される全構成要素から幾つかの構成要素を省略してもよい。更に、異なる実施形態に亘る構成要素を適宜組み合せてもよい。 The present invention is not limited to the above-described embodiment as it is, and can be embodied by modifying the constituent elements without departing from the scope of the invention in the implementation stage. Various inventions can be formed by appropriately combining a plurality of constituent elements disclosed in the embodiment. For example, some components may be omitted from all the components shown in the embodiment. Furthermore, you may combine suitably the component covering different embodiment.

本発明の第１の実施形態にしたがった辞書構築支援装置の構成例を示すブロック図。The block diagram which shows the structural example of the dictionary construction assistance apparatus according to the 1st Embodiment of this invention. 本発明の第１の実施形態にしたがった辞書構築支援装置の処理動作の一例を示すフローチャート。The flowchart which shows an example of the processing operation of the dictionary construction assistance apparatus according to the first embodiment of the present invention. 本発明の第１の実施形態にしたがった辞書構築支援装置の分析対象データ記憶部に記憶された分析対象データ群の一例を表形式で示す図。The figure which shows an example of the analysis object data group memorize | stored in the analysis object data storage part of the dictionary construction assistance apparatus according to the 1st Embodiment of this invention in tabular form. 本発明の第１の実施形態にしたがった辞書構築支援装置の登録候補表現生成部による特徴表現抽出のための処理データの一例を表形式で示す図。The figure which shows an example of the process data for the feature expression extraction by the registration candidate expression production | generation part of the dictionary construction assistance apparatus according to the 1st Embodiment of this invention in a table format. 本発明の第１の実施形態にしたがった辞書構築支援装置の登録候補表現生成部による特徴表現抽出結果の一例を表形式で示す図。The figure which shows an example of the feature expression extraction result by the registration candidate expression production | generation part of the dictionary construction assistance apparatus according to the 1st Embodiment of this invention in a table format. 本発明の第１の実施形態にしたがった辞書構築支援装置の登録候補表現生成部による類似表現抽出のための処理データの一例を表形式で示す図。The figure which shows an example of the process data for the similar expression extraction by the registration candidate expression production | generation part of the dictionary construction assistance apparatus according to the 1st Embodiment of this invention in a table format. 本発明の第１の実施形態にしたがった辞書構築支援装置の登録候補表現生成部による類似表現抽出結果の一例を表形式で示す図。The figure which shows an example of the similar expression extraction result by the registration candidate expression production | generation part of the dictionary construction assistance apparatus according to the 1st Embodiment of this invention in a table format. 本発明の第１の実施形態にしたがった辞書構築支援装置の登録候補表現生成部による類似表現抽出結果にしたがった表現生成条件の一例を表形式で示す図。The figure which shows an example of the expression production | generation conditions according to the similar expression extraction result by the registration candidate expression production | generation part of the dictionary construction assistance apparatus according to the 1st Embodiment of this invention in a table format. 本発明の第１の実施形態にしたがった辞書構築支援装置の登録候補表現生成部による辞書登録候補生成結果の一例を表形式で示す図。The figure which shows an example of the dictionary registration candidate production | generation result by the registration candidate expression production | generation part of the dictionary construction assistance apparatus according to the 1st Embodiment of this invention in a table format. 本発明の第１の実施形態にしたがった辞書構築支援装置の辞書登録部による辞書情報登録結果の一例を表形式で示す図。The figure which shows an example of the dictionary information registration result by the dictionary registration part of the dictionary construction assistance apparatus according to the 1st Embodiment of this invention in a table format. 本発明の第１の実施形態にしたがった辞書構築支援装置による辞書情報の表示例を示す図。The figure which shows the example of a display of the dictionary information by the dictionary construction assistance apparatus according to the 1st Embodiment of this invention. 本発明の第２の実施形態にしたがった辞書構築支援装置の構成例を示すブロック図。The block diagram which shows the structural example of the dictionary construction assistance apparatus according to the 2nd Embodiment of this invention. 本発明の第２の実施形態にしたがった辞書構築支援装置の処理動作の一例を示すフローチャート。The flowchart which shows an example of the processing operation of the dictionary construction assistance apparatus according to the 2nd Embodiment of this invention. 本発明の第２の実施形態にしたがった辞書構築支援装置の辞書検証部による辞書情報検証結果の表示例を示す図。The figure which shows the example of a display of the dictionary information verification result by the dictionary verification part of the dictionary construction assistance apparatus according to the 2nd Embodiment of this invention. 本発明の第２の実施形態にしたがった辞書構築支援装置の辞書編集部による辞書情報編集後の辞書情報検証結果の表示例を示す図。The figure which shows the example of a display of the dictionary information verification result after dictionary information editing by the dictionary editing part of the dictionary construction assistance apparatus according to the 2nd Embodiment of this invention.

Explanation of symbols

１…入力装置、２…記憶装置、３…登録候補表現生成部、４…表現編集部、５…辞書登録部、６…表示装置、７…分類処理部、８…バス、２１…分析対象データ記憶部、２２…辞書情報記憶部、３１…特徴表現抽出部、３２…類似表現抽出部、３３…表現生成条件設定部、３４…表現生成部、４１…辞書検証部、４２…辞書編集部、５１…辞書登録判定部。 DESCRIPTION OF SYMBOLS 1 ... Input device, 2 ... Memory | storage device, 3 ... Registration candidate expression production | generation part, 4 ... Expression edit part, 5 ... Dictionary registration part, 6 ... Display apparatus, 7 ... Classification processing part, 8 ... Bus, 21 ... Analysis object data Storage unit 22 ... Dictionary information storage unit 31 ... Feature expression extraction unit 32 ... Similar expression extraction unit 33 ... Expression generation condition setting unit 34 ... Expression generation unit 41 ... Dictionary verification unit 42 ... Dictionary editing unit 51... Dictionary registration determination unit.

Claims

Storage means for storing dictionary information for analysis of a document data group;
Input means for inputting a document data group for constructing the dictionary information;
First extraction means for extracting feature expressions in the input document data group;
Second extraction means for extracting similar expressions in the input document data group;
Generating means for generating a different expression from the expression included in the input document data group based on the feature expression extracted by the first extraction means and the similar expression extracted by the second extraction means;
A dictionary construction support apparatus comprising: storage control means for storing dictionary information including expressions generated by the generation means in the storage means.

The dictionary information is information in which a classification item name and an expression in a document data group related to the classification item name are associated with each other,
Each expression of the input document data group is associated with a classification item name related to the expression,
The first extraction unit extracts the feature expression in association with a classification item name associated with the document data group from which the feature expression is extracted,
The generation means associates the generated expression with a classification item name associated with a feature expression that is the basis for generating the expression,
The storage control means stores dictionary information including the different expression and a classification item name associated with the expression in the storage means,
A second input means for inputting a document data group to be classified;
Classification processing for classifying each document data group to be classified as document data related to the classification item name based on the document data group input by the second input unit and the dictionary information stored in the storage unit The dictionary construction support apparatus according to claim 1, further comprising means.

Editing input means for receiving editing input of dictionary information stored in the storage means;
2. The dictionary construction support apparatus according to claim 1, further comprising editing means for editing the dictionary information based on contents input by the editing input means.

Selecting means for selecting an expression that satisfies a predetermined condition among the expressions generated by the generating means;
The dictionary construction support apparatus according to claim 1, wherein the storage control unit stores dictionary information including the selected expression in the storage unit.

Display control means for displaying on the display device the feature expression extracted by the first extraction means and the similar expression extracted by the second extraction means;
Edit input means for accepting edit input of an arbitrary expression among the displayed characteristic expression and similar expression;
2. The dictionary construction support apparatus according to claim 1, further comprising editing means for editing the corresponding expression in accordance with an input by the editing input means.

Editing input means for receiving editing input of dictionary information stored in the storage means;
Editing means for editing the dictionary information based on the content input by the editing input means;
Selection input means for receiving selection input of one document data among the document data group;
Selection means for selecting document data from a document data group stored in the storage means in accordance with an input by the selection input means;
Of the expressions in the document data selected by the selecting means, the expression that matches the expression of the dictionary information edited by the editing means is highlighted and displayed on the display device, and the matching expression of the dictionary information is The dictionary construction support apparatus according to claim 1, further comprising display control means for displaying on the display device.