JP2011081495A

JP2011081495A - Document data analysis device, method and program

Info

Publication number: JP2011081495A
Application number: JP2009231525A
Authority: JP
Inventors: Keizo Uchiyama; 恵三内山; Akiko Shibata; 明子柴田
Original assignee: Tokyo Electric Power Co Inc
Current assignee: Tokyo Electric Power Company Holdings Inc
Priority date: 2009-10-05
Filing date: 2009-10-05
Publication date: 2011-04-21

Abstract

<P>PROBLEM TO BE SOLVED: To more accurately estimate the meaning of a text contained in object data. <P>SOLUTION: A document data analysis device for analyzing document data containing the plurality of object data, includes: a holding part 22 which holds a plurality of search keys defined using regular expressions; a search part 24 which searches the plurality of object data using the plurality of held search keys; a clustering part 28 which sets the information obtained as a result of searching as variables and performs clustering processing to group the similar object data; and an information acquiring/estimating part 27 which acquires the attribute information set in the other object data belonging to the same group as a result of clustering processing as the attribute information of the object data. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、文書データを解析する技術に関する。 The present invention relates to a technique for analyzing document data.

従来、自然文を形態素に分解し、その形態素の並びが予め用意された事象パターンに該当するか否かを判断することで、因果関係の有無を判断し、因果関係がある場合には、その因果関係を加味したデータクラスタリングを行う技術がある（特許文献１を参照）。また、述語キーワードを用いて検索を行い、該当した述語キーワードと前後の文章を抽出し、抽出された文章に対して、絞り込み検索や並べ替えを行うことでノイズを除去し、前後の文と併せて知識ベースに格納する技術がある（特許文献２を参照）。 Conventionally, natural sentences are decomposed into morphemes, and the presence or absence of a causal relationship is determined by determining whether or not the arrangement of the morphemes corresponds to a prepared event pattern. There is a technique for performing data clustering in consideration of the causal relationship (see Patent Document 1). Also, a search is performed using predicate keywords, the corresponding predicate keywords and the preceding and following sentences are extracted, and the extracted sentences are subjected to narrowing search and rearrangement to remove noise, and combined with the preceding and following sentences. There is a technique for storing the information in the knowledge base (see Patent Document 2).

特開２００８−２０３９６４号公報JP 2008-203964 A 特開２００５−１１５８８２号公報JP 2005-115882 A 特開２００５−０６３４７０号公報JP 2005-063470 A 特開２００１−８４２５０号公報JP 2001-84250 A

酢山明弘、他３名、「決定木を用いたフリーアンケート文のフィルタリング」、人工知能学会全国大会論文集、２００４１Ａ４−０５Akihiro Nineyama, 3 others, “Filtering Free Questionnaire Sentences Using Decision Trees”, Proceedings of National Conference of the Japanese Society for Artificial Intelligence, 2004 1A4-05 土田正明、他２名、「ＷｏｒｌｄＷｉｄｅＷｅｂを用いた辞典システムの構築」、人工知能学会全国大会論文集、２００４１Ａ３−０４Masaaki Tsuchida and two others, “Construction of Dictionary System Using World Wide Web”, Proceedings of National Conference of the Japanese Society for Artificial Intelligence, 2004 1A3-04 西山莉紗、他４名、「技術文書マイニングのための特長表現抽出」、人工知能学会全国大会論文集、２００８３Ｋ３−２Atsushi Nishiyama and 4 others, “Extraction of feature expression for mining technical documents”, Proceedings of National Conference of the Japanese Society for Artificial Intelligence, 2008 3K3-2 小林慎一、他５名、「インターネットリソースを用いた技術動向の時系列的分析」、電気学会論文誌Ｃ（電子・情報・システム部門誌）、２００５年、Ｖｏｌ．１２５−Ｃ、Ｎｕｍｂｅｒ５Shinichi Kobayashi and five others, “A time-series analysis of technology trends using Internet resources”, IEEJ Transactions C (Electronics, Information and Systems Division), 2005, Vol. 125-C, Number5

従来、大量の文書データから知識を抽出する方式として、形態素解析技術により１つの文書から単語を切り出し、自立語を基準にした単語の係り受け関係を推定することで、係り受け関係から構文木を構築する技術が用いられている。このような形態素解析では、構築された構文木の中で多くの構文木に含まれている頻出パターン（知識）を、与えられたパターンの制約に基づいて発見し、発見された頻出パターンへの代入にマッチする文書を検索する。 Conventionally, as a method for extracting knowledge from a large amount of document data, a morphological analysis technique is used to extract words from one document and estimate the dependency relationship of words based on independent words, thereby obtaining a syntax tree from the dependency relationship. The technology to build is used. In such morphological analysis, frequent patterns (knowledge) included in many syntax trees in the constructed syntax tree are found based on the given pattern constraints, and Search for documents that match the assignment.

しかし、実際の解析対象となる自然文は、表記の揺れが大きく、また、主語や述語が省略されることもあるため、文章における係り受けの関係を形態素解析によって機械的に抽出したり、抽出結果から文章の正確な意味内容を推定したりすることは困難であった。 However, the natural sentences that are actually analyzed are subject to large fluctuations in the notation, and the subject and predicate may be omitted. It was difficult to estimate the exact meaning of the sentence from the results.

本発明は、上記した問題に鑑み、対象データに含まれる文章の意味内容をより正確に推定することを課題とする。 In view of the problems described above, an object of the present invention is to more accurately estimate the semantic content of a sentence included in target data.

本発明は、以下の構成を備えることで、上記した課題を解決することとした。即ち、本
発明は、複数の対象データを含む文書データを解析するための文書データ解析装置であって、正規表現を用いて定義された複数の検索キーを保持する保持手段と、前記保持手段によって保持された前記複数の検索キーを用いて前記複数の対象データを検索する検索手段と、前記検索手段による検索の結果得られた情報を変数として設定して、類似する対象データ同士をグループ化するクラスタリング処理を行うクラスタリング手段と、対象データの属性情報として、前記クラスタリング処理の結果において同一グループに属する他の対象データに設定されている属性情報を取得する、情報推定手段と、を備える文書データ解析装置である。 The present invention has the following configuration to solve the above-described problems. That is, the present invention is a document data analyzing apparatus for analyzing document data including a plurality of target data, the holding unit holding a plurality of search keys defined using regular expressions, and the holding unit Search means for searching the plurality of target data using the plurality of stored search keys and information obtained as a result of the search by the search means are set as variables to group similar target data together Document data analysis comprising: clustering means for performing clustering processing; and information estimation means for acquiring attribute information set in other target data belonging to the same group in the result of the clustering processing as attribute information of the target data Device.

本発明によれば、正規表現で定義された検索キーが用いられることによって、対象データの表記揺れに影響されずに必要な特徴部分を索出することが出来る。なお、このような特徴部分の索出をより正確に行うために、検索キーには、口語体や主語の省略等の表記揺れに影響されない特徴を正規表現化したものが用いられることが好ましい。 According to the present invention, by using a search key defined by a regular expression, a necessary feature portion can be searched without being affected by the fluctuation of the notation of the target data. In order to more accurately search for such a feature portion, it is preferable to use a search key that is a regular expression of a feature that is not affected by notation fluctuations such as colloquial or subject omission.

更に、本発明では、検索手段による検索の結果得られた情報を変数として設定して、類似する対象データ同士をグループ化するクラスタリング処理が行われる。ここで、クラスタリング処理とは、対象データの中から互いに近似するものを凝集させてグループ分けを行う従来技術であり、例えば、ケース間の距離を計算し、最も距離の近い２つのケースを逐次的に併合することによりグループ化を行うことが出来る。クラスタリング処理によって、意味が類似した対象データ同士がグルーピングされる。このため、本発明によれば、正規表現の特徴パターンをクラスタリングし、類似ケースから属性情報を取得して設定することができる。即ち、本発明によれば、属性情報の設定がなされていない対象データについても、内容が類似すると推定される他の対象データから属性情報を取得すること、換言すると、類似する他の対象データに基づいて、対象データに含まれる文章の意味内容を推定することが出来る。 Furthermore, in the present invention, clustering processing is performed in which information obtained as a result of the search by the search means is set as a variable and similar target data is grouped. Here, the clustering process is a conventional technique in which object data that are close to each other are aggregated to perform grouping. For example, the distance between cases is calculated, and the two cases with the closest distance are sequentially determined. Grouping can be performed by merging with. By the clustering process, target data having similar meanings are grouped. For this reason, according to the present invention, it is possible to cluster feature patterns of regular expressions and acquire and set attribute information from similar cases. That is, according to the present invention, even for target data for which attribute information is not set, the attribute information is acquired from other target data whose contents are estimated to be similar, in other words, to similar target data. Based on this, it is possible to estimate the semantic content of the text included in the target data.

また、本発明に係る文書データ解析装置は、前記検索手段による検索の結果を参照することで、前記対象データが、夫々、関連する検索キーの組み合わせとして予め定義された検索キー関係を含むか否かを判定する判定手段と、前記判定手段によって、前記対象データが前記検索キー関係を含むと判定された場合に、前記検索キー関係に予め関連付けられた属性情報を、該対象データの属性情報として取得する属性取得手段と、を更に備え、前記情報推定手段は、前記判定手段および属性取得手段によって属性情報が取得されなかった対象データについて、前記クラスタリング処理の結果に基づいて属性情報を取得してもよい。 Also, the document data analysis apparatus according to the present invention refers to whether or not the target data includes a search key relationship defined in advance as a combination of related search keys by referring to a search result by the search means. When the determination unit determines that the target data includes the search key relationship, attribute information associated in advance with the search key relationship is used as attribute information of the target data. An attribute acquisition unit for acquiring, and the information estimation unit acquires attribute information for target data for which attribute information has not been acquired by the determination unit and the attribute acquisition unit based on a result of the clustering process. Also good.

ここで、検索キー関係とは、意味内容として関連性を有する検索キーの組み合わせ、換言すると、対象データ内に同時に現れた場合に所定の意味内容（属性）を表すものとして予め定義された関係である。本発明では、対象データがこのような検索キー関係を含むか否かを判定することによって、従来の言語解析では抽出できなかったような係り受け関係を抽出することを可能としている。 Here, the search key relationship is a combination of search keys having relevance as semantic content, in other words, a relationship defined in advance as representing a predetermined semantic content (attribute) when appearing in the target data at the same time. is there. In the present invention, it is possible to extract a dependency relationship that cannot be extracted by conventional language analysis by determining whether or not the target data includes such a search key relationship.

また、前記情報推定手段は、更に、前記対象データに関連する情報として、前記クラスタリング手段によってグループ化されたグループのうち、該対象データに含まれる検索キー関係が属するグループの、代表検索キー関係に関連する情報を取得してもよい。 Further, the information estimation means further includes a representative search key relationship of the group to which the search key relationship included in the target data belongs among the groups grouped by the clustering means as information related to the target data. Related information may be acquired.

グループ化によって特定された代表ケースの検索キー関係に基づいて、対象データに関連する情報の置換や補充等を行うことで、対象データの検索キーを詳細に推定すること、検索キーを具体化、詳細化または統一化すること、が可能となる。 Based on the search key relationship of the representative cases specified by grouping, the search key of the target data is estimated in detail by replacing or supplementing information related to the target data, and the search key is embodied. It is possible to refine or unify.

更に、本発明は、コンピュータが実行する方法、又はコンピュータに実行させるプログ
ラムとしても把握することが可能である。また、本発明は、そのようなプログラムをコンピュータその他の装置、機械等が読み取り可能な記録媒体に記録したものでもよい。ここで、コンピュータ等が読み取り可能な記録媒体とは、データやプログラム等の情報を電気的、磁気的、光学的、機械的、または化学的作用によって蓄積し、コンピュータ等から読み取ることができる記録媒体をいう。 Furthermore, the present invention can be understood as a method executed by a computer or a program executed by a computer. Further, the present invention may be a program in which such a program is recorded on a recording medium readable by a computer, other devices, machines, or the like. Here, a computer-readable recording medium is a recording medium that stores information such as data and programs by electrical, magnetic, optical, mechanical, or chemical action and can be read from a computer or the like. Say.

本発明によれば、対象データに含まれる文章の意味内容をより正確に推定することが出来る。 According to the present invention, it is possible to more accurately estimate the semantic content of a sentence included in target data.

実施形態に係る文書データ解析装置１のハードウェア構成を示す図である。It is a figure which shows the hardware constitutions of the document data analysis apparatus 1 which concerns on embodiment. 実施形態に係る文書データ解析装置１の機能構成の概略を示す図である。It is a figure which shows the outline of a function structure of the document data analysis apparatus 1 which concerns on embodiment. 実施形態に係るベース辞書の構成を示す図である。It is a figure which shows the structure of the base dictionary which concerns on embodiment. 実施形態に係る関係辞書の構成を示す図である。It is a figure which shows the structure of the relational dictionary which concerns on embodiment. 実施形態に係るデータクラスタリング処理の流れを示すフローチャートである。It is a flowchart which shows the flow of the data clustering process which concerns on embodiment. 実施形態における、ベース辞書を用いた検索処理の結果を示す図である。It is a figure which shows the result of the search process using a base dictionary in embodiment. 実施形態において、クラスタリングのために生成された変数データの構成を示す図である。It is a figure which shows the structure of the variable data produced | generated for clustering in embodiment. 実施形態における、データクラスタリングの結果を示す図である。It is a figure which shows the result of data clustering in embodiment. 実施形態に係るパターン抽出処理の流れを示すフローチャートである。It is a flowchart which shows the flow of the pattern extraction process which concerns on embodiment. 実施形態における、関係辞書を用いた属性判定処理の結果を示す図である。It is a figure which shows the result of the attribute determination process using the relation dictionary in embodiment. 実施形態におけるパターン抽出処理の結果を示す図である。It is a figure which shows the result of the pattern extraction process in embodiment.

以下、本発明に係る文書データ解析装置１の実施の形態について、図面に基づいて説明する。 Hereinafter, an embodiment of a document data analysis apparatus 1 according to the present invention will be described with reference to the drawings.

＜システムの構成＞
図１は、本実施形態に係る文書データ解析装置１のハードウェア構成を示す図である。文書データ解析装置１は、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）１１、主記憶装置としてのＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）１３、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）１２、ＨＤＤ（ＨａｒｄＤｉｓｋＤｒｉｖｅ）やＳＳＤ（ＳｏｌｉｄＳｔａｔｅＤｒｉｖｅ）等の補助記憶装置１４、表示装置１５としてのディスプレイ、および、入力装置１６としてのキーボードやマウス等を備えるコンピュータ（情報処理装置）である。 <System configuration>
FIG. 1 is a diagram illustrating a hardware configuration of a document data analysis apparatus 1 according to the present embodiment. The document data analysis apparatus 1 includes a CPU (Central Processing Unit) 11, a RAM (Random Access Memory) 13 as a main storage device, a ROM (Read Only Memory) 12, an HDD (Hard Disk Drive), an SSD (Solid State Drive), and the like. The auxiliary storage device 14, the display device 15 as a display, and the input device 16 as a computer (information processing device) including a keyboard and a mouse.

図２は、本実施形態に係る文書データ解析装置１の機能構成の概略を示す図である。図１に示された構成を有するコンピュータは、補助記憶装置１４に記録されている文書データ解析用プログラムが、ＲＡＭ１３に読み出され、ＣＰＵ１１によって実行されることによって、入力受付部２１、保持部２２、出力部２３、検索部２４、判定部２５、パターン抽出部２６、情報取得／推定部２７、クラスタリング部２８および表記統制部２９を備える文書データ解析装置１として機能する。 FIG. 2 is a diagram illustrating an outline of a functional configuration of the document data analysis apparatus 1 according to the present embodiment. In the computer having the configuration shown in FIG. 1, the document data analysis program recorded in the auxiliary storage device 14 is read into the RAM 13 and executed by the CPU 11, whereby the input receiving unit 21 and the holding unit 22. , Functioning as a document data analysis apparatus 1 including an output unit 23, a search unit 24, a determination unit 25, a pattern extraction unit 26, an information acquisition / estimation unit 27, a clustering unit 28, and a notation control unit 29.

情報取得／推定部２７は、判定部２５によって、対象ケースが検索キー関係を含むと判定された場合に、検索キー関係に予め関連付けられた属性情報を、該対象ケースの属性情報として取得し、検索キー関係に予め関連付けられた属性情報によって属性情報が取得さ
れなかった場合に、対象ケースについて、対象ケースの属性情報として、クラスタリング処理の結果において同一グループに属する他の対象ケースに設定されている属性情報を取得する。 When the determination unit 25 determines that the target case includes a search key relationship, the information acquisition / estimation unit 27 acquires attribute information associated in advance with the search key relationship as attribute information of the target case, When the attribute information is not acquired by the attribute information associated in advance with the search key relationship, the target case is set as another target case belonging to the same group as the target case attribute information as the target case attribute information. Get attribute information.

また、保持部２２は、通常の文字列または正規表現を用いて予め定義された複数の検索キー、および検索キーに対応する属性情報を含む各種辞書（ベース辞書および関係辞書等）を保持する。なお、辞書は、分野ごとに異なるものが用意され、文書データの属する分野に応じて適切な分野の辞書が優先的に用いられることが好ましい。 In addition, the holding unit 22 holds a plurality of search keys defined in advance using normal character strings or regular expressions, and various dictionaries (such as a base dictionary and a relational dictionary) including attribute information corresponding to the search keys. It should be noted that it is preferable that different dictionaries are prepared for each field, and a dictionary in an appropriate field is preferentially used according to the field to which the document data belongs.

図３は、本実施形態に係るベース辞書の構成を示す図である。ベース辞書は、検索キーを識別するための検索キーＩＤと、通常の文字列または正規表現による文字列で定義された１の検索キーと、この検索キーに対応する属性情報と、を有するパターンが蓄積された辞書データであり、解析対象の文書データの分野毎に異なるものが用意されることが好ましい。原則として、検索キーは正規表現で定義されており、この正規表現にマッチする文章の意味内容を示す属性情報が、夫々の検索キーに関連付けられている。例えば、「.*苦労(した|しました)」という正規表現にマッチする文章は、クレームを意味すると考えられるため、「クレーム」を示す属性情報が関連付けられる。 FIG. 3 is a diagram showing a configuration of the base dictionary according to the present embodiment. The base dictionary has a pattern having a search key ID for identifying a search key, one search key defined by a normal character string or a character string using a regular expression, and attribute information corresponding to the search key. It is preferable that different dictionary data is prepared for each field of the document data to be analyzed. In principle, the search key is defined by a regular expression, and attribute information indicating the semantic content of a sentence that matches the regular expression is associated with each search key. For example, since a sentence that matches the regular expression “. * Struggled” is considered to mean a claim, attribute information indicating “claim” is associated therewith.

図４は、本実施形態に係る関係辞書の構成を示す図である。関係辞書は、検索キーＩＤと、複数の検索キーと、この検索キーの組み合わせに対応する属性情報と、を有するパターンが蓄積された辞書データである。原則として検索キーは正規表現で定義されており、１のパターンに含まれる全ての正規表現（検索キーのセット）にマッチする文章の意味内容を示す属性情報が、夫々の検索キーのセットに関連付けられている。例えば、単体ではクレームを意味する「.*苦労(した|しました)」という正規表現にマッチするが、同時に「.*(良いのでは|助かる)」という正規表現にもマッチする文章は、文章全体としてクレームではなく意見要望を意味すると考えられるため、「意見要望」を示す属性情報が関連付けられる。なお、関係辞書の検索キーは、直接正規表現としてデータで含まれるのではなく、ベース辞書における検索キーＩＤ（検索キーの識別情報）を指定することで設定されていてもよい。 FIG. 4 is a diagram showing a configuration of the relation dictionary according to the present embodiment. The relational dictionary is dictionary data in which patterns having a search key ID, a plurality of search keys, and attribute information corresponding to a combination of the search keys are accumulated. In principle, search keys are defined by regular expressions, and attribute information indicating the semantic content of sentences matching all regular expressions (sets of search keys) included in one pattern is associated with each set of search keys. It has been. For example, a sentence that matches the regular expression ``. * (Satisfied) '' that means a claim alone, but also matches a regular expression ``. Since it is thought that it means not a claim but an opinion request as a whole, attribute information indicating “opinion request” is associated. The search key of the relation dictionary may not be directly included in the data as a regular expression, but may be set by specifying a search key ID (search key identification information) in the base dictionary.

＜処理の流れ＞
次に、本実施形態に係る文書データ解析装置１によって実行される処理の流れを説明する。なお、以下では、コールセンタにおいてオペレータが入力し蓄積された受付ログを、文書データ解析装置１によって解析する場合の処理の流れについて説明する。コールセンタでは、顧客からの問い合わせ等の電話が受け付けられ、オペレータは、顧客対応の記録をコンピュータに入力する。オペレータによって入力され、蓄積される受付ログは、一部が文章で入力されている。但し、本実施形態に係る文書データ解析装置１は、コールセンタの受付ログ以外にも、様々な文章（例えば、アンケート結果等）を解析する目的で用いることが出来る。 <Process flow>
Next, the flow of processing executed by the document data analysis apparatus 1 according to this embodiment will be described. In the following, the flow of processing when the document data analysis apparatus 1 analyzes the reception log input and accumulated by the operator at the call center will be described. In the call center, telephone calls such as inquiries from customers are accepted, and the operator inputs customer correspondence records into the computer. A part of the reception log input and accumulated by the operator is input in text. However, the document data analysis apparatus 1 according to the present embodiment can be used for the purpose of analyzing various sentences (for example, questionnaire results and the like) in addition to the call center reception log.

図５は、本実施形態に係るデータクラスタリング処理の流れを示すフローチャートである。本フローチャートに示された処理は、ユーザによって解析対象の文書データが指定され、データクラスタリング処理の開始を指示する操作の入力が受け付けられたことを契機として開始される。なお、本フローチャートに示した処理の順序および具体的な処理は、本発明を実施するうえで採用できる一例であり、実際の処理順序および具体的な処理には、本発明を実施するために当業者が採用可能な様々な処理順序および具体的な処理が採用されてよい。 FIG. 5 is a flowchart showing a flow of data clustering processing according to the present embodiment. The process shown in this flowchart is started when the user specifies the document data to be analyzed and the input of the operation for instructing the start of the data clustering process is accepted. Note that the order of processing and the specific processing shown in this flowchart are examples that can be adopted in carrying out the present invention, and the actual processing order and specific processing are not included in order to implement the present invention. Various processing orders and specific processing that can be employed by a vendor may be employed.

ステップＳ１０１では、文書データの入力が受け付けられる。入力受付部２１は、ＬＡＮ等のネットワークや、ＵＳＢメモリ、ＣＤ−ＲＯＭ等の可搬記録媒体等を介して文書デ
ータの入力を受け付ける。本実施形態において処理の対象となる文書データは、コールセンタにおける受付ログであり、受付ログには、コール単位、または一連の案件単位で、オペレータが入力した文章が文字コードを用いたデータとして含まれる。以下、受付ログに含まれるコール単位又は案件単位のデータを、「ケース」と称する。 In step S101, input of document data is accepted. The input receiving unit 21 receives input of document data via a network such as a LAN, a portable recording medium such as a USB memory and a CD-ROM, and the like. Document data to be processed in this embodiment is a reception log in a call center, and the reception log includes text input by an operator as data using character codes in units of calls or a series of cases. . Hereinafter, the data in units of calls or cases included in the reception log is referred to as “case”.

各ケースには、ケースを識別するためのケースＩＤ、及びオペレータによって入力されたケースの属性情報が含まれる。属性情報とは、ケースの属性を示すための情報であり、例えば、「クレーム」、「意見要望」、「おほめ」等、ケースに係るコール又は案件の意味内容を判断可能とするために、電話対応を行ったオペレータによって設定される情報である。但し、属性情報は、必ずしも全てのケースに設定されているものではなく、受付ログ中には、オペレータによる設定忘れ等の原因で、属性情報が設定されていないケースも存在し得る。入力された文書データがＲＡＭに記録され、入力受付が完了すると、処理はステップＳ１０２へ進む。 Each case includes a case ID for identifying the case and case attribute information input by the operator. The attribute information is information for indicating the attribute of the case. For example, in order to be able to determine the meaning content of the call or case related to the case, such as “claim”, “request for opinion”, “praise”, etc. This is information set by the operator who made the call. However, the attribute information is not necessarily set for all cases, and there may be cases where the attribute information is not set in the reception log due to forgetting the setting by the operator. When the input document data is recorded in the RAM and the input acceptance is completed, the process proceeds to step S102.

ステップＳ１０２では、文書における表記が統制される。表記統制部２９は、ステップＳ１０１で入力された文書データに対し、表記統制用辞書を用いた検索・置換処理を実行することで、文書における表記を統制する。表記統制用辞書には、表記の揺れや頻出する誤記を定義した検索キーが、通常の文字列または正規表現文字列で含まれており、また、この検索キーに対応する置換文字列、即ち表記統制後の文字列が含まれている。このような表記統制用辞書を用いて、文書データに対して検索・置換処理が適用されることで、文書中の表記が統制され、表記揺れや誤字のない（または、表記揺れや誤字が低減された）文書データが生成される。 In step S102, the notation in the document is regulated. The notation control unit 29 controls the notation in the document by executing a search / replacement process using the notation control dictionary for the document data input in step S101. The dictionary for notation control contains a search key that defines notation fluctuation and frequent misprints as a normal character string or a regular expression character string, and a replacement character string corresponding to this search key, that is, a notation The post-control character string is included. By using such a dictionary for notation control, search / replacement processing is applied to document data, so that notation in the document is controlled and there are no notation or typographical errors (or reduced typographical or typographical errors). Document data) is generated.

ここで、統制とは、文章中で用いられる表現を一定の基準の下に画一化することをいう。具体的には、誤記の修正、複数の表記方法がある語句の統一（例えば、「ファックス」、「ファクシミリ」等の文字列を全て「ＦＡＸ」に置換する）、同義語の統制（例えば、「手早く」、「素早く」、「迅速に」等の文字列を「すぐに」に置換する）、接頭語の削除、もってまわった言い回しの補正、である調への統一、簡素化、意味を持たない文末の削除および補正、等が行われる。 Here, the term “control” refers to standardizing expressions used in sentences under a certain standard. Specifically, correction of typographical errors, unification of words and phrases having a plurality of notation methods (for example, replacing all character strings such as “fax” and “facsimile” with “FAX”), synonym control (for example, “ (Easy, quick, and quickly) replaces the string with “immediately”), deletes prefix, corrects phrasing, unifies to key, simplifies, has meaning No end of sentence is deleted and corrected.

コールセンタの応対記録やアンケート等を分析の対象とした場合、扱う文章は誤字脱字、変換ミスが含まれた口語体で記述されており、表記の揺れが大きい。このため、文章中に自立語との係り受け関係がきれいに表現されていないことが多い。例えば「デンわたいおうが、○○です」の場合、人であれば「電話対応が、○○です」と読めるので、「電話対応」と「○○」との係り受け関係が抽出できるが、機械的に抽出することは困難である。そのため、本実施形態では、統制処理によって、分析前に表記の揺れを除去することとしている。また、アンケート等の文書データでは、主語や述語が省略されていることが多く、係り受け関係が漏れることがある。例えば、「電話対応昨日○○」という文章では、「電話対応」と「昨日○○」の間に空白文字が入っているため、人間が読めば分かる係り受け関係であるにも拘らず、従来の形態素解析では、「昨日」と「○○」との係り受け関係は抽出できるが、「電話対応」との関係が抽出されない。本実施形態に係る文章データ解析処理によれば、統制処理によって文章中の表現が整えられることで、後述する正規表現を用いた辞書検索処理、およびデータクラスタリング処理がより効率化され、精度の高い解析結果が期待できる。 When analyzing call center response records, questionnaires, etc., the sentences to be handled are written in a colloquial style that includes typographical errors and conversion errors, and the notation is greatly shaken. For this reason, the dependency relationship with independent words is often not clearly expressed in the text. For example, in the case of “Den Wataiou is XX”, a person can read “Phone support is XX”, so the dependency relationship between “Phone support” and “XX” can be extracted. It is difficult to extract mechanically. Therefore, in the present embodiment, the shaking of the notation is removed before analysis by the control process. Further, in document data such as questionnaires, the subject and predicate are often omitted, and the dependency relationship may be leaked. For example, in the sentence “telephone correspondence yesterday ○○”, there is a blank character between “telephone correspondence” and “yesterday ○○”, so even though it is a dependency relationship that can be understood by humans, In the morphological analysis, the dependency relationship between “Yesterday” and “XX” can be extracted, but the relationship with “telephone correspondence” is not extracted. According to the text data analysis process according to the present embodiment, the expression in the text is arranged by the control process, so that the dictionary search process and the data clustering process using the regular expression, which will be described later, are more efficient and highly accurate. Analysis results can be expected.

また、本実施形態に係る文書データ解析装置１では、置換文字列を正規表現で設定することで、正規表現で索出された文字列の一部のみを置換することを可能としている。このため、本実施形態では、予めマッチングのための正規表現内に、置換文字列において引用する箇所を、「\1」等のタグで囲む等して示しておく。例えば、「１００ワット、２００ワット、・・・」から「１００Ｗ、２００Ｗ、・・・」へ、数学の箇所はそのままに単位
だけ表記を統一したい場合、「\1[０−９]＋\1ワット」のように定義された正規表現で検索して置換箇所を探し、置換文字列を「\1[０−９]＋\1Ｗ」のように定義することで、数値の部分のみ置換前の文字列から取得して、置換することが出来る。なお、１つの正規表現内に、置換時に引用すべき部分が複数あるような場合には、「\2」「\3」・・・等の囲みタグを用いることで対応出来る。このような置換処理によれば、置換処理において、全ての置換パターンを用意することなく、また、表記ミス、表記漏れなどが起こる確率を低減させることが出来る。表記の統制が終了すると、処理はステップＳ１０３へ進む。 Further, in the document data analysis apparatus 1 according to the present embodiment, by setting the replacement character string with a regular expression, it is possible to replace only a part of the character string searched with the regular expression. For this reason, in this embodiment, a part to be quoted in the replacement character string is indicated in advance in a regular expression for matching by enclosing it with a tag such as “¥ 1”. For example, if you want to unify only the units from “100 Watts, 200 Watts,...” To “100 W, 200 W,. Search with the regular expression defined as “Watt” to find the replacement part, and define the replacement string as “\ 1 [0-9] + \ 1W”, so that only the numeric part is replaced Can be obtained from a string and replaced. In addition, when there are a plurality of parts to be cited at the time of replacement in one regular expression, it can be dealt with by using enclosing tags such as “\ 2”, “\ 3”. According to such a replacement process, it is possible to reduce the probability of notation mistakes and notation without preparing all the replacement patterns in the replacement process. When the notation control ends, the process proceeds to step S103.

ステップＳ１０３では、ベース辞書を用いた検索処理が行われる。検索部２４は、ステップＳ１０２で生成された、表記統制済みの文書データを、ベース辞書に含まれる検索キーで検索し、索出された文字列（以下、「索出文字列」という）、索出文字列が含まれる対象ケースのケースＩＤ、文書中の索出文字列の位置、索出に係る検索キー、索出文字列の属性、等を夫々関連付けて、検索結果としてＲＡＭ１３に保持する。 In step S103, search processing using a base dictionary is performed. The search unit 24 searches the document data with the notation controlled generated in step S102 with a search key included in the base dictionary, and searches for a searched character string (hereinafter referred to as “searched character string”), The case ID of the target case including the output character string, the position of the search character string in the document, the search key related to the search, the attribute of the search character string, and the like are associated with each other and stored in the RAM 13 as a search result.

図６は、本実施形態における、ベース辞書を用いた検索処理の結果を示す図である。なお、正規表現による検索の場合、システムに指定された区切り文字をデータの区切りとして、正規表現と一致する文字列を発見することで検索が行われる。区切り文字は自由に指定可能であり、例えば、「。」や改行を区切り文字として指定できる。また、索出文字列の位置は、文書の先頭からの文字数、バイト数や論理行数等を用いて特定することが出来る。ベース辞書検索が終了すると、処理はステップＳ１０５へ進む。 FIG. 6 is a diagram illustrating a result of search processing using the base dictionary in the present embodiment. In the case of a search using a regular expression, the search is performed by finding a character string that matches the regular expression using a delimiter specified in the system as a data delimiter. The delimiter can be freely specified. For example, “.” Or a line feed can be specified as the delimiter. Further, the position of the searched character string can be specified by using the number of characters from the top of the document, the number of bytes, the number of logical lines, and the like. When the base dictionary search ends, the process proceeds to step S105.

ステップＳ１０４では、索出文字列の関係性が抽出される。パターン抽出部２６は、各ケースにおいて、ステップＳ１０３で検索キーに一致した索出文字列同士の順番等に基づいて、同一ケースに存在する索出文字列間の関係を抽出する。ここで、パターン抽出部２６は、各索出文字列の出現位置の距離を所定の閾値と比較し、閾値以上離れた索出文字列は、関係性が低いとみなして抽出対象から除外する。また、パターン抽出部２６は、索出文字列がケースに占める割合（索出文字列の文字数／１文書の文字数）を所定の閾値と比較し、割合が閾値以下である索出文字列を抽出対象から除外する。このような処理を経ることで、ケース毎に、索出文字列の関係性を抽出することが出来る。なお、上述した閾値等を含む抽出のための定義は、解析対象の文書データの分野毎に異なるものが用意されることが好ましい。その後、処理はステップＳ１０５へ進む。 In step S104, the relationship of the searched character string is extracted. In each case, the pattern extraction unit 26 extracts the relationship between the searched character strings existing in the same case based on the order of the searched character strings that matched the search key in step S103. Here, the pattern extraction unit 26 compares the distance of the appearance position of each searched character string with a predetermined threshold value, and excludes a searched character string that is more than the threshold value from the extraction target because the relationship is low. Further, the pattern extraction unit 26 compares the ratio of the searched character string to the case (number of characters in the searched character string / number of characters in one document) with a predetermined threshold value, and extracts the searched character string whose ratio is equal to or less than the threshold value. Exclude from the target. Through such processing, it is possible to extract the relationship of the searched character string for each case. In addition, it is preferable that different definitions for extraction including the above-described threshold value are prepared for each field of document data to be analyzed. Thereafter, the process proceeds to step S105.

ステップＳ１０５では、索出文字列間の関係の絞り込みが行われる。パターン抽出部２６は、ステップＳ１０４で抽出された、索出文字列間の関係（正規表現のキー関係）を、出現頻度または出現率（例えば、「出現ケース数／全ケース数」で算出される出現率を用いることが出来る。）に基づいて絞り込む。この際、絞り込みの方法としては、算出された出現頻度または出現率を所定の閾値と比較し、閾値以上の出現頻度または出現率を有する関係のみを抽出することで絞り込む方法が採用されてよい。その後、処理はステップＳ１０６へ進む。 In step S105, the relationship between searched character strings is narrowed down. The pattern extraction unit 26 calculates the relationship between the searched character strings (regular expression key relationship) extracted in step S104 by the appearance frequency or the appearance rate (for example, “number of appearance cases / number of all cases”). Appearance rate can be used.) At this time, as a narrowing-down method, a method of narrowing down by comparing the calculated appearance frequency or appearance rate with a predetermined threshold and extracting only a relationship having an appearance frequency or appearance rate equal to or higher than the threshold may be employed. Thereafter, the process proceeds to step S106.

ステップＳ１０６では、データクラスタリングによる、類似ケース（類似パターン）およびその代表ケース（代表パターン）の抽出が行われる。クラスタリング部２８は、ステップＳ１０５までに抽出され絞り込まれた、正規表現のキー関係を、ケース、各正規表現及びその出現順位、属性、該当文書数、該当文数、正規表現間の距離、出現頻度、出現率を変数としてデータクラスタリングし、類似ケースおよびその代表ケースを抽出する。 In step S106, a similar case (similar pattern) and its representative case (representative pattern) are extracted by data clustering. The clustering unit 28 extracts the key relationships of the regular expressions extracted and refined up to step S105, the cases, the regular expressions and their appearance ranks, the attributes, the number of corresponding documents, the number of corresponding sentences, the distance between the regular expressions, the appearance frequency. Then, data clustering is performed using the appearance rate as a variable, and a similar case and its representative case are extracted.

図７は、本実施形態において、クラスタリングのために生成された変数データの構成を示す図である。なお、正規表現間の距離は、データクラスタリングにおける距離と同様である。例えば、２つの正規表現のキー関係（以下、「２項関係」と称する）において、２つの正規表現が文章中で続けて現れる場合の距離は、（２つめの正規表現の文頭位置）−
（１つめの正規表現の文末位置）で、１となる。 FIG. 7 is a diagram illustrating a configuration of variable data generated for clustering in the present embodiment. Note that the distance between regular expressions is the same as the distance in data clustering. For example, in the key relationship between two regular expressions (hereinafter referred to as “binary relationship”), the distance when two regular expressions appear in a sentence is (the beginning position of the second regular expression) −
1 at the end of the sentence of the first regular expression.

図８は、本実施形態における、データクラスタリングの結果を示す図である。ここで、代表ケースとは、類似ケースの中で各ケースと最も距離関係が近いケースのことであり、即ち、データクラスタリングの結果が重心に近いケースである。なお、関係辞書に正規表現のキー関係が存在する場合、抽出されたケースには属性情報が設定される。 FIG. 8 is a diagram showing the result of data clustering in the present embodiment. Here, the representative case is a case that is closest to each case among similar cases, that is, a case where the result of data clustering is close to the center of gravity. Note that when a regular expression key relationship exists in the relation dictionary, attribute information is set in the extracted case.

なお、データクラスタリングは、対象（ケース、変数）の中から互いに近似するものを凝集させてグループ分けを行う従来技術であり、キー関係１、キー関係２、・・・キー関係ｋ（ｋはキー関係のケース数）、各キー関係の出現順位、属性、頻度、および文字長の夫々を変数として指定し、クラスタリング処理することにより、意味が類似したキー関係パターンがグルーピングされる。例えば、ウォード法によるクラスタリングを採用することで、ケース間の距離を計算し、最も距離の近い２つのケースを逐次的に併合することによりグループ化することが出来る。 Data clustering is a conventional technique in which objects that are close to each other (cases, variables) are aggregated to perform grouping, and key relationship 1, key relationship 2,... Key relationship k (k is a key) The number of relationship cases), the appearance rank of each key relationship, the attribute, the frequency, and the character length are designated as variables, and clustering processing is performed to group key relationship patterns having similar meanings. For example, by employing clustering based on the Ward method, the distance between cases can be calculated, and the two cases with the closest distance can be grouped by sequentially merging.

より具体的には、データクラスタリングでは、初期ケースとして、ｎ個のケースについて、ｍ個の変数ｘ_ｉ１，ｘ_ｉ２，・・・，ｘ_ｉｍが設定される（ｉ＝１，２，・・・，ｎ）。そして、第１段の併合として、ケースａとｂの間の距離ｄ（ｘ_ａ，ｘ_ｂ）（非類似度）からケース間の距離が計算され、最も距離の近い二つのケースが逐次的に併合される。この際、以下の式が用いられる。 More specifically, in data clustering, as an initial case, m variables x _i1 , x _i2 ,..., X _im are set for n cases (i = 1, 2,...). , N). Then, as the merging of the first stage, the distance between cases is calculated from the distance d (x _a , x _b ) (dissimilarity) between cases a and b, and the two cases with the shortest distance are sequentially determined. Merged. At this time, the following equation is used.

更に、第２段以降の併合として、ケースａ、ｂを併合したｃと、ｄとが併合される。この際、以下の式が用いられる。なお、ｎ_ａ、ｎ_ｂ、ｎ_ｃ、ｎ_ｄは、ケースａ、ｂ、ｃ、ｄに含まれる個体数（データの個数）である。 Furthermore, as a merge after the second stage, c and d, which merge cases a and b, are merged. At this time, the following equation is used. _{_{_{Incidentally, n a, n b, n}}} c, n d is the case a, b, c, the number of individuals included in the d (number of data).

このようにして併合が繰り返されると、１つのグループが形成される。このため、ケース間の距離を所定の閾値と比較して、閾値以内にあるものを、最終グループとして決定することが出来る。グループの中で、各ケースに最も距離の近いケース、すなわち、重心となるケースが、代表ケースとなり、そのグループの正規表現の代表記述を有することとなる。 When merging is repeated in this way, one group is formed. For this reason, the distance between cases is compared with a predetermined threshold, and those within the threshold can be determined as the final group. In the group, the case closest to each case, that is, the case having the center of gravity is the representative case, and has a representative description of the regular expression of the group.

データクラスタリングの結果は、予め定義された辞書ではなく蓄積された文書データから自動生成された辞書（以下、「リアルタイム辞書」と称する）として用いることが出来る他、関係辞書に辞書項目を追加するための候補データとして用いることが出来る。なお、リアルタイム辞書として用いる方法については、パターン抽出処理の説明において後述する。 The result of data clustering can be used not as a predefined dictionary but as a dictionary automatically generated from accumulated document data (hereinafter referred to as “real-time dictionary”), and for adding dictionary items to a relational dictionary Can be used as candidate data. The method used as the real-time dictionary will be described later in the description of the pattern extraction process.

図９は、本実施形態に係るパターン抽出処理の流れを示すフローチャートである。本フローチャートに示された処理は、ユーザによって解析対象の文書データが指定され、パターン抽出処理の開始を指示する操作の入力が受け付けられたことを契機として開始される。なお、本フローチャートに示した処理の順序および具体的な処理は、本発明を実施するうえで採用できる一例であり、実際の処理順序および具体的な処理には、本発明を実施するために当業者が採用可能な様々な処理順序および具体的な処理が採用されてよい。 FIG. 9 is a flowchart showing a flow of pattern extraction processing according to the present embodiment. The process shown in this flowchart is started when the user specifies the document data to be analyzed and receives an operation input instructing the start of the pattern extraction process. Note that the order of processing and the specific processing shown in this flowchart are examples that can be adopted in carrying out the present invention, and the actual processing order and specific processing are not included in order to implement the present invention. Various processing orders and specific processing that can be employed by a vendor may be employed.

ステップＳ２０１からステップＳ２０３では、文書データの入力が受け付けられ、文書における表記が統制され、ベース辞書を用いた検索処理が行われる。ステップＳ２０１からステップＳ２０３の処理の詳細は、ステップＳ１０１からステップＳ１０３において説明した処理と概略同様であるため、説明を省略する。ベース辞書を用いた検索処理が終了すると、処理はステップＳ２０４へ進む。 In steps S201 to S203, input of document data is accepted, notation in the document is regulated, and search processing using a base dictionary is performed. Details of the processing from step S201 to step S203 are substantially the same as the processing described in step S101 to step S103, and thus description thereof is omitted. When the search process using the base dictionary is completed, the process proceeds to step S204.

ステップＳ２０４では、関係辞書を用いて、対象ケースに含まれる文章の意味内容（属性）が判定される。判定部２５および情報取得／推定部２７は、ステップＳ２０３における検索結果と、関係辞書とを対比する事で、対象ケースに含まれる文章の意味内容を判定し、属性情報を設定する。より具体的には、判定部２５は、ステップＳ２０３における検索結果を参照し、関係辞書に含まれる検索キーと同一の検索キーによる索出文字列がステップＳ２０３における検索結果に含まれているか否かを判定することで、関係辞書の各レコードに定義された複数の検索キーが、対象ケース（検索対象の文書）に同時に含まれているか否かを判定する。複数の検索キーが対象ケースに同時に含まれている場合、判定部２５は、更に索出文字列の長さや、索出文字列同士の間隔等が、関係辞書に定義された条件を満たすか否かを判定し、条件が満たされている場合に、情報取得／推定部２７は、対象ケースが、関係辞書の該当レコードに付された属性を有しているものとして、属性情報を設定する。 In step S204, the semantic content (attribute) of the sentence included in the target case is determined using the relation dictionary. The determination unit 25 and the information acquisition / estimation unit 27 determine the semantic content of the sentence included in the target case by comparing the search result in step S203 with the relation dictionary, and set the attribute information. More specifically, the determination unit 25 refers to the search result in step S203 and determines whether or not a search character string using the same search key as the search key included in the relation dictionary is included in the search result in step S203. It is determined whether or not a plurality of search keys defined in each record of the relational dictionary are simultaneously included in the target case (search target document). When a plurality of search keys are included in the target case at the same time, the determination unit 25 further determines whether the length of the searched character string, the interval between the searched character strings, etc. satisfy the conditions defined in the relational dictionary. If the condition is satisfied, the information acquisition / estimation unit 27 sets the attribute information on the assumption that the target case has the attribute attached to the corresponding record in the relation dictionary.

図１０は、本実施形態における、関係辞書を用いた属性判定処理の結果を示す図である。例えば、「○○苦労した。××したら良いのでは。」という文章の場合、「○○苦労した。」と「××したら良いのでは。」との２文で構成されているため、従来の形態素解析では、「××」と「良い」との係り受け関係は抽出できるが、これらの表現と「○○」との関係性は抽出されない。本実施形態に係る文書データ解析処理によれば、１文中の形態素の係り受けだけでなく、文の前後関係を考慮した分析、人の行動や感情等の特徴的表現を取り入れた分析、が可能となる。 FIG. 10 is a diagram illustrating a result of attribute determination processing using a relation dictionary in the present embodiment. For example, in the case of a sentence “I struggled. XX, I should do it.”, Since it is composed of two sentences “XX, I should have done it.” And “XX, I should do it.” In the morphological analysis, the dependency relationship between “XX” and “good” can be extracted, but the relationship between these expressions and “OO” is not extracted. According to the document data analysis processing according to the present embodiment, not only the dependency of morphemes in one sentence but also the analysis considering the context of the sentence and the analysis incorporating characteristic expressions such as human behavior and emotions are possible. It becomes.

なお、抽出処理は、定義辞書で定義された順に行われる。また、マッチングの結果一致した検索キーの数が多い検索結果や、各項間の出現位置の距離が短い検索結果が優先的に抽出される。また、属性に優先順位を設定し、この優先順位に従って抽出が行われても善い。例えば、属性のうち「クレーム」に「意見要望」よりも高い優先順位を設定し、クレームを優先的に抽出することとしてもよい。なお、本実施形態では、ステップＳ２０３における検索結果と、関係辞書とを対比する事で、関係辞書の各レコードに含まれる複数の検索キーが、対象ケースに同時に含まれているか否かを判定することとしているが、このような方法に代えて、関係辞書に含まれる検索キーで対象ケースを直接検索することで、関係辞書の各レコードに含まれる複数の検索キーが、対象ケースに同時に含まれているか否かを判定することとしてもよい。意味内容の判定が終了すると、処理はステップＳ２０５へ進む。 The extraction process is performed in the order defined in the definition dictionary. In addition, a search result having a large number of search keys that match as a result of matching and a search result having a short distance of appearance position between each item are preferentially extracted. It is also possible to set priorities for attributes and perform extraction according to the priorities. For example, a higher priority than “request for opinion” may be set for “claim” among the attributes, and the claims may be extracted preferentially. In the present embodiment, by comparing the search result in step S203 with the related dictionary, it is determined whether or not a plurality of search keys included in each record of the related dictionary are simultaneously included in the target case. However, instead of such a method, by directly searching the target case with the search key included in the relation dictionary, a plurality of search keys included in each record of the relation dictionary are simultaneously included in the target case. It is good also as determining whether it is. When the semantic content determination is completed, the process proceeds to step S205.

ステップＳ２０５では、対象ケースにおける索出文字列間の関係が、関係辞書を用いて抽出されたか否かが判定される。判定部２５は、ステップＳ２０３で索出された複数の索出文字列間の関係が、ステップＳ２０４における関係辞書を用いた意味内容の抽出処理で
抽出されたか否かを判定する。換言すると、判定部２５は、対象ケースにおける索出文字列の関係が、関係辞書に存在するか否かを判定する。対象ケースにおける索出文字列間の関係が、関係辞書を用いて抽出された場合、処理はステップＳ２０７へ進む。対象ケースにおける索出文字列間の関係が、関係辞書を用いて抽出されなかった場合、処理はステップＳ２０６へ進む。 In step S205, it is determined whether or not the relationship between the searched character strings in the target case has been extracted using the relationship dictionary. The determination unit 25 determines whether or not the relationship between the plurality of searched character strings searched in step S203 is extracted by the semantic content extraction process using the relation dictionary in step S204. In other words, the determination unit 25 determines whether or not the relationship of the searched character string in the target case exists in the relationship dictionary. If the relationship between the searched character strings in the target case is extracted using the relationship dictionary, the process proceeds to step S207. If the relationship between the searched character strings in the target case is not extracted using the relationship dictionary, the process proceeds to step S206.

ステップＳ２０６では、データクラスタリングの結果に基づいて、対象ケースの意味内容が抽出される。情報取得／推定部２７は、データクラスタリングの結果（リアルタイム辞書）を参照することで、対象ケース（複数の索出文字列の組み合わせ）のパターンに近い類似ケースの属性情報を取得し、この属性情報を、対象ケースの属性情報（意味内容）として抽出する（図１０を参照）。ここで、補完の優先順位は、重心に最も近い代表ケースが最も優先順位が高く、重心から距離が遠くなる順に優先順位が低くなる。なお、データクラスタリングの結果から抽出される属性情報は、対象ケース（または対象ケースにおいて索出された検索キー関係）が属するグループの代表ケースの属性情報であってもよいし、代表ケースから属性情報が得られない場合には、同一グループ内の他のケースの属性情報であってもよい。また、情報取得／推定部２７は、対象ケースに関連する情報として、対象ケースまたは該対象ケースの検索キー関係が属するグループの代表ケースに関連する情報全般（即ち、属性情報に限られず、検索キーや索出文字列等の広範な情報）を取得してもよい。これによって、対象ケースに関連する情報の置換や補充等が可能となり、対象ケースの検索キーを詳細に推定すること、検索キーを具体化、詳細化または統一化すること、が可能となる。その後、処理はステップＳ２０７へ進む。 In step S206, the semantic content of the target case is extracted based on the result of data clustering. The information acquisition / estimation unit 27 refers to the result of data clustering (real-time dictionary) to acquire attribute information of a similar case close to the pattern of the target case (a combination of a plurality of searched character strings), and this attribute information Is extracted as attribute information (meaning content) of the target case (see FIG. 10). Here, as for the priority of complementation, the representative case closest to the center of gravity has the highest priority, and the priority becomes lower in order of increasing distance from the center of gravity. Note that the attribute information extracted from the data clustering result may be attribute information of a representative case of a group to which the target case (or a search key relationship searched in the target case) belongs, or attribute information from the representative case. May not be obtained, it may be attribute information of other cases in the same group. In addition, the information acquisition / estimation unit 27 includes, as information related to the target case, general information related to the representative case of the target case or a group to which the search key relationship of the target case belongs (that is, not limited to attribute information, Or a wide range of information such as a search string). As a result, information related to the target case can be replaced or supplemented, and the search key of the target case can be estimated in detail, and the search key can be specified, detailed, or unified. Thereafter, the process proceeds to step S207.

ステップＳ２０７では、パターン抽出が行われる。パターン抽出部２６は、ステップＳ２０３におけるベース辞書に基づく索出結果、ステップＳ２０４における関係辞書に基づく索出結果、およびステップＳ２０６における意味内容の推定結果、に基づいて、受付ログに含まれるケースのパターン抽出を行う。 In step S207, pattern extraction is performed. The pattern extraction unit 26 uses the search result based on the base dictionary in step S203, the search result based on the relation dictionary in step S204, and the estimation result of the meaning content in step S206, and the pattern of cases included in the reception log. Perform extraction.

図１１は、本実施形態におけるパターン抽出処理の結果を示す図である。本実施形態では、各ケースが有するパターンとして、ケースに含まれる検索キー、および属性を抽出する。また、抽出されたケースを集計することで、受付ログに含まれる検索キー関係の出現頻度および出現率等を算出する。また、パターン抽出は、ステップＳ２０６までに索出等された関係について、出現頻度または出現率（出現ケース数／全ケース数）、文字長の割合（正規表現該当文字数／１文書文字数）を予め指定された閾値と比較することで絞り込むことで行われる。閾値を用いた絞り込みによって、閾値に満たないパターンをベース辞書および関係辞書からの削除候補として示したり、閾値以上であるが辞書に定義されていないパターンをベース辞書および関係辞書への追加候補として示したりすることが出来る。パターン抽出が完了すると、処理はステップＳ２０８へ進む。 FIG. 11 is a diagram illustrating a result of pattern extraction processing in the present embodiment. In this embodiment, a search key and an attribute included in a case are extracted as a pattern included in each case. Further, by summing up the extracted cases, the appearance frequency and the appearance rate of the search key relationship included in the reception log are calculated. For pattern extraction, the appearance frequency or appearance rate (number of appearance cases / number of all cases) and character length ratio (number of characters corresponding to regular expressions / number of characters in one document) are specified in advance for the relationship searched up to step S206. It is performed by narrowing down by comparing with the threshold value. By narrowing down using a threshold, patterns that do not meet the threshold are indicated as candidates for deletion from the base dictionary and relational dictionary, and patterns that are equal to or greater than the threshold but not defined in the dictionary are indicated as candidates for addition to the base dictionary and relational dictionary. Can be. When pattern extraction is completed, the process proceeds to step S208.

ステップＳ２０８では、抽出されたパターンが出力される。出力部２３は、ステップＳ２０７で抽出されたパターンおよび算出された集計結果等を、表示装置１５へ出力することで表示させ、ユーザに提供する。出力部２３は、キー関係の出現頻度や出現率、検索キーの出現頻度や出現率の多いものが上位に来るように出力する。更に、出力部２３は、出現頻度や出現率の低いパターンを、辞書からの削除候補として提示し、また、現在は異なるパターンとして辞書に定義されているが類似度が所定の閾値以上であるパターンを、辞書内の１エントリとしてまとめられる候補として提示する。ユーザは、出力された情報を参照して、ベース辞書や関係辞書等の正規表現辞書を作成、更新、削除する。その後、本フローチャートに示された処理は終了する。 In step S208, the extracted pattern is output. The output unit 23 outputs the pattern extracted in step S207, the calculated aggregation result, and the like to the display device 15 for display, and provides them to the user. The output unit 23 outputs such that the appearance frequency and appearance rate of the key relationship and the search key appearance frequency and appearance rate are high. Furthermore, the output unit 23 presents a pattern with a low appearance frequency or low appearance rate as a candidate for deletion from the dictionary, and is currently defined in the dictionary as a different pattern but has a similarity that is greater than or equal to a predetermined threshold. Are presented as candidates that are grouped together as one entry in the dictionary. The user creates, updates, and deletes a regular expression dictionary such as a base dictionary and a relational dictionary with reference to the output information. Thereafter, the processing shown in this flowchart ends.

なお、本実施形態では、ユーザに対する出力の方法として、表示装置１５を用いた表示出力が用いられるが、表示出力以外の出力方法が採用されてもよい。例えば、プリンタを
用いた印刷出力によって、ユーザに対して情報が提供されてもよい。 In the present embodiment, display output using the display device 15 is used as an output method for the user, but output methods other than display output may be employed. For example, information may be provided to the user by print output using a printer.

本実施形態に係る文書データ解析装置１によれば、従来のテキスト分析による知識獲得手法では抽出できなかった、複数の文で構成される文書から文の前後関係を考慮した形や人の直感的な言語の特徴表現も正規表現化して言語解析辞書に登録可能となり、分析の幅が広がり、アンケートなど省略形の多い文書の分析・分類・知識獲得にも対応することが可能となる。また、本実施形態に係る文書データ解析装置は、正規表現の特徴パターンによる文書解析機能を備えているため、従来の形態素に細かく区切られた後に抽出された係り受け関係のパターン分析よりも、実表記パターンの関係をより的確に抽出し、人による直感的な表現パターンを容易に抽出することを可能としている。 According to the document data analysis apparatus 1 according to the present embodiment, a form that takes into account the context of a sentence from a document composed of a plurality of sentences, which cannot be extracted by a conventional knowledge acquisition technique based on text analysis, or a human intuitive The feature expression of a simple language can be converted into a regular expression and registered in the language analysis dictionary, the range of analysis can be expanded, and analysis, classification, and knowledge acquisition of documents with many abbreviations such as questionnaires can be handled. In addition, since the document data analysis apparatus according to the present embodiment has a document analysis function based on a feature pattern of regular expressions, it is more effective than pattern analysis of dependency relations extracted after finely dividing into conventional morphemes. It is possible to more accurately extract the relationship between the notation patterns and easily extract intuitive expression patterns by humans.

１文書データ解析装置
２２保持部
２４検索部
２５判定部
２６パターン抽出部
２７情報取得／推定部
２８クラスタリング部
２９表記統制部 DESCRIPTION OF SYMBOLS 1 Document data analyzer 22 Holding | maintenance part 24 Search part 25 Judgment part 26 Pattern extraction part 27 Information acquisition / estimation part 28 Clustering part 29 Notation control part

Claims

A document data analysis apparatus for analyzing document data including a plurality of target data,
Holding means for holding a plurality of search keys defined using regular expressions;
Search means for searching the plurality of target data using the plurality of search keys held by the holding means;
Clustering means for performing clustering processing for grouping similar target data by setting information obtained as a result of the search by the search means as a variable;
Information estimating means for acquiring attribute information set in other target data belonging to the same group in the result of the clustering process as attribute information of the target data;
Document data analysis apparatus comprising:

A determination unit that determines whether or not the target data includes a search key relationship defined in advance as a combination of related search keys by referring to a search result by the search unit;
Attribute determination means for acquiring, as the attribute information of the target data, attribute information associated in advance with the search key relation when the determination means determines that the target data includes the search key relation; In addition,
The information estimation unit acquires attribute information based on a result of the clustering process for target data for which attribute information has not been acquired by the determination unit and the attribute acquisition unit.
The document data analysis apparatus according to claim 1.

The information estimation means further relates to a representative search key relation of a group to which the search key relation included in the target data belongs among groups grouped by the clustering means as information related to the target data. Get information,
The document data analysis apparatus according to claim 1.

A document data analysis method for analyzing document data including a plurality of target data.
A holding step for holding a plurality of search keys defined using regular expressions;
A search step for searching the plurality of target data using the plurality of search keys held in the holding step;
A clustering step for performing a clustering process for grouping similar target data by setting the information obtained as a result of the search in the search step as a variable;
As attribute information of the target data, obtaining attribute information set in other target data belonging to the same group in the result of the clustering process, an information estimation step;
Document data analysis method to execute.

In the computer,
A determination step of determining whether or not the target data includes a search key relationship defined in advance as a combination of related search keys by referring to a search result in the search step;
In the determination step, when it is determined that the target data includes the search key relationship, an attribute acquisition step of acquiring attribute information previously associated with the search key relationship as attribute information of the target data; Let it run further,
In the information estimation step, the attribute information is acquired based on the result of the clustering process for the target data for which the attribute information has not been acquired in the determination step and the attribute acquisition step.
The document data analysis method according to claim 4.

In the information estimation step, the information related to the target data is further related to the representative search key relationship of the group to which the search key relationship included in the target data belongs among the groups grouped in the clustering step. Information is retrieved,
The document data analysis method according to claim 4.

A document data analysis program for analyzing document data including a plurality of target data, comprising:
Holding means for holding a plurality of search keys defined using regular expressions;
Search means for searching the plurality of target data using the plurality of search keys held by the holding means;
Clustering means for performing clustering processing for grouping similar target data by setting information obtained as a result of the search by the search means as a variable;
Information estimating means for acquiring attribute information set in other target data belonging to the same group in the result of the clustering process as attribute information of the target data;
Document data analysis program to function as

The computer,
A determination unit that determines whether or not the target data includes a search key relationship defined in advance as a combination of related search keys by referring to a search result by the search unit;
When the determination unit determines that the target data includes the search key relationship, attribute acquisition unit acquires attribute information associated with the search key relationship in advance as attribute information of the target data. Make it work,
The information estimation unit acquires attribute information based on a result of the clustering process for target data for which attribute information has not been acquired by the determination unit and the attribute acquisition unit.
The document data analysis program according to claim 7.

The information estimation means further relates to a representative search key relation of a group to which the search key relation included in the target data belongs among groups grouped by the clustering means as information related to the target data. Get information,
The document data analysis program according to claim 7.